1. 20

  2. 9

    Mostly good, but conflates consistency with durability in a number of places – almost didn’t keep reading after this bit near the beginning:

    the guarantee that data has been committed to somewhere durable when a write() call returns without an error is a semantic aspect of the POSIX write() API

    That’s just not correct at all.

    1. 6

      Yeah, that was really weird. Then later it goes on to talk about dirty pages in the cache, which is an obvious contradiction.

      1. 5

        It also took me awhile to realize that the author seemed mostly interested in distributed file systems, where I think POSIX I/O semantics are pretty well known to not be great semantics. For a single system though, are they really so bad?

    2. 8

      Really nice article.

      Still, while I’m not a POSIX’s fan (Jehanne itself is not POSIX compliant and I’m designing a new filesystem protocol to replace 9P2000 that will require pretty different API), the research efforts spent on these HPC issues looks excessive to my untrained eye.
      The number of people that can benefit from these kind of performance improvements is very low. They own or work for large and powerful tech companies with ton of money and huge interests in those topic (both for cost reducing and performance improvement) but for 99.9999% of people and companies this research is completely irrelevant. Isn’t it?

      1. 4

        The “large and powerful tech companies with ton of money” account for a huge amount of the market. If Amazon was able to utilize some new IO innovation to speed up EBS volumes, how many people would that affect?

        We aren’t talking about tiny performance gains here. Storage IO is frequently a bottleneck, and proper IO management can easily win you 10x gains or more. You can’t optimize past the lowest level API guarantees. You’re stuck with them, with no way to opt out. There are applications that would trivially be able to utilize stateless IO APIs, the most obvious being databases. Right now many databases get around this with mmap, but unfortunately mmap doesn’t scale.

        1. 3

          If Amazon was able to utilize some new IO innovation to speed up EBS volumes, how many people would that affect?

          For sure, many people would be affected by such speed up, but how many would benefit from it?

          For sure Amazon would benefit: it could save on hardware, for example, as the same hardware could serve more customers. Or it could lower pricing attracting more customers. But how many of its customers would even notice the performance improvement?
          Since Amazon is the one who really use the new technology, it’s the one who benefit from it more.

          Let’s take another example: suppose Google use such new technology to improve GMail performance. How many people you think would even notice the gain? It’s much more probable that Google would decide to use the technology to reduce costs.

          I’m not saying that this is a pointless research field: to me good research is always welcome since I really think that computer science is still at stone age.

          But while, for example, fully homomorphic encryption could easily improve the life of a billion of people, whenever I see such huge effort on a technology like this… don’t know… I find it far less than optimal.

          1. 3

            If Amazon was able to utilize some new IO innovation to speed up EBS volumes, how many people would that affect?

            Mostly, Amazon’s internal systems, unless they’re also rewriting your code to use it, because the speedups being talked about here involve user code keeping track of I/O instead, of system code doing it.

            but unfortunately mmap doesn’t scale.

            I’m curious what you mean by this. Are you referring to address space limitations, or are you imagining multiple database nodes concurrently modifying the same shared database storage concurrently?

        2. 2

          I wonder whether the two major open source file systems/object stores ceph and gluster can offer a similar API to the object stores described in the article.

          1. 1

            If there are bottlenecks with multiple writes on a big file (database?), maybe it is time to split the big file into smaller files.

            Are object based storage immune to this problem? One file per object, able to be written independently.

            1. 2

              The premise is that you don’t change application code. So if I’ve got a Tetris clone from 1990 that writes to a high_scores file, and I install it in my cluster and millions of users play, all writing to the same file, that should be fast.

              1. 1


                Better think early about these issues, but hard to imagine how the program we write will be used tomorrow.