1. 8

    This is a pretty inspirational article, even though I disagree about some specifics.

    It was posted here 5 months ago in a comment on a related story, and there was a good discussion, worth rereading.

    In general, I think the filesystem as we know it is one of those OS concepts from the 1960s that really needs to be rethought. The more I work with and on databases, the more I realize that a filesystem is basically just a shitty database, with a fixed and limited schema and very poor durability for client data.

    1. 1

      Agreed with your last paragraph. A typical hierarchical file system is functionally like a NoSQL key-value document store. There is no schema, no concept of fields/columns, no relations or foreign keys, no complicated queries involving columns and joins, no transactions or atomicity over an arbitrary number of updates.

      This section might interest you: https://www.nayuki.io/page/designing-better-file-organization-around-tags-not-hierarchies#externalized-relational-database

      Externalized relational database

      As hinted in a few places, this entire proposed scheme revolving around files, tags, and queries can be viewed in the light of the relational data model. A file in the proposed system corresponds to a database row. The schema for a file corresponds to which database table the row belongs in. The hash of a file is a de facto primary key. Hash references correspond to foreign key fields. Tag-based queries make heavy use of joins, thus we need table indexes to efficiently find the desired data.

      This proposal does necessarily differ somewhat from the relational model. The query domain is flexible, because you can search for files over multiple storage devices instead of just one database table. As a corollary, it means you can temporarily read and query over foreign data without permanently importing it into your local collection. Database rows can be updated in place but hash-addressed files cannot. Every file has a globally unique identity, whereas database rows out in the open cannot use their small integers as primary or foreign keys. And every file can be a reference target, whereas not every database row can easily be referenced (e.g. multi-attribute keys).

      1. 1

        I think relational vs key-value is a red herring here. You could implement the tagging scheme in a document database or in a relational one.

        The big things that bug me about filesystems are:

        • They don’t provide durability/integrity of application data, only their own metadata. As recounted elsewhere, the simple act of reliably updating a file is remarkably hard to do, requires platform-specific steps, and almost no one seems to do it correctly except database engines. Saving also requires copying the entire file, unless you use complex file formats and algorithms that allow safe update-in-place, which again only databases do.
        • They provide almost no structure for data, just a couple of attributes like filename and last-modified. (Linux apparently doesn’t even ensure the name is any kind of valid string.) OK, some filesystems have extended attributes but they’re limited and vary between file systems and platforms.
    1. 2

      Incidentally, the _ExtInt proposal, if accepted, fixes promotion rules.


      when converting to a signed type […] if it doesn’t fit, then the behavior is implementation-defined, and could raise an exception (e.g. overflow trap).

      The behaviour is undefined, not implementation-defined, and therefore free to invoke eldritch nasal demons.

      Misconceptions […] Converting a pointer to an int and back to a pointer will be lossless.

      This is guaranteed for (u)intptr_t.

      Misconceptions […] Converting {a pointer to one integer type} to {a pointer to another integer type} is safe. (e.g. int *p = (...); long *q = (long*)p;.) (See type punning and strict aliasing.)

      This is permitted for char pointers.

      1. 1

        if the target type is signed, the behavior is implementation-defined (which may include raising a signal)

        https://en.cppreference.com/w/c/language/conversion#Integer_conversions

        If the destination type is signed, the value does not change if the source integer can be represented in the destination type. Otherwise the result is {implementation-defined (until C++20)} / {the unique value of the destination type equal to the source value modulo 2^n where n is the number of bits used to represent the destination type. (since C++20)}. (Note that this is different from signed integer arithmetic overflow, which is undefined).

        https://en.cppreference.com/w/cpp/language/implicit_conversion#Integral_conversions

        Speaking meta, I wrote the article because many people don’t understand the rules in C/C++.

        This is guaranteed for (u)intptr_t.

        I’m not sure about this one… There might be problems with fat function pointers?

        1. 2

          implementation-defined

          Ah, right, the clause I was thinking of has a different context (conversion of floats to ints). Though there is UB for signed shifts out of range (c11§6.5.7p2):

          [for E1 << E2] If E1 has a signed type and nonnegative value, and E1 × 2^E2 is representable in the result type, then that is the resulting value; otherwise, the behavior is undefined.


          I’m not sure about [(u)intptr_t]… There might be problems with fat function pointers?

          Ahhh, good point. intptr_t can round-trip any void pointer, which can in turn round-trip any object pointer, but a function is not an object.

      1. 6

        I strongly recommend the Clang Undefined Behavior Sanitizer — it adds runtime checks to your code that detect and flag (among many other things) integer overflows. I always enable this in my dev/debug builds.

        Some of the weirder rules only apply to long-obsolete hardware. I think you hav to go back to the 1960s to find CPUs with 36- or 18-bit words, or that don’t use 2s-complement arithmetic, and to the 1970s to find ones whose memory isn’t byte-addressable (i.e. “char” is bigger than 1 byte.) And EBCDIC never made it out of IBM mainframes.

        Bu I guess if you ignore those factors you’ll get outraged bug reports from the retro-computing folks complaining that your code breaks on OS\360, TENEX or TOPS-20…

        1. 6

          Some of the weirder rules only apply to long-obsolete hardware. I think you hav to go back to the 1960s to find CPUs with 36- or 18-bit words, or that don’t use 2s-complement arithmetic, and to the 1970s to find ones whose memory isn’t byte-addressable (i.e. “char” is bigger than 1 byte.) And EBCDIC never made it out of IBM mainframes.

          You need to look no further than the SHARC, still one of the most popular lines of DSP lines at the moment, to find an architecture where sizeof(char) = 4.

          (Edit: FWIW, I think a better approach to these would be to make the standard stricter, even if that means more non-compliant implementations for “special” architectures like the SHARC. I’m pretty sure AD’s C compiler isn’t (or wasn’t, thankfully for my mental sanity I haven’t touched it in like 5 years) standards-compliant anyway because sizeof(int) is also 4. We’re at a stage where you expect inconsistencies and bugs in vendor-supplied compilers anyway. There’s no need for anyone else to suffer just so a bunch of vendors with really particular requirements to be able to claim compliance, when most of them don’t care about it anyway.)

          1. 5

            You need to look no further than the SHARC, still one of the most popular lines of DSP lines at the moment, to find an architecture where sizeof(char) = 4.

            Indeed. It’s the little, low-level, specialized devices where all the weirdness shows up. Basically, if it’s not a 32-bit device there is probably something the violates your expectations. Don’t assume the world is made up of the computers you’re used to using.

            1. 4

              Ok, mind blown! I did not know that. But I can see how a DSP platform wouldn’t see byte-addressibility as necessary.

              1. 1

                Wait, isn’t sizeof(char) = 1 by definition? I suspect that what you meant to say is that for the C implementation that runs on Analog Devices’ SHARC DSP, char is 32 bits wide, int is also 32 bits wide, and actually sizeof(int) = 1.

                1. 1

                  You may be right, I don’t have the compiler at hand anymore. I’m sure (it caused me a lot of headaches when porting some code from elsewhere) that sizeof(char) and sizeof(int) were equal but I really don’t remember which of these two puzzling results it yielded.

              2. 3

                And EBCDIC never made it out of IBM mainframes.

                This is true, but there’s still a lot of code running (and being maintained!) on mainframes, so it can’t be ignored.

                1. 1

                  Didn’t ICL mainframes also use EBCDIC?

              1. 14

                Pretty good article overall, but a few things to fix:

                • Please spell UTF-8 and UTF-16 consistently with a hyphen.
                • Each plane has 65536 code points, not 65535.
                • Rust strings do not support rune indexing; they only support byte indexing. However, the str.chars() method returns an iterator that yields each rune sequentially. Moreover, this iterator can be collected into a Vec<char> which is in UTF-32.
                • CJK characters are 3 bytes characters long in UTF-8, not 2 bytes as you stated. This is because the 2-byte limit is U+07FF, and all CJK characters are above U+3000.
                • You mention that UTF-8 is a variable-length encoding by design, but fail to acknowledge that UTF-16 is variable-length too.
                • Python 2 is a can of worms (which depends on compile-time flags), but Python 3 is not. In Python 3, all strings behave as if they are in UTF-32 (i.e. direct access to every code point). But internally, it can store strings as UTF-32, UTF-16, or UTF-8, depending on the actual content of each string (e.g. a pure ASCII string can be stored as UTF-8 and still allow fast O(1) indexing).
                1. 6

                  Rust strings do not support rune indexing; they only support byte indexing. However, the str.chars() method returns an iterator that yields each rune sequentially. Moreover, this iterator can be collected into a Vec which is in UTF-32.

                  Stringactually doesn’t support byte indexing either, it just has range indexing to get a slice of the string, but you can use the as_bytes method to get a slice of bytes instead.

                  1. 3

                    This is correct and this is a really important point I think for Rust string indexing, I’ll modify the post to reflect this.

                  2. 3

                    You mention that UTF-8 is a variable-length encoding by design, but fail to acknowledge that UTF-16 is variable-length too.

                    I don’t understand this criticism. The entire article is about the special case where UTF-16 is variable length - the case where you need to use surrogate pairs to represent a single unicode code point.

                    1. 2

                      Thanks for the feedback! I’m really glad the article initiated some interesting comments here. There are a few points which need clarifying and correcting which I’ll do in an update.

                      You’re right about the Rust string indexing requiring .chars() iterator, I think this is a very important point so I’ll edit the post to explain this.

                      I tried to point out how UTF-16 isn’t fixed length but I definitely think it could be made more clear.

                    1. 2

                      The themes of these questions remind me of Java Puzzlers (book, video).

                      1. 2

                        Excellent article; it explains the underlying concepts well. Note that using a B-tree to store a long string is like using many little gap buffers, because each tree node has some slack space.

                        I wrote about the same topic (a list structure with fast insertion and reading) with a different focus. My description is shorter and shallower, but code is included. https://www.nayuki.io/page/avl-tree-list

                        It’s true that the modern solution is to use ropes in a balanced tree.

                        1. 1

                          The default implimentation of “Set” in F# is an immutable AVL Tree. Is there a requirement for uniqueness of elements for AVL trees? Also are AVL’s unordered? Or is this just a coincidence that F# decided to use AVL Trees for unique unordered collections.

                          1. 1

                            In an imperative programming language like Java (C++, Python, etc.), it is customary to use a self-balancing binary tree data structure to implement ordered sets and ordered maps/dictionaries. If I recall correctly, the AVL balanced tree was invented first, but the later red-black tree became the more popular choice in practice. I’m not sure why though, because the RB code is longer, there are more cases to handle, and it’s not clear to me that it’s faster than AVL (though I haven’t implemented or benchmarked it). Anyways, these mutable binary trees can be adapted to be immutable - just throw away a node and replace it with a new one if you need to insert/update/delete a tree.

                            Onto your questions. Any tree (plain binary, AVL, RB, B-tree) doesn’t require unique elements or ordering of the values. But you need certain rules about where to store and retrieve an element. For ordered sets, the ordering forces the position within the tree. For the hex editor article and my “AVL tree list”, the subtree weight forces the position within the tree.

                            Sorry, because I’m unfamiliar with F#, I can’t answer your specific question about why they decided to use AVL trees for unique unordered collections. It seems that “unique” and “unordered” are features that are mutually incompatible for a tree structure. I expect you’ll have better luck on a larger community forum like Stack Overflow.

                          2. 1

                            That’s a great article and site. I’ll probably dig into those write-ups at some point.

                            1. 3

                              Be careful when XOR-swapping two integers that are addressed by pointers/references. If they both point to the same memory location, then the value gets erased to zero. In C it would be good to add the restrict keyword to the two pointers, to let the human know about this pitfall (it doesn’t stop the compiler).

                              1. 6

                                Be careful when using coding tricks in production code.

                                I doubt whether there’s still a compiler used “for real” which doesn’t optimize the normal swap algorithm, using a temporary variable, into whatever’s most efficient for the system, which usually works out to being a single opcode. This is decades-old peephole optimization stuff. The intricacies of restrict and pointer aliasing are useful to keep in mind in general, but if you’re using this trick in code where correctness matters, just use a temporary variable and have some faith in the people who wrote the optimizer.

                              1. 4

                                Rust is the first popular language to propose an alternative — automatic memory management and safety without garbage collection 

                                This statement is inaccurate. C++ has automatic memory management without garbage collection, thanks to RAII. But C++ comes with a lot of baggage from the past, unsafe default behaviors, and subtle pitfalls. Whereas Rust is a clean start that takes the good parts of C++ and none of the bad parts.

                                But overall, a well-written, information-packed article.

                                1. 1

                                  This statement is inaccurate. C++ has automatic memory management without garbage collection, thanks to RAII.

                                  I knew about RAII in C++, but (please correct me if I’m mistaken) it is not enforced by the compiler. By the same metric, C with Boehm’s GC has automatic memory collection. Sure, in Rust a developer can use unsafe and escape the compiler, but it is an explicit act and, when using an open-source 3rd party lib, pretty easy to grep.

                                  1. 1

                                    RAII in C++ is basically an add-on.

                                    After fighting with random SIGSEGVs on some large C++ codebase I was asked to maintain, I disciplined myself to either use RAII and pass-by-reference, or having a mandatory public ok() function returning false if something went wrong in the constructor. Years later, I found Rust compiler enforced what to me was the best practice.

                                1. 12

                                  This is a bit of a wild card, I am not experienced at all with haskell, but my impression is it may not be well suited to imperitive operations like read buffers from a stream and write buffers to a stateful database. I could be totally wrong or ignorant and would like to one day address my ignorance.

                                  These are things that I use Haskell for!! As Tackling the Awkward Squad says:

                                  In short, Haskell is the world’s finest imperative programming language.

                                  1. 3

                                    exactly the type of thing I love to read, thanks :)

                                    1. 1

                                      This PDF crashed my browser.

                                      1. 0

                                        Why was my comment about marking PDFs downvoted?

                                        1. 0

                                          The article is an HTML page. There is not a PDF in sight. Why do you want to mark a PDF?

                                          1. 1

                                            It’s quite obviously a PDF. I honestly don’t know what you’re talking about.

                                        2. -2

                                          Can you mark your PDF? Thanks.