1. 10
  1. 8

    This is a pretty inspirational article, even though I disagree about some specifics.

    It was posted here 5 months ago in a comment on a related story, and there was a good discussion, worth rereading.

    In general, I think the filesystem as we know it is one of those OS concepts from the 1960s that really needs to be rethought. The more I work with and on databases, the more I realize that a filesystem is basically just a shitty database, with a fixed and limited schema and very poor durability for client data.

    1. 1

      Agreed with your last paragraph. A typical hierarchical file system is functionally like a NoSQL key-value document store. There is no schema, no concept of fields/columns, no relations or foreign keys, no complicated queries involving columns and joins, no transactions or atomicity over an arbitrary number of updates.

      This section might interest you: https://www.nayuki.io/page/designing-better-file-organization-around-tags-not-hierarchies#externalized-relational-database

      Externalized relational database

      As hinted in a few places, this entire proposed scheme revolving around files, tags, and queries can be viewed in the light of the relational data model. A file in the proposed system corresponds to a database row. The schema for a file corresponds to which database table the row belongs in. The hash of a file is a de facto primary key. Hash references correspond to foreign key fields. Tag-based queries make heavy use of joins, thus we need table indexes to efficiently find the desired data.

      This proposal does necessarily differ somewhat from the relational model. The query domain is flexible, because you can search for files over multiple storage devices instead of just one database table. As a corollary, it means you can temporarily read and query over foreign data without permanently importing it into your local collection. Database rows can be updated in place but hash-addressed files cannot. Every file has a globally unique identity, whereas database rows out in the open cannot use their small integers as primary or foreign keys. And every file can be a reference target, whereas not every database row can easily be referenced (e.g. multi-attribute keys).

      1. 1

        I think relational vs key-value is a red herring here. You could implement the tagging scheme in a document database or in a relational one.

        The big things that bug me about filesystems are:

        • They don’t provide durability/integrity of application data, only their own metadata. As recounted elsewhere, the simple act of reliably updating a file is remarkably hard to do, requires platform-specific steps, and almost no one seems to do it correctly except database engines. Saving also requires copying the entire file, unless you use complex file formats and algorithms that allow safe update-in-place, which again only databases do.
        • They provide almost no structure for data, just a couple of attributes like filename and last-modified. (Linux apparently doesn’t even ensure the name is any kind of valid string.) OK, some filesystems have extended attributes but they’re limited and vary between file systems and platforms.
    2. 4

      I’m going to post my Hacker News comment about this article here.

      This post has good ideas, but there are a few things wrong with this.

      First, we forget that filesystems are not hierarchies, they are graphs, whether DAG’s or not. 1

      Second, and this follows from the first, both tags and hierarchy are possible with filesystems as they currently are.

      Here’s how you do it:

      1. Organize your files in the hierarchy you want them in.
      2. Create a directory in a well-known place called tags/ or whatever you want.
      3. For every tag <name>, create a directory tags/<name>/
      4. Hard-link all files you want to tag under each tag directory that apply.
      5. For extra credit, create a soft link pointing to the same file, but with a well-known name.

      This allows you to use the standard filesystem tools to get all files under a specific tag. For example,

      find tags/<name> -type f

      (The find on my machine does not follow symbolic links and does not print them if you use the above command.) If you want to find where the file is actually under the hierarchy, use

      find -L tags/ -xtype l

      Having both hard and soft links means that 1) you cannot lose the actual file if it’s moved in the hierarchy (the hard link will always refer to it), and 2) you can either find the file in the hierarchy from the tag or you know that the file has been moved in the hierarchy.

      Also, if you want to find files under multiple tags, I found that the following command works:

      find -L tags/tag1 tags/tag2 -xtype l | xargs readlink -f | sort | uniq -d

      I have not figured out how to find files under more than one tag without following the links, but it could probably be done by taking the link name and prepending where the link points to plus a space, then sorting on where the link points.

      Of course, I’m no filesystem expert, so I probably got a few things wrong. I welcome smarter people to tell me how I am wrong.

      1. 6

        The hard/soft link scheme has some problems. The way most application save files breaks hard links, because for safety you have to write to a new file and then rename the new file replacing the old. A symlink will survive that, but if you both save and move/rename a file in between checking the tag, your links are both broken.

        In a way you’re trying to reinvent the file alias, which has existed on macOS since 1991. An alias is like a symlink but also contains the original’s fileID (like a hard link), and if the file’s on a remote volume it has metadata allowing the filesystem to be remounted. macOS got around the safe-save problem with an FSExchangeFiles system call that preserves the original file’s fileID during the rename.

        At a higher level, though, I think your argument is similar to saying “you can already do X in language Y, because Y is Turing-complete.” Which is true, but irrelevant if doing X is too awkward or slow, or incompatible with the way everyone uses Y. Apple’s Spotlight metadata/search system represents this approach applied to a normal filesystem, but it’s still pretty limited.

        As an example of how things could be really, fundamentally different, my favorite mind-opening example is NewtonOS’s “soup”.

        1. 5

          It’s worth noting that this is more or less what BFS did. It provided four high-level features:

          • Storage entities that contained key-value pairs.
          • Storage for small values.
          • Storage for large values.
          • Maps from queries to storage entitites.

          Every file is an entity and the ‘contents’ is typically either a large or small value (depending on the contents of the file) with a well-known key. HFS-style forks / NTFS alternative data streams could be implemented as other key-value pairs. Arbitrary metadata could also be stored with any file (the BeOS Tracker had some things to grab ID3 tags from MP3s and store them in metadata, for example).

          BeOS provided a search function that would crawl metadata and generate a set of files that matched a specific query. This could be stored in BFS and any update to the metadata of any file could update the query. Directories were just a special case of this: they were saved queries of a key-pair identifying a parent-child relationship.

          The problem is not that filesystems can’t represent these structures it’s that:

          • Filesystems other than BFS don’t have a consistent way of representing them (doubly true for networked filesystems) and,
          • UIs don’t expose this kind of abstraction at the system level, so if it exists it’s inconsistent from one application to another.
          1. 2

            (I think the correct spelling is “BeFS”.) BeFS designer Dominic Giampaolo went on to Apple and applied a lot of these concepts in Spotlight. It’s not as deeply wired into the filesystem itself, but provides a lot of the same functionality.

            1. 6

              I think the correct spelling is “BeFS”

              I take Dominic’s book describing the FS as the canonical source for the name, and it uses BFS, though I personally prefer BeFS.

              BeFS designer Dominic Giampaolo went on to Apple and applied a lot of these concepts in Spotlight. It’s not as deeply wired into the filesystem itself, but provides a lot of the same functionality.

              Spotlight is very nice in a lot of ways, but it is far less ambitious. In particular, Spotlight had the design requirement that it should work with SMB2 shares with the same abstractions. Because Spotlight maintains the indexes in userspace, it is possible to get out of sync (and is actually quite easy, which then makes the machine really slow for a bit as Spotlight goes and tries to reindex everything, and things like Mail.app search just don’t work until it’s finished). Spotlight also relies on plugins to parse files, rather than providing structured metadata storage, which means that the same file on two different machines may appear differently in searches (saved or otherwise). For example, if you put a Word document on an external disk and then search for something in a keyword in the metadata, it will be found. If you then plug this disk into a machine that doesn’t have Word installed, it won’t be. In contrast, with the BFS model Word would have been responsible for storing the metadata and then it would have been preserved everywhere.

          2. 2

            I like your idea. It made me realize that one little system of my own is, in fact, tagging: I have a folder called to-read that contains symlinks to documents I’ve saved.

            Tangentially: I want rich Save dialogs. Current ones only let you save a file. I would love it if I could

            • Save a file
            • Set custom file attributes like ‘downloaded-from’ or ‘see-also’ or ‘note-to-self’
            • Create symlinks or hardlinks in other directories
            • Check or select programs/scripts to run on the newly-created file?
            • All in one dialog
            1. 2

              Then somebody will edit their file with an editor that uses atomic rename, and your hard links will all be busted.

              1. 2

                TBH This sounds like a fragile system that only addresses the very shallow benefits of a more db-like object store.

              2. 1

                You want hierarchies, but hierarchies in the tags, not the filesystem.