1. 52

  2. 3


    We analyze the I/O behavior of iBench, a new collection of productivity and multimedia application workloads. Our analysis reveals a number of differences between iBench and typical file-system workload studies, including the complex organization of modern files, the lack of pure sequential access, the influence of underlying frameworks on I/O patterns, the widespread use of file synchronization and atomic operations, and the prevalence of threads. Our results have strong ramifications for the design of next generation local and cloud-based storage systems.

    The iBench tasks also illustrate that file systems are now being treated as repositories of highly-structured “databases” managed by the applications themselves. In some cases, data is stored in a literal database (e.g, iPhoto uses SQLite), but in most cases, data is organized in complex directory hierarchies or within a single file (e.g., a .doc file is basically a mini-FAT file system). One option is that the file system could become more application-aware, tuned to understand important structures and to better allocate and access these structures on disk. For example, a smarter file system could improve its allocation and prefetching of “files” within a .doc file: seemingly non-sequential patterns in a complex file are easily deconstructed into accesses to metadata followed by streaming sequential access to data.

    1. 2

      One of the things I’ve been thinking of lately is could sqlite be used as the configuration file as well? That avoids many problems with parsing, but makes editing by endusers quite more painful.

      1. 15

        I’ve had a bunch of people suggest this to me over the years in response to my YAML post, and I think both editing and reading it will be painful.

        With a simple config file (e.g. key = value # comment) it’s quite easy to make even extensive configs quite readable, add explanation/context where needed, etc. But with SQLite that seems a lot harder to me, even if you add a comments column for this.

        Want to update a few words in that comments column or remove a sentence? Not so easy.

        If your application is intended to be primarily modified through a GUI or some such then it’s probably a reasonable choice, but even there a simple config file would do just as well in most cases (it’s not that hard to parse a config file).

        1. 8

          Reading a config file is easy, but robustly reading and writing them might be hard.

          • When serializing after your software changed its config, do you make sure to keep the original order (which the user would expect)? Does our config file format library make this easy?
          • When serializing after your software changes its config, do you keep comments? Does your config file format library make this easy?
          • When writing your new config file, do you do that atomically? I’ve both written and encountered programs which would just corrupt its config file when saving and the disk is full.
          • What happens if a user changes the config while the software is running? Do you attach an inotify listener to re-read the file when it changes? (What if the user writes an invalid config file which they didn’t intend to take effect yet?) Or do you just not care? (Do you make sure not to overwrite the user’s config if they change it? How will your config file conflict resolution UI work?)
          • What happens when a user writes an invalid config file? You can’t just crash a graphical program; does your config file parsing library make it easy to read the data that’s not corrupt while informing the user of which parts are corrupt?

          I’m not against using human-read/writable config files for all kinds of software. It’s probably what I would do to be honest. There are good (or at least good enough) answers to all of these questions. However, it’s not as straightforward as you’re implying to use a simple text config file format as a program’s configuration file if you want the config to be also modified through a GUI.

          1. 5

            All valid points as such, but I’m not sure if this is really a big problem in most cases. If you allow editing manually and have an interface to edit it then a # DO NOT EDIT WHILE THE APPLICATION IS RUNNING comment at the top is an easy way to solve most confusion/problems. It’s simplistic and crude perhaps, but generally speaking people mostly want text config files so they can bootstrap their machines and/or share extensive configurations, rather than fiddle with their config while the app is running.

            Another solution is to use two files: one that’s generated by the GUI, and one that the user can edit which overwrites the settings from automatically generated one. This is what Firefox does with prefs.js and user.js for example.

            I think there’s a place for SQLite configuration, but you’re making sharp trade-offs with it and probably only worth it in fairly specialized cases.

            1. 2

              I know that the point of your comment isn’t really for those questions to be answered, because you do say that you know answers exist, but some of the answers really are pretty trivial.

              • I don’t think that software should change its own configuration file. If it does though then it probably should maintain the contents of the original file other than the bit it changed, which is actually not that hard if you put a little thought into the parser and parse to a CST that retains the spans of input text on the syntax tree nodes. But probably you should have two separate files, one of which is only touched by the programme and the other only by the user, where the user’s preferences override the programme’s. This also fixes the problem of the developer changing the defaults in an update, resetting the preferences of users.

              • If your software can’t robustly read and write files to the disk then sort that out, because the whole programme is unreliable if that is true, not just the configuration. Sort out your primitives. If you have good primitives they’ll work fine on config files.

              • If the user wants to reload the config file they run reload.

              • If you try to reload with malformed configuration, do nothing and print an error. If you try to start with malformed configuration, do nothing and print an error. What you should not to do is try to ‘DWIM’.

              1. 3

                All those answers are valid for some kinds of applications, but they’re not universal.

                • Most graphical software written for non-developers (and honestly most graphical software written for developers too) wants a way to change preferences through a GUI. You’re probably right that it’s fairly easy to write a parser from scratch which preserves the spans of text, but do existing config parsing deserialization/serialization libraries do that? My experience is that the don’t, meaning you have to write an entire parser/serializer yourself just for your config file. If you wanna use a standard like yaml or toml, writing a correct parser isn’t a small amount of work.
                • My experience is that most software uses the posix open/write/close/rename interfaces directly (or thin wrappers), and manually do atomic writes by creating a temp file and renaming. Maybe an atomic file write function would be better, but you’re probably going to use your file format library’s writeToFile method anyways, which probably isn’t going to be atomic, meaning you still have to manually do the temp file + move dance manually.
                • The next two points are OK.

                Curiously, the top comment on one of the posts on the front page discusses how Firefox fails to make sure it writes JSON data atomically, leaving you with corrupt or empty file: https://lobste.rs/s/xt82a0/performance_avoid_sqlite_your_next#c_eslys1

          2. 10

            I’ve worked with applications that did this before (or used similar formats), and honestly it’s a pain because it makes configuration management incredibly difficult. You have to either (a) keep the binary db file in your repo, making it harder to track changes; or (b) use scripts to execute commands against the db. And then build a bunch of error handling into your scripts for the tool you’re calling to configure the app…

            For single-user desktop apps, where you don’t do as much configuration management, db-based config can be less painful. Until one of your users decides to manage a fleet of laptops with the app; or hack it into some automated pipeline and run it in a VM in a data center, so now it’s actually a prod app.

            … and as I’m typing, I realize I’ve actually been paged for production incidents involving all of those scenarios except the fleet of laptops one. Sigh. So yeah, I still strongly prefer text configuration.

            1. 2

              Exactly this the review process + git makes this incredibly awkward

              pseudo-code example:

              case App.env do 
                :dev -> Config.find(api: "dev") # -> dev.api.com
                :prod -> Config.find(api: "prod") # -> prod.ap.com
                _ -> raise "unknown App environment" <> App.env <> "refusing to boot"

              Would make the fact ap.com was committed a non-trivial thing to review, and would not appear during dev/staging testing

              1. 1

                I don’t disagree, but there are solutions, such as a –csv option that will read a CSV file of the config, or a –sql option that will read a SQL file/run a command. Then it’s no big deal to store the config(s) in plain text in a VCS, etc.

                We don’t have a –csv option, but we do have a –sql option.

                there are def. trade-offs, It’s not magically delicious, but then nothing in tech usually is.

              2. 3

                I guess you could, but I don’t really love the idea. Configurations are usually small and simple and seldom modified. Doesn’t seem to fit the RDBMS paradigm well or play to SQLite’s advantages. Maybe if you already had Sqlite in your codebase for something else and/or you have a really complex configuration for some reason. I’d prefer Json/Yaml/Toml or something most of the time.

                1. 2

                  I do exactly this. It’s not overly hard to configure, we give a –get/set key value CLI to the config table, plus encourage them to use sqlite3 themselves. For GUI’s it’s just a table view. The config table holds the default values as well, so there is never any doubt as to what the value is.

                  1. 1

                    It’s not overly hard to configure, we give a –get/set key value CLI to the config table, plus encourage them to use sqlite3 themselves.

                    You say not hard to configure, but that sounds a lot harder to configure to me than just editing a simple text file. It seems okay for things where the configuration variables are all simple booleans, strings or integers. But what if they’re more complicated? Imagine trying to configure Postfix with a CLI

                    $ sudo postfix config --set smtpd.recipient_restrictions 'permit_mynetworks, permit_sasl_authenticated, reject_unknown_client_hostname, reject_unknown_sender_domain, reject_unknown_recipient_domain, reject_invalid_hostname, reject_non_fqdn_sender'
                    $ sudo postfix config --get smtpd.helo_restrictions
                    $ sudo postfix config --set smtpd.relay_restrictions '

                    The cases where sqlite configuration isn’t an issue are the cases where text configuration is trivial and the cases where it does create big issues are the cases where text configuration is necessary.

                    1. 1

                      That’s pretty much http://www.postfix.org/postconf.1.html.

                      But yes, I too prefer editing the configuration file with an editor. (And postconf has more options, and is presumably more intended as a scripting target than as an interactive command?)

                      1. 1

                        your example is a touch annoying, but it’s not difficult. What becomes difficult is when your config has complex relationships between config items. It’s easy to store, YAY FK’s, but it’s not obvious how to expose that easily with a CLI interface.

                        In those cases,if we can’t figure out a nice CLI interface, we generally just use the GUI, or allow them to execute SQL directly against the config table:

                        myprog –sql ‘insert …’

                        But we try to avoid those sorts of config options if possible.

                        The upsides is, it’s very easy when debug time comes around, they just ship us the .db file as we store the last-ran args, etc, and we store logs in an audit table as well. (we still do the standard stdout logging too). Also we have none of the issues where IO nightmares come to roost, like files being 1/2 written or corrupted, etc. It happened often enough in our deployments before we switched that it was annoying for sure.

                        there are def. trade-offs, It’s not magically delicious, but then nothing in tech usually is.

                    2. 2

                      Arcan (desktop framework) does this: https://github.com/letoram/arcan/wiki/Configuration-Support

                      It provides a FUSE layer into the database, too.

                      1. 2

                        If you provide nice GUI and CLI tools or API to manage such configuration, it might be a great option. Despite one detail: version control – many people manage their configuration using a VCS (Mercurial, Git, Fossil etc.) and want to see, what has changed. Classic diff is quite useless on binary files and databases. Text formats (e.g. XML) are much more VCS-friendly. You might version SQL dumps which are also text. Or you can provide a diff tool for databases… It depends on situation – if your users are consumers, they would probably never tweak the config files by hand or manage them in a VCS.

                        1. 3

                          Version control should be easier to do on a database, as we have much more granular data. You can say this field changed, which is better than saying this whole line changed. It’s just that our tools for text diffs are better “right now”.

                          1. 1

                            Yes, you’re right, that is one of pretty important points — most of configurations now sit in version control systems, and seeing the differences between deployments is quite important for ops teams.

                            True, I could write a script that populates such a database when I’m deploying a service, and that would be the new config. However, that would just add more complexity to our already complex systems.

                            1. 2

                              Actually, this is not uncommon and it is used in real world for a long time – e.g. in Postfix you have pairs of files like:

                              $ file /etc/postfix/client_checks*
                              /etc/postfix/client_checks:    ASCII text                                                                                                                                                                                                    
                              /etc/postfix/client_checks.db: Berkeley DB (Hash, version 9, native byte-order)

                              You edit the first text one. And during runtime, the database is created and used. The the text form is converted to the more efficient database form. In your VCS you just ignore the .db files and diff the text ones.

                              Regarding the complexity: it is question whether the features provided by a DBMS (or any other library) overweight its complexity. This differs project from project, there is not universal answer to this question.

                            2. 1

                              if your users are consumers, they would probably never tweak the config files by hand or manage them in a VCS.

                              Some will want to though. If your users are consumers then they’re many and varied. Now you might not want to put a lot of effort into letting your users control their configuration with git but it doesn’t hurt anyone or force you to make negative tradeoffs so why not?

                              1. 1

                                We solved this with a –sql option. So you store the plaintext .sql file in VCS/etc and just read it in. Since we store more than just config, we can do a lot of end-to-end testing this way as well, since we can setup our state to be whatever we want with a given SQL file, and then we just have to test the output.