1. 47
  1. 23

    Former Mozilla employee here.

    This wiki page is ancient. There was a strong anti SQLite phase several years ago. I’m not sure where things are today, but by the time I left, people had warmed back up to SQLite a bit. Although many points in the wiki page are still accurate.

    I think a large part of the anti SQLite mindset in Firefox was driven by a combination of lack of understanding on how to optimally use SQLite and actually achieving an optimal configuration through all consumers of it. In its default configuration, SQLite will fsync() after every transaction. And SQLite implicitly creates a transaction for every statement performing a mutation. This can suck performance like no other. You have to issue a bunch of PRAGMA statements when opening a database to get things to behave a bit better. This sacrifices durability. But many database writes aren’t critical and can be safely lost. e.g. do you care if your browser history for number of times you visited a page is off by 1?

    The performance impact of continuous SQLite writes with high durability guarantees is demonstrated in my favorite commit which I authored while at Mozilla: https://hg.mozilla.org/mozilla-central/rev/6b064f9f6e10b25bf2edc916b1b9a8b78cff0465. tl;dr a test harness loading thousands of pages and triggering writes to the “places” database (which holds site visit information) was incurring ~50 GB of I/O during a single test run.

    Compounding matters was Firefox’s code historically performed I/O on the main thread. So if there was a read()/write() performed on the main thread (potentially via SQLite APIs), the UI would freeze until it was serviced. Hopefully I don’t have to explain the problem with this model.

    One of my contributions to Firefox was SQLite.jsm (https://developer.mozilla.org/en-US/docs/Mozilla/JavaScript_code_modules/Sqlite.jsm), a JavaScript module providing access to SQLite APIs with batteries included so consumers would get reasonable behavior by default. The introduction of this module and subsequent adoption helped fix a lot of the performance issues attributed to SQLite.

    IMO the biggest issue with Firefox’s data storage was there were literally dozens of components each rolling their own storage format and persistence model. This ranged from SQLite databases, JSON files, plain text files, bespoke file formats, etc. This represented 10+ years of accumulated technical debt. There have been various efforts to unify storage. But I’m unsure where they are these days.

    1. 16

      I’ve been bitten once by the modern use of JSON files in firefox. I don’t have a bug report at hand, but the issue is that since the whole file gets rewritten on save, a crash at the wrong time can make you end up with an empty (0 bytes) file. This happened to me with the password database, and it was quite hard to fix. I ended up creating a new profile. SQLite databases tend to resist damage a lot better.

      1. 5

        That seems like a fixable problem. The trick would be to write the file with a temporary name, then fsync, then rename it over the top of the old one. Depending on when it crashes, you might lose the changes, but you shouldn’t get torn writes.

        1. 1

          rename() should be atomic on any sane posix filesystem. might be able to skip the whole fsync() call altogether.

          1. 3

            But then the atomic rename may be of an incomplete file…

            1. 1

              Someone correct me if I’m wrong, but for my own understanding it sounds like the process is:

              • Write output to temp file
              • One way or another ensure temp file gets committed to disk
              • Rename temp file to real file

              And then in the case of a crash, start over from the beginning? That may lose you data in a crash but will always be consistent. Otherwise you need to verify that the temp file is valid before doing stage 3, or detect a crash and attempt to recover something, or something along those lines…

              1. 1

                Perhaps I’m mistaken and doing dumb things, but I’ve intentionally induced power faults testing my temp to rename over method without fsync and it seems to work, vs opening and editing or just overwriting the file underlying.

                1. 3

                  iirc ext4 filesystem actually inserts an fsync for you in this case because it is such a common mistake.

              2. 1

                When you’re dealing with crashes, all bets are off

              3. 1

                Funny you say that, this is exactly what FF does for downloads.

                1. 0

                  To encourage less SQLlite usage, providing an API to manage this (compressed JSON files, loaded+saved via tmpfile/fsync/rename) seems a sensible approach, and probably not much more effort than writing this guide.

                    1. 1

                      Thanks, interesting, the non-default version (if you provide a tmpFile and ask for ‘flush’) seems to do this safely. I wonder if a higher-level API would be of use (save/load a json object, no need to provide tmpfile name etc)

                      The description of the ‘flush’ argument suggests it is doing an fsync(), rather than using stdio and doing ‘fflush’, but if so the terminology is confusing.

                      I would check, but I don’t want to check out all of firefox and but I’m having difficulty searching up “NativeOSFileInternals.cpp” to look.

                        1. 2

                          Try searchfox.org

                2. 2

                  A similar thing can happen if you run out of disk space. During writing the file will first get pruned, then some program will use up all the available space in the meantime and your write will fail with the end result being an empty file. This can be avoided by creating a temporary file in the same location as the destination file and then renaming it.

                3. 6

                  Note that this “vacuuming” only deals with internal fragmentation, not the external filesystem fragmentation

                  From the sqlite docs this does not seem accurate, as it recreates the database file. https://sqlite.org/lang_vacuum.html

                  Surely removing the old file does address filesystem fragmentation…

                  1. 1

                    But the new file might be re-fragmented by the OS. Applications have very little control whether the OS chooses to allocate a continuous chunk for their data

                    1. 3

                      But why is that relevant? if the OS is fragmenting fresh files, why is that being blamed on sqlite?

                  2. 5

                    Although I think it’s worth recognizing the mention that this is an old document at the start of it, I think that it’s a bit ridiculous that they talk about all of these problems they’ve seen with SQLite and assume that the developers who have had issues recognizing these things in SQLite will suddenly not run into most of them in JSON blobs.

                    1. 4

                      I’m surprised LMDB is not mentioned in the alternatives, I think Firefox does use it (Mozilla published a Rust binding..)

                      The more I look at LMDB, the more I want all software ever to access memory-mapped data directly and avoid copies as much as possible. Is this a ridiculous obsession or am I correct? :D

                        1. 2

                          As part of some internal libraries building, I’ve researched LMDB as a potential application file format a few years ago. The biggest critique I have is that once the size of the LMDB set, it cannot grow (at least cannot grow while the app is active). This is problematic for mobile apps or any app want to be memory conscious. You basically have to mmap large enough upfront with limited ways to grow / shrink dynamically. That coupled with some weird accounting on mobile OSes make it a no-go (I think certain iOS versions simply don’t really evict pages that was once dirty even if you msync back to SSD).

                          1. 4

                            Your criticism is correct in many ways, but still requires significant clarification.

                            We should distinguish between the limitations of an approach (using a single flat memory-mapped file) and the disadvantages of implementing it in the LMDB.

                            POSIX mmap API supported by (almost?) all OSes, including iOS and Android. This allows to dynamically increase and shrink a mapped file with insignificant restrictions:

                            • The mapping size (but not the file size) should be selected with a margin. This consumes PTE, but not RAM or disk space.
                            • Subsequent mapping extension (not the file, but memory-mapping itself) require either mremap() (with free address space behind the current mapping region), or a re-creation of the mapping is required (with data access suspended during this time).
                            • Reducing the size of the mapped data is easy to control by using posix_madvise(POSIX_MADV_DONTNEED) and reducing the file size. However, this requires defragmentation inside the database or calling madvise() for each page to be released (which is expensive for performance).

                            The problem with LMDB is that none of the above is done, or the corresponding API is not provided. So I would like to pay attention to libmdbx, where this problem was resolved:

                            About page evicting in mobile OSes I should clarify even more:

                            • Basically (for iOS) a used page in RAM could be in one of the statuses: clean (could be unloaded and then re-loaded from file), dirty (modified and should be written to file or swap), compressed (compressed by OS dirty page).
                            • It is no any problem to application to have a lot of clean pages, but it should avoid extra dirty and/or compressed pages.
                            • In case of libmdbx (and LMDB) this could be easy avoided just not using MDBX_WRITEMAP flag (MDB_WRITEMAP for LMDB).

                            P.S. As the main developer of libmdbx, I will be happy to answer any related questions.

                            1. 2

                              Thanks for the detailed answer. Looks interesting. I may be able to pick it up later sometime this year to re-evaluate. It was based on observations done a few years ago and certainly the landscape would change. It is also the first time I heard of libmdbx which sounds like to address some of the shortcomings.

                              Ignoring the fact that I haven’t read your code yet, just out of curiosity (I am always interested in mmap techniques), if we don’t mmap it as readable, what are the techniques you use for writes? Do you just fseek and fwrite or use a journal and consolidating all writes back during compactification?

                              1. 2

                                if we don’t mmap it as readable, what are the techniques you use for writes?

                                Thanks to the Unified Page Cache (aka Unified Virtual Memory) this is easy. I.e. if you map a local file read-only into RAM and pwrite() into this file such changes becomes immediately visible in the mapped memory region for all local processes which have mmap‘ed this file. Since 2000’x the Unified Page Cache implemented for almost all OSes, except OpenBSD, QNX and a few other embedded platforms. Without the unified page cache, a database file must be mapped read-write by each process that wants to perform write-transactions. See description of the MDBX_WRITEMAP option.

                                … consolidating all writes back during compactification?

                                The internals of MDBX are complex enough to be briefly explained. However, the detailed information about the internal mechanisms of MDBX and LMDB is easily found on the web, starting with https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database

                                Nevertheless you should keep in mind that MDBX is a descendant of LMDB and all improvements are described mostly just in the README (https://erthink.github.io/libmdbx/#improvements-beyond-lmdb) and in the API description (https://github.com/erthink/libmdbx/blob/master/mdbx.h).

                            2. 3

                              Maybe mdbx is better about this?

                              1. 2

                                Sure :)

                                See my answer below.

                          2. 6

                            It is bit funny if about „hidden complexity“ talks someone who’s codebase is over 22 millions lines of code (BTW: Chromium is even worse). Yes, one must start somewhere and there might be small purges on many places. But it is questionable, whether SQLite used for reliable storage of structured data is the right place to reduce the complexity.

                            However I am quite pessimistic about the Web as a whole – this is ill technology and I see not much chances to make it simple again (like in its beginnings).