1. 5

    That’s not how you spell ZFS.

    Bcachefs is too new and unproven. It will eat your data if you let it by using it.

    1. 10

      Bcachefs is intended to succeed the current go-to options: Btrfs and ZFS. All filesystems have to start somewhere.

      1. 4

        Is there a guide what are the planned improvements over ZFS?

        1. 9

          Apparently it has a much smaller codebase than ZFS or btrfs (and in fact, it’s slightly smaller than ext4) while being cleaner and more flexible.

          1. 6

            Not a technical improvement, but if all goes right, it would be included in mainline kernel with a known-valid licence - which I don’t think zfs will ever achieve.

            1. 4

              Not a technical improvement, but if all goes right, it would be included in mainline kernel with a known-valid licence - which I don’t think zfs will ever achieve.

              …on Linux, at least ;)

              1. 3

                At the same time it seems to be GLPv2 licensed, which mean that merge into BSDs is also something that we probably will not see. So almost all users of ZFS will not bother with it.

                1. 2

                  Maybe that’s for the best. Both worlds can have great filesystems, and the two worlds don’t need to use the same great filesystems. I’m getting more and more convinced that the less BSD people and Linux people have to interact, the better.

                  1. 1

                    I’m still hopeful once HAMMER2 is more mature, it’ll get ported to other systems including Linux.

            2. 3

              To me, two things stands out comparing to ZFS (at least promised):

              1. True tiered storage support, unlike current L2ARC / slog in ZFS;

              2. Better expansion RAIDZ* disks.

              1. 1

                Number one is 100% correct, and no one should use an slog thinking it’s tiered storage. The only tiered storage ZFS does is tier-0 in memory (ARC, not L2ARC) and tier-1 to your pool, rate-limited by the slowest vdev. The ZFS devs have turned down requests for true tiered storage many times, so I don’t think it’s going to happen anytime in the next 5 years. (You can get it working by using ZFS with a dm-writecache device as the underlying “disk” but I think all bets are probably off as to the integrity of the underlying data.)

                But for number two, draid is probably the correct answer. I think it’s superior to what bcachefs is offering.

            3. 2

              But why not use what’s there. ZFS or porting HAMMER instead? If your focus is “reliability and robustness” code that is used in production seems better than creating new code with new bugs.

              1. 3

                But why not use what’s there. ZFS or porting HAMMER instead? If your focus is “reliability and robustness” code that is used in production seems better than creating new code with new bugs.

                Because licenses, to start somewhere (the Linux kernel can’t include BSD code, for example). It’s also hard to argue that HAMMER is proven due to it’s extremely small user base, even if it might be interesting from a technical standpoint.

                1. 6

                  Linux can include BSD licensed code - as long as there’s no advertising clause, it’s compatible with GPL.

                  See an old thread: http://lkml.iu.edu/hypermail/linux/kernel/0003.0/1322.html

                  1. 1

                    Linux can include BSD licensed code - as long as there’s no advertising clause, it’s compatible with GPL.

                    I stand corrected!

                    1. 1

                      This is true, but note that ZFS is CDDL, not BSD.See https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/

                      …but HAMMER would be fine from a license perspective.

                    2. 4

                      the Linux kernel can’t include BSD code, for example

                      What makes you think so? There’s quite a bit of code in the Linux kernel that started out BSD licensed.

                      1. 2

                        It’s totally legal to go either direction (BSD code mixed with GPL code or GPL code mixed with BSD code). There is a cultural stigma against the other license, depending on which license camp one is in. I.e. Most BSD people dislike the GPL license and go out of their way to replace GPL’d code with BSD license code(and the opposite for BSD code in Linux land).

                        As a more recent example, GUIX didn’t get invented out of a vacuum, it’s partly(mostly?) GPL people unhappy with the Nix license.

                        One extra gotcha as @singpolyma points out, GPL code mixed with BSD code enforces the GPL license on the binaries, usually making BSD proponents unhappy.

                        1. 3

                          This is very misleading because if you add GPL to BSD then the combined work/a derived work can no longer be distributed as BSD.

                      2. 2

                        the Linux kernel can’t include BSD code

                        BSD is GPL-compatible, but the reverse isn’t true: BSD code can’t include GPL’d code.

                        1. 2

                          Technically BSD codebases can include all the GPL’d code they want. They can even keep the BSD license on the parts that don’t depend on the GPL part. The binaries would be GPL though and there are other reasons purists might avoid this.

                        2. 1

                          due to it’s extremely small user base

                          So, like, bcachefs? or nilfs2 (which is in mainline…).

                  1. 4

                    I’ve been enjoying all the discussion recently about methods people are using to be faster at programming, so I decided I’d contribute my a list of my own.

                    I’d be eager to hear what everyone else does that I haven’t listed!

                    One area I know I’m quite lacking is in using good keyboard shortcuts and keybaord-based navigation as a whole. I still do a good amount of clicking around the editor to select stuff and move the cursor, and I’ve seen some vim users who take pride in never taking their hands off the keyboard.

                    1. 3

                      Honest question: Is it really the tooling that does it…? Or is it just icing on the cake? Compared to knowledge of algorithms, choosing the right language for the job, and stuff like that.

                      1. 1

                        Having to pick up the mouse from time to time is an annoying distraction though. When I moved from Vim to VSCode, not learning the keyboard shortcuts feels quite destructive in the tight modify-compile-run loop. Having learned how to switch between editor window and terminal helps a lot.

                        1. 1

                          In case you haven’t seen it yet, vim extension for vscode is really good. Less relearning that way.

                      2. 3

                        Long ago Bruce Tognazzini did some experiments showing that choosing a command from a GUI menu was measurably faster than pressing the keyboard shortcut, even though the people doing it felt that the keyboard was faster. (The exceptions were super common shortcuts like Copy and Paste that were burned into muscle memory.) He hypothesized that subjective experience of time was different when performing a physical task like moving a mouse, than for a memory based task like remembering a key command.

                        I’m sure an expert can get the full range of keyboard navigation commands burned into muscle memory to where it’s faster than mousing, but personally, I’m waiting for eye-tracking interfaces that can move the cursor to where I’m looking.

                        1. 12

                          Dan Luu did a rebuttal to Bruce’s claims: https://danluu.com/keyboard-v-mouse/

                          1. 3

                            The fundamental flaw with mousing is that you have to look (at things) to use it.

                            The operational cycle is: 1) actuate muscle motor skills, 2) watch mouse pointer movement, and other on-screen UI changes, 3) loop (1 -> 2) until target is reached, 4) complete task (click, hover, release drag)

                            With keyboard, assuming you are zeroed in on your home row position, you can execute the vast majority of your tasks without looking. The cycle is: 1) actuate muscles, 2) complete task

                            1. 3

                              Exactly. The point is not to minimize the amount of milliseconds that the operation takes, but to minimize the amount of distraction or context switching. Ideally one can edit the code without paying special attention to the editing process itself, so your mind can devote all space to solving the actual problem you are taking on.

                              I’d say this goal is easier to reach with keyboard shortcuts than with the mouse. But perhaps one can do this with the mouse too after enough training.

                            2. 1

                              Interesting finding! In my case, I opened a really old delphi project recently, I think I hadn’t had delphi installed in years, but the shortcuts immediately were in my muscle memory. What I used often stayed there - just like riding the bike. There are so many commands today that nearly impossible to memorize ’em all, but what you use the most frequently worth the effort to memorize (which is like using it 6-8 times to learn?) - those will actually speed up your daily routine.

                            3. 1

                              I have this in my .spacemacs:

                                ;; Don't use the mouse in emacs. It's just annoying.
                                (load-file "~/.emacs.d/private/disable-mouse.el")
                              

                              https://github.com/purcell/disable-mouse/blob/master/disable-mouse.el

                            1. 2

                              Did I miss something or does the car not actually know where the target is? It just looks like it has overfitted to keeping some distance to the left and centering forwards and backwards?

                              1. 2

                                Yep. It also basically uses Genetic Programming to find the coefficient for a linear regression.

                                To do this with RL, probably need to have a few layers deep network and then passing the target as the input (as well as the sensor input). Should be pretty simple to modify the author’s code to do so.

                              1. 4

                                Read both the rant and the author’s response. I have no problem with weird GUI APIs, that is how it is. Someone prefer it one way, someone prefer it another. Consistency helps, but not all GUI APIs are consistent with each other.

                                I do have some comments on the object system based on the author’s response and the rant. There may be some historical baggage, and it is indeed difficult to do object systems within C’s constraints, but it can be done better.

                                1. On accepting everything as the root class: it is typical to have void * and typedef out as the root class for object system, but nothing prevent you to only accept “FilePanel *” if that is the only thing it accepts. The upcasting to your root class is transparent, and if you ever need downcasting, it has to be explicit anyway, and having an method to do evas_as_file_panel won’t be the end of the world with all IDE assistance. C does enforce you to have a flatter object hierarchy in this case, which may be a good thing (may be not).

                                2. On opaque identifier and manage own object pool: I am slowing come around to say, unless the life-cycle of these objects are the same, it is often more trouble than it worth to maintain an opaque identifier than just use pointers. Tools such as valgrind / address sanitizers interact much better with raw pointers so you don’t need to implement your own null pointer detector or used-after-free detection logic. The only place you may need opaque identifier, is probably to implement Entity-Component-System. You may be able to design a more efficient GUI API with ECS, but even in that case, a robust entity management system goes a long way (i.e., a generic auto-grown dense hashmap).

                                3. On manual memory management: it is true that in C, you have to document some places need to free memory, some places you don’t, and it is a fact of life. However, a consistent API helps. In my C-API design, I normally couples every _new method with a _free method, and anything doesn’t have this suffix will have memory auto managed (maybe it is stack allocated, or the life-cycle is binded to another object, many times it is apparent why). Having such consistency makes understanding the memory management idiom within your library really easy.

                                1. 8

                                  Working in a mobile-and-otherwise dev company of many projects and teams, the anecdote about Swift protocol-based code rings very true. Many enthusiastic Swift developers are tempted to use the most exotic corner of the language to accomplish anything. I’ve observed the same happening in TypeScript. It’s a struggle to get people to consider all their options and pick the simplest one (“Could this just be a function?”). What those interesting features exist to do is solve the problems that plain functions and objects don’t already solve just as well.

                                  There was once a WWDC talk that included the advice, “start with a protocol.” Later, when it was clear that had created a mess, there was another WWDC talk exhorting Swift developers not to take that line so literally.

                                  1. 2

                                    Yeah, I was coming here to say the same thing. Coming from a recently-mostly-Swift perspective, the part about Swift sounded familiar. I don’t think protocols are quite a feature of last resort, but most of the time you get by just fine without them, and at the same time you get to not worry about a bunch of surprises and design traps.

                                    1. 3

                                      Protocols are indeed useful at least as frequently as interfaces in regular OO languages. Just, not every function needs to be in a protocol extension, or a system framework type extension for that matter. Often all that does is change which argument gets to be self. Many discussions in our Slack have begun with a question about how to handle some difficult corner of protocols with associated types, and we’ve then discovered it wasn’t necessary to go down the road that led there. I sometimes challenge people to make it a free function, get it working, and only then determine which type would most idiomatically contain it as a method.

                                      1. 1

                                        I often find this rule is useful: protocol is only needed if you have two things want to conform to it.

                                        Swift tries to orient you to struct / protocol based programming, but mostly, it is still quite OO and thinking through protocol / struct lens can be counter-productive.

                                        At the end of the day, if you need to use open class with generics, just go for it, write it down, and if you are not happy, can always improve from there with fancier things (protocols, PATs, property wrappers, function builders …).

                                  1. 1

                                    Predicting the next word let the machine understands the relationship between words, potentially even in long range. Predicting images from texts and vice versa let the machine understands the relationship between words and pixels. None of these two so far let the machine learn the physics, or how to interact with the world.

                                    Tools are here, we have some deep reinforcement learning algorithms, we have some mix of experts + transformer networks. But it appears that we haven’t figured out a reasonable reward function or training architecture to formulate this well so that a computer can start interact with the real world and start the world concept building process.

                                    I don’t believe it is just about scaling, but I do believe we are getting closer.

                                    1. 2

                                      As the post itself says, this is an MoE(Mixture of Experts) model so its parameter count is not really comparable to parameter count of dense models like GPT-3 (175B). Google already did GShard in 2020, 600B MoE model.

                                      It is a bit surprising to me that after more than a year since GPT-3, apparently no one(!) trained a larger dense model. What’s going on?

                                      1. 1

                                        It is a bit surprising to me that after more than a year since GPT-3, apparently no one(!) trained a larger dense model. What’s going on?

                                        It’s computationally expensive and conceptually quite difficult to train such a large, dense model. Difficult enough that GPT-3 still hasn’t been reversed reliably.

                                        1. 1

                                          I understand it is expensive and difficult, but there are lots of players with more money and more talent than OpenAI. My conclusion is no one is willing to invest, which in itself is surprising.

                                          1. 1

                                            It is not that surprising. Academia at large are not interested in replications, it doesn’t help publication. Industry replicates if it has real-world usage or help promote something. At the moment, GPT-3 is sort of “jack of all trades, master of none” situation in real-world usage. Google’s conversation bot (LaMDA, not the previous gen Meena) haven’t disclosed their parameter size at the moment, it is probably closer to GPT-3 in terms of parameter size and far better at conversations.

                                            The best bet is in the next 2 years, some hardware players will tot this to show their hardware capabilities. It is later than what can be done probably due to the dataset issues (I don’t think OpenAI has a dataset that trains GPT-3 ready-to-download?). A rack of NVIDIA latest GPU servers probably can train GPT-3 in a month. It will probably show up once we can train them under a week in a rack or two.

                                      1. 4

                                        A solution to the problem said to exist with composition is to make variables context local, e.g. in Godot, position refers to relative position to the parent node, so it is clear where to handle it. Also, I think that it’s worth pointing out that ECS doesn’t remove dependencies, it just obscures them by removing them from the type system, since you will still inevitably have some systems depending on the behavior of other systems, but now the dependency isn’t represented in the type system.

                                        1. 1

                                          Can you elaborate? The parent post doesn’t make it clear, but entity in ECS normally is just an int64, I guess that’s what you mean? But the dependencies (and type) should be expressed as dependencies between systems in ECS, not on either entity or components?

                                          1. 1

                                            Hmmm, if the entity is just an id, then how does one keep track of which components are registered to an entity? Or is it usually the other way around, a component knows which entities it owns?

                                            EDIT: Looks like the answer to my second quesiton is yes based on http://bitsquid.blogspot.com/2014/08/building-data-oriented-entity-system.html

                                            1. 1

                                              With ECS there isn’t a good way to express that one system depends on the fact that the same entities must be processed by another system. The dependency is still there, but now it has no representation in type systems.

                                          1. 5

                                            I wanted to do the same thing, but then I realized it isn’t worth it. My internet is slower than 1 Gbit/s, and the disks in my NAS are about 1 Gbit/s, so my bottleneck isn’t really Ethernet at the moment. Also, these 10 Gbit/s switches require way more power, which means a higher electricity bill.

                                            1. 5

                                              Put a 1T NVMe SSD as L2ARC for your NAS and 10Gbps can have a meaningful difference. Especially for SMB which doesn’t seem to have any local caching supported in either macOS or Windows. Seek and play will be much faster than 1Gbps connection because the content will be cached better on NVMe and it can be comfortably served faster from NVMe cache.

                                              My current problem is pretty stupid. zbackup seems not work well with sshfs and it cannot remotely reach any meaningful speed when doing the backup, not even mention 10Gbps …

                                              1. 3

                                                Which is true for 99% of lobsters, 10G is still an overkill unless you d bulk transfer daily or having access to a better uplink.

                                              1. 19

                                                Great article! Just want to call out one thing. Rewrite decisions are mostly social not technical. It is a great exercise to pass down business logic knowledge from one developer to another, and keep the new developers engaged in bug-fixes and active ownership in general.

                                                1. 6

                                                  …and the reason why thay fail so often is that the new developers fail to follow the business logic through the implementation details of the old system, or fail to understand the reasons behind it, and the authors of the original system often aren’t around to answer their questions.

                                                  1. 3

                                                    I wonder what people mean by failure in this context. Is it that the new implementation isn’t immediately exactly like the old one or that in the end the new implementation gets canned and all the work that went to it with it?

                                                    I’ve been a part of a few rewrite gigs and they all sucked at MVP stage compared to the old one, and most of them were better eventually than the old one, some even massively better.

                                                    1. 3

                                                      in the end the new implementation gets canned and all the work that went to it with it

                                                      I meant these, usually in a context when the thing to be rewritten is old and big.

                                                1. 6

                                                  I don’t think Swift is ever going to be a real competitor as an app language on non-Apple platforms, given those platforms all have well-developed existing ecosystems for applications (.NET, Java/Kotlin, etc).

                                                  (But I don’t know what I don’t know. I’m talking as almost a complete outsider here.)

                                                  That leaves Swift as a really nice systems/server language in the same spaces as Go or Rust (which I realize have overlapping but very different spaces; I put Swift somewhere between the two on that continuum).

                                                  I’d really like to use Swift as a systems language on Linux but it just doesn’t seem ready for prime time there. The biggest problem for me is the lack of C interop the other way: using C from Swift is trivial, but using a Swift library from C is not officially supported.

                                                  (Go suffers from this problem too, but in a different way.)

                                                  Writing a “universal” library (callable from other languages easily) basically requires that it be written in C, C++, or Rust. Zig has good support for this, but the language is pre-1.0. D I think can do it but it seems moribund (correct me if I’m wrong).

                                                  Anyway, long story short, I want to see Swift as a viable competitor to Rust in the stdlib range of Rust’s use cases, but I don’t think it’s gonna happen. I just think that would be a place where Swift could thrive.

                                                  I asked a question here on lobste.rs a few weeks ago to see if anyone was using Swift on Linux, and all but one of the answers, sadly, was no. I would love to hear of people using Swift on Linux.)

                                                  1. 3

                                                    I think that’s a fair treatment. As I understand it, server work is the current expansion goal, and (waves hands) all this concurrency work is on that critical path. Systems programming features, like a memory ownership model, come later. It remains primarily an application-level language and there’s a long way to go. I’m optimistic, but it will take time.

                                                    1. 2

                                                      Definitely. I am excited about Actor proposal accepted and maybe implement process-based executor and supervisors for Actor. That would be very interesting!

                                                    2. 3

                                                      To be honest, I think it’s equally likely that Java will make a break-through in the Go/Rust space as Swift being successful there.

                                                      1. 1

                                                        Even if you’re interested and want to try to use it on anything other than a mac, it ends up being far from trivial. Apple apparently do internal CI builds on one release of Ubuntu so it basically works on Linux. But you’re not going to find packages; building from source is non-trivial and if you’re on something slightly unusual you can forget it.

                                                        At first glance, the language seems to have syntax rather like Rust and some nice features. But it seems every feature they ever considered got included. It was in production use early so had no period where things that didn’t work could be thrown out again. With Rust by contrast, there’s not much left of the language as it was when I first read up about it and played with it because lots of things got thrown out again. It’s reached a stable basis now but didn’t for some time.

                                                        1. 3

                                                          I’d temper that a little; Swift has thrown out some features like currying syntax and basically renamed all functions in 3.0. The early years saw a lot of churn on the way to source and binary stability. And before that there were years of development before Apple announced the project at all. (I would guess that Rust was public very early in life.) The additions for SwiftUI go overboard if you ask some of us, but that’s all version 5.

                                                          I’d say Swift is maturing on Apple platforms but is barely past experimental elsewhere. Its feature set probably looks broad for how incompletely deployed it is.

                                                      1. 7

                                                        It seems everyone has their own take on what’s “better Jupyter” looks like. First with Observable: https://observablehq.com/, now this.

                                                        Let me paraphrase what the problems with Jupyter notebook these trying to solve:

                                                        1. The on-disk format is not diff-friendly, and not code-review friendly. Without launching Jupyter, it is hard to make sense of what these JSON blobs does;
                                                        2. There is no reproducibility for a give notebook, you can enforce it at code review level, or have some pre-commit hooks to clean up the output, run the notebook in order etc. But these are not built-in. You can always evaluate cell 10 before evaluate cell 8, and next time do the reverse. Over time, only you knows the right order to run your notebook;
                                                        3. Jupyter is not really meant to be collaborative, you definitely cannot have two people work on one notebook synchronously. Even asynchronously, you need piles of out-of-band protocols to make sure the result notebook is in sane state.

                                                        The solutions to these problems seem obvious, but:

                                                        1. Using text format cannot encode graphs and images into the notebook. Thus, either you need an external data source, or encode base64 blobs, or make sure people who open it has the environment and can re-evaluate instantly upon open;
                                                        2. Many languages cannot extract order from the source directly (observablehq went length to make an reactive language before building their notebook), specifying order beyond that seems to be another chore nobody really cares at write time.

                                                        After worked closely with some language kernels in Jupyter, I grown much more interested in alternatives. It is a big ecosystem now, but many extensions outside of what the core team provides are not well-maintained. Some more interesting features (to me), such as much better type-ahead and code-analysis (close to what VSCode can offer) still have no movement for years: https://github.com/jupyter/jupyter_client/issues/51

                                                        One last thing: many extensions in Jupyter ecosystem are “scratch my own itch” kind of thing. I have no objections to that. It is just if I am going to introduce these extensions to everyone in my team, I would rather maintain it “in-tree” in our code-repo rather than in their OS directories. Because it is “scratching own itch”, I expect we who use it to fix any bugs if there are any, it is light-years easier to do so with “in-tree” extensions. However, Jupyterlab (last time I checked) doesn’t make this easy.

                                                        1. 3

                                                          I have worked on and off with helping people deploy jupyter notebooks for a while now and I completely agree with your takes, and have secretly hoped for a jupyter alternative in elixir.

                                                          Elixirs new Mix.install will really help with your last point. Jose has been very resistant to allowing global dependencies in elixir due to pain points from the python and ruby ecosystems; so you can be sure that this was designed very carefully and I imagine people will push to public or private hex repos to accomplish out of tree extensions.

                                                          I think the real killer feature of live notebook is the last mode, (not demoed in video) connect to running node. You can connect the notebook to a running elixir or erlang node and execute functions in that node. So I imagine a system where you can create an analytics notebook, hook it up to prod, run a quick etl, and prepare a report for management.

                                                          Or, let’s say you need to patch some data in a database prod. You can wrap your notebook in an sql transaction, play with the changes and verify that it looks good, roll back the tx if it doesn’t, and when you are happy with your script, commit the script and save the notebook for record keeping purposes.

                                                          Or, because concurrency is fantastic, run a machine learning training on a node, and midway through training, interact with the model at that point in training…

                                                        1. 4

                                                          Are there implementations of persistent/immutable collections in Swift, similar to what we have natively in Clojure or via libraries like Immutable.js in JavaScript or Paguro in Java?

                                                          1. 5

                                                            If I take your meaning, yes. But the implementation is different from the way a Java class would have to enforce its own immutability. Collections in Swift are generally aimed at preventing shared mutable state. They have a way to provide mutability, but those mutations aren’t shared with other users of the collection.

                                                            Swift collections like Array are generally not classes but struct types, with value semantics. Assignment copies the value, not a reference to it. Mutations are disallowed through any constant variable, which is the idiomatic normal way to make any local variable or property. Behind the scenes there is a reference to a backing store, so that copies are cheap and mutations can grow the collection, and the value can remain a compile-time constant size. But mutations are still possible through a mutable variable, and in that case the backing store will be copied on write when it isn’t uniquely owned by just one copy already. So it’s a private mutation, not a shared one. The Array type didn’t have to assert immutability; rather it had to implement a copy on write pattern.

                                                            Finally, in the upcoming Swift Concurrency implementation of the actor model, only copyable data types can pass across actor isolation boundaries.

                                                            1. 3

                                                              I don’t think this is what was being asked, and I think the correct answer is “no, Swift doesn’t have persistent collections”:

                                                              The core property of persistent collections is that “mutating” operations are asymptotically as cheap as mutable collections, by retaining the unchanged parts instead of copying everything.

                                                              So if your COW collection (like array) copies everything on a change, it isn’t persistent.

                                                                1. 1

                                                                  Yes, please read it.

                                                                  1. 3

                                                                    So if your COW collection (like array) copies everything on a change, it isn’t persistent.

                                                                    Your claim is that “if your collection copies everything on a change, it isn’t persistent”. This is directly disputed by the wikipedia article where it states COW is a method for creating Persistent Data Structures. Inefficent, yes. But a method nevertheless.

                                                                    1. 0

                                                                      [citation needed]

                                                                      1. 1

                                                                        There is a naive scheme to make any data structure persistent. This scheme performs the operations exactly as they would have been performed in an ephemeral setting but before each update operation it makes new copies of all input versions. Then it performs the update on the new copies.

                                                                        Kaplan, Haim (2001). “Persistent data structures”. Handbook on Data Structures and Applications https://cs.uwaterloo.ca/~imunro/cs840/Kaplan_persistent-survey.pdf

                                                                    2. 1

                                                                      Reading it, I think you’re pointing to this part:

                                                                      This is an inefficient technique because the entire backing data structure must be copied for each write

                                                                      But I’m not sure how being labeled an inefficient technique for persistent data structures makes it not a persistent data structure at all.

                                                                      In Swift’s case, there’s an attempt to mitigate that cost, which is that copying only happens on write when the backing data structure is not yet uniquely owned, after which the new copy is uniquely owned, and further mutations do not copy again. In effect, what the programmer sees as a collection is a history snapshot of the collection. The backing structure of items is preserved not for every step of history, but for every step that some collection still has a handle to.

                                                                      1. 1

                                                                        Because it’s as if you called cutting the power cord to your workstation “garbage collection”.

                                                                        Yes, technically all memory has been cleaned up, but everyone knows that this is not what people have in mind when using the term.

                                                                        Wholesale copying the complete array on change is certainly not what people have in mind when talking about persistent data structures.

                                                                        1. 1

                                                                          Ha, ok. Well I don’t doubt you. It just seems like there are strict and loose interpretations of the term at play here. I’ll encourage you to contribute changes to that Wikipedia article since you’re familiar with the research.

                                                                  2. 1

                                                                    If I understand what you mean, in Swift there is a copy of the buffer, but it’s a shallow copy, not a deep copy. In that sense it doesn’t copy everything on a change. But there are two cases, and I wonder whether you’d call each case a persistent collection. Before a mutation of a collection whose backing store is not uniquely owned:

                                                                    If the items are class objects, then the backing store of many pointers is copied. The objects themselves are shared.

                                                                    If the items are value types, like integers, strings, collections, or algebraic data type values, then they’re stored inline in the backing store, and indeed they’re all copied up to but not past any reference property they contain. So mutating an array of sets, for instance, will copy the array’s backing store of many sets, but not the sets’ backing stores. Again those are shared.

                                                                    1. 1

                                                                      No, that’s not what is meant by persistent data structures. Here is an introduction to the topic with a few examples.

                                                                      1. 2

                                                                        I think I’m reading here that under lazy evaluation, the work entailed by preserving history across many mutations is reduced by evaluating once at the end. Retaining the unchanged parts is a matter of adding interstitial thunks through which you’d observe the result of changes, including undisturbed original items. I can see how that doesn’t involve a copy across the length of the structure, and I imagine the work is instead on the order of the number of mutations.

                                                                        In Swift, the work entailed by preserving history across many mutations is reduced by copying once for the first local change, after which point the structure is uniquely owned and can be mutated in place. Swift also idiomatically discourages mutation in the first place. When the need arises, though, it will have to copy those N items.

                                                                        Swift doesn’t pretend to be a functional language, it just points us that direction from an imperative starting line.

                                                                        1. 1

                                                                          Persistent data structures do not require lazy evaluation or being a functional language, see HAMTs and its usage as an example for that.

                                                                          I think it’s fair to say that Swift simply does not offer anything in this regard.

                                                                  3. 2

                                                                    Interesting!

                                                                    Could you explain what happens in Swift in terms of mutations and copies when we write something like this pseudocode:

                                                                    var m = bigNestedMap
                                                                    m["b"]["c"] = 2
                                                                    
                                                                    1. 2

                                                                      This is more complicated than it seems :) First, for map (dictionary) type in Swift, subscript will return optional, thus, m[“b”][“c”] won’t compile. You can, however, do: m["b"]!["c"] = 2. In this case, if “b” exists, it will modify m such that m["b"]! will contain the new “c” key. If “b” doesn’t exist, it will panic. This works because m is mutable, and m["b"]!["c"] will make a mutating func subscript(key: String) { set { } } call, which can mutate m["b"]!.

                                                                      To make it is easier in Swift in case “b” doesn’t exist, Swift support this syntax: m["b", default: [:]]["c"] = 2, it will create an empty dictionary in-place if cannot find with key “b”.

                                                                      Note that this behavior is different if you do:

                                                                      var m = bigNestedMap
                                                                      var mb = m["b", default: [:]]
                                                                      mb["c"] = 2
                                                                      

                                                                      In the above case, you are not making in-place mutating func call, hence, m won’t change.

                                                                      (Actually, thinking about it again, I don’t think that I explained well why for your case, it will mutate m, this does better job at it I think: https://github.com/apple/swift-evolution/blob/main/proposals/0165-dict.md)

                                                                      1. 1

                                                                        Happy to. First, Swift subscripts are like computed properties in many languages: They have get and set functions that you implement. The difference from a computed property is that a subscript’s two functions take an index argument too. So a single subscript assignment x[1] = 2 is equivalent to a hypothetical x.put(1, 2). A subscript read x[1] is like x.get(1).

                                                                        Subscript assignment on a struct is like any other mutating function on a struct: it’s allowed only if the struct value is in a mutable variable. That applies to both the major and minor levels here.

                                                                        We can think of your example case as:

                                                                        var m = bigNestedMap
                                                                        
                                                                        // Make a copy of m["b"], sharing its backing store using the copied reference within
                                                                        var n = m.get("b") 
                                                                        
                                                                        // Replace n's backing store with a shallow copy if it is not uniquely owned by n, then replace its item "c" with 2
                                                                        n.put("c", 2)
                                                                        
                                                                        // Replace m's backing store with a shallow copy if it is not uniquely owned by m, then replace its item "b" with n
                                                                        m.put("b", n)
                                                                        

                                                                        In the end, the number of collection backing stores has either increased by zero, one, or two depending on whether anybody else still has a copy of the originals’ backing stores.

                                                                        If m’s backing store was copied, then its items were shallow copied. Being collections, they each share their backing stores with the items in the original backing store of m.

                                                                        I’ve pretended the optional nil case doesn’t exist just to keep it simple. Dictionaries use optionals for the result of a read, but arrays just bounds-check and crash. Maybe this example is more accurate if you imagine it’s the Array type instead.

                                                                        As you would guess, this is just a naive implementation and optimizations are possible, such as the detection of unique ownership (perhaps bigNestedMap is a local that’s never used again). But in general, mutation should only affect your own view of the collection.

                                                                        1. 2

                                                                          The behaviour you describe reminds me structural sharing via path copying (See here).

                                                                          A couple of questions:

                                                                          1. Is it exactly the same as path copying?
                                                                          2. What do you mean by “backing store”?
                                                                          3. You describe how structs work. Does dictionaries (aka hash maps) work in the same way?
                                                                          1. 2

                                                                            Answering out of order:

                                                                            1: Yes, I’d call it pretty similar to that!

                                                                            3: Yes. Dictionaries are also structs in Swift, just like arrays and sets and other basic collections. All of them use the pattern I’m attempting to describe.

                                                                            2: A collection type, being a struct (a value type) and not a class (not reference type), is allocated inline with its owner, or on the stack if it’s a local variable. That allocation must be a fixed size regardless of adding items to the collection. So the struct will have a property that is a reference (a fixed size pointer) to a buffer on the heap where items are stored. That buffer is what I’m calling a backing store. Each array or dictionary or set will have one of those containing its items, and when the collection is passed around, the struct value containing the reference is copied but the buffer is therefore shared. Only when a mutation would alter a buffer that’s referenced by more than one struct value is the buffer copied. That’s a shallow copy, so if it’s for instance an array of ints, the ints will be copied then. If it’s an array of arrays, the child arrays will be copied but their buffers are again shared.

                                                                          2. 1

                                                                            Swift subscripts are like computed properties in many languages: They have get and set functions that you implement. The difference from a computed property is that a subscript’s two functions take an index argument too.

                                                                            Rather weird btw, that they still use the [], unlike more modern languages…

                                                                            1. 2

                                                                              Do you have a language in mind you’re comparing to?

                                                                    1. 15

                                                                      I feel like a lot of the sqlite3 complaints w.r.t. concurrency stem from the fact that most people do not understand how to use immediate write transactions to avoid ‘database busy’ errors.

                                                                      The key points being that you must set the busy timeout as well as using ‘begin immediate’ for write transactions to tell sqlite3 to immediately take a write lock. This means it won’t have to rollback halfway through a transaction when it fails to upgrade to a write lock.

                                                                      1. 3

                                                                        This is an interesting point. As I understand it, you basically have 2 options:

                                                                        1. handle the wait yourself:
                                                                        db = sqlite3_open ...;
                                                                        sqlite3_busy_handler(db, myBusyHandler, (void*)300);
                                                                        

                                                                        and myBusyHandler will be called every time a write lock can’t happen. The handler can return 1 to try for a lock again, or 0 to give up and not try(and SQLite will then return a SQL_BUSY). i.e.

                                                                        i.e. you define when to try for a lock by:

                                                                        Return 1 = SQLite will immediately try to get a lock.
                                                                        Return 0 = SQLite gives up and returns SQL_BUSY for the statement.
                                                                        While in the function(so sleeping, etc) SQLite is waiting for a return value and does nothing.
                                                                        
                                                                        1. let sqlite use the default handler, by just setting a timeout.
                                                                        sqlite3_busy_timeout(sqlite3*, int ms);
                                                                        

                                                                        where it will try up to int ms to get a lock, or return SQL_BUSY if count ms is reached without getting a lock.

                                                                        Basically the 1st option is if you want to update a GUI or whatever and don’t want to block waiting on a lock, like the default handler will do(which by default never runs unless you set a timeout).

                                                                        So the default behaviour is try for a lock, once, if it doesn’t happen, return SQL_BUSY and give up. Waiting/retrying you have to ask for via the above method(s).

                                                                        1. 3

                                                                          I’m saying sometimes sqlite3 must give up and is unable to wait even if you set a busy wait. Sometimes transactions get interleaved in ways that are not serializable. To avoid this you must use begin immediate to avoid any non serializable interleavings.

                                                                          1. 1

                                                                            agreed. Though in theory you could set your timeout to be max int ms and never punt, or always return 1. I’m sure a user or three would get very upset with you eventually though :)

                                                                            1. 1

                                                                              No I think you have misunderstood, I’m saying there are situations where busy wait won’t help no matter how high it is, sqlite3 must abort unless you structure your transactions correctly.

                                                                              1. 1

                                                                                I think SQLite will detect these cases such as read lock upgrade to write and in these cases, it won’t trigger busy handler at all and just return SQLITE_BUSY. That has been said, I don’t know if there are any cases that SQLite cannot detect and will busy wait forever though.

                                                                                1. 1

                                                                                  I understood that. Sorry I was trying to make a joke, clearly I failed.

                                                                                  I’ve never seen a documented list of the places where it can fail in such situations, but I believe you that they exist, when you have multiple writers and don’t use BEGIN IMMEDIATE;.

                                                                                  1. 1

                                                                                    Yeah, it seems really under documented for such a common cause of confusion.

                                                                              2. 1

                                                                                It only happens if you grab read lock in a transaction before upgrade to a write lock. In many cases, it is fine to read without BEGIN on your writer thread, thus, only start transaction when you need to write in the writer thread. In this way, you don’t need to BEGIN IMMEDIATE from get go and you won’t have the failed to upgrade error.

                                                                              3. 1

                                                                                Similar to your second option, you can also set the busy_timeout as a regular statement by using:

                                                                                PRAGMA busy_timeout = 5000;

                                                                                1. 1

                                                                                  yes, a lot of options, like WAL which is also another basically must do default setting, are available as PRAGMA’s and not only via the sqlite API.

                                                                            1. 2

                                                                              There is a popular opinion among developers that SQLite is not suitable for the web, because it doesn’t support concurrent access. This is a myth. In the write-ahead log mode (available since long ago), there can be as many concurrent readers as you want. There can be only one concurrent writer, but often one is enough.

                                                                              I wonder why SQLite3 developers choose to allow only one simultaneous write transaction. Does supporting multiple concurrent writers complexity the codebase or have other undesirable implications?

                                                                              1. 6

                                                                                For many applications what you want to do is actually take the WAL-recommendation seriously and enable it.

                                                                                PRAGMA journal_mode=WAL;
                                                                                

                                                                                So much of “SQLite is slow” could be avoided by doing that. If you think you’d benefit from concurrent writes that’s most likely what you are looking for.

                                                                                1. 1

                                                                                  The problem with the inability to have concurrent writers is also a usability problem, since SQLite does not handle that for you, you need to architect the application so that there is never concurrent writes. An alternative would be to have a lock.

                                                                                2. 5

                                                                                  Yes. If you have two writers in a transaction and they conflict, one of them has to be able to undo partial bits of its transaction. That complicates the design of data structures and will make performance worse if you never have more than one writer. For most consumers of an embedded database, there is at most one thing writing (sometimes zero - I’ve seen sqlite used to provide an indexed data source that’s easy to update independently of the application build but never actually modified by the database).

                                                                                  1. 1

                                                                                    almost read-only database :)

                                                                                    Seems like a good fit for things like RDF HDT

                                                                                  2. 3

                                                                                    As soon as you let multiple processes change database state, you need to worry about changes which affect the same row in different ways. You end up in eventual consistency land, or you do as sqlite3 does and ensure strong consistency with a single writer.

                                                                                    1. 2

                                                                                      My understanding is, concurrent write requires independent rollback journal files for each transaction to implement correctly. However, SQLite’s rollback journal doesn’t support concurrency at all (single reader / writer). In SQLite’s WAL mode, concurrent readers are possible, but because they only have one write-ahead log file, it cannot have multiple write transactions (you cannot interleave write transactions in the WAL file). I wrote a little about this speculation in https://dflat.io/notes/mwmr/

                                                                                      1. 1

                                                                                        Hey I am sort of okvs expert. If you need about programming let me know. Mind the fact that mdbx works primary from memory, for bigger than memory dataset, it might perform less well (check that claim). Also apparantly the API is not so good (check that claim too). Ping via lobsters messages :)

                                                                                        What about rocksdb?

                                                                                    2. 1

                                                                                      I wonder why SQLite3 developers choose to allow only one simultaneous write transaction.

                                                                                      It is much easier to implement isolation: there is never any write conflicts, hence snapshot isolation is trivial, among other things MVCC is not even necessary.

                                                                                      Something that is unclear is how fast a write transaction is seen by other transactions, in theory according to ACID rules it should be immediate… In any case, it seems there is a performance optimization opportunity to declare transaction read-only. There is also an opportunity to allow to read data from the data file lagging a little behind the WAL. It is also a simplification, when you accept you might read old data according the WAL, you do not need to keep around the WAL data in memory (or read the WAL at each transaction). LMDB has a read-only flag.

                                                                                      Last time I checked, in SQLite there is no builtin support to enforce a single writer, so the user needs to handle that. Since SQLite support multiple processes, it leads to more complication to synchronize multiple separate process, each of which would have several writers. Unlike a single process database, where is it easier and would be more performant. Mind the fact that PostgreSQL rely on multiple processes, but they are all forked from the same, my guess is that shared memory or whatever is easier to do in that setup.

                                                                                      ref: about isolation in SQLite https://www.sqlite.org/isolation.html

                                                                                    1. 4

                                                                                      If you are looking for good generic data structures in C, klib is an good alternative as well: https://github.com/attractivechaos/klib

                                                                                      One thing missing there is a good rbtree implementation. I found the one from jemalloc is dependency-less and easy to use: https://github.com/jemalloc/jemalloc/blob/dev/include/jemalloc/internal/rb.h

                                                                                      1. 8

                                                                                        @rui314 has been doing some pretty interesting projects recently (also mold: https://github.com/rui314/mold). Is this some fun retirement period they is enjoying?

                                                                                        1. 23

                                                                                          Yes. After working for Google for 12 years, I decided to leave to pursue my personal interest.

                                                                                          1. 7

                                                                                            Your READMEs are excellent. I’ve never read any so captivating.

                                                                                            1. 2

                                                                                              ruiu, I’d love to be able to pre-order your compiler book or sign up for a mailing list (or follow an RSS feed) to let me know when it’s ready.

                                                                                              1. 3

                                                                                                The chibicc README has instructions on signing up for notifications.

                                                                                              2. 2

                                                                                                Hi @ruiu - quick question. Will your compiler send each phase to a separate process or will the compiler be a single process.

                                                                                                Just curious - no strong opinion on this.

                                                                                                1. 2

                                                                                                  It’s a single process just like other compilers are.

                                                                                                2. 2

                                                                                                  Remarkable, thanks for sharing your research

                                                                                              1. 2

                                                                                                What size would the sqlite file be if it was also gzipped?

                                                                                                1. 7

                                                                                                  Parquet can read directly from the compressed file and decompress on the fly, though.

                                                                                                  1. 3

                                                                                                    471MB zipped.

                                                                                                    1. 1

                                                                                                      Or anyone seriously investigated this? https://sqlite.org/zipvfs/doc/trunk/www/readme.wiki

                                                                                                      1. 1

                                                                                                        It’s not very popular due to the license and fees. Here is an interesting project should work really well with the zstd dict compression. Since there is tons of repeated values in some columns, or json text, zstd’s dict based compress works really well.

                                                                                                        [2020-12-23T21:14:06Z INFO sqlite_zstd::transparent] Compressed 5228 rows with dictid=111. Total size of entries before: 91.97MB, afterwards: 1.41MB, (average: before=17.59kB, after=268B)

                                                                                                        1. 1

                                                                                                          I would want to investigate using ZSTD (with a dictionary) and Parquet as well. Columnar formats are better for OLAP.

                                                                                                        2. 1

                                                                                                          Why would anybody investigate this when you have Parquet and ORC specifically designed for OLAP use cases. What does Sqlite add to this problem that a data scientist wants to process the Github export data?

                                                                                                        3. 1

                                                                                                          Apples to oranges.

                                                                                                        1. 6

                                                                                                          After ~24 hours, the answer would appear to be “no”.

                                                                                                          That’s a shame. I know it’s blasphemous, but I think Swift is a more elegant and usable language than Rust. I’m sad that it’s probably going to be Apple-only for the foreseeable future.

                                                                                                          1. 2

                                                                                                            Swift for most part, is an “application” language. To me, a good application language can express interactions with end-users in succinct manner. When we speak non-Apple language in production environment, what most people refer to would be either Android apps, Windows apps, Web backend or micro-services (maybe with the exception of games).

                                                                                                            Android apps / Windows apps, it would be a good fit but apparently no. Swift is too immature and people won’t want to shoot their foot with a language that is not officially blessed by the platform. Windows is probably more open in that regard, but it takes time to mature on that platform.

                                                                                                            Web backend, these are dominated with Go / JavaScript at the moment. Swift have some open-source frameworks, but it is going to be niche. Web backend for most part aggregates data from micro-services and do some basic business logic. The language is simply not there for basic building blocks that we think would be good for efficient web backend logic such as coroutines.

                                                                                                            Micro-services, this is a big category with all kinds of different requirements. Swift is actually not a bad language for these, but it doesn’t have a nice deployment story as Go / Rust (static binary just supported towards the end of last year, otherwise it requires Swift runtime). Not to mention the lack of coroutines which would be useful for some simple micro-services.

                                                                                                            Games, I think Swift would be excel at that. But we are in a world with only two game engine that matters, and they have their own supported language.

                                                                                                            I personally use Swift for all kinds of things on Linux, from Jupyter notebook to command line tools. The language is nice, but its ecosystem will take another 3 to 5 years to mature.

                                                                                                            Swift does mature slower than Rust or Go though if you consider the language itself is already 7-year old now.

                                                                                                          1. 33

                                                                                                            This is nuts. Who would have expected sscanf() to call strlen() on the /input string/ every time it was called? Not me!

                                                                                                            Turns out that the vast majority of libc implementations implement sscanf() by converting the input string to a fake FILE* object & then passing that to their fscanf() implementation. (The FILE* object needs the length of the input string so that it can signal EOF correctly I presume.)

                                                                                                            Pretty sure that I would have made exactly the same mistake in the same circumstances: How else are you going to convert input strings into floats / ints. Write your own float or decimal parser? Are you mad? No, obviously you call sscanf(). And here we are.

                                                                                                            What a footgun.

                                                                                                            1. 13

                                                                                                              It’s a good reminder that writing in C / using stdlib doesn’t make your code fast. I’ve been guilty of choosing libraries written in C because I believed this meant they’d be fast. Often they are, but I need to remember that performance should always be tested when it matters and shouldn’t be assumed to be as good as it could or should be.

                                                                                                              1. 10

                                                                                                                Or just use an existing JSON parser.

                                                                                                                1. 3

                                                                                                                  Typically you should use strtod, strtol, and the rest of strto* family for conversion after strtok, or strsep if you have it.

                                                                                                                  1. 3

                                                                                                                    Oh sure, but if you’re ad-hoc parsing little chunks of string encoded data that happens to have a regular structure, sscanf is right there, begging to be used so you can blow your feet off with it.

                                                                                                                    1. 1

                                                                                                                      If you strtok it before probably this is not a big deal. Because it null-terminated the returned token already so you are not strlen the whole input.

                                                                                                                    2. 3

                                                                                                                      Turns out that the vast majority of libc implementations implement sscanf() by converting the input string to a fake FILE* object & then passing that to their fscanf() implementation. (The FILE* object needs the length of the input string so that it can signal EOF correctly I presume.)

                                                                                                                      That is … bizarre. Not implementing sscanf in terms of fscanf—that’s fine, and sprintf usually does it too—but precalculating the length! It’s trivial to signal EOF: just look for the null terminator.