Threads for walleye

    1. 4

      Since we are a platform-as-a-service provider, we think that we can contribute the most by showing you how to build a small web service in both languages.

      In 99% of cases, everything else being equal, Go works best in this scenario. Main benefit of Rust is memory safety without GC overhead, and it’s almost never worth it if network latencies are present.

      1. 5

        I disagree. To me, the main benefit of Rust is that it’s much more unlikely I’ll have to deal with an essential service having gone down at 3AM.

        That and the ability to easily expose bindings to tons of languages so I can reuse code are why I use Rust in place of “shell scripts” (actually Python these days) so often. Rust, like Haskell, surfaces as much as possible at compile time but, unlike Haskell, has an ecosystem leaning more toward “fearless upgrades” and less toward pushing the state of the art at the expense of a little API breakage.

      2. 2

        My personal experience between the two languages was that CPU/latency differences were small (10-20%) but memory consumption differences were huge (4-5x). Even with the cost of network hops, at a certain scale the OpEx differences can be huge.

        1. 1

          How did you measure memory consumption?

          1. 1

            It’s been long enough that I’ve forgotten the details, and it was a past job. I’m pretty sure I used one of the Datadog memory metrics.

    2. 4

      Cloud will not get simpler Infrastructure as code will not get simpler

      I (unfortunately) agree. There’s always a constant thread in conversations about simplicity as an ideal. It’s a great ideal, but it always folds under the pressure of practical needs.

      As far as the actual idea here, I do think it’s nice. I wouldn’t turn down the opportunity to use it, as YAML configuration is so constantly painful that any amount of code reuse is welcome.

      If we’re thinking “from scratch” though, I think infrastructure as code should support… code. A real programming language. Pulumi is already going in this direction, and that’s where I think we should all go.

      Funnily, as was hinted at in the post, is that this is the direction web frontend went as well, with React. Instead of adding layers and layers on top of HTML, React embeds HTML inside of a real, actual host programming language (JS). That’s the most flexible and usable approach long-term.

      1. 3

        Definitely agree re: needing real code. I’ve seen a few systems now start off as pure declarative and end up with a poorly thought out language bolted on the side. If you admit you’ll need code eventually early on, you can build your language/code ecosystem more intentionally.

    3. 9

      It seems we are roughly converging on the error handling. Midori, Go, Rust, Swift, Zig follow similarlish design, which isn’t quite checked exceptions, but is surprisingly close.

      • there’s a way to mark functions which can fail. Often, this is a property of the return type, rather than a property of the function (Result in Rust, error pair in Go, ! types in Zig, and bare throws decl in Midori and Swift)
      • we’ve cranked-up annotation burden and we mark not only throwing function declarations, but call-sites as well (try in Midori, Swift, Zig, ? in Rust, if err != nil in Go).
      • Default is existentially-typed AnyError (error in Go, Error in Swift, anyhow in Rust, anyerror in Zig). In general, the value is shifted to distinguishing between zero and one error, rather than exhaustively specifying the set of errors.
      1. 13

        I would not put Go’s “error handling” in the same category as either checked exceptions or things like Rust.

        Both checked exceptions, and Rust’s fallibility-encoded-in-types, provide the ability for a compiler to force you to acknowledge a potential error at the call site where it might occur. Go’s approach is nowhere close to that – as far as I’m aware it happily lets you just not write an error-handling block and potentially bubble the error up, unhandled and without any notice to the caller.

        So, honestly, I prefer unchecked exceptions to Go – at least there if you don’t catch, you get a crash that tells you why you crashed and information to let you track it down.

        1. 3

          Aye Go’s is woefully insufficient out of the box, but there are linters which force you to ack errors (errcheck?). Past that I’m not sure any language can force you to properly handle error, Rust certainly does not e.g.

          _ = function_which_returns_a_result()

          Will pass muster in the default configuration (clippy may have a lint for that, I’ve not checked).

          The problem is really the go community which is woefully inconsistent and will simultaneously balk at the statement that OOTB Go is very much lacking, and tell you that the compiler being half-assed is not an issue because you should have golanglint-ci or whatever.

          1. 4

            It also has the problem that you can get both a return value and an error. While it is generally not advised to return a meaningful return value in the error case, it is not without example.

            It should be a proper sum type, either a return value, or an error. Exceptions (both unchecked and checked) fulfill this property.

          2. 2

            Past that I’m not sure any language can force you to properly handle error, Rust certainly does not e.g.

            The thing with Rust is that the compiler knows that a Result<SomeType, &str> isn’t a SomeType .

            1. 3

              That’s awkward to say aloud. :-) (I agree, though.)

              (Edit for non-native English users: Result<SomeType, &str> isn’t a SomeType, but it is a sum type, and “some” and “sum” have the same pronunciation.)

          3. 2

            The linters help some, but there are edge cases that they miss (or at least used to). The last time I was writing Go in anger, variations on this bug came up over and over:

            foo, err := f()
            if err != nil {
              return err
            bar, err := g(foo)

            (I may be getting the exact details wrong, but the weird/subtle semantics around := with a pre-existing err variable caused pain in a number of dimensions.)

            1. 2

              Yeah the compiler only catches dead variables, here err is reassigned to do there’s no dead variables just a dead store, and the compiler will not say anything.

              Staticcheck should catch it, from what I’ve been told.

        2. 3

          What Go and Rust both have is explicit control flow around errors (though you have to get used to looking for those ?s in rust). That is to say, you aren’t longjmping half way across the project because of an error state.

          That makes things so much easier to reason about than what I’ve seen in Java/C++/JS (the latter two having the distinction of being able to throw things that aren’t even error objects).

          Call me lucky, but I’ve never had things go wrong because an error was actually unhandled.

          Things going wrong because they were in a different state than I thought? That one crops up quite often, and is worse with implicit control flow.

        3. 2

          Til, thanks! I was sure that go vet lints unused results, just like Rust does, but that doesn’t seem to be the case. The following program passes vet

          package main
          import (
          func e() error {
          	return errors.New("")
          func main() {
          1. 2

            go vet does very little, for this you need errcheck, probably.

            Likely also staticcheck for e.g. dead store, as goc only checks dead variables.

          2. 1

            That’s not an error, but this is because err is unused:

            func main() {

            It does happen sometimes that you move a code block and an err ends up without an if err != nil after it, and it doesn’t trigger the “no unused values” rule, but it’s sort of unusual.

      2. 2

        This is exactly what I was thinking as I read the interview, so I found it really interesting that Anders Hejlsberg was so opposed to this style of error handling. He’s not someone whose opinion I’d dismiss easily!

        When Hejlsberg argued against checked exceptions on the grounds that adding a new type of exception to a function broke the interface, I completely agreed with the interviewer:

        But aren’t you breaking their code in that case anyway, even in a language without checked exceptions? If the new version of foo is going to throw a new exception that clients should think about handling, isn’t their code broken just by the fact that they didn’t expect that exception when they wrote the code?

        But Hejlsberg disgrees:

        No, because in a lot of cases, people don’t care. They’re not going to handle any of these exceptions. There’s a bottom level exception handler around their message loop. That handler is just going to bring up a dialog that says what went wrong and continue.

        Surely in any kind of event-driven application like any kind of modern UI, you typically put an exception handler around your main message pump, and you just handle exceptions as they fall out that way.

        You don’t want a program where in 100 different places you handle exceptions and pop up error dialogs. What if you want to change the way you put up that dialog box? That’s just terrible. The exception handling should be centralized, and you should just protect yourself as the exceptions propagate out to the handler.

        That’s when I realized that error handling in the software he’s talking about - GUI desktop applications - might be very different from the software that I’m used to writing. The scalability argument at the end also kind of rang true.

        I wonder if the Rust/Zig/Go style of error handling actually works well for GUI desktop applications. Do you have experience of that sort of thing?

        Is it your final point - that we default to existentially-typed AnyError - that prevents the problems Hejlsberg describes?

        1. 4

          If the new version of foo is going to throw a new exception that clients should think about handling, isn’t their code broken just by the fact that they didn’t expect that exception when they wrote the code?

          Note that this is a litmus test for the difference between checked exceptions and the approach taken by newer languages.

          In Java, throwing a new kind of thing requires updating both the functions that just propagate the errors, and the functions that handles the errors.

          With newer languages, you don’t have to change the propagation path. If a new variant is added to a Rust error enum, only the code that match needs to change, the ? paths stay the same.

          Separately, it turns out that pretty often you don’t want to update the match site as well (eg, because you don’t actually match exhaustively, so paying for allowing exhaustive match is worthless). The single Exception type allows for this.

          might be very different from the software that I’m used to writing

          I would say that that strategy applies to a vast majority of apps, which fall into one of the two categories:

          • there’s some sort of top-level event loop, and the underlying transactional data-store, where, if a single event loop turn goes wrong, you just alert the operator and move on (sometimes there’s an outer service manager loop which restarts your service on a crash)
          • it’s a “run-to-completion” program, like a CLI, where you print the error and exit (maybe not even unwinding the stack)

          Some notable exceptions here are:

          • localized retries, where you exponentially backof a network request, or do things like “if file does not exist, create file”
          • something high-reliability/embedded, where you precisely know every syscall you make and trace each and every error path

          Not an exception, but a situation which can often be misunderstood as one, is when your target domain has a notion of error. Eg, if you are writing a compiler, syntax errors in the target language are not host-language errors! A missing semicolon in the file you are compiling, and inability to read the file due to permissions issues are two different, orthogonal kinds of things. This also comes up when implementing databases: “transaction fails due to would be constraint violation” is not an exception, it’s a normal value of database’s domain model.

          1. 1

            In Java, throwing a new kind of thing requires updating both the functions that just propagate the errors, and the functions that handles the errors. With newer languages, you don’t have to change the propagation path

            That’s not true. If your n-level deep nested method had a return type of String, and you change it to Result<String, SomeErr>, then in all those languages you will need to refactor recursively until you hit the part where you want to handle the error condition. If it already returned a Result type Result<T, SomeEnum> as in your example, then analogously you can also just add more variants to it in Java with subclassing a checked exceptions.

            Currently no mainstream language is polymorphic this way — koka and similar experimental languages can do this by having first class Effect types.

            1. 2

              When I say a new variant, I mean specifically going from n, n > 0 to n+1, not about 0 to 1. If we restrict throws declaration to just throws Exception we get Swift/Midori/Go semantics, yes.

              If it already returned a Result type Result<T, SomeEnum> as in your example, then analogously you can also just add more variants to it in Java with subclassing a checked exceptions.

              This doesn’t work quite analogously. With enums, you can combine unrelated ErrorA and ErrorB into ErrorAorB. With sub typing and single inheritance, you either need to go all the way up to throws Exception, use composition (and break subtyping-based catch), or, well, throw ErrorA, ErorrB.

              1. 1

                Correct me if I’m wrong but Rust’s enums are regular old sum types like you would have in Haskell. Then you can’t have

                struct ErrorC {}
                enum MyErr {
                  ErrorA, ErrorB
                + ErrorC

                Your ErrorC would need to wrap the outside defined ErrorC, so ErrorC(ErrorC). Composition works in a completely analogous fashion with subtyping as well. (The real difference between sum types and inheritance from this aspect is their boundedness/sealedness. That’s why Java’s sum types are named that way.)

                What would actually do what you want, you would need union types, that are similar in concept, but not completely the same. Typescript and Scala 3 do have these for example, denoted as A | B

                1. 1

                  Composition works the same, but catching breaks. Java catching is based on subtyping, and doesn’t work with composition. Rust catching is pattern matching, it works with composition.

                  1. 1

                    Java’s checked exceptions surely leave much to be desired, especially in ergonomics, but that can also be reasonably emulated by throwing/catching GenericException with a single composited object (perhaps of YourSealedInterface type since a few years), on which you can do pattern matching with a switch expression. Not beautiful by any means, but not weaker in expressivity.

                    As I wrote, I believe you have to go to Effect types to actually raise that level (so that for example a map function can throw a checked exception if its lambda does)

          2. 1

            With newer languages, you don’t have to change the propagation path. If a new variant is added to a Rust error enum, only the code that match needs to change, the ? paths stay the same.

            Ah, OK. That seems a lot like unchecked exceptions. I don’t have much experience with Rust - and I last tried it years ago - but I remembered error handling being fiddlier than that.

            IIRC, the annoying part was when I was writing a function for a library - i.e. to be used by others - but in turn I was using multiple other libraries, each of which returned their own error types. I remember writing a lot of boilerplate to create my own error type that wrapped all of the possible underlying errors. Is it possible to pass them along up the stack without doing that?

            1. 1

              This already is less work than in the world of checked exceptions. If an underlying library foo adds a new error variant to its foo::Error type, then your code already doesn’t need to change. The work is proportional to the number of error types, not to the number of the call-sites to failabel functions.

              It’s still a lot of work though! That’s why usually languages try to bless some sort of “any-error” type, like Swift or Go. It is quite a bit simpler if you don’t try to be precise with the types of errors, and just downcast at the catch site, if you need that. In Rust, this is available via `anyhow crate.

              One thing in this space I’ve realized recently is that anyhow’s approach is usually recommended for applications, but, because in the apps you don’t actually care about semver, just making a giant error enum with thiserror might also be fine, and even better along some dimensions.

              1. 1

                I just checked and the first commits to both the anyhow and thiserror repositories came about a fortnight after I last tried writing anything in Rust: they look like they would have helped with the boilerplate tedium.

                I really dislike exceptions because they obscure control flow: it makes code review very difficult. Most of my career has tended towards what you called “something high-reliability/embedded”, so I like to know exactly what errors can occur at any point in the program.

                It’s always interesting to learn about other domains and how practices, such as error handling, have different costs and benefits.

                1. 2

                  I just checked and the first commits to both the anyhow and thiserror repositories came about a fortnight after I last tried writing anything in Rust: they look like they would have helped with the boilerplate tedium.

                  anyhow and thiserror are two surviving siblings in a family tree of boilerplate-reducing Rust error libraries. Before they were released, there was their uncle snafu, which still survives and is more boilerplate-reducing than thiserror, and which I prefer, and there were fehler (2019–2020), failure (2017–2020), error-chain (2016–2020), and quick-error (2015–2021), all of which I think have been abandoned by now — except, I’m surprised to see, quick-error, the oldest of these but still getting nearly 2 million downloads per month tracked by

                  1. 1

                    That’s good to know, thanks. I was only tinkering with Rust for a couple of personal projects so wasn’t aware of these libraries.

        2. 2

          That’s when I realized that error handling in the software he’s talking about - GUI desktop applications - might be very different from the software that I’m used to writing. The scalability argument at the end also kind of rang true.

          The pattern of a “main loop” that also includes the last-resort error handler – which in many cases is also the only error handler, with everything else doing try/finally as mentioned in the article – is not some sort of weird rare fringe thing, nor is it exclusive to “GUI desktop applications”. Many types of long-running programs, especially ones that may receive external input as they run (say, network daemons, web applications, etc. etc.), are built that way.

          1. 3

            Is it really that common for the only error handling to be done at the outermost event loop?

            1. 1

              In many networked services, yes.

              There might be the occasional case where, say, a blog app knows it might not find a blog entry matching the request URL and the underlying DB tool surfaces that as an exception that the app catches at that spot and transforms into a 404 response, but in general “use a finally to release any resources you were using, and otherwise let it bubble up” is a pretty standard practice, so much so that in many web frameworks the outermost main loop has code to transform various exceptions into specific HTTP response codes (example).

              1. 1

                Fair enough. GUI desktop applications and web apps then :)

                I guess these are both situations in which someone else - either the human using the desktop application or the human using the web app (or, at least, the API client) - is best placed to decide what to do about an error. In both cases the program itself only handles the error by turning it into an appropriate representation for reporting externally: either a popup error dialog or an HTTP response code.

                Presumably there’s also some logging or telemetry to help whoever is responsible for the program to deal with the underlying issue: again, this is external to the program itself.

            2. 1

              It’s a stupid version of what Erlang does well, so probably :)

              1. 1

                Actually, an Erlang supervision tree is quite a good counter-example. I guess for a really simple system you might just handle things at the very root of the tree but for most “real world” systems you’ll have multiple levels of supervisors with multiple supervision strategies.

          2. 1

            I would say it’s more common than the “run main do some work and exit” paradigm we’ve been stuck with as if everything is a unix command line program from the 70s.

            1. 3

              That’s probably true but those are hardly the only two options. Most of the software I’ve written over the last twenty years has been of the long-running event loop type, but I’ve never just slapped a single generic error handler over the outermost loop. I can see how it might make sense for a GUI desktop application, where you just want to pop up some sort of error dialog to the human user, but otherwise there’s usually some context-dependent error handling to be done nearer the point of failure.

    4. 8

      I have to say, since sharing the article here, I have had some time to think about it and really don’t agree with the author on many of their points. The article feels like a mixture of the author’s actual frustrations with some light-hearted japes, but it’s hard to differentiate them.

      A lot of these ‘complexities’ the author complains about are features of the language that allow it to be so safe. What little time you spend making use of Option, Result or puzzling out lifetimes is potentially an order of magnitude less time than would be spent debugging errors down the line.

      See that? No return statement. It’s like playing hide-and-seek with semicolons. The value of the last expression is returned automatically, whether you like it or not.

      This statement suggests the author doesn’t fully understand the language features either, else he would have known that the value is only returned when omitting a semicolon at the end.

      Having said that, I did find the article thought provoking. It has ultimately served to only increase my appreciation of Rust though.

      1. 3

        I agree with you that most (but not all) of the complexity the author is complaining about is necessary/inherent complexity, but that doesn’t make it any more fun to learn. The article is more venting than making a technical point.

    5. 2

      As much effort as with JavaScript has been spent trying to speed up Python & it seems odd. With JavaScript, until WASM adoption/features, there is no alternative for the browser platforms, but we always had alternatives for every platform Python exists.

      1. 4

        It’s all down to sunk costs - the ecosystem is large and of decent quality, which acts as a gravity well.

        1. 7

          It’s a really nice pairing of syntax and semantics, on top of a mostly-sane standard library. Choosing a language one enjoys using over all the other ones isn’t a sunk cost, it’s a privilege.

          1. 4

            There are certainly worse choices for a language, but my point was more about the performance. Python was never a performance beast, and this means you’re always limited to extensions. People just keep throwing more and more manpower at it to make it go faster, even where switching languages might actually be the wiser option.

      2. 2

        There are a lot of programs in Python that fall into the category 98% of the code is fine in Python but for 2%, Python is too slow. Being able to rewrite only that 2% in Rust is really valuable.

    6. 4

      I find this very unconvincing. The fact that you can’t get perfect coverage isn’t a good reason to not even try. Also, if you avoid integrated tests, huge sets of combinations of states can’t be tested at all. There are good arguments for limiting your use of integrated tests, but this article doesn’t have any.

      1. 1

        I should have added the rant tag, too, really. Unfortunately the message gets overshadowed by the clear irritation he feels.

    7. 2

      The one thing I would add here is that most active- passive HA systems receiving continuous writes lose data on failover. Because replication is asynchronous, any data that hasn’t been replicated yet is lost whenever the primary fails. For many scenarios, that’s ok, but avoiding data loss is the primary reason I’m aware of for using active-active HA.

      1. 2

        The article specifically mentioned that the active-passive replication is synchronous.

        Even with an active-active HA, a non-replicated write can be lost. Which is why you have EACH_QUORUM in Cassandra, for example.

        1. 2

          I don’t see any mention of active-passive replication being synchronous, and that would seem to contradict the definition. etcd is called out as an example of active-active because it uses synchronous replication.

          The post uses the term “sync” to cover both synchronous and asynchronous replication. The only example of active-passive I see is OpenVPN where the fact that replication is asynchronous is specifically highlighted.

          1. 1

            Maybe you are right. The post doesn’t mention async explicitly but somehow I didn’t pay attention to the quotes around “sync”.

            But regarding your original point, both active-active and active-passive HAs would require some level of non-async operations to be durable. In active-active systems, you can use quorum to avoid data loss with a high confidence (unless you use consistency level of ALL, to continue using the Cassandra terminology). But the active-passive systems only allow you two switches: ONE or ALL.

            To summarize, I would rather say that the ability to use a consistency level between ONE or ALL (such as EACH_QUORUM, QUORUM, or LOCAL_QUORUM) is the primary reason for using active-active HA.

    8. 3

      I get their point about always measuring, but most code based should ~always use maps with transparent string hashing/comparisons. You can trivially create a wrapper that defaults to a reasonable transparent hash function. Similar to how you would never wait until a benchmark to stop passing a string by value. These death by a thousand cuts performance problems can be handled with good hygiene.

      1. 3

        Agreed. It’s one of those things that probably doesn’t show up in profiling because no single use of this pattern is likely to add much overhead, but you can easily have a hundred places where it adds 0.1% overhead.

        I was quite surprised that these were introduced as far back as C++14. I remember not using them because they weren’t in the version of C++ I had to support, but all of my C++ projects are now C++17 / C++20, with occasional bits of C++23. LLVM has a custom StringMap implementation, in part, because C++98 didn’t allow heterogeneous lookups on maps. Coming from Objective-C / Smalltalk, where the map type just requires the thing you look for to have a compatible hash and comparison method, this always struck me as a strange omission in C++.

        1. 1

          Similar to LLVM, heterogeneous string lookups was one of the motivating factors in the development of Abseil’s Swiss Table.

        2. 1

          you can easily have a hundred places where it adds 0.1% overhead

          I always call this “the death by 1,000 cuts” and it’s something I try to pay attention to in the early stages.

    9. 2

      FDB defaults to strictly serializable transactions but allows relaxing these semantics for applications that don’t require them with flexible, fine-grained controls over conflicts.

      This interested me. I found this:

      It doesn’t seem like it does exactly what I want though.

      Does anyone know if FoundationDB (or any db) will handle this conflict? I’ll just use a made up syntax: (assume a table of key/values)

      update value to 5 if value < 5 where key=='somekey' update value to 15 if value < 15 where key=='somekey' What I want to happen is for the value to be 15 since, no matter what order these writes would happen in, that is the result. So If the db gets both at the same time I’d like it to just throw the first away.

      I assume a system that manages the transactions could perform this optimization?

      1. 2

        It seems to me that you are confusing two things. You only need a fairly weak consistency model to ensure that the key ends up set to 15. A database’s consistency model doesn’t imply anything about which optimizations will happen.

        1. 1

          Well… doesn’t it? Like, if a database enforces serializable reads and writes it would potentially reject one of the two transactions. The transaction manager would have to inspect the contents of the query to know that it can actually drop one of them. That sounds like an optimization to me.

          1. 1

            Serializable simply implies that the transactions will appear as if they occurred in some, unspecified order. Assuming the if/then logic is embedded in the query and not a full read/modify/write, the DB is well within its right to receive them concurrently, execute one, and retry the other. It could also do the optimization you described as there’s no guarantee about the linear order of reads and writes in time.

            For FoundationDB specifically, a read/modify/write would fail one of the transactions. If you implemented that logic as an atomic op, both transactions would succeed and both states would be visible at their respective commit versions. FDB could, but doesn’t yet, coalesce those transactions by not making every version readable. AFAIU, having a strict serializable version history where only a subset of versions are readable is still acceptable and is something I’ve advocated for when I was a part of the FDB community.

            1. 1

              Thanks, that sounds like it accurately describes what my ideal behavior is. I don’t need (nor do I want to pay any cost for) those intermediary versions. Atomic operations sound interesting though, I may check that out.

                1. 1

                  Yeah, this looks perfect. I wonder if this is something available in other dbs.

    10. 1

      The Rust equivalent would be the Borrow trait.

      1. 1

        I would have thought it was more similar to the raw entry API.

    11. 7

      This feels tedious to read: overly defensive and like an ad for AWS at times. Maybe I’m not the intended audience.

      1. 5

        it is tedious, but it makes it clear it’s a little more nuanced than just “monolith vs microservices”

        1. 2

          I would argue it’s mostly no true Scotsman. There’s no good definition of microservices nor monoliths, so a successful project must have been the side I was already defending.

      2. 5

        It’s defensive because the initial attack by DHH was overly aggressive and completely wrong, so they need to make this article tedious and watertight. Discussion on DHH’s blog post.

    12. 1

      I’m surprised Pacemaker didn’t have any logging about unresolvable constraints. Even if the solver couldn’t figure out the particular constraints that broke the system, that would have been a great hint to avoid the WAL debugging, which turned out to be a bit of a non sequitur

      1. 1

        At least at the time, we didn’t find anything useful in the logs. It’s possible that’s changed in later versions or would have been present with (e.g.) debug logging enabled.

        There is a CLI tool for simulating specific situations in advance, but that requires foresight into edge cases that might happen, rather than telling you what it’s currently doing in production.

        Lack of introspectability was ultimately a big reason we left Pacemaker behind and started using Stolon.

    13. 1

      Is there a Rust equivalent to this? I wrote a horrible hack at a former job that caused a bunch of pain.

      1. 1

        In addition to the sibling comment there is also

    14. 1

      I’ll add a seventh:

      Letting healthchecks take out most or all of a cluster automagically. Sometimes the problem really isn’t in one failing node, but instead is an outage that’s being triggered by incoming traffic and that same traffic will hurt any node it lands on (due to overwhelming volume, or due to a specific bug or performance issue with a specific query). The point of healthchecks is to isolate singular nodes failing for local reasons, but in these sorts of cases, the loadbalancer will shoot down a node which is merely unlucky (first to fail due to the impactful traffic) rather than a truly unhealthy one. As the traffic keeps coming in (especially if it’s a high-volume problem) your balancer will eventually shoot down most or all of your nodes at an accelerating pace, making everything about the situation worse than it had to be.

      One reasonably simple mitigation strategy is to oversize the cluster on load capacity and set some kind of depool threshold: for example, if you /need/ 8 nodes alive to handle intended capacity, deploy 11 of them and don’t allow the healthchecking balancer to remove more than 3 at a time from healthcheck failures and/or manual maintenance depools. It’s not perfect, but it’s a good first step!

      Addendum: another strategy I’ve also pursued, which is also not perfect: when the threshold above is reached, consider the check itself to be at fault and put all nodes back in service that aren’t manually-depooled for maintenance, at least until some of them recover back over the threshold.

      1. 2

        Envoy has a similar strategy where you configure a threshold of percentage endpoints being healthy. Below that threshold, health checks are ignored, and all the endpoints will receive traffic regardless of health check status.

    15. 1

      How much is this a problem in practice for enterprise accounts with 2FA enabled? Most of the complained I saw were from consumers without 2FA turned on, but I also didn’t look too hard.

      1. 3

        If you mean for Google Workspaces, it seems like it’s less of a problem because you can supposedly disable risk-based authentication in it. I’m not sure this entirely mitigates the other issues like the browser discrimination, and the inscrutability of the system remains frustrating.

        Moreover it does seem like enabling 2FA acts as a kind of undocumented cheat code to disabling risk-based authentication in many systems. I actually often enable 2FA on some accounts I don’t care about the security of (and pointlessly store the TOTP secret in a password safe same as the password).

        Why? Because it seems to be treated by a lot of services as a flag to a) disable non-deterministic authentication, and b) as a flag to disable password recovery (or at least make it moot if you can’t also recover the TOTP secret). I consider the former actively desirable in all circumstances, and I consider the latter desirable because I have enough confidence in my own credential storage practices that I’m more than happy to accept the risk and responsibility of being locked out if I lose that token.

      2. 3

        Okta calls this “Adaptive MFA” and it’s an endless source of confusion for our customers.

    16. 7

      What I’d like to see most is support for efficient pagination. If you have a few billion rows, it should be as fast to scroll to the last page as the second page. Maybe a new table type is needed to enable this. FoxPro and Clipper were able to do this decades ago.

      1. 3

        How did pagination work for FoxPro and Clipper? I can’t imagine how a database table could support O(1) insertions and deletions + O(1) access to the nth element + concurrent transactional reads and writes. So they must have been compromising on one or more of these properties. Depending on what exactly they compromise, I’m wondering whether it would be possible to implement the same kind of access pattern in Postgres as a set of tables, indexes and triggers.

        1. 5

          The big challenge would be cache management, since you’d basically be materializing the results of every query that needed to support fast pagination, and you’d need to know when it was safe to discard those cached results. “When is the user done looking at this?” is always a tricky question in a stateless web app.

          If you were willing to pin user sessions to database connections, you could probably get pretty far with temporary tables or with keeping cursors open across user interactions. That’s how database-backed desktop apps worked back in the day when I was writing them.

          My hunch is that to the extent fast pagination is harder in PostgreSQL than it used to be in old-school database systems, it’s more because the client side is radically different than because the database is less capable.

          1. 3

            You’re not wrong, but unfortunately Postgres has a limit on the number of simultaneous connections, usually on the order of hundreds or thousands.

            The solution would be to materialize the results to either another store or unclogged tables that are periodically cleaned up.

            1. 1

              That is good news. For the type of CRUD applications I’m thinking of, thousands of users is more than enough. If more users are needed, fast scrolling of tables with billions of records will need to be rethought, maybe eliminated from the design.

        2. 2

          Doesn’t postgres have O(log N) insertion/deletion/access because it’s a b-tree underneath? You could augment the b-tree with order statistics (a la an order statistic tree), but that would come at a cost for a feature many users wouldn’t benefit from. I don’t know enough about postgres internals to know if there would be a way to enable it as an optional table feature in a low cost way.

        3. 1

          Clipper and FoxPro are from the days before networking; there was no concurrency. Also, FoxPro wasn’t O(1), but it was very very fast at what it did. A little language that made noSQL databases and tables super easy to work with for making CRUD applications.

          1. 3

            I used Clipper and then FoxPro as part of two technical support call center jobs for case management. Maybe the application didn’t handle networking directly but it was definitely used in a networking environment. In our case this was provided by Novell NetWare under MS-DOS. It was fast and when customers called in we would always have to ask for a “customer number” (which allowed us to probably get O(1) back then). When customers didn’t know that we had ways to look it up by first name and last name (which was also pretty quick). Now we never had a billion customers but I’m pretty sure we had thousands of customers and the same application was able to lookup invoices as well. It’s funny to think about how fast this solution was back then compared to how slow Salesforce is now with their own case management solutions.

            1. 1

              Yeah. With the database backend secured, I’d like to make a lisp package to do all the Foxpro convenience functions.

      2. 1

        Using cursors is an option that works pretty well. Also it’s possible with the right indices, not supporting arbritary pagination, to query without LIMIT and OFFSET but with WHERE order_col(id) > order_col(last_id_of_prev_pagination).

        1. 1

          Thank you, that is O(1)?

          1. 1

            More O(your where clause), it’s difficult to extricate that, but yes this technique (“keyset pagination”) has roughly constant performance compared to offset pagination’s proportional increase with page distance.

            1. 1

              I’ll give it a whirl, thanks

    17. 2

      What’s with the weird political turn at the end? I’ll try not to editorialize too much, but it seemed very unnecessary.

    18. 1

      Incidentally, it occurs to me that playing badly with other languages is probably a design choice: it makes it much easier for google to rebuild the entire library supply chain from scratch if it’s all in one language.

      This has the benefit that you can patch the source of a dependency and have all your software benefit from it, without needing to make dynamic linking work, and makes it easy to know if your deployed artifact came from a build that used the patched version.

      1. 2

        Playing badly with other languages has been a big blocker to internal adoption at Google there are tons of libraries/systems that just don’t work in Go.

    19. 5

      I’m actually surprised. Completely did not expect the smaller allocation to make any difference. Since the only difference should be the initialisation time… that sounds like a lot of time for simple clears. Unless there’s more page faulting than I would expect.

      The next improvement then would be to preallocate that slice and clear instead of making a new one.

      So the accounting of “alloc”s seem weird to me. Here are 4 extra variants:

      Benchmark1          1224            899297 ns/op           30092 B/op        218 allocs/op
      Benchmark2           640           1803190 ns/op           18012 B/op         24 allocs/op
      Benchmark3         24681             46804 ns/op           17416 B/op          8 allocs/op
      Benchmark4          2342            440008 ns/op           17418 B/op          8 allocs/op

      1 is original, 2 is original with clear instead of make, 3 is slice, 4 is slice with clear instead of make.

      I feel like this will need some deep dive to understand how make is better optimised than a loop clearing the values. And why does benchmark 4 have the same number of allocs as 3.

      And why is the difference so large for clearing? Clearing 256 bytes of data should be trivial, unless go doesn’t optimise that loop at all…

      1. 3

        Looks like go is silly at (not) optimising trivial loops. make for slice becomes DUFFZERO $276 which I assume is unrolled. Clearing the same slice becomes a really standard, basic, byte-at-a-time loop, which… why are you like this, Go?

        The funny thing is that this actually ends up going against the title of the post. Go can’t optimise a basic slice clear and a (potentially) memory-wasteful version stays 10x faster, because it’s got known optimisations hardcoded.

        The alloc count turns out to not include stack allocations, which makes sense, so the count is equivalent for both the make variant and for a reused global.

        1. 5

          I thought one of the points of keeping Go a simple language was that they can make exactly these kinds of compiler optimizations?

          Also explains why my dumbest thing I could think of in Rust was 3 times faster :-/

          1. 3

            Compile times. The compiler team has been up front that they are willing to trade performance for faster compile times.

            As of a couple of years ago, the rules for what kinds of functions could be inlined was a great example of this. Only the simplest of functions were eligible.

          2. 2

            They kept everything simple. For example, only adopting a register-based calling convention two years ago.


        2. 2

          That hasn’t been my experience. I’m having difficulty writing code that clears a slice without having it be optimized to a runtime call that does vectorized clearing, etc. See What code did you use to clear the slice?

          1. 1

            So both a basic for i:=0... and for i:=range... with slice[i]=false is slower (10x and 2x respectively) than the DUFFZERO that go uses for clearing a new local.

    20. 1

      Really enjoyed this, but I do want to stand up for “treating the symptoms” as a useful approach. You should always try to fix the underlying cause, but I’ve never seen a large, distributed system without finding new, surprising failure modes. Systems that can’t self heal via back pressure, incremental progress, or other similar techniques are custody one weird edge case away from a really bad day.