Makesure seems exactly what I was looking for, and the example they give is what I needed it for: check if there are new files, download, check signature, process, upload processed files. Now, I need to do that for a bunch of files, and would have been great to have an option to do it in parallel from makesure. How could one do it in the simplest way? Wrap it under xargs and loop until all instances succeed?
If you have a bunch of files that can be fetched in parallel, I guess that would mean that they represent a single “node” in the dep graph, and so probably should be modeled as a single makesure goal. So probably you’d define a single goal that would get all of those files in parallel via whatever shell pattern you prefer.
Where is the boundary between process and system? Is it the host? Well, for systemd services, maybe, but what about Docker containers? In that case, is it the host, or the container? And what is a container, actually? Is it something more than just the cgroup constraints applied to a (set of) processes? And what about for Kubernetes pods? In that case, is it the container or the pod?
There are three kinds of observability I know about
There are generally three categories of telemetry data, yes, but (as you note) all telemetry data is fundamentally the same — so the categorization is based on access patterns, rather than kind of property of the data itself. Those categories are metrics (counter and gauges, yes, but also histograms), traces (hierarchical spans), and logs (plain ol’ key/val events).
My summary of blameless culture is: when there is an outage, incident, or escaped bug in your service, assume the individuals involved had the best of intentions, and either they did not have the correct information to make a better decision, or the tools allowed them to make a mistake.
I agree that blameless culture assumes good faith effort by all individuals, but I think it goes further than simply absolving any mistakes, and removes individuals from root cause analysis altogether. The only thing that’s relevant is the code/tools involved in the incident. I think this is an important, if subtle, distinction – software has to accommodate human limitations, after all.
It’s always striked me as very ingrained in American culture, and very counterproductive, to assign blame based on outcome. If you fire someone who made a production system crash by making a typo, you’re essentially saying “making typo’s can get you fired at our company”.
If you crashed a production system, you need to communicate this to everyone involved and everyone who can potentially help you fix it. Blaming is exactly the thing that makes you prefer “it was down for an hour but no-one really noticed” over “after crashing the thing I made some noise, now the whole company is aware but we got it up and running again in half an hour”.
Assigning blame is very common across cultures because it is the simplest corrective action: X screwed up, get rid of X and everyone else will be more careful and we’ll just find a replacement for X in a few hours.
This works well when it is cheap to replace personnel. The blameless philosophy is used in organizations across the world, throughout time, where it would be expensive to replace personnel, and it makes more economic sense to adopt safety practices (training tools) that reduce the chance for mistakes instead.
Blameless culture says that if a production system crashed, it is never because you or I or anyone screwed up, it is always because the relevant tools/code failed in some way. It’s a subtle but important distinction.
I work with a high-self-accountability team. With the way we work, the article points out one of the most important lessons we’ve learned: “do not stop the root cause analysis when someone admits blame”
We had an incident where a small UAS had a very bad takeoff that ended up being a near miss. People would have been injured if they’d been standing a couple feet to the left of where they happened to be standing. The pilot immediately took responsibility and told us how he’d screwed up. I was tasked with putting together a slide deck outlining the event.
Instead of just letting him take the blame, though, I got ahold of the onboard telemetry logs and discovered something that shook me pretty hard: while the pilot had provided a very reasonable explanation of how he believed he’d screwed up, the telemetry told a very different story. Yes, he had made a mistake, but not the mistake he thought he had. And further, it was significantly exacerbated by two (unknown at the time) behavioural defects in the autopilot. His explanation made sense given how we all believed the system worked, but the system itself didn’t work that way!
The opening slide was something like “Jim has taken responsibility for this incident, and he is incorrect…”
The outcome was that we all understood the system better, the training deficiencies that led to us all misunderstanding what to do in that scenario, updated checklists and procedures, and two tickets filed with the vendor. If we had just stopped when he took responsibility I am certain the same chain of events would have happened again, potentially with a worse outcome than a near miss.
I work with a high-self-accountability team. With the way we work, the article points out one of the most important lessons we’ve learned: “do not stop the root cause analysis when someone admits blame”
But that’s still assigning blame! It should be more factual like “X lost control of of the UAS”. Then, the outcome can be “X has had insufficient training to fly in terrain near people” or even “X loses control of an UAS significatly more often than other pilots”. But these outcomes are not likely in my experience, it is more often something like “in circumstance Z, during move Y, pilots will sometimes lose control” or even “we should never do test flights near people, as flight is inherently dangerous”.
You should have not one, but multiple safeguards in place and “someone screwed up” should never be a final conclusion of a postmortem. It essentially means “We have created a system which can do something terrible if someone makes a mistake. We saw this happening and we’re fine with it. We will not introduce any safeguards.”
I think we’re saying the same thing. I guess a different way to put it: “Someone self-assigning blame is not a sufficient root cause. Do not stop there. Find the systemic issues that resulted in the screw-up.”
You might get to the point where you can leave it there. Even people with good judgement sometimes make mistakes that weren’t really preventable. “A bird shit on my head and it got in my eye, I dropped the remote, the lanyard caught it but the throttle snagged on my shirt.” is probably a sufficiently rare event that it’s not worth mitigating or attributing to a systemic issue, but you gotta tug on the thread until you get to that point.
Even more, nobody should ever self-assign blame in a postmortem, and it’s important that if anyone tries to, it’s quickly shut down by the IC or postmortem director. Blameless culture is at least as much about psychological safety as it is about technical efficacy.
Blameless culture is at least as much about psychological safety as it is about technical efficacy.
My team (I was lead at the time) had been blamed for an outage/issue which caused downtime for our clients. The blame actually happened in a private channel in slack between our data team’s project manager and our upper management. I was informed privately that this occurred. I was so pissed/irritated, not just that my team was blamed but that it was done in private without us being able to defend ourselves. I had to literally take a walk to calm down. Later it was found to be the data team’s “feature”. Eventually that data team’s project manager was let go (at least partially because over this issue). There was absolutely no need for any of that to have transpired had we just debugged the issue. I have since never blamed any person nor team for any issue, even if I knew exactly whom had pressed the button.
Comments on the previous ID generator thread have convinced me that it’s better not to store the exact millisecond an ID was generated, but instead fuzz the nearest second and add a counter. That’s what I’m doing now in SLID: https://code.lag.net/robey/nightcoral/src/branch/main/src/slid.ts
Your base32 doesn’t include 2 or v? (It also doesn’t include l, but I get that one.) What makes this alphabet “proper” where e.g. Crockford’s base32 alphabet is not?
I’ve used SQL in projects but in most cases only as a key-value store (only occasionally using simple queries like select * from tbl where prop>3). But the explanations I’d seen before of joins always seemed a little off to me.
The explanation in this article makes sense, though!
I have more experience with array languages and frameworks like pandas, and I gotta say the nested loop thing is really stupid. In my mind, databases have this mystique about them, like only really clever folks can work on them or use them properly. Is my perception based on a fundamental lie? Are databases only “hard” because they’re stupidly organized by default?
as a key-value store (only occasionally using simple queries like select * from tbl where prop>3).
Understanding that application code should never SELECT * and should always SELECT specific, column, names is a canon event for many/most software engineers 😇
First you learn to use it, then you learn that Real Programmers never use it, and then (much later) you realize there are cases where using anything else is a mistake. It’s the sql version of goto.
I posted a comment above. Databases are based on mathematical concepts rather than being based on delivering a specific productivity tool. This math is very convenient if you understand it and can make for a very productive tool, but many intuitions are only valid if you understand the deeper “what is going on here in theory” that many other tools don’t require. The mathematical model of relational databases predates actual databases.
So, it’s not you, it’s the tool.
Its not terribly difficult to learn, but it’s often not intuitive either, and many places suck at explaining.
Since you bring up history I’d like to challenge what you wrote:
The mathematical model of relational databases predates actual databases.
E F Codd was working at IBM and wanted to come up with a better and more practical type of database system than the ones that were current. That’s why he developed the mathematical theory behind the relational model. True, that theory is connected to prior mathematics and formal logic, but that’s natural since all mathematics is connected somehow. The point is that the aim was to create practical database systems that would be convenient and flexible and stand the test of time because they were based on sound mathematical principles.
Not sure if that contradicts what you wrote or not, but I think there’s an important difference in flavor.
Totally agree that understanding the theory (which isn’t all that complicated) is key to unlocking the power of relational databases (and avoiding jumping on bandwagons that get going because people had the chance or taken the time to understand the relational model).
What I meant by that is that it predates RDBMS like MySQL or PostgreSQL. Not that it predates the concept of a database.
What I’m saying is the model came first. It was designed on paper before it was implemented. As opposed to being an iterative evolution of a product.
Note that the relational model is a theoretical model. Versus the RBDMS is an implementation of a database that leans on that theoretical relational model.
come up with a better and more practical type of database
In this case a database doesn’t need to be relational. In a broad and general sense I understand a database to be some abstraction that persists and manages data for storage and retrievable. One major problem databases help to solve is decoupling storage implementation from language memory models. Otherwise imagine reordering your C struct fields and then no longer being able to pull in information from your data store.
Rule lists like this wouldn’t be any good if none of them felt at least a little bit controversial.
Rule #5 I disagree with. I hate APIs that return clear map structures with “key”s inside, but I cannot lookup by the key because it’s a list and not a map. Just use a map when the data fits into that pattern.
Rule #8. 404 is fine for an API because all the invalid endpoints can just return 5xx. If the client needs further information, you can return an error code that never changes along with the response body.
I think another way to say Rule 5 might be: return objects with well-defined schemas. Or: given a specific key, its corresponding value should always be the same type. So if you return {"user":"John"} in one situation, it’s not cool to return {"user":123} in a different situation.
Rule #5 I disagree with. I hate APIs that return clear map structures with “key”s inside, but I cannot lookup by the key because it’s a list and not a map. Just use a map when the data fits into that pattern.
I’ve heard that list returns also require breaking changes to make any improvements. It’s a “Probably Gonna Need It” to use a map.
Was there also something that broke if you returned a list instead of a map, or was that just a vague memory?
I like a lot of this advice, parts like “always return objects” and “strings for all identifiers” ring with experience. I’m puzzled that the only justification for plural names is convention when it’s otherwise not at all shy of overriding other conventions like 404s and .json URLs. It’s especially odd because my (unresearched) understanding is that all three have a common source in Rails API defaults.
The difficulty with 404 is that it expresses that an HTTP-level resource is not found, and that concept often doesn’t map precisely to application-level resources.
As a concrete example, GET /userz/123 should (correctly!) 404 because the route doesn’t exist, I typoed users. But if I do a GET /users/999 where 999 isn’t a valid user, and your API returns 404 there as well, how do I know that this means there’s no user 999, instead of that I maybe requested a bogus path?
Yeah, you should, because if we require that there be enough fine-grained status codes to resolve all this stuff we’re going to need way more status codes and they’re going to stop being useful. For example, suppose I hit the URL /users/123/posts/456 and get a “Route exists but referenced resource does not” response; how do I tell from that status code whether it was the user or the post that was missing? I guess now we need status codes for
Route does not exist
Route exists but first referenced resource does not
Route exists but second referenced resource does not
Remember: REST is not about encoding absolutely every single piece of information into the verb and status code, it’s about leveraging the capabilities of HTTP as a protocol and hypertext as a medium.
There’s yet another interpretation too, that’s particularly valid in the wild! I may send you a 404 because the path is valid, the user is valid, but you are not authorized to know this.
Purists will screech in such cases about using 403, but that opens to enumeration attacks and whatnot so the pragmatic thing to do is just 404.
The last time this came up, I said “status codes are for the clients you don’t control.” Using a 404 makes sense if you want to tell a spider to bug off, but if your API is behind auth anyway, it doesn’t really matter what you use.
I actually totally disagree with this approach. These kinds of helpers aren’t how you write normal code (I hope 😬), so why do you need them for tests? I think good test code is as simple as possible even at the cost of a little verbosity. The main thing is to make it so the test harness is never the problem.
I agree. This complicates the test input IMO. Table driven tests work well for simple inputs, outputs, and setups, and for more complex situations data driven tests, which I’m very fond of, allow test authors and readers to focus on the input and output without parsing code: https://lobste.rs/s/n16aps/what_if_writing_tests_was_joyful#c_on6nkh
I have a very simple helper library that just reads/writes files, checks for equality, and spins up sub-tests based on a file glob. It’s been very helpful for me in cutting boilerplate with golden file testing. https://pkg.go.dev/github.com/carlmjohnson/be/testfile
I can’t say I’ve ever seen a business service prototyped (and deployed!) as a shell script. I’d be really curious to hear some examples of this kind of thing!
This article contains example code, I get that, but please please please never require callers to manually lock/unlock a mutex in order to safely use a value. There are just too many ways for this approach to go wrong, and they’re all basically out of your control as an author, which subverts any kind of guarantees that you can provide to your consumers. Instead, encapsulate the value in a type, and only allow access through methods that do the synchronization work correctly.
// No
var mtx sync.Mutex
var val int
func caller() { mtx.Lock(); val = 123; mtx.Unlock() }
// Yes
type syncVal struct { mtx sync.Mutex; val int }
func (v *syncVal) set(i int) { v.mtx.Lock(); defer v.mtx.Unlock(); v.val = i }
func caller() { var v syncVal; v.set(123) }
This is far from a yes, it’s only applicable to “atomic” semantics where operations are independent and non-composed. And while you can sometimes provide “precomposed” operations it remains not generally applicable.
A more general pattern is to wrap the thing into a gatekeeping context manager, but then while it protects against misuses of the mutex itself it does not protect against mutex related misuses (e.g. leaking sub-resources outside of the managed context) and it’s limited to lexical operations.
This … [is] only applicable to “atomic” semantics where operations are independent and non-composed.
Yes, that’s more or less the the assumption of synchronization in Go, at the level we’re talking about here. There is no expectation of “general applicability” in the sense you describe, at least not by default.
In any case, my “yes” isn’t meant to represent a fully generalized pattern, it’s just a response to the “no”.
after programming Go for a long time I write all my code in Rust now, and now that I have Result types and the ? operator, my code is consistently shorter and nicer to look at and harder to debug when things go wrong. I dunno. I get that people think that typing fewer characters matters. I just don’t think that. When I program Rust I miss fmt.Errorf and errors.Is quite a lot. Yeah, sure, you can use anyhow and thiserror but still, it’s not the same. You start out with some functionality in an app and so you use anyhow and anyhow::Context liberally and do some nice error message wrapping, but inspecting a value to see if it matches some condition is annoying. Now you keep going and your project matures and you put that code into a crate. Whoops, anyhow inside of a crate is rude, so you switch the error handling strategy up to thiserror to make it more easily consumed by other people. Now you’ve got to redo a bunch of error propagating things maybe you’ve got some boilerplate enum stuff to write, whatever; you can’t just move the functionality into a library and call it a day.
Take a random error in Rust and ask “does any error in this sequence match some condition” and it’s a lot more work than doing the equivalent in Go. There are things that make me happy to be programming Rust now, but the degree to which people dislike Go’s error system has always seemed very odd to me, I’ve always really liked it. Invariably, people who complain about Go’s error system consistently sprinkle this everywhere:
if err != nil {
return nil, err
}
Which is … bad. This form:
if err != nil {
return nil, fmt.Errorf("some extra info: %w", err)
}
goes a long way, and errors.Is and errors.As compose with it so nicely that I don’t think I’ve seen an error system that I like as much as Go’s. Sure, I’ve seen lots of error systems that have fewer characters and take less ceremony, I just don’t think that’s particularly important. And I’m sure someone is going to come along and tell me I’m doing it wrong or I’m stupid or whatever, and I just don’t care any more. I quite like the ergonomics of error handling in Go and miss them when I’m using other languages.
Isn’t wrapping the errors with contextual information at each function that returns them just equivalent to manually constructing a traceback?
The problem of errors in rust and Haskell lacking information comes from syntax that makes it easy to return them without adding that contextual information.
Seems to me all three chafe from not having (or preferring) exceptions.
Isn’t wrapping the errors with contextual information at each function that returns them just equivalent to manually constructing a traceback?
no, because a traceback is a map of how to navigate the code, and it requires having the code to be able to interpret it, whereas an error chain is constructed from the vantage point that the error chain alone should contain all of the information needed to understand the fault. An error chain can be understood by an operator who lacks access to the source code or lacks familiarity with the language the program is written in; you can’t say the same thing about a stack trace.
The problem of errors in rust and Haskell lacking information comes from syntax that makes it easy to return them without adding that contextual information.
not sure about Haskell but Rust doesn’t really have an error type, it has Result. Result is not an error type, it is an enum with an error variant, and the value in the error variant may be of any type. There’s the std::error::Error trait, but it’s not used directly; a Result<T, E> may be defined by some E that implements std::error::Error or not, but it’s inconsistent. It may or may not have some kind of ErrorKind situation like std::io does to help you classify different types of related errors, it might not.
Go’s error type is only comparable to Rust’s Result type if you are confused about one language or the other. Go’s error type is less like Result and more like Vec<Box<dyn std::error::Error>>, because the error type in Go is an Interface and it defines a set of unwrap semantics that lets you unwrap an error to the error that caused it. (it’s actually more like a node in a tree of errors these days so even that is a simplification).
Unlike a trait, a Go interface value is sized, it’s essentially a two-tuple with a pointer to the memory of the value and another pointer to a vtable. Every error in the Go standard library and third party ecosystem is always a thing with a fixed memory size and at least one layer of indirection. Go’s error system is more like if the entirety of Rust’s ecosystem used Result<T> to mean Result<T, Box<dyn std::error::Error>>. Since that’s not universal, even if you want to do it in your own projects, you run into friction when combining with projects that don’t follow that convention. Even that is an oversimplification, because downcasting a trait object in Rust is much more fiddly than type switching on an interface value in Go.
There are a lot of subtle differences between the two systems that turn out to cause enormous differences in the quality of error handling across the ecosystems. Getting a whole team of developers to have clean error handling practices in Go is straightforward; you could explain the practices to an intermediate programmer in a single sitting and they’d basically never forget them and do it correctly forever. It’s significantly more challenging to get that level of consistency and quality with error propagation, reporting, and handling in Rust.
a traceback is a map of how to navigate the code, and it requires having the code to be able to interpret it, whereas an error chain is constructed from the vantage point that the error chain alone should contain all of the information needed to understand the fault
Extremely interesting. You’re both right: it is manually constructing (a projection of) a continuation, but the nifty point scraps makes is that the error chain is a domain-specific representation of the continuation while the traceback is implementation-specific. Super cool. This connects to a line of thinking rooted in [1] that I’ve been mulling over for ages and would love to follow up on.
Here’s a “fantasy abstract” I wrote years ago on this topic:
Felleisen et al[1] showed that an abstract algebraic treatment of evaluation contexts was possible, interesting and useful. However, algebras capturing implementation-level control structure are unsuitable for reflective use, precisely because they capture an implementation- rather than a specification-level description of the evaluation context. We introduce a language for arbitrary specification-level algebras describing high-level perspectives on program tasks, and a mechanism for ensuring that implementation-level control context refines specification-level control context. We then allow programs to not only capture and reinstate but to analyse and synthesise these structures at runtime. Tuning of the specification algebra allows precise control over the level of detail exposed in reflective access, and has practical benefits for orthogonal persistence, for code upgrade, for network mobility, and for program execution visualisation and debugging, as well as offering the promise of new approaches to program verification.
[1] Felleisen, Matthias, et al. “Abstract Continuations: A Mathematical Semantics for Handling Full Functional Jumps.” ACM Conf. on LISP and Functional Programming, 1988, pp. 52–62.
creating a stack trace isn’t free. It’s fair to assume creating a stack trace will require a heap allocation. Performing a series of possibly unbounded heap allocations is a meaningful tax; whether that cost is affordable or not is highly context-dependent. This is one of the reasons why there is a whole category of people who use C++ that consciously avoid using C++ exceptions. One of the benefits of Rust’s system and the Result type is that the Result values themselves may be stack allocated. E.g., if you were inclined to do so, you could define a function that returns Result<(), u8>, and that’s just a single stack-allocated value, not an unbounded series of heap allocations. In Go, you could write a function that returns an error value, which is an interface, and that error value could be some type whose memory representation is a uint8. That would be either a single heap allocation or it would be stack-allocated; I’m not sure offhand, but regardless, it’s not an unbounded series of heap allocations. The performance implications are pretty far-reaching. Those costs are likely irrelevant in the case of something like an i/o bounded http server, but in something like a game engine or when doing embedded programming, they matter.
The value proposition of stack traces is also different when you have many concurrently running stacks, as is typical in a Go program. Do you want just the current goroutine’s stack, or do you want the stacks of all running goroutines? goroutines are also not hierarchical in the scheduler, so “I want the stack of this goroutine and all of its parents” is not a viable option, because the scheduler doesn’t track which goroutines spawned which other goroutine (plus you could have goroutine A spawn goroutine B that spawns goroutine C and goroutine B is long gone and then C hits an error, which stacks do you want a trace of? B is gone, how are you gonna stack trace a stack that no longer exists?). If you want all the stacks, now you’re doing … what, pausing all goroutines every time you create an error value? That would be a very large tax.
Overall, the original post is about the ergonomics of error handling; I think the ergonomics of Go’s system are indeed verbose, but my experience has been that the design of Go’s error handling system makes it reasonably straightforward to produce systems that are relatively easy to debug when things break, so generally speaking I think the tradeoff is worthwhile. Sure, you can just yeet errors up the chain, but Go doesn’t make that significantly more ergonomic than wrapping the error. I think where the original post misses the point by a wide margin is that it just yeets errors up the chain, which makes them hard to debug. Making the code shorter but the errors less-useful for debugging is not a worthwhile tradeoff, and is a misunderstanding of the design of Go’s error system.
I think this is the point you may have missed further up asking about the analogy to exceptions and tracebacks. The usual argument I see in favor of something like Go’s error “handling” is that it avoids the allocations associated with exceptions and tracebacks. But then the standard advice for writing Go is that basically every level of the stack ought to be using error-wrapping tools that likely allocate anyway. So it’s a sort of false economy – in the end, anyone who’s programming Go the “right”/recommended way is paying a resource and performance price comparable to exceptions and tracebacks.
Creating a stack trace means calling both runtime.Callers (to get the initial slice of program counter uintptrs) and runtime.CallersFrames (to transform those uintptrs to meaningful frame data). Those calls are not cheap, and (I’m not 100% confident but I’m pretty sure that) at least CallersFrames will allocate.
To create a stacktrace up from your current location, sure. But the unwind code you’re emitting knows what its actual location is, so as the error is handed off through each function, each site can just push its pc or even just a location string onto the backtrace list in the error value. And you can gather the other half of the stacktrace when you actually go to print it (if you even want it), which means until then it’s basically free: one compare, one load, two store or thereabouts. And hence, errors that aren’t logged don’t cost you anything.
The thing that bit me recently was that Go uses the pattern of returning an error and a value so much that it is now encoded in tooling. One of the linters throws an error if you check that an error is not nil, but then return a valid result and no error. Unfortunately, there is nothing in the type system to say ‘I have handled this error’ and so this raised a false positive that can be addressed only with an annotation where you have code of the form ‘try this, if it raises an error then do this instead’. I don’t know how you’re supposed to do that in idiomatic Go.
basically every person I work with thinks it’s an issue, because we write production Rust code in a team setting and we’re on call for the stuff we write. A lot of the dialogue about Rust here and elsewhere on the web is driven by people that aren’t using Rust in a production/team setting, they’re using it primarily for side projects or solo projects or research or are looking at it from a PL theory standpoint; these problems don’t come up that much in those settings. And that’s not to dismiss those settings; I personally do my side projects in Rust instead of Go right now because I find writing Rust really fun and satisfying, and this problem doesn’t bother me much on my personal projects. Where it really comes up is when you’re on call and you get paged and you’re trying to understand the nature of an error in a system that you didn’t write, while it is on fire.
Yeah okay but mean time people are writing dozens of blog posts about how async rust is broken (which really I don’t think it is) and this issue barely gets any attention.
How exactly is thiserror and pattern matching on an error variant not strictly superior, if not, at worst, the exact same as fmt.Errorf and errors.Is? The whole point of thiserror is allowing you to write format strings for the error variants you return or are wrapping.
errors.Is is not simply an equality check, it recursively unwraps the error, finding if any error in the chain (or tree, if an unwrap returns a slice) matches, so you can always safely add a layer of wrapping without affecting any errors.Is checks on that error. https://pkg.go.dev/errors#Is
I wonder whether Rust’s not having this is due to Rust’s choice to have a more minimal standard library or simply to no one’s having proposed it yet. I think something like this would be very simple to implement in the Rust standard library (and only slightly more verbose to implement outside it).
The tradeoff here is that it is breaking the abstraction of a function. For example changing underlying http library may change the error type. Your options are to break the error chain and the implementation-specific diagnostics or expose that to callers.
Rust cares a lot about being explicit a out API breakage so it doesn’t surprise me that this isn’t a part of the standard library.
It definitely is convenient for within your own code but across stable API boundaries it is a much worse tradeoff.
What does this function even do if there’s an error? I have to read so much code to find out.
The function is 14 lines long. The longest line is 29 characters (plus a tab indent). Each line contains a single conditional or expression, built from a very small set of keywords, tokens, and identifiers. It is as easy to read as code gets.
Conway’s Game of Life can be expressed in a single line of APL.
+100. There’s succinct, and then there’s cramming things onto less lines and they aren’t the same. Succinct doesn’t just mean short, it also means clearly explained.
I joined a big internation company about 10+ years ago and when we had our first informal new hires gathering, this mastodon principal developer remarked: you were hired to read the code, not to write it. As funny as it might sound, in a retrospect it actually made a lot of sense (and not only because they had very little documentation, haha). Now every time I come across a new innovative soft- or hardware approach to put code into a computer more efficiently™ I remember that guy and chuckle.
The primary metric for “good code” is how quickly and easily it can be read and understood by a new engineer with basic domain knowledge. The real costs are borne by the reader, and not the writer — and so you always want to optimize for the reader, and not the writer.
I’ve been using the sample implementation repo in my project and it works fine. The big thing I miss from the standard library now is some form of middleware chaining.
For sure! Middlewares are decorators, and it’s intuitive (to me, at least) that the outer-most decorator is executed first. But I’ve no doubt this isn’t universal among programmers.
The issue seems sort of dead. I’ve written a library with RoundTripperFunc myself, so that could be worth pulling out into its own proposal. I don’t like the proposed Middleware API in that issue. It’s overly complicated. The Then method is fluff.
I agree that it’s overly complicated. I just have a hope that someone comes up with a better API and something that can be made in a stdlib compatible way.
With one half (or more) of why I’d use a none-stdlib gone it would be so nice if middleware support could find its way into net/http.
type Middleware func(next http.Handler) http.Handler
func ComposeMiddleware(mws ...Middleware) Middleware {
return func(next http.Handler) http.Handler {
h := next
for i := range mws {
mw := mws[len(mws)-1-i]
h = mw(h)
}
return h
}
}
I mean, middleware is already supported in the stdlib, just by virtue of the fact that http.Handler is an interface, which any caller can decorate as they please.
I don’t know much about Kafka or its type of work loads, but looking at object storage costs has shown me that they tend to be about 10x the cost of block storage. As long as you’re having this big data-abstraction layer in your pipeline anyway, why would you want to build it atop object storage instead of something else?
WarpStream’s read path seems like it should be straightforward. The WarpStream Agents are stateless, and all the data is in S3. Naively, it seems like each Agent should be able to just issue a GET request directly to object storage every time it receives a request to Fetch data for a specific partition. Unfortunately that would once again be extremely cost prohibitive in terms of object storage API requests.
Like… this just seems to be a hugely artificial problem that this then solves by dint of throwing away the advantages that object storage has. Then they appear to… build a filesystem atop object storage by spreading their actual data across multiple objects, if I’m understanding properly??? I very well might not be, but that’s kinda what it looks like.
Cloud providers don’t let you share block storage across availability zones, but object storage is shared by all availability zones in a region. As they say in the post, object storage also acts as their networking layer. It’s a neat way of building a multi-datacenter shared memory abstraction. Durable replication is a hard problem, and I like the way they’ve sidestepped dealing with it here. They could end up with a product more reliable than Kafka with significantly less effort and lines of code.
Also many use cases for message brokers don’t involve permanent long term storage of messages. So storage costs may be less of a factor.
It’s built on top of object storage to reduce costs and operations compared to traditional Apache Kafka. Building it on top of S3 makes it ~10x cheaper and eliminates virtually all the hard parts of operating Kafka.
Building a streaming abstraction on top of object storage turns out to be really difficult, but we did do it, and there are a lot of benefits for doing so. You’re right thought that effectively what we created was a virtual filesystem.
You’re caching S3 data in your application memory. Is application memory cheaper than reading from S3? If so, is S3 the right tool for the job, so to speak?
The cache is only necessary for the “live edge” of the last few seconds of data, and therefore is not significant to the overall cost of the system.
For example, a workload writing 1GB/s of data would need roughly 15GB of cache per AZ. That is not significant compared to the other costs such as S3 API operations or CPU cores.
In terms of whether S3 is the right tool for the job… I think so. WarpStream is roughly 5-10x cheaper per GiB transferred than Apache Kafka and we wrote 0 code to deal with replication and users gets to operate a completely stateless system for the data plane. I think that’s pretty good evidence it was a good choice.
I’m curious about the assumptions this system makes re: S3 access costs. For example, do you treat S3 as more or less equivalent to local disk in terms of stuff like read latency?
I think this idea is fascinating. I think our data engineer is about to deploy MSK so I’ll share this with him.
Regarding this quote from the “Kafka is dead…” article:
we never acknowledge data until it has been durably persisted in S3 and committed to our cloud control plane
How are you checking that data has been durably persisted? Does it mean “The PutObject call returned a success code”, because the core assumption of WarpStream is that S3 just always works? Or is there some extra verification?
Yeah we just assume if PutObject() succeeded then the data is durable. Which I think to be fair while its not a perfect assumption, AWS (and other cloud providers) have a pretty good track record here and that cloud object storage is arguably more reliable and battle tested than any other storage technology on the market.
I checked up on Vultr, Digital Ocean, and a few other places. Apparently I’m just wrong. Which feels weird ’cause I did a bunch of research on this a few months ago and recall object storage being always ridiculously more expensive, even not counting for IOPS/egress.
Makesure seems exactly what I was looking for, and the example they give is what I needed it for: check if there are new files, download, check signature, process, upload processed files. Now, I need to do that for a bunch of files, and would have been great to have an option to do it in parallel from makesure. How could one do it in the simplest way? Wrap it under xargs and loop until all instances succeed?
If you have a bunch of files that can be fetched in parallel, I guess that would mean that they represent a single “node” in the dep graph, and so probably should be modeled as a single makesure goal. So probably you’d define a single goal that would get all of those files in parallel via whatever shell pattern you prefer.
Where is the boundary between process and system? Is it the host? Well, for systemd services, maybe, but what about Docker containers? In that case, is it the host, or the container? And what is a container, actually? Is it something more than just the cgroup constraints applied to a (set of) processes? And what about for Kubernetes pods? In that case, is it the container or the pod?
There are generally three categories of telemetry data, yes, but (as you note) all telemetry data is fundamentally the same — so the categorization is based on access patterns, rather than kind of property of the data itself. Those categories are metrics (counter and gauges, yes, but also histograms), traces (hierarchical spans), and logs (plain ol’ key/val events).
I agree that blameless culture assumes good faith effort by all individuals, but I think it goes further than simply absolving any mistakes, and removes individuals from root cause analysis altogether. The only thing that’s relevant is the code/tools involved in the incident. I think this is an important, if subtle, distinction – software has to accommodate human limitations, after all.
I definitely agree, though I’d add “processes” to “code/tools”.
It’s always striked me as very ingrained in American culture, and very counterproductive, to assign blame based on outcome. If you fire someone who made a production system crash by making a typo, you’re essentially saying “making typo’s can get you fired at our company”.
If you crashed a production system, you need to communicate this to everyone involved and everyone who can potentially help you fix it. Blaming is exactly the thing that makes you prefer “it was down for an hour but no-one really noticed” over “after crashing the thing I made some noise, now the whole company is aware but we got it up and running again in half an hour”.
Assigning blame is very common across cultures because it is the simplest corrective action: X screwed up, get rid of X and everyone else will be more careful and we’ll just find a replacement for X in a few hours.
This works well when it is cheap to replace personnel. The blameless philosophy is used in organizations across the world, throughout time, where it would be expensive to replace personnel, and it makes more economic sense to adopt safety practices (training tools) that reduce the chance for mistakes instead.
Blameless culture says that if a production system crashed, it is never because you or I or anyone screwed up, it is always because the relevant tools/code failed in some way. It’s a subtle but important distinction.
I work with a high-self-accountability team. With the way we work, the article points out one of the most important lessons we’ve learned: “do not stop the root cause analysis when someone admits blame”
We had an incident where a small UAS had a very bad takeoff that ended up being a near miss. People would have been injured if they’d been standing a couple feet to the left of where they happened to be standing. The pilot immediately took responsibility and told us how he’d screwed up. I was tasked with putting together a slide deck outlining the event.
Instead of just letting him take the blame, though, I got ahold of the onboard telemetry logs and discovered something that shook me pretty hard: while the pilot had provided a very reasonable explanation of how he believed he’d screwed up, the telemetry told a very different story. Yes, he had made a mistake, but not the mistake he thought he had. And further, it was significantly exacerbated by two (unknown at the time) behavioural defects in the autopilot. His explanation made sense given how we all believed the system worked, but the system itself didn’t work that way!
The opening slide was something like “Jim has taken responsibility for this incident, and he is incorrect…”
The outcome was that we all understood the system better, the training deficiencies that led to us all misunderstanding what to do in that scenario, updated checklists and procedures, and two tickets filed with the vendor. If we had just stopped when he took responsibility I am certain the same chain of events would have happened again, potentially with a worse outcome than a near miss.
But that’s still assigning blame! It should be more factual like “X lost control of of the UAS”. Then, the outcome can be “X has had insufficient training to fly in terrain near people” or even “X loses control of an UAS significatly more often than other pilots”. But these outcomes are not likely in my experience, it is more often something like “in circumstance Z, during move Y, pilots will sometimes lose control” or even “we should never do test flights near people, as flight is inherently dangerous”.
You should have not one, but multiple safeguards in place and “someone screwed up” should never be a final conclusion of a postmortem. It essentially means “We have created a system which can do something terrible if someone makes a mistake. We saw this happening and we’re fine with it. We will not introduce any safeguards.”
I think we’re saying the same thing. I guess a different way to put it: “Someone self-assigning blame is not a sufficient root cause. Do not stop there. Find the systemic issues that resulted in the screw-up.”
You might get to the point where you can leave it there. Even people with good judgement sometimes make mistakes that weren’t really preventable. “A bird shit on my head and it got in my eye, I dropped the remote, the lanyard caught it but the throttle snagged on my shirt.” is probably a sufficiently rare event that it’s not worth mitigating or attributing to a systemic issue, but you gotta tug on the thread until you get to that point.
Even more, nobody should ever self-assign blame in a postmortem, and it’s important that if anyone tries to, it’s quickly shut down by the IC or postmortem director. Blameless culture is at least as much about psychological safety as it is about technical efficacy.
My team (I was lead at the time) had been blamed for an outage/issue which caused downtime for our clients. The blame actually happened in a private channel in slack between our data team’s project manager and our upper management. I was informed privately that this occurred. I was so pissed/irritated, not just that my team was blamed but that it was done in private without us being able to defend ourselves. I had to literally take a walk to calm down. Later it was found to be the data team’s “feature”. Eventually that data team’s project manager was let go (at least partially because over this issue). There was absolutely no need for any of that to have transpired had we just debugged the issue. I have since never blamed any person nor team for any issue, even if I knew exactly whom had pressed the button.
Comments on the previous ID generator thread have convinced me that it’s better not to store the exact millisecond an ID was generated, but instead fuzz the nearest second and add a counter. That’s what I’m doing now in SLID: https://code.lag.net/robey/nightcoral/src/branch/main/src/slid.ts
Your base32 doesn’t include
2
orv
? (It also doesn’t includel
, but I get that one.) What makes this alphabet “proper” where e.g. Crockford’s base32 alphabet is not?That was a long time ago but I think it was pulled from a spec. Looks like a lowercase version from around 2002: https://en.wikipedia.org/wiki/Base32#z-base-32
Thanks for the link! Might be worth a comment, too :)
I’ve used SQL in projects but in most cases only as a key-value store (only occasionally using simple queries like
select * from tbl where prop>3
). But the explanations I’d seen before of joins always seemed a little off to me.The explanation in this article makes sense, though!
I have more experience with array languages and frameworks like pandas, and I gotta say the nested loop thing is really stupid. In my mind, databases have this mystique about them, like only really clever folks can work on them or use them properly. Is my perception based on a fundamental lie? Are databases only “hard” because they’re stupidly organized by default?
Understanding that application code should never
SELECT *
and should alwaysSELECT specific, column, names
is a canon event for many/most software engineers 😇First you learn to use it, then you learn that Real Programmers never use it, and then (much later) you realize there are cases where using anything else is a mistake. It’s the sql version of
goto
.I’m not sure
SELECT *
is comparable togoto
in any meaningful sense, except that neither of them should ever be present in application code.I posted a comment above. Databases are based on mathematical concepts rather than being based on delivering a specific productivity tool. This math is very convenient if you understand it and can make for a very productive tool, but many intuitions are only valid if you understand the deeper “what is going on here in theory” that many other tools don’t require. The mathematical model of relational databases predates actual databases.
So, it’s not you, it’s the tool.
Its not terribly difficult to learn, but it’s often not intuitive either, and many places suck at explaining.
I made this many years ago as a layman explanation of table relationships and joins https://m.youtube.com/watch?v=EcrO7hz-nfM
Since you bring up history I’d like to challenge what you wrote:
E F Codd was working at IBM and wanted to come up with a better and more practical type of database system than the ones that were current. That’s why he developed the mathematical theory behind the relational model. True, that theory is connected to prior mathematics and formal logic, but that’s natural since all mathematics is connected somehow. The point is that the aim was to create practical database systems that would be convenient and flexible and stand the test of time because they were based on sound mathematical principles.
Not sure if that contradicts what you wrote or not, but I think there’s an important difference in flavor.
Totally agree that understanding the theory (which isn’t all that complicated) is key to unlocking the power of relational databases (and avoiding jumping on bandwagons that get going because people had the chance or taken the time to understand the relational model).
What I meant by that is that it predates RDBMS like MySQL or PostgreSQL. Not that it predates the concept of a database.
What I’m saying is the model came first. It was designed on paper before it was implemented. As opposed to being an iterative evolution of a product.
Note that the relational model is a theoretical model. Versus the RBDMS is an implementation of a database that leans on that theoretical relational model.
In this case a database doesn’t need to be relational. In a broad and general sense I understand a database to be some abstraction that persists and manages data for storage and retrievable. One major problem databases help to solve is decoupling storage implementation from language memory models. Otherwise imagine reordering your C struct fields and then no longer being able to pull in information from your data store.
Rule lists like this wouldn’t be any good if none of them felt at least a little bit controversial.
Rule #5 I disagree with. I hate APIs that return clear map structures with “key”s inside, but I cannot lookup by the key because it’s a list and not a map. Just use a map when the data fits into that pattern.
Rule #8. 404 is fine for an API because all the invalid endpoints can just return 5xx. If the client needs further information, you can return an error code that never changes along with the response body.
I think another way to say Rule 5 might be: return objects with well-defined schemas. Or: given a specific key, its corresponding value should always be the same type. So if you return
{"user":"John"}
in one situation, it’s not cool to return{"user":123}
in a different situation.I’ve heard that list returns also require breaking changes to make any improvements. It’s a “Probably Gonna Need It” to use a map.
Was there also something that broke if you returned a list instead of a map, or was that just a vague memory?
I like a lot of this advice, parts like “always return objects” and “strings for all identifiers” ring with experience. I’m puzzled that the only justification for plural names is convention when it’s otherwise not at all shy of overriding other conventions like 404s and .json URLs. It’s especially odd because my (unresearched) understanding is that all three have a common source in Rails API defaults.
The difficulty with 404 is that it expresses that an HTTP-level resource is not found, and that concept often doesn’t map precisely to application-level resources.
As a concrete example,
GET /userz/123
should (correctly!) 404 because the route doesn’t exist, I typoed users. But if I do aGET /users/999
where 999 isn’t a valid user, and your API returns 404 there as well, how do I know that this means there’s no user 999, instead of that I maybe requested a bogus path?From solely the status code, you don’t.
Fortunately, though, HTTP has a thing called the response body, which is allowed to supply additional context and information.
Of course, but I shouldn’t need to parse a response body to get this basic level of semantic information, right?
Yeah, you should, because if we require that there be enough fine-grained status codes to resolve all this stuff we’re going to need way more status codes and they’re going to stop being useful. For example, suppose I hit the URL
/users/123/posts/456
and get a “Route exists but referenced resource does not” response; how do I tell from that status code whether it was the user or the post that was missing? I guess now we need status codes forAnd on and on we go.
Or we can use a single “not found” status code and put extra context in the response body. There’s even work going on to standardize this.
Remember: REST is not about encoding absolutely every single piece of information into the verb and status code, it’s about leveraging the capabilities of HTTP as a protocol and hypertext as a medium.
There’s yet another interpretation too, that’s particularly valid in the wild! I may send you a 404 because the path is valid, the user is valid, but you are not authorized to know this.
Purists will screech in such cases about using 403, but that opens to enumeration attacks and whatnot so the pragmatic thing to do is just 404.
Perhaps a “204 No Content”?
That doesn’t convey the message “yeah, you got the URL right, but the thing isn’t there”
I think it basically does.
Well, it says “OK, no content”. Not, “the thing is not here, no content”. To me these would be different messages.
The last time this came up, I said “status codes are for the clients you don’t control.” Using a 404 makes sense if you want to tell a spider to bug off, but if your API is behind auth anyway, it doesn’t really matter what you use.
https://lobste.rs/s/czlmyn/how_how_not_design_rest_apis#c_yltriz
You never control the clients to an HTTP API. That’s one of the major selling points! Someone can always
curl
.I actually totally disagree with this approach. These kinds of helpers aren’t how you write normal code (I hope 😬), so why do you need them for tests? I think good test code is as simple as possible even at the cost of a little verbosity. The main thing is to make it so the test harness is never the problem.
+1 to this. Lots of scaffolding code here, which, like, feel free to do if it makes sense in your context — but as a general pattern? No thanks.
–
Go isn’t an object-based language, AFAIU?
–
Why not
t.Fatalf
here?I agree. This complicates the test input IMO. Table driven tests work well for simple inputs, outputs, and setups, and for more complex situations data driven tests, which I’m very fond of, allow test authors and readers to focus on the input and output without parsing code: https://lobste.rs/s/n16aps/what_if_writing_tests_was_joyful#c_on6nkh
I have a very simple helper library that just reads/writes files, checks for equality, and spins up sub-tests based on a file glob. It’s been very helpful for me in cutting boilerplate with golden file testing. https://pkg.go.dev/github.com/carlmjohnson/be/testfile
The inverse is also pretty useful: given a port, see the process(es) listening on it.
I can’t say I’ve ever seen a business service prototyped (and deployed!) as a shell script. I’d be really curious to hear some examples of this kind of thing!
I worked at a medium sized ad-tech company who’s main event processing “server” was a big shell script running awk and rsync.
For bonus points it also ran out of a Directors home directory on a single non-redundant server.
Woof. Okay.
This article contains example code, I get that, but please please please never require callers to manually lock/unlock a mutex in order to safely use a value. There are just too many ways for this approach to go wrong, and they’re all basically out of your control as an author, which subverts any kind of guarantees that you can provide to your consumers. Instead, encapsulate the value in a type, and only allow access through methods that do the synchronization work correctly.
This is far from a yes, it’s only applicable to “atomic” semantics where operations are independent and non-composed. And while you can sometimes provide “precomposed” operations it remains not generally applicable.
A more general pattern is to wrap the thing into a gatekeeping context manager, but then while it protects against misuses of the mutex itself it does not protect against mutex related misuses (e.g. leaking sub-resources outside of the managed context) and it’s limited to lexical operations.
Yes, that’s more or less the the assumption of synchronization in Go, at the level we’re talking about here. There is no expectation of “general applicability” in the sense you describe, at least not by default.
In any case, my “yes” isn’t meant to represent a fully generalized pattern, it’s just a response to the “no”.
Using buffered chan as a semaphore is another pattern. We have used it to create a fixed pool of workers or throttling processing using goroutines.
Yep, under-rated pattern IMO.
I personally like SizedWaitGroup for that (very similar but
.Wait
is nice syntax).Why Wait when you can stream? 😉
But I get your point, of course; all good.
after programming Go for a long time I write all my code in Rust now, and now that I have
Result
types and the?
operator, my code is consistently shorter and nicer to look at and harder to debug when things go wrong. I dunno. I get that people think that typing fewer characters matters. I just don’t think that. When I program Rust I missfmt.Errorf
anderrors.Is
quite a lot. Yeah, sure, you can useanyhow
andthiserror
but still, it’s not the same. You start out with some functionality in an app and so you useanyhow
andanyhow::Context
liberally and do some nice error message wrapping, but inspecting a value to see if it matches some condition is annoying. Now you keep going and your project matures and you put that code into a crate. Whoops,anyhow
inside of a crate is rude, so you switch the error handling strategy up tothiserror
to make it more easily consumed by other people. Now you’ve got to redo a bunch of error propagating things maybe you’ve got some boilerplate enum stuff to write, whatever; you can’t just move the functionality into a library and call it a day.Take a random error in Rust and ask “does any error in this sequence match some condition” and it’s a lot more work than doing the equivalent in Go. There are things that make me happy to be programming Rust now, but the degree to which people dislike Go’s error system has always seemed very odd to me, I’ve always really liked it. Invariably, people who complain about Go’s error system consistently sprinkle this everywhere:
Which is … bad. This form:
goes a long way, and
errors.Is
anderrors.As
compose with it so nicely that I don’t think I’ve seen an error system that I like as much as Go’s. Sure, I’ve seen lots of error systems that have fewer characters and take less ceremony, I just don’t think that’s particularly important. And I’m sure someone is going to come along and tell me I’m doing it wrong or I’m stupid or whatever, and I just don’t care any more. I quite like the ergonomics of error handling in Go and miss them when I’m using other languages.Isn’t wrapping the errors with contextual information at each function that returns them just equivalent to manually constructing a traceback?
The problem of errors in rust and Haskell lacking information comes from syntax that makes it easy to return them without adding that contextual information.
Seems to me all three chafe from not having (or preferring) exceptions.
no, because a traceback is a map of how to navigate the code, and it requires having the code to be able to interpret it, whereas an error chain is constructed from the vantage point that the error chain alone should contain all of the information needed to understand the fault. An error chain can be understood by an operator who lacks access to the source code or lacks familiarity with the language the program is written in; you can’t say the same thing about a stack trace.
not sure about Haskell but Rust doesn’t really have an error type, it has
Result
. Result is not an error type, it is an enum with an error variant, and the value in the error variant may be of any type. There’s thestd::error::Error
trait, but it’s not used directly; aResult<T, E>
may be defined by someE
that implementsstd::error::Error
or not, but it’s inconsistent. It may or may not have some kind ofErrorKind
situation likestd::io
does to help you classify different types of related errors, it might not.Go’s
error
type is only comparable to Rust’sResult
type if you are confused about one language or the other. Go’serror
type is less likeResult
and more likeVec<Box<dyn std::error::Error>>
, because theerror
type in Go is an Interface and it defines a set of unwrap semantics that lets you unwrap an error to the error that caused it. (it’s actually more like a node in a tree of errors these days so even that is a simplification).Unlike a trait, a Go interface value is sized, it’s essentially a two-tuple with a pointer to the memory of the value and another pointer to a vtable. Every error in the Go standard library and third party ecosystem is always a thing with a fixed memory size and at least one layer of indirection. Go’s error system is more like if the entirety of Rust’s ecosystem used
Result<T>
to meanResult<T, Box<dyn std::error::Error>>
. Since that’s not universal, even if you want to do it in your own projects, you run into friction when combining with projects that don’t follow that convention. Even that is an oversimplification, because downcasting a trait object in Rust is much more fiddly than type switching on an interface value in Go.There are a lot of subtle differences between the two systems that turn out to cause enormous differences in the quality of error handling across the ecosystems. Getting a whole team of developers to have clean error handling practices in Go is straightforward; you could explain the practices to an intermediate programmer in a single sitting and they’d basically never forget them and do it correctly forever. It’s significantly more challenging to get that level of consistency and quality with error propagation, reporting, and handling in Rust.
Extremely interesting. You’re both right: it is manually constructing (a projection of) a continuation, but the nifty point scraps makes is that the error chain is a domain-specific representation of the continuation while the traceback is implementation-specific. Super cool. This connects to a line of thinking rooted in [1] that I’ve been mulling over for ages and would love to follow up on.
Here’s a “fantasy abstract” I wrote years ago on this topic:
[1] Felleisen, Matthias, et al. “Abstract Continuations: A Mathematical Semantics for Handling Full Functional Jumps.” ACM Conf. on LISP and Functional Programming, 1988, pp. 52–62.
Or just give error types a protocol that propagation operators can call to append a call-stack level.
That’s still only convenient syntax for manually constructing an error traceback, no?
Sure, but it gives feature parity with exceptions when using the simple mechanism.
creating a stack trace isn’t free. It’s fair to assume creating a stack trace will require a heap allocation. Performing a series of possibly unbounded heap allocations is a meaningful tax; whether that cost is affordable or not is highly context-dependent. This is one of the reasons why there is a whole category of people who use C++ that consciously avoid using C++ exceptions. One of the benefits of Rust’s system and the
Result
type is that theResult
values themselves may be stack allocated. E.g., if you were inclined to do so, you could define a function that returnsResult<(), u8>
, and that’s just a single stack-allocated value, not an unbounded series of heap allocations. In Go, you could write a function that returns anerror
value, which is an interface, and that error value could be some type whose memory representation is a uint8. That would be either a single heap allocation or it would be stack-allocated; I’m not sure offhand, but regardless, it’s not an unbounded series of heap allocations. The performance implications are pretty far-reaching. Those costs are likely irrelevant in the case of something like an i/o bounded http server, but in something like a game engine or when doing embedded programming, they matter.The value proposition of stack traces is also different when you have many concurrently running stacks, as is typical in a Go program. Do you want just the current goroutine’s stack, or do you want the stacks of all running goroutines? goroutines are also not hierarchical in the scheduler, so “I want the stack of this goroutine and all of its parents” is not a viable option, because the scheduler doesn’t track which goroutines spawned which other goroutine (plus you could have goroutine A spawn goroutine B that spawns goroutine C and goroutine B is long gone and then C hits an error, which stacks do you want a trace of? B is gone, how are you gonna stack trace a stack that no longer exists?). If you want all the stacks, now you’re doing … what, pausing all goroutines every time you create an error value? That would be a very large tax.
Overall, the original post is about the ergonomics of error handling; I think the ergonomics of Go’s system are indeed verbose, but my experience has been that the design of Go’s error handling system makes it reasonably straightforward to produce systems that are relatively easy to debug when things break, so generally speaking I think the tradeoff is worthwhile. Sure, you can just yeet errors up the chain, but Go doesn’t make that significantly more ergonomic than wrapping the error. I think where the original post misses the point by a wide margin is that it just yeets errors up the chain, which makes them hard to debug. Making the code shorter but the errors less-useful for debugging is not a worthwhile tradeoff, and is a misunderstanding of the design of Go’s error system.
I think this is the point you may have missed further up asking about the analogy to exceptions and tracebacks. The usual argument I see in favor of something like Go’s error “handling” is that it avoids the allocations associated with exceptions and tracebacks. But then the standard advice for writing Go is that basically every level of the stack ought to be using error-wrapping tools that likely allocate anyway. So it’s a sort of false economy – in the end, anyone who’s programming Go the “right”/recommended way is paying a resource and performance price comparable to exceptions and tracebacks.
Not necessarily! This is pretty much the optimal case for per-thread freelists. Most errors are freed quickly and don’t overlap.
Creating a stack trace means calling both runtime.Callers (to get the initial slice of program counter uintptrs) and runtime.CallersFrames (to transform those uintptrs to meaningful frame data). Those calls are not cheap, and (I’m not 100% confident but I’m pretty sure that) at least CallersFrames will allocate.
It’s definitely way way cheaper to fmt.Errorf.
To create a stacktrace up from your current location, sure. But the unwind code you’re emitting knows what its actual location is, so as the error is handed off through each function, each site can just push its pc or even just a location string onto the backtrace list in the error value. And you can gather the other half of the stacktrace when you actually go to print it (if you even want it), which means until then it’s basically free: one compare, one load, two store or thereabouts. And hence, errors that aren’t logged don’t cost you anything.
I see your point
The thing that bit me recently was that Go uses the pattern of returning an error and a value so much that it is now encoded in tooling. One of the linters throws an error if you check that an error is not nil, but then return a valid result and no error. Unfortunately, there is nothing in the type system to say ‘I have handled this error’ and so this raised a false positive that can be addressed only with an annotation where you have code of the form ‘try this, if it raises an error then do this instead’. I don’t know how you’re supposed to do that in idiomatic Go.
yeah but that sounds like a problem with the linter, not a problem with the language. That’s not reasonable linting logic at all.
That whole part of the Rust error handling experience is such a mess. But then again nobody seems to think it’s an issue, really?
basically every person I work with thinks it’s an issue, because we write production Rust code in a team setting and we’re on call for the stuff we write. A lot of the dialogue about Rust here and elsewhere on the web is driven by people that aren’t using Rust in a production/team setting, they’re using it primarily for side projects or solo projects or research or are looking at it from a PL theory standpoint; these problems don’t come up that much in those settings. And that’s not to dismiss those settings; I personally do my side projects in Rust instead of Go right now because I find writing Rust really fun and satisfying, and this problem doesn’t bother me much on my personal projects. Where it really comes up is when you’re on call and you get paged and you’re trying to understand the nature of an error in a system that you didn’t write, while it is on fire.
Yeah okay but mean time people are writing dozens of blog posts about how async rust is broken (which really I don’t think it is) and this issue barely gets any attention.
How exactly is thiserror and pattern matching on an error variant not strictly superior, if not, at worst, the exact same as fmt.Errorf and errors.Is? The whole point of thiserror is allowing you to write format strings for the error variants you return or are wrapping.
errors.Is
is not simply an equality check, it recursively unwraps the error, finding if any error in the chain (or tree, if an unwrap returns a slice) matches, so you can always safely add a layer of wrapping without affecting anyerrors.Is
checks on that error. https://pkg.go.dev/errors#IsI wonder whether Rust’s not having this is due to Rust’s choice to have a more minimal standard library or simply to no one’s having proposed it yet. I think something like this would be very simple to implement in the Rust standard library (and only slightly more verbose to implement outside it).
The tradeoff here is that it is breaking the abstraction of a function. For example changing underlying http library may change the error type. Your options are to break the error chain and the implementation-specific diagnostics or expose that to callers.
Rust cares a lot about being explicit a out API breakage so it doesn’t surprise me that this isn’t a part of the standard library.
It definitely is convenient for within your own code but across stable API boundaries it is a much worse tradeoff.
The function is 14 lines long. The longest line is 29 characters (plus a tab indent). Each line contains a single conditional or expression, built from a very small set of keywords, tokens, and identifiers. It is as easy to read as code gets.
Conway’s Game of Life can be expressed in a single line of APL.
Pithiness is, broadly, not correlated to coherence.
edit
This is cleaner? Huh. Different strokes, I guess.
I am pretty sure this is meant as a joke, to illustrate whacky things you can do with AST manipulation, and not a serious take about error checking.
The “new code” is definitely less verbose and less tedious compared to the “old code.”
I see it exactly oppositely, so I guess this just makes it a subjective judgment.
+100. There’s succinct, and then there’s cramming things onto less lines and they aren’t the same. Succinct doesn’t just mean short, it also means clearly explained.
I joined a big internation company about 10+ years ago and when we had our first informal new hires gathering, this mastodon principal developer remarked: you were hired to read the code, not to write it. As funny as it might sound, in a retrospect it actually made a lot of sense (and not only because they had very little documentation, haha). Now every time I come across a new innovative soft- or hardware approach to put code into a computer more efficiently™ I remember that guy and chuckle.
+100
The primary metric for “good code” is how quickly and easily it can be read and understood by a new engineer with basic domain knowledge. The real costs are borne by the reader, and not the writer — and so you always want to optimize for the reader, and not the writer.
Tell that to my (ex) coworkers. They saw “Lua” and were like “not CV material! TILT!” Come on! Lua is simpler than Javascript, for crying out loud!
I’ve been using the sample implementation repo in my project and it works fine. The big thing I miss from the standard library now is some form of middleware chaining.
I wonder if they will go as far as adding something simple like alice to the standard library.
Works fine, no?
In that example, other applies first and first applies second. It’s confusing if you have more than a small number of middleware and handlers.
For sure! Middlewares are decorators, and it’s intuitive (to me, at least) that the outer-most decorator is executed first. But I’ve no doubt this isn’t universal among programmers.
I think a(b(c(h))) is clear but h = c(h); h = b(h); h = a(h) is confusing. In general, you want the code to read in abc order.
Sure! I personally find it easier to understand the other way around. But potato potato, you know.
Maybe this leads to something…
https://github.com/golang/go/issues/38479
The issue seems sort of dead. I’ve written a library with RoundTripperFunc myself, so that could be worth pulling out into its own proposal. I don’t like the proposed Middleware API in that issue. It’s overly complicated. The Then method is fluff.
I agree that it’s overly complicated. I just have a hope that someone comes up with a better API and something that can be made in a stdlib compatible way.
With one half (or more) of why I’d use a none-stdlib gone it would be so nice if middleware support could find its way into net/http.
Maybe I should propose just adding this:
It’s pretty minimal, but it gets the job done.
I mean, middleware is already supported in the stdlib, just by virtue of the fact that http.Handler is an interface, which any caller can decorate as they please.
I don’t know much about Kafka or its type of work loads, but looking at object storage costs has shown me that they tend to be about 10x the cost of block storage. As long as you’re having this big data-abstraction layer in your pipeline anyway, why would you want to build it atop object storage instead of something else?
Like… this just seems to be a hugely artificial problem that this then solves by dint of throwing away the advantages that object storage has. Then they appear to… build a filesystem atop object storage by spreading their actual data across multiple objects, if I’m understanding properly??? I very well might not be, but that’s kinda what it looks like.
I am so bewildered.
Cloud providers don’t let you share block storage across availability zones, but object storage is shared by all availability zones in a region. As they say in the post, object storage also acts as their networking layer. It’s a neat way of building a multi-datacenter shared memory abstraction. Durable replication is a hard problem, and I like the way they’ve sidestepped dealing with it here. They could end up with a product more reliable than Kafka with significantly less effort and lines of code.
Also many use cases for message brokers don’t involve permanent long term storage of messages. So storage costs may be less of a factor.
I think it will make a lot more sense to you if you read our original blog post: https://www.warpstream.com/blog/kafka-is-dead-long-live-kafka
It’s built on top of object storage to reduce costs and operations compared to traditional Apache Kafka. Building it on top of S3 makes it ~10x cheaper and eliminates virtually all the hard parts of operating Kafka.
Building a streaming abstraction on top of object storage turns out to be really difficult, but we did do it, and there are a lot of benefits for doing so. You’re right thought that effectively what we created was a virtual filesystem.
You’re caching S3 data in your application memory. Is application memory cheaper than reading from S3? If so, is S3 the right tool for the job, so to speak?
The cache is only necessary for the “live edge” of the last few seconds of data, and therefore is not significant to the overall cost of the system.
For example, a workload writing 1GB/s of data would need roughly 15GB of cache per AZ. That is not significant compared to the other costs such as S3 API operations or CPU cores.
In terms of whether S3 is the right tool for the job… I think so. WarpStream is roughly 5-10x cheaper per GiB transferred than Apache Kafka and we wrote 0 code to deal with replication and users gets to operate a completely stateless system for the data plane. I think that’s pretty good evidence it was a good choice.
I’m curious about the assumptions this system makes re: S3 access costs. For example, do you treat S3 as more or less equivalent to local disk in terms of stuff like read latency?
I think this idea is fascinating. I think our data engineer is about to deploy MSK so I’ll share this with him.
Regarding this quote from the “Kafka is dead…” article:
How are you checking that data has been durably persisted? Does it mean “The PutObject call returned a success code”, because the core assumption of WarpStream is that S3 just always works? Or is there some extra verification?
Yeah we just assume if PutObject() succeeded then the data is durable. Which I think to be fair while its not a perfect assumption, AWS (and other cloud providers) have a pretty good track record here and that cloud object storage is arguably more reliable and battle tested than any other storage technology on the market.
Yeah assuming that PutObject is durable is the assumption everyone makes with all their S3 data, so can’t fault you for that.
How is block storage 10x cheaper than object storage? In my AWS region, EBS is 4x the cost of S3 per GB-month.
I think because you have to pay for all your GETs and PUTs with object storage, but EBS comes with at least some IOPS built into the price.
Don’t forget paying for triple replication, provisioned IOPS, and interzone networking (for replication).
If you design the storage system for it, object storage can be much cheaper than EBS.
Can you elaborate? I’m not understanding how you’re going to get much cheaper.
I checked up on Vultr, Digital Ocean, and a few other places. Apparently I’m just wrong. Which feels weird ’cause I did a bunch of research on this a few months ago and recall object storage being always ridiculously more expensive, even not counting for IOPS/egress.
why go1.19, when 1.21 is available and 1.19 isn’t supported anymore?
And why not fasthttp if they’re going to put their finger on the scales with httpbeast?
fasthttp isn’t spec-compliant and doesn’t support HTTP/2. Is httpbeast doing the same kind of shortcuts?
httpbeast seems to only implement HTTP 1.1.
From their GitHub page: