if you are a startup that wants to disrupt anything
There is the hint. This won’t scale too much. You want error aggregation, otherwise your slack channel is just a fire hose for benchmarking slack and your electron client when anything goes south.
I’m really curious whether this is true. If compiler error messages are analogous then it might not be: if you start your project with lots of warnings enabled and you have a policy of fixing them as they appear, the overhead of reading every one is low. If you start with a mature project and then turn on warnings then you’ll see millions of them and you need some triage to get them down to a manageable level (and you might never be able to if you’re writing new features at the same time). On the other hand, if hardware failures are analogous then at small systems will see few and each can by individually fixed but large systems will see the errors scale with system size (or a bit faster due to correlated failures). I have no idea which side of the line API failures like this will fall.
Friends of mine had the honor of writing such error aggregation. The previous system would spit out a message per error. Problem was they were working with 3rd party APIs a lot (for example payment processors). So when one of the other companies they were talking to had some problems (could mean they changed their API), they would receive a ton of such errors. Imagine paypal getting errors for every transaction or onboarding when one bigger bank decides to screw up their API.
And this kind of “external influence create a burst of equal errors” is the thing I want to point out here. Most systems aren’t closed in the sense that you can just mail yourself every error if you have > 100 users. They have to account for external influence where you start getting problems that you couldn’t have catched earlier, and which you may only be able to fix after talking to another person. This could just be a DNS misconfiguration which creates 500 hits for every customer that’s trying to perform a certain request.
The point is that they want to fix every kind of error. But not get an alert for every instance of it. Otherwise you’re drowning in messages.
Which to me isn’t much different from a compiler giving you tons of errors when you refactor something, probably even giving you warnings on top for unused variables, while your core state struct has a renamed field, that propagates throughout the whole codebase.
The previous system would spit out a message per error. Problem was they were working with 3rd party APIs a lot (for example payment processors). So when one of the other companies they were talking to had some problems (could mean they changed their API), they would receive a ton of such errors. Imagine paypal getting errors for every transaction or onboarding when one bigger bank decides to screw up their API.
Well, the kind of errors that would be reported in the way that is described here would necessarily be errors as observed by users of the overall system, high-level errors defined and returned by the API for that system, and not any of the intermediate errors encountered by the components that happen to implement that API.
Even if you get internal errors, they can stem from parts of your network having problems, a disk for one service running full etc. And then you also want the information, but not the amount of notifications.
Yes, I am a bit stung by “Google bad”, but I can’t say for sure what I’ve learned from this article, except the warm fuzzy feeling of their artisanal self-satisfaction by the fact that they care to read every error.
Customer happiness does not stem from squashing every possible exception in your code. Hyperfocus on that can actually be detrimental to getting in the user’s shoes and seeing the big picture.
What does customer happiness stem from? Reliable software? Smoother UX? Deeper empathy for your customers? That’s what they say they got from doing this:
However, we noticed a powerful second-order effect emerge over time. The team began obsessing over the user experience. Following this process builds a visceral understanding of how your system behaves. You know when a new feature has a bug before the first support tickets get opened. You know which customer workloads will require scaling investments in the next 90 days. You know what features see heavy usage and which ones customers ignore. You’re forced to confront every wart in your application. Slowly, your team builds a better understanding of your customers and this trickles down into every aspect of product development. You begin to hold yourselves to higher standards.
There is the hint. This won’t scale too much. You want error aggregation, otherwise your slack channel is just a fire hose for benchmarking slack and your electron client when anything goes south.
I’m really curious whether this is true. If compiler error messages are analogous then it might not be: if you start your project with lots of warnings enabled and you have a policy of fixing them as they appear, the overhead of reading every one is low. If you start with a mature project and then turn on warnings then you’ll see millions of them and you need some triage to get them down to a manageable level (and you might never be able to if you’re writing new features at the same time). On the other hand, if hardware failures are analogous then at small systems will see few and each can by individually fixed but large systems will see the errors scale with system size (or a bit faster due to correlated failures). I have no idea which side of the line API failures like this will fall.
Friends of mine had the honor of writing such error aggregation. The previous system would spit out a message per error. Problem was they were working with 3rd party APIs a lot (for example payment processors). So when one of the other companies they were talking to had some problems (could mean they changed their API), they would receive a ton of such errors. Imagine paypal getting errors for every transaction or onboarding when one bigger bank decides to screw up their API.
And this kind of “external influence create a burst of equal errors” is the thing I want to point out here. Most systems aren’t closed in the sense that you can just mail yourself every error if you have > 100 users. They have to account for external influence where you start getting problems that you couldn’t have catched earlier, and which you may only be able to fix after talking to another person. This could just be a DNS misconfiguration which creates 500 hits for every customer that’s trying to perform a certain request.
The point is that they want to fix every kind of error. But not get an alert for every instance of it. Otherwise you’re drowning in messages.
Which to me isn’t much different from a compiler giving you tons of errors when you refactor something, probably even giving you warnings on top for unused variables, while your core state struct has a renamed field, that propagates throughout the whole codebase.
Well, the kind of errors that would be reported in the way that is described here would necessarily be errors as observed by users of the overall system, high-level errors defined and returned by the API for that system, and not any of the intermediate errors encountered by the components that happen to implement that API.
Even if you get internal errors, they can stem from parts of your network having problems, a disk for one service running full etc. And then you also want the information, but not the amount of notifications.
Yes, I am a bit stung by “Google bad”, but I can’t say for sure what I’ve learned from this article, except the warm fuzzy feeling of their artisanal self-satisfaction by the fact that they care to read every error.
Customer happiness does not stem from squashing every possible exception in your code. Hyperfocus on that can actually be detrimental to getting in the user’s shoes and seeing the big picture.
What does customer happiness stem from? Reliable software? Smoother UX? Deeper empathy for your customers? That’s what they say they got from doing this: