I don’t speak for Amazon, but my experience has been that this kind of analysis and architectural refitting is essentially constant and that’s a good thing. As volume and scope change, different approaches are needed. Monoliths are great for some problems and for some length of time; same with SOA, microservices. The lifetime of the problem usually sees one or all of those approaches as conditions change.
Completely agree with you. It’s all about the trade offs, and sometimes a problem is not understood well enough at the start to properly analyze those trade offs.
Given how flexible and malleable software is, it always amazes me the reluctance to refactor at scale, especially architecture. Electrical and mechanical (let alone civil!) all have massive up-front cost in manufacturing and existing stock needs to be used/trashed, yet revisions are common. Software has none of that overhead and everyone overreacts to revisions…
Software does have an up front cost in testing that the replacement is equivalent. If you can’t fully simulate production without risking production, is it any surprise it’s hard to get rework through?
If we’re using other engineering disciplines as our comparison point, the testing requirements are the same (and much higher for electrical/mechanical/civil). We can get far closer to simulating a true production use-case with software, it’s an unfortunate part of our industry that integration testing is mostly an afterthought.
Snark aside. I like that DHH and folks challenge the popular choices. Diversity is good. But if you really read the post you see that the service was not well fit for microservices/lambda/Step Functions. I have worked on a project that used StepFunctions for event based big data processing and worked really well for us. A lot better than having to run your own event platform and or worse poll things all the time. Engineering is about trade-offs. Sometimes you are wrong, like the Amazon team was. That does not mean the approach is wrong for all use cases.
not speaking for the Amazon team or for Amazon, but I’d wager the original approach wasn’t even wrong, it was the situation that changed. It’s super common in Amazon that you build something that works great for some period of time, but then 1000x the users show up, or the service gets popular and now there are 150 peer teams that need to put code in your app, or whatever. If you hadn’t launched with something simple and easy like serverless, then maybe you wouldn’t have survived to have the new great problems that force rearchitecture, who’s to say?
It brings to mind the old, and 100% true, saw that “a legacy system is one that people actually use.”
The conclusions in Amazon’s article are that service and egress charges were too high, and that account limits for resource usage were too low. Wow. “Amazon can’t make sense of serverless!” DHH cries out, missing the point entirely, for a problem that wouldn’t exist on DHH’s “sovereign cloud” only for lack of fees to read from disk instead of S3. Run your own microservices, avoid these scaling limits, but that means you deal with your own infrastructure. That’s the argument to be made, not a stupid gotcha-rant.
the overall story is, they originally designed this for substantially lower capacity than they’ve decided to expand it to, serverless microservice architecture allowed them to rapidly build the initial solution. they have made fairly straightforward changes to the code to run siloed / vertically separated copies of a monolith, with each copy running only a configured subset of functionality. so it’s still kinda microservices, but they no longer spin up an entirely new service for each quality check.
this is a case study in properly leveraging the cloud and not simply saying, “serverless is too expensive at large scale so it is a poor choice.”
they prototyped something, put it in production without premature optimization, it worked so well they wanted to expand its’ usage such that it made more sense to spend more energy on their infrastructure. microservice architecture has an advantage that it is easier to scale individual functions. that flexibility, like all abstractions, does introduce some complexity, latency, etc..
they’re still using AWS services, and showing how using serverless to build something is not a forever trap. you can just change things. :)
Everytime I hear an argument in this area, it boils down to this:
Monolith to microservices: painful, generally worthwhile
Microservices to monolith: generally never happens, but manageable and not painful at all
That is, starting with and metastatizing a monolith puts you more in trouble architecturally than an expensive microservices architecture that can be streamlined and inlined when the abstractions become too expensive.
“Microservices” at Amazon also may not mean the same thing to everyone who looks at them. Each one is commonly maintained by an entire team, with operational tooling specific to them – dashboards, alerting, etc – and there is deep technical support for some kinds of them (if you work at Amazon, don’t @ me about the variability of that support; until a half year ago I was in Builder Tools and worked closely with the teams that own all the “new” stuff). There are also “many tiny lambda” services; it’s a spectrum.
Anyway, yeah, Amazon can and does “make sense of serverless”. As a company they also change direction when the one they’ve got isn’t working.
I have worked on a project that used StepFunctions for event based big data processing and worked really well for us.
I’m just about to embark on building a data pipeline for work, and we’re probably going with Apache Pulsar. Expecting up to a few million documents per day, max. We need to send them through a variety of different machine learning models, merge the results, and put the final enriched data into Elasticsearch. In your experience, would Step Functions be something we should consider?
It’s not even the lambdas here (which people often suspect) but the architecture lambda nudges you towards. Step Functions are awkward and expensive and if you can’t avoid them, then at least do your home work when it comes to pricing.
That Prime Video team had one poor choice about transferring a lot of data between services. And now everyone praises the end of the microservice civilization.
I think that’s a harsh conclusion from the outcome of a single team out of many at Amazon.
I don’t speak for Amazon, but my experience has been that this kind of analysis and architectural refitting is essentially constant and that’s a good thing. As volume and scope change, different approaches are needed. Monoliths are great for some problems and for some length of time; same with SOA, microservices. The lifetime of the problem usually sees one or all of those approaches as conditions change.
Completely agree with you. It’s all about the trade offs, and sometimes a problem is not understood well enough at the start to properly analyze those trade offs.
Given how flexible and malleable software is, it always amazes me the reluctance to refactor at scale, especially architecture. Electrical and mechanical (let alone civil!) all have massive up-front cost in manufacturing and existing stock needs to be used/trashed, yet revisions are common. Software has none of that overhead and everyone overreacts to revisions…
Software does have an up front cost in testing that the replacement is equivalent. If you can’t fully simulate production without risking production, is it any surprise it’s hard to get rework through?
If we’re using other engineering disciplines as our comparison point, the testing requirements are the same (and much higher for electrical/mechanical/civil). We can get far closer to simulating a true production use-case with software, it’s an unfortunate part of our industry that integration testing is mostly an afterthought.
For video processing nonetheless.
Confirmation Bias, the blog post.
Snark aside. I like that DHH and folks challenge the popular choices. Diversity is good. But if you really read the post you see that the service was not well fit for microservices/lambda/Step Functions. I have worked on a project that used StepFunctions for event based big data processing and worked really well for us. A lot better than having to run your own event platform and or worse poll things all the time. Engineering is about trade-offs. Sometimes you are wrong, like the Amazon team was. That does not mean the approach is wrong for all use cases.
not speaking for the Amazon team or for Amazon, but I’d wager the original approach wasn’t even wrong, it was the situation that changed. It’s super common in Amazon that you build something that works great for some period of time, but then 1000x the users show up, or the service gets popular and now there are 150 peer teams that need to put code in your app, or whatever. If you hadn’t launched with something simple and easy like serverless, then maybe you wouldn’t have survived to have the new great problems that force rearchitecture, who’s to say?
It brings to mind the old, and 100% true, saw that “a legacy system is one that people actually use.”
I think DHH would agree. There are definitely use cases for microservices and serverless but not everything fits this pattern. It reminds me of this idea that what works for Google surely works for me like map reduce https://www.cloudcity.io/blog/2018/11/08/parsing-logs-230x-faster-with-rust/
The conclusions in Amazon’s article are that service and egress charges were too high, and that account limits for resource usage were too low. Wow. “Amazon can’t make sense of serverless!” DHH cries out, missing the point entirely, for a problem that wouldn’t exist on DHH’s “sovereign cloud” only for lack of fees to read from disk instead of S3. Run your own microservices, avoid these scaling limits, but that means you deal with your own infrastructure. That’s the argument to be made, not a stupid gotcha-rant.
the overall story is, they originally designed this for substantially lower capacity than they’ve decided to expand it to, serverless microservice architecture allowed them to rapidly build the initial solution. they have made fairly straightforward changes to the code to run siloed / vertically separated copies of a monolith, with each copy running only a configured subset of functionality. so it’s still kinda microservices, but they no longer spin up an entirely new service for each quality check.
this is a case study in properly leveraging the cloud and not simply saying, “serverless is too expensive at large scale so it is a poor choice.”
they prototyped something, put it in production without premature optimization, it worked so well they wanted to expand its’ usage such that it made more sense to spend more energy on their infrastructure. microservice architecture has an advantage that it is easier to scale individual functions. that flexibility, like all abstractions, does introduce some complexity, latency, etc..
they’re still using AWS services, and showing how using serverless to build something is not a forever trap. you can just change things. :)
Everytime I hear an argument in this area, it boils down to this:
That is, starting with and metastatizing a monolith puts you more in trouble architecturally than an expensive microservices architecture that can be streamlined and inlined when the abstractions become too expensive.
I used to work at AWS and nothing here surprises me (up to and including DHH beating his chest about how his decisions have always been right). I like Werner Vogels’ response to this: https://www.allthingsdistributed.com/2023/05/monoliths-are-not-dinosaurs.html
“Microservices” at Amazon also may not mean the same thing to everyone who looks at them. Each one is commonly maintained by an entire team, with operational tooling specific to them – dashboards, alerting, etc – and there is deep technical support for some kinds of them (if you work at Amazon, don’t @ me about the variability of that support; until a half year ago I was in Builder Tools and worked closely with the teams that own all the “new” stuff). There are also “many tiny lambda” services; it’s a spectrum.
Anyway, yeah, Amazon can and does “make sense of serverless”. As a company they also change direction when the one they’ve got isn’t working.
This should be merged into the main article.
I’m just about to embark on building a data pipeline for work, and we’re probably going with Apache Pulsar. Expecting up to a few million documents per day, max. We need to send them through a variety of different machine learning models, merge the results, and put the final enriched data into Elasticsearch. In your experience, would Step Functions be something we should consider?
It’s not even the lambdas here (which people often suspect) but the architecture lambda nudges you towards. Step Functions are awkward and expensive and if you can’t avoid them, then at least do your home work when it comes to pricing.
That Prime Video team had one poor choice about transferring a lot of data between services. And now everyone praises the end of the microservice civilization.