We’ve adopted a microservices architecture. We run over 2000 services and counting!
We’ve adopted a microservices architecture. We run over 2000 services and counting!
This seems… a bit much.
They had 1551 employees in 2019, so more microservices than employees!
Out of curiosity, what is your specific concern with running this many? There’s quite a few pros/cons, so if I know your specific concern then I can have a go at responding to it :)
I wouldn’t necessarily call it a concern. I would have probably went for a much simpler architecture, but I assume there are plenty of smart people at Monzo and you have weighted the pros and cons in much better detail that I can possibly infer from a blog post.
I’m quite curios how you split the business logic between so many microservices and what’s the thought process behind creating so many. Maybe write about this on the company blog?
Yes there are pros/cons, and it does warrant a whole separate blog post to go into them in more detail. The best I can give you is a talk from a couple of years ago Modern Banking in 1500 Microservices. There’s also a summary of this in The Register.
I’m not aware of internal guidance on how granular our microservices should be. It’s more of a cultural things based on the existing system. And it has shifted over time. I think a couple of years ago the services that were being written were more micro than they are today for example.
I’ve found this presentation from 2020 that talks about your approach to writing microservices. It’s quite interesting to see the amount of inherent complexity that an online bank has to deal with just to provide what we think are basic services.
I’m a little late to this discussion but I have this questions: Does it make code organization and test difficult ? Do you have one repo per service ? How do you test interactions between so many services ?
We have a monorepo for all our services. On the whole, code organisation isn’t too tricky. Each service is in its own directory, and you can easily “jump to definition”/“find references” on the RPC protobuf definitions figure out what the upstream and downstream references.
With the sheer number of services the totally flat hierarchy has been getting tricky to make sense of, so we’ve been putting effort into grouping services into “systems” to make the system easier to reason about at a high level.
As for testing, most services will have unit tests that stub out dependencies. We also have some integration tests that depend on some real network dependencies running (e.g. a database). In addition we also have acceptance tests that run in our staging environment and test many services from the user’s point of view.
One of the problems with our unit tests is when our mocks make incorrect assumptions about the dependency it is mocking. For that reason we are currently experimenting with consumer driven contract testing to mitigate this. Stay tuned - we’re writing a blog post about this right now.
It was around that number in their materials from 2 years ago. Either the number is old, or they have started replacing instead of building in new places.
The rate of increase has slowed significantly over the past 2 years. I’m not sure why this is off the top of my head, but my feeling is it’s partly due to a cultural shift within engineering, and partly because a lot of the broad features have already been implemented, so a lot of the development work goes into iterating on existing systems.
Great article, the deployment CLI sounds amazing and would love to hear more about it!
We have to key CLI tools. Both of which are opinionated and I’d say follow more of a “swiss army knife” design rather than unix philosophy.
If by any chance you are able to watch QCon videos, I’d highly recommend this talk by Suhail that goes into detail on our tooling.
Thanks! Hopefully the talk will be made public sooner or later 😊
So… no QA?
QA is a much broader topic than “gating deployments on manual testing” which is what you seem to imply. If you do many of the other things right, you might find that gating on manual testing no longer catches enough defects to be worth it.
In fact, gating on manual testing actually works counter to many other good QA practises – some of which are described in TFA, like small batches.
This. Sometimes you need to gate this way, but it’s a sign of a problem, not a feature to bring everywhere.
QA is integrated into development in the form of full tests.
Having QA be an extra few month process is a sign of less than ideal development practices.
Hopefully more companies take this approach, it is easier to compete with companies that don’t have QA.
Honestly I’ve never worked anywhere where I felt like a QA organization added much value. They would automate tests for product requirements which were usually covered by product-dev test suites anyway, except the QA tests would take longer to run, they’d be far flakier, and they would take much longer to identify the cause of failure. Moreover, the additional manual testing they would perform in-band of releases rarely turns up issues that would reasonably block a release (they aren’t usually the ones finding the security issues, for example). It really feels like QA is something that needs to be absorbed into product/dev much like the devops philosophy regarding infrastructure–basically, developers should just learn to write automated tests and think about failure modes during development (perhaps via a design review process if necessary).
It seems like the thinking behind QA orgs is something like “these well-paid devs spend all of this time writing tests! let’s just hire some dev bootcamp grads for a fraction of the cost of a developer and have them write the tests!”. Unfortunately, writing good tests depends on an intimate understanding of the product requirements and the source code under test as well as strong automation skills, and the QA orgs I’ve worked with rarely tick any of these boxes (once in a while you get a QA engineer who ticks one or more of these boxes, but they’re quickly promoted out of QA).
I have had an experience where the QA team found a showstopper “go back and fix before our customers get broken” bug.
Sometimes a long QA cycle can really help, but it’s not something you want to need.
I don’t doubt that these happen, but plausibly a company without a QA org might’ve done more diligence and caught it early rather than throwing it over the wall to QA. Moreover, even if the QA org caught this one, it must be weighed against the overall inefficiency that the org introduces. Is one show stopper really worth being consistently late to market? In some orgs the answer might be “yes” but in most it probably isn’t (unless “show stopper” means “a bug so serious it sinks the company” by definition).
As it turns out, not always (e.g. https://spectrum.ieee.org/yahoos-engineers-move-to-coding-without-a-net)
I have worked at companies with long QA cycles for good reasons - among other things we shipped software on physical device and things like iOS apps where you cannot fix a mistake instantly because of validation. But at a Web-based software company I would also say it is probably not always a good trade-off.
FWIW, Monzo is mostly accessed through apps on your phone. When they launched the bank you couldn’t even login on their website. Their apps do seem like they’re mostly just browsers accessing some private site, though.
fast deploys and QA aren’t incompatible. The classic way of handling this is to write code behind feature flags so that code can be integrated into production-esque environments (like a staging env) and handle at various paces.
This has the added benefit of making it easy to roll out refactors and other pieces of code without having long-running branches that accumulate a lot of stuff. And you can do stuff like release features for a subset of users, do betas, roll back feature releases, all mostly decoupled from the code writing.
And of course 100 deploys to prod per day in a full CI/CD system is likely a lot of tiny deploys that are just like…. version bumps of libaries for some random service and the like.
We do QA! This is not something I’m very involved with, so I’ll struggled to give you details. I’d recommend you check out this recent blog post on our approach to QA.
Is it me, or does this seem like an excessive amount of deployments?
How would any amount be excessive? 100 is honestly not that big
I’d like to hear more about the automated tests that presumably are what really enable this kind of quick deployment
I wrote about this in a bit more detail in this reddit comment.