1. 91

  2. 28

    Have done this, can confirm it works. If you have trouble with the political side of things, introduce the strangler as a diagnostic tool or a happy path and start shipping new features under its guise instead of under the legacy system. Arguing for two concurrent system with initially disjoint usecases is easier than a stop-the-world cutover.

    1. 3

      Strongly seconding. I’ve seen countless person-hours wasted on trying to replace legacy systems wholesale. IT systems are like cities: They grow organically and need to change organically if you want to avoid displacing whole swaths of the population and causing more harm than good.

      1. 1

        How do you handle downstream consumers of functionality as you strangle the original piece of code? Can’t always force everyone to move to a new API route or use a new class in a library.

        1. 8

          The best case is not to force them to change anything. In my case, we did exactly as the article mentioned, and slowly transitioned to a transparent proxy. Then slowly we turned on features where API requests were handled by the new code, rather than being proxied to the old code.

          1. 4

            It’s obviously harder if your API has multiple consumers (some of which you don’t control). One option is to have the proxy expose the same endpoints as the legacy system, though that’s not without its own complications (especially if the technologies are particularly divergent).

            1. 3

              That’s a political problem, not a technical one. You solve it by building political power in the organisation.

              1. 2

                Only if the consumers of your API are within your organisation…..

                1. 3

                  For this you need separate gateway service that hosts the API and then forwards to either the new service or the legacy service. It’s also generally appropriate to use the legacy service as the api gateway, and abstract the route through that to the new external service.

                  Be mindful of added latencies for remote systems.

                  1. 2

                    If people are paying you for a working API, I’d struggle to imagine a viable business case for rebuilding it differently and asking customers to change.

                    1. 3

                      It happens all the time. That’s one of the reasons that REST APIs are generally versioned.

                      1. 1

                        That still doesn’t solve the problem. The customers still need to transition to the newer version.

                        1. 3

                          I think our wires are crossed. I was using multiple versions of REST APIs as a counterpoint to the idea that there’s no “viable business case for rebuilding it differently and asking customers to change.”

                          That change may even be driven by customers asking for a more consistent/functional API.

                2. 3

                  I’ve normally handled this my making all consumers use a service discovery system to find the initial endpoint, and then using that system to shift traffic “transparently” from the old system to the new one.

                  This is admittedly a lot easier if your consumers are already using service discovery, otherwise you still have a transition to force. But at least it becomes a one-time cost rather than every migration.

              2. 19

                Here’s the referenced article from Martin Fowler:


                1. 8

                  This is one of those things that seems so obvious now that I’ve read it, but for some reason have never thought of or seen before.

                  One idea from the linked article by Martin Fowler is that we should design systems so that they’re easier to “strangle” in the future. What kinds of design decisions support that?

                  1. 9

                    Narrower interfaces, explicit versioning.

                    I’ve found narrower interaces to really help. Specifically, making it difficult for users to rely on unintentional effects, side efect or postconditions. Eliminating accidental postconditions is often quite simple: Document what the intended interface is and then make look at the output/result/…, see if there’s more, and remove it, randomise it or think about how rarish unusual behaviour can be made more usual, so that the documented behaviour is simple and the rest appears complicated.

                    1. 4

                      Anything I can come up with in that direction will also reduce the necessity of a rewrite in the first place.

                    2. 7

                      I’m a big fan of this pattern, and have seen it used pretty successfully. It’s definitely the approach I’d take for most live service migrations.

                      I will say I’ve experienced a few problems when using this pattern, though they’re relatively minor:

                      • This is a good way to expose any clients that are relying on undocumented behavior! :P

                      • When the functionality is mostly migrated and only a small fraction of traffic is getting forwarded, the will to finish the migration can stagnate. This can lead the old service to becoming a long-lived black box that no one understands but that can still bite you in production.

                      • Changes to dependencies need to be carefully controlled. If the old service and the new one aren’t being upgraded in lock-step, but both rely on some of the same dependencies, it can bite you. (Example: if your new service pulls in a new version of a common RPC lib, and the changes are not quite fully backward compatible…)

                      • If the whole system is sufficiently resource-intensive, the new service can require a substantial increase in total resource needs before it finally takes over. That’s fine but needs to be explicitly planned for. (Though this only matters if you’re sufficiently big that capacity planning is a thing for you.)

                      1. 6

                        I’ve attempted my fair share of rewrites and had to work with a few legacy systems and yeah, this is definitely good advice. I think it’s best to keep the application running at all times, not only for business reasons but for psychological ones too: not being able to ship the stuff you’re working on for a long while takes a big toll on morale, and having to deal with the legacy bits keeps you grounded in reality and stops you from chasing the pie in the sky of the perfect system.

                        1. 2

                          At one job, I could only watch as my first 4-6 months of work on their total rewrite project was shelved in one go. Needless to say, my personal investment plummeted and I left for another job as quickly as I could. Their loss.

                        2. 5

                          Honestly, I have to say: It depends. This technique is easily applied when the system is entirely written in one language, and you can write wrapper methods and classes in almost every case. But here’s an example of a situation where you can’t do that, or where doing it would probably introduce too much complexity relative to the value of the technique.

                          • The old stack has a frontend written in $old_tech (e.g. Rails views + jQuery; or perhaps AngularJS), and a backend written as a REST API
                          • The new stack is to have its frontend written in $new_tech (say, Vue, or React, or Angular), and is supposed to interact with the backend via GraphQL
                          • The database gets to stay the same, serving both old and new stacks

                          What can be done in such situations is replacement of pages (whether in classic web app or SPA) one by one, each self-contained, whose development doesn’t impact the other.

                          1. 3

                            One step in that process would be for you to “implement” your REST API through GraphQL, internally.

                            For the frontend, you can have Angular or Vue side by side, so you can write nice helpers that make things easier to deal with, while just basically calling into jquery plugins or the like.

                            the DB remaining the same seems like a decent strategy overall

                            Now this is just from your data points and they might be harder or easier (I’m of the opinion that some old tech can stick around forever if it’s not busted), but you can get incremental value add with a bit of work here

                            1. 1

                              Not 100% the same, but when we did it we just served different /route to different backends. In our case the rewrite was PHP like the original application, but that was coincidence. It wasn’t very frontend-heavy in the sense that I read from your question, but I don’t see a huge difference. Also maybe you can just refactor the old system enough that it lends itself to interacting with the new, or deliberately is out of the way. E.g. put a very basic GraphQL mock interfaced on the old app.

                            2. 5

                              I used this approach to learn Rust! I already knew Ruby, so I prototyped a game project in it. Then, using native extensions via helix and rutie, I replaced the Ruby code section by section until the entire game was written in Rust. Learning the new language and its unfamiliar paradigms proved easier for me, compared to academic study, when I already had the concrete context of familiar code to guide me.

                              1. 4

                                Did this at my previous job, and absolutely it worked well. It did cause some incidents, but was the best test to verify that our replacement could handle the load and all the weird quirks that the old system implemented. We did still have political problems betting certain consumers to buy-in, but I’m happy to say that we did indeed sunset the old system, and avoid having to implement features twice.

                                1. 3

                                  In my previous job, we did the mistake of trying to rewrite our huge legacy app that served customers day afyer day. Everyday we were flooded by bug-related tickets, so we decided it was time to rewrite something new from a sane ground. We were naive, because it just ended as the post described: two apps to maintain and deploy.

                                  Eventually, we overcame our fear of the legacy code and we took the bull by the horns, by dealing with that legacy code! It worked better. Much better.

                                  We used that technique on the API level. The legacy app was a (mostly) SSR web app. We implemented new features in a API-only manner in a special new namespace for endpoints: /api/new/.../....
                                  The UI was client-side-rendered, but the final user couldn’t tell the difference between legacy and new pages!

                                  When we had trouble understanding a feature in the original code, we simply wrapped our new API around it, so we could buy some time to understand and re-implement or clean that part, if needed. Again, in a fully transparent manner for the final user (except that the app was faster or less buggy after the clean/rewrite of a feature).

                                  1. 3

                                    As a corollary, my personal recommendation is to start the strangling from a subsystem that is the smallest one and/or with smallest total “interaction surface” to other subsystems (i.e. interface/API), that you can find — ideally, both. Those future “snipping points” may be cunningly hidden, but once you can find them, you at least have some interface you can work with. Try to also find all the places where this interface is used by its clients; ideally, you should be able to find all of them. Then, try to replicate this interface, however crazy it is; if you found all its uses, you should ideally only implement the functionalities that are actually used (this helps make your time spent rewriting this subsystem as short as possible). Importantly, don’t try to fix the actual interface yet — you’ll come to this in the future, but it’s just not the time for that now. Doing this would make you try and fix the whole system, and that will crush your soul. Still, keep note that the rewrite will take some time anyways. The good part is that, after you’re done with it, you have a fragment of the code that is conquered. This gives you an area you know is sane. This is your foothold, your first safe haven. The green pasture where you can recover your sanity and get back to when tired. And you know it’s borders, and you protect them fiercely. After you’ve celebrated with your friends and regained some strength with your favourite beverage, it’s time to bring peace and prosperity to some next orc-infested area. In other words, rinse and repeat. In the not-yet-rewritten code, try to find a subsystem that is the smallest one and/or with smallest total “interaction surface” to other subsystems… however, now the already conquered subsystem does not count to this total “interaction surface”; also, you can also now consider fixing the part of the interface between the “new grounds” you’re conquering, and the ones already redeemed. It’s not mandatory, but you can do it if you think you have time. If not, leave it till later, maybe after you’re finished conquering the new subsystem. Godspeed!

                                    At the time I was doing this, I used to call it “isolating the tumor” and then “cutting it away”.

                                    1. 2

                                      “Strangling” is much catchier than the term for this I’m more familiar with, “progressive refactoring”.

                                      Outside of computing, this is also the strategy used by some insurrectionary anarchists, & PM notably outlines a very detailed & software-engineer-ish plan for slowly replacing all governments with small co-ops in his book bolo’bolo.

                                      Of course, when you don’t have a current user-base that’ll be displaced, it’s almost always better to rewrite it wholesale (because nobody will experience the churn except you).

                                      1. 1

                                        For anyone in or around Leeds, UK, I’m demonstrating a Strangler example this Friday.