1. 35
  1.  

  2. 17

    I’ve seen component rewrites be hugely successful (eg. replace a part). I’ve seen system rewrites fail miserably (replace an entire system if parts). Some of my colleagues have seen both work, others both fail.

    I think the key is to not add features while you are in the middle of rewriting, and don’t over architect the new thing. Also helps if the same team (or parts of it) that wrote the first one participate in rewriting it. New people will often just make new mistakes, due to not understanding past decisions based on business process. Not rushing it out probably helps too.

    1. 17

      Seems like the wrong takeaway to me. You had great intentions, someone came in and told you to throw them away and rush things. You decided to go ahead and do that, and suddenly rewrites and future-proofing things is “bad”?

      1. 8

        The “future-proofing” was almost entirely ways to increase scope.

        1. 8

          The “future-proofing” was almost entirely ways to increase scope.

          My interpretation here is that “future-proofing” was not about a set of well-described features that addressed the written scenarios, added to design documents, then estimated and then prioritized in cooperation with the product team and management. I guess that in this situation, as I have seen a lot, “future-proofing” is stated as an overall requirement that all code should follow.

          And that, I believe, makes future-proofing as a pitfall the most important takeaway from your article. I believe the problem with this kind of future-proofing is larger than just the risk of increasing scope. It’s is about the mentality that increasing scope silently is ok, without being judged by the expertise of the rest of the company.

          I think code should always be looked at alongside a design. If there is anything in code, be it features, infrastructure, an elaborate data model or even quality-driven version control practices, that is not needed by any requirement in the functional design, something is wrong and should be changed, either to the code or to the design.

          As soon as engineers start introducing ‘silent’ features or requirements, i.e. requirements that are only seen by engineers and not by the product team and management, things start to slide. For example, if a requirement such as “it works with short response times for more than 10.000 simultaneous users” is not in a design document, it should not be supported by the code. Obviously, if someone brings an engineer a design document that misses this requirement, it should be brought up and the consequences should be discussed. But then, at least there will be an official estimate of this feature and management can make a conscious decision and compare trade-offs, which often means de-prioritizing other requirements because development hours are not unlimited and the estimate of maintenance hours increases with every feature.

          I have come to the conclusion that every line of code in a commit that is not directly justified by a written and verified requirement, should in fact be flagged as a mistake. Even if there is an expectation that in the next version, this requirement will be added.

        2. 7

          Not only that….

          The heart of the task matching code was this monstrosity of a cross join that took the other people on the team a few sheets of graph paper to break down and understand.

          That’s a huge red flag too…. it’s says… “You know our DB structure, our indices, the super cunning query optimizer in the DB engine and all that Good Stuff? “

          Eh. None of that applies to this problem.

          BIG RED FLAG.

          1. 8

            You had great intentions, someone came in and told you to throw them away and rush things. You decided to go ahead and do that, and suddenly rewrites and future-proofing things is “bad”?

            The problem is that runaway scope can push the delivery time of the “non-rushed” system out to infinity. The people building the system are usually very reluctant to realize this – they tend to use the rationale you described to defend their working pace, when the real issue is the scope of the work involved.

            As the author described, “future-proofing” can be an extremely tempting way to inflate scope – many scenarios can happen in the future, programmers love to show how clever they are in thinking up new ones, and nothing is ever obviously wrong. Limiting scope to what is happening now (or what will happen in the very near future) is a conservative, but effective, way to limit this problem.

            1. 2

              Keeping within scope and future-proofing are two different things. If the problem was a runaway scope, then that should be the takeaway. If you don’t future-proof your code, it will not live very long. Future-proofing is about designing a system that lasts, and in some cases it’s about simplification (read: reducing scope, not increasing it). That takes time to do well.

              1. 2

                Why is building systems that last worth it? When the facts change, perhaps the software should change.

                1. 1

                  Why is building systems that last worth it?

                  No one is stopping you from building something that breaks shortly after encountering the real world if that’s what you want to do.

            2. 2

              I think the message was that what happened (scope creep, overengineering) is almost inevitable. Even if you know exactly that those are the risks, you cannot sufficiently divert the forces pushing them. Only if everyone is on board, in writing, and your leads and senior engineers can be relied on to scrupulously guard against overengineering and your management can be relied on to have your back against other parts of the company, then it may be evitable.

              1. 2

                I agree most of it doesn’t seem to be about future proofing. I’ve seen that work when it focused on data formats and interfaces. The 64-bit portability layer of System/38 getting it through AS/400 and IBM i phases is an example. Amusing and relevant: it key ideas came out of the abandoned “Future Systems” project.

                This is mostly about the risk of rewrites of legacy systems and a product team trying to use a legacy system for what it wasn’t designed for.

                1. 4

                  I found that the real future proofing is usually about removing/abstaining from things rather than adding them. Before adding a feature in a certain way, one should stop and think what it may prevent them from doing in the future.

                  You can’t anticipate future requirements, but you can predict how your current decisions will interfere with them.

                  1. 2

                    Supporting your point, there’s also the fact that small, well-documented things are easier to rewrite. The cost goes up with the complexity. Once it’s past a point, there’s not going to be either a rewrite or a successful rewrite.

              2. 12

                This project would be built with a microservices architecture

                Why? I don’t see a rationale for this anywhere in the article.

                A “microservices architecture” is almost never a good idea. What possesses programmers to believe that if maintaining one system is hard, maintaining several would be easier? Or that network calls are less complex than function calls? Even among people who actually like this approach for some strange reason, anecdotally I’ve only heard tales along the line “oh yeah, I just use some microservices for simple things, like postcode validation”.

                …Because that’s somehow easier than using a library? 😐

                1. 12

                  Looking back, we thought the tasks running part was a logical place to split the code. It was also the “new idea” that others had had success with internally. We got nerd-sniped.

                  1. 4

                    The main (only?) valid rationale I can see for microservices is when the services need to be reused and composed in different ways to solve different problems (particularly when the services manage their own state). Splitting a logically singular app into services just for fun is, in practice, no fun at all.

                    1. 2

                      That’s interesting. Genuinely curious — because I’m drawing a blank personally — could you give an example of when a part of a system should maintain its own state, and when that state should not be adjacent to, i.e. in the same database as, the core system state? And why would this be different from a monolith collaborating with multiple databases?

                      1. 4

                        I can’t go into specifics of the main example I’ve seen in practice, but the way I (perhaps incorrectly) think of microservices is as islands of independent functionality. There is no ‘core system state’; if you want to know something, you have to ask the service responsible for it. The services can then be part of different independent business flows with minimal spillover effects.

                        Lets say you have an interactive business flow that involves a user going through a series of processes (A, B, C, D) in a mobile app. A ‘gateway’ service provides an API for the app and orchestrates calls to internal microservices A, B, C and D that provide specific functionality. The gateway might know where a user is up to in the flow, but will rely on the services to manage the state related to their part of the process.

                        Another gateway API could be written for a batch process that progresses through processes B, D and E. Changes to either flow are less likely to impact on each other, particularly if the service boundaries are well thought through.

                        Of course you can do this in a monolith with different endpoints for different clients/flows, but that can get messy unless you’re quite strict about how you structure your code and data. it’s easy to tweak something in your ‘user’ flow and accidentally break something seemingly unrelated in your ‘batch’ flow.

                        I’ve worked on another system where microservices used essentially the same data store, with some logical partitioning. Over time, the partitions got thinner and thinner and the boundary between services less and less clear. We ended up collapsing it back into a monolith because there was no gain (and a lot of pain).

                        1. 3

                          In my department, we ended up with a microbased service kind of by accident (it was how the work was initially divided up by different departments, only to end up in one department). It worked out for us because it broke the front end (accepting requests for service) from the backend (the business logic) so when we needed to support a new front end (original front end talks SS7, new front end talks SIP) it was easy enough to write. It also allows us to make updates to the business logic without disrupting exiting connections with our customers (very important with SS7).

                          1. 2

                            The natural example is something like Auth0 before Auth0 itself existed. Centralizing your authentication and authorization so that users could have a single login to all your stuff. Then you can have public APIs, websites, or native apps and have authorization work identically.

                            1. 1

                              Right, but why does that need to be a separate system? Why would that not work as part of a monolithic system?

                              1. 2

                                The need arises when you have multiple products and they use some common functionality, and the multiple products could have independent scaling needs. For instance, imagine Gitlab with VCS, CI, Issue Tracking. You’d want someone to be able to login from each into the same account, you’d want to be able to scale or improve your login system independent of Issue Tracking, etc.

                                It’s just separation of concerns in a way that gives you independent scalability, updatability, reusability, etc.

                                1. 2

                                  That makes sense, but I think the discourse in this area typically forgets what cost this approach has, and whether or not it’s genuinely financially appropriate for the business adopting it.

                        2. 3

                          Why? I don’t see a rationale for this anywhere in the article.

                          I don’t think the author was defending the decisions per se, hence

                          Those of you who have been down this road before probably have massive alarm bells going off in your head.

                          and

                          The second lesson is that making something microservices out of the gate is a terrible idea. Microservices architectures are not planned. They are an evolutionary result, not a fully anticipated feature.

                          1. 1

                            Sorry, I don’t mean for this at all to be an attack on the author. But to address the quotes you pulled:

                            Those of you who have been down this road before probably have massive alarm bells going off in your head.

                            This does not suggest the author has herself been down this road before.

                            The second lesson is that making something microservices out of the gate is a terrible idea. Microservices architectures are not planned. They are an evolutionary result, not a fully anticipated feature.

                            Again, this does not suggest that the author was not aware of this being a terrible idea before embarking on it (and I recognise she was part of a team). I believe my perspective is reasonable, given the use of the word “lesson”, suggesting she learned from this experience.

                            1. 5

                              I, for one, would like to read more posts which describe a project’s failure points and lessons learned. There’s too much focus on “look at how awesome we are” and “we implemented this shiny new thing and it’s great” in tech.

                              1. 3

                                I had not been down that road before the failed project in question, no.

                          2. 8

                            Aside from the main point, am I alone in wincing at

                            This project would have 75% test coverage as reported by CI

                            especially if that means “line coverage”, you’ll end up with tons and tons of untested error-handling paths, in a product that was apparently Rather Important…

                            (But I’m working in a very different domain and have mostly been doing non-coding work for a while, so I may be out of touch?)

                            1. 4

                              Yeah, they ain’t like us. It’s more Move Fast and Break Things or Keep Legacy Systems Running and Growing Forever out there with minimal QA. It’s why I switched focus to automated methods for QA that get results quickly and with minimal effort by developers. Until regulation or liability hits, most won’t adopt anything that requires much effort.

                              1. 1

                                What would be your greatest hits version of “automated methods for QA that get results quickly and with minimal effort by developers”?

                            2. 8

                              The lesson here is threefold. First, the Big Rewrite is almost a sure-fire way to ensure a project fails. Avoid that temptation.

                              Often true, but equally, often unavoidable. I’m writing the replacement (and vast enhancement) of a 20yo system that simply is unsustainable. Previous failed attempts at the rewrite tried to do it more piece-meal, building on or at least inter-operating with the old system. They all failed.

                              Our Big Rewrite has has a very well received phase 1 release, and its about release phase 2, which finger crossed will be just as well received.

                              The second lesson is that making something microservices out of the gate is a terrible idea. Microservices architectures are not planned. They are an evolutionary result, not a fully anticipated feature.

                              I don’t think this is true. The best use case for microservices is (IMO) when the services need to be composed in different ways to meet different needs. I’ve worked on a system like that, and I don’t think it would have been possible without the architect doing a ton of up front work.

                              I’ve never seen a microservices architecture evolve. I’ve seen (and argued for) a microservices architecture devolve into single service, though.

                              1. 8

                                My post on “big rewrites” and “the second system effect” can be found here, in “Shipping the Second System”:

                                https://amontalenti.com/2019/04/03/shipping-the-second-system

                                The most relevant lesson, IMO, is “Big Rewrites are New Products”. That is, every engineer, happening upon an “old” codebase, thinks to him or herself, “This is a big pile and I could have done it better.” But the senior engineers resist the temptation to rewrite – and instead, they start reading and understanding why the system is the way it is.

                                The question is whether there are real limitations of the existing system that can be solved to deliver instantaneous new customer value, while preserving the old value the system used to provide. In that way, the only successful rewrites are new products, with some level of backwards of compatibility to old use cases. A big system rewrite that adds no new customer value, but simply redesigns the code and architecture, is almost guaranteed for failure. The new design will work less well than you think, implementing it will take longer than you think, and verifying that it works as well as the old system might take you as long as understanding the old system would have taken you. So, after you pour weeks, months, or (in some awful cases), years into that rewrite, the business question will be, “What customer value do you have to show for it?” If the answer is, “None – we successfully implemented 100% of the old system and the customers didn’t notice”, then, guess what – it’s still a failure, thanks to opportunity cost! And if the customers do notice, due to new instability, new bugs, or new limitations, then you’re even worse off than when you started.

                                1. 5

                                  I’m curious, at some point OP writes:

                                  The task execution layer worked perfectly in testing, but almost never in production.

                                  Why not testing with production data first?

                                  1. 7

                                    We did! It worked beautifully in testing. Even with the things that failed in production!

                                    1. 4

                                      It’s more common the issue is with production infrastructure differences rather than just data.

                                    2. 5

                                      I’m surprised nobody has mentioned the second system effect yet, it seems to crop up everywhere.

                                      1. 5

                                        The lesson I take from this is to tell Product they’re going to have to wait.

                                        Then the product team came in and noticed fresh meat. They soon realized that this could be a Big Thing to customers, and they wanted to get in on it as soon as possible. So we suddenly had our deadlines pushed forward and needed to get the whole thing into testing yesterday.

                                        Why did you agree to the revised deadlines?

                                        1. 3

                                          Product owned the purse.

                                        2. 3

                                          This is why people advocate refactoring instead right? Refactor to increase test coverage. Refactor to bring more and more parts of the system under CI. Refactor subsystems out so they can be independently worked on.

                                          I feel if there are systems you want to rewrite you should also accept that the new system will do way less than the old system but do those select tasks better. If you can be ok with that then a rewrite could work out.

                                          1. 3

                                            Rewrites are OK, but they should not be larger than what the current team can handle. This is quite fundamental I think. Each team can handle different project sizes depending on their skills and how well they work with each-other. It’s important to know your limits.

                                            Instead of doing a full rewrite, try to cut down the existing product into manageable chunks, ideally defined by interfaces. And then replace them one by one. It might seem like it takes longer, but each chunk is something that you can deliver and is not lost.

                                            1. [Comment removed by author]

                                              1. 12

                                                Attacking someone’s persona like this after they’ve freely shared a story with the community is in extremely poor taste. I hope you can reconsider your approach.

                                                1. 3

                                                  Holy shit, what the fuck.

                                                  Edit: oh, “Status: Active user with invites and story submissions disabled”. Just a troll.