1. 108
  1. 25

    Don’t forget that there’s a half-way compromise for YAGNI: instead of actually implementing the anticipated requirement, you may merely ensure it’s reasonably easy to change the code to add it. (for the record, I agree with the article about things that you can’t go back and change, like logging non-PII metadata).

    For example, if you expect to have more than 1 address per user, you don’t need to implement multi-address storage and management right away, but make sure your API has IDs for addresses and returns them as an array (which initially can be an array with just one).

    Another suggestion for YAGNI is to make it accountable: file a feature request for every thing you anticipate, and put // TODO: #issuenumber next to the code that makes an incompatible assumption. If you later actually get to implement this, it’ll be easier to track what needs to be changed. If you don’t, you’ll have in writing how much waste you’ve avoided :)

    1. 18

      A nice response with many additions from Simon Willison here: https://simonwillison.net/2021/Jul/1/pagnis/

      1. 12

        This a good list of concrete exceptions.

        We can afford to do YAGNI only when the systems we are working with are malleable and featureful.

        This generalization, however, strikes me as only partially true.

        Yes, YAGNI is about embracing change and recognizing that you can’t predict the future.

        But, maybe more so, YAGNI is a psychological strategy to avoid an enticing bad habit. It is an admission that programmers love to be clever by “preparing for everything” and designing beautiful abstractions that will save the day… tomorrow. It is a recognition of how costly this is.

        For this reason, it applies outside programming too (business planning, organizing your house, etc) – it’s a stop-loss mechanism for certain types of behavior for a certain type of personality.

        1. 6

          I take issue with the statement that you need a relational database, and not just because my day job is at a document-database company (Couchbase.) Saying “most data is naturally relational” is misleading. Most data includes relationships, links, between records, yes. That does not mean the same thing as the specific mathematical formalism of relations implemented in relational databases.

          For example, the linked-to article about switching from MongoDB talks about the social network Diaspora. Social network data sets are practically poster children for graph databases, another type of non-relational DB. The key reason Diaspora switched from MongoDB turns out to be:

          What’s missing from MongoDB is a SQL-style join operation, which is the ability to write one query that mashes together the activity stream and all the users that the stream references. Because MongoDB doesn’t have this ability, you end up manually doing that mashup in your application code, instead.

          Ouch. That is a problem with MongoDB, not with document databases themselves. As a counterexample, Couchbase’s N1QL query language definitely has joins (it’s roughly a superset of SQL) and there are other document DBs I’m less familiar with that do joins too. Joins are not something limited to relational databases. (And they’re of course the bread and butter of graph DBs.)

          In my own projects I’ve found document databases very useful during prototyping and development because their schemas are much more flexible. You can more easily apply YAGNI to your schema because, when you do need to add a property/column/relation, you don’t have to build migrations or upgrade databases or throw out existing data. You just start using the new property where you need it. (This is an even bigger boon in a distributed system where migrating every instance in lockstep can be infeasible.)

          1. 7

            very useful during prototyping and development because their schemas are much more flexible

            Prototypes always become production systems. That flexibility makes them hell to work with (ask me how I know). I feel MongoDB is the document DB to most devs, and I’ll share my experience has been miserable in every. single. instance.

            1. 1

              My experience differs, I think the way the data is structured brings a lot of pain - it leeway - later on.

              The problem I’ve seen often with both SQL and document-based solutions are not always obviously similar, but often fall into the same general category.

              But yeah, working with low level SQL be low level mongodb usually has more pitfalls.

          2. 4

            More generally, instead of a boolean flag, e.g. completed, a nullable timestamp of when the state was entered, completed_at, can be much more useful.

            In the DB we have at work, the only update to any row we allow is to take a field which is currently null and set it to the current time. All other new data is recorded by inserting a new row and setting the stop date on the previous row which represents the record in question.

            It’s a little fuzzier than “purely immutable” rows but it still ensures that no information is ever lost, the old version from a given point in time can always be reconstructed.

            1. 3

              Is there a reason you use this pattern instead of the pattern of copying all inserts/updates to an audit table?

              1. 1

                What would be the rationale for introducing a separate table here? Not sure I understand. What you’re describing sounds more complicated.

                1. 4

                  Well, it would greatly reduce the row count in the (non-audit) table & would accomplish the “… can always be reconstructed” bit at the same time.

                  It seems likely the audit table would be read less, but I don’t know exactly which queries you’re trying to serve best.

                  1. 1

                    Hm; well, I could imagine some circumstances where this would be useful but none of the advantages really apply to our situation, so there’s no need to complicate things for us.

                    1. 2

                      You can also do one audit table for all the other tables across the whole app.

              2. 2

                Do the “stopped” records sit in the same table?

                I’ve not tried that, but I’ve read that it can cause problems since you need to ensure that all application queries contain AND stopped_at IS NULL, otherwise you can have the old data surfacing in the app (or reporting tools etc).

                I’d appreciate your thoughts on whether this is a problem in practice. I’m interested in effective patterns on modelling dynamic data and preserving history.

                A proposed alternative is to also move the record to an archive table.

                1. 7

                  I’m not the person you asked, but I’ve worked with similar patterns before. There are a lot of use cases for keeping all the rows and marking to know which date ranges particular rows are valid for, or at least which ones are valid now and which used to be.

                  Most of my experience is with setups which used two timestamps per row, to indicate the beginning and end of the row’s period of validity, which is such a common pattern that some DBs support it as a built-in feature (for example, MS SQL Server’s “temporal” tables). A lot of heavily-regulated fields rely on setups like this, and merely deleting old rows and keeping around an archive/audit table is generally not enough.

                  To give an example, I used to work for a company that needed to process (US) health-care claims. Very often there were rules about how they had to be processed which caused different claims to interact with/impact each other. Say, for example, that a particular procedure can be done 6 times with no questions asked, but beyond that needs some sort of additional authorization or documentation. So you process them as they come in: 1, 2, 3, 4, 5, 6. Then one day a seventh claim comes in, but the date on it falls between the claims you previously thought were “2” and “3” (this is a valid and common thing, because health-care providers often get extended periods in which to submit claims, usually a year from the time the treatment was provided). So the new one has to become “3”, the old “3” becomes “4”, and so on until the former “6” becomes “7” and needs to be reprocessed under the rule for claims beyond the sixth. But you also need the ability to keep and reconstruct/replay your source data, processing and decisions for two timelines: the original timeline with the first six claims you received, and the new timeline where you know there are seven and that they go in a different order. But both timelines agree on the sequencing of claims “1” and “2”, and both also depend on the existence and sequencing of those claims. And they in turn depend on other claims/sequencing in other rows. So there’s no clear point where you can just say “OK, drop all these rows and move them to the archive/audit table”, because the potential interactions mean you need the whole set available to be able to recalculate when new information comes in.

                  1. 1

                    Most of my experience is with setups which used two timestamps per row, to indicate the beginning and end of the row’s period of validity

                    Yes, I didn’t mention this because it’s much less frequently used, but we store the start date too. We haven’t gotten to this yet, but one big advantage of this is that you can insert rows with a start date in the future. This complicates insertions because you need to check to see if there’s a future-dated row and invalidate it if so, but it leads to some really interesting properties. It’s possible that will turn out to be error-prone and not worth the headache. We’ll see.

                    This was an easier sell on my team because we store billing data, but it has worked so well that I’d use the same pattern in any context where the performance/storage overhead was not prohibitively expensive.

                    I’ve read that it can cause problems since you need to ensure that all application queries contain AND stopped_at IS NULL, otherwise you can have the old data surfacing in the app (or reporting tools etc). I’d appreciate your thoughts on whether this is a problem in practice.

                    Yes, if a query is only interested in the most recent row, it needs to indicate that, and if it’s interested in the historical data it can access that as well. It has not proven to be a problem in practice because the database has always worked this way. I could imagine it being an issue if you introduced this pattern in a system that used more conventional patterns on other tables.

              3. 3

                Thank you for writing this.

                I feel like sometimes YAGNI is taken to mean “whatever, we’ll just do something”. YAGNI does not mean you should not think ahead or design a little bit.

                Another one to add to the already great list Enums. You already hint at it under Timestamps, but I always wince at code that defines a boolean field, then another, etc. Enums over booleans, where appropriate.

                1. 2

                  I have one weird opinion about enums, though: use strings that are constrained to a set of values rather than numeric-based or native db enums. There are a few ways to do this, it’s not much trouble, and I have run into issues with changing enums too many times.

                2. 1

                  This was already posted as a reply in this topic: https://lobste.rs/s/quywfp/yagni_exceptions#c_fwijse