1. 19

As a basis for discussion:

  • DB validating the format of an email address: No
  • DB validating the price of a product can’t be negative: Yes Q1:How would you define the line between them? Q2:What are your favorite readings/resoucres on this topic?
  1. 11

    For me, the line depends on a few things:

    1. Is this a constraint/property that only the database can ensure? Uniqueness of given data would be my go-to here. If you need something to be unique in a database, only the database is going to be able achieve that reliably without data races. But, in general, if only the database can enforce a given property, then enforce it there. This does require knowing about what databases can enforce, and having spend some time thinking about data races.
    2. What is the expertise balance on the team? If you have a lot of people who can work in the app layer, and only a few who can work in the data layer, prefer to enforce business logic in the app layer. If your balance is the other way, then prefer to enforce it in the data layer.
    3. A lot of business logic should be distributed-systems (though not necessarily distributed-consensus) correct at a basic level, since that also helps it handle all of the edge cases of the real world. This is an argument for using the app layer.
    4. Different databases have different tools for programmability, which gives things different flavors. Postgres has things like PostgREST, and can be programmed in a wide variety of languages. Sql Server has basically 4: T-SQL, C#, VB.NET, and C++. MySQL/MariaDB has support for Stored Procedures, and also can be extended in at least two different ways, though I don’t know the details. All of these are tools one can bring to a given data problem.

    I don’t know that I have a good reading on this particular tradeoff. I do know that databases tend to become the bottleneck in a lot of web apps, so any rules that aren’t “X data can’t relate to Y data in Z way” being enforced outside the database will reduce the load on the database. If a particular bit of data integrity logic Must Never Be Broken, then enforcing it at the database level will be the strongest tool you have to accomplish that.

    1. 8

      Why can’t a db validate the format of an email address? The pattern is known, even if the RFC is crazy compared to real email addresses.

      1. 10

        I recommend using the regex from HTML5 spec. Much simpler and very relevant as that’s what browsers use for input=email

        edit: link: https://html.spec.whatwg.org/multipage/input.html#valid-e-mail-address

        1. 9

          For the most part, the validity of the email address itself is not an issue, it’s whether the user can be contacted by it. Knowing this requires a dance of sending the email and verifying that the user got it, which is way beyond the purview of the database. I usually don’t see a point in going beyond NOT NULL on a database field for email.

          1. 2

            I think it makes sense to (perhaps in the UI) just do a sanity check, like an @ sign, just to fix up any big fat-fingers. Before the TLD explosion, it was pretty easy to ensure the TLD also, but that’s over(yes purposely ignoring UUCP/bang-path type addresses).

            Otherwise I basically agree with you. NOT NULL UNIQUE forces the email field to do what you need it to do that you can control: be a unique identifier. If it happens to also work for communication, bonus! :)

            1. 1

              add an “validated” field and it’s good to go..

            2. 1

              Why can’t a db validate the format of an email address?

              You gotta good point; you can validate an email address at the DB layer for sure. I looked into this a bit more now that you mentioned an RFC

              There is a really good response here by this DBA suggesting similar thought of using the HTML5 email regex spec “if its good for HTML5/browser its good for the rest of us” - https://dba.stackexchange.com/a/165923

            3. 7

              I would argue that the DB is responsible for data integrity and thus should validate the format of an email address if possible.

              IMHO the DB is responsible for data integrity, data formatting, data storage, and data extraction and transformation (if possible).

              That last point is something that I’ve seen ignored a lot. I’ve seen countless apps pull a huge amount of data from the DB, do a trivial transformation, and store it all back when the DB could have done it faster, safer, and more clearly in situ.

              1. 1

                I would argue that the DB is responsible for data integrity and thus should validate the format of an email address if possible.

                I like this perspective, adding on to the “if possible” statement: If its not easy to enforce in DB, its hard to modify/contribute/maintain, it’s might be a good idea to handle it in the code if your language of choice can do a better job Interestingly after reading so many of the responses in the thread, I can see the benefit of putting the logic at the DB layer and/or app code

              2. 6

                Most times the argument “code is easier to put in version control” wins for me. So I do as little in the DB as possible. But then, the last years I have been working mostly with Mongo, a document database, so the idea that a DB can give you any useful guarantees is starting to feel like a relic from the past for me anyway.

                1. 14

                  Checking in migrations should do the trick?

                  1. 1

                    the last years I have been working mostly with Mongo

                    You could go back to using a DBMS ;)

                  2. 5

                    Valuation in the DB has the benefit that even if the application layer changes significantly it will be checked. Just make sure you have proper migrations.

                    It is also possible to check in both of course.

                    I also like to think of the app doing input validation vs the database doing data validation. I don’t think the database should ever allow inconsistent data.

                    Applications tend to be faster moving than dates models. Sometimes you also want to decouple application and database. Sometimes views are used for that, but especially for input of new data sometimes a bit more logic is needed because for application reasons and various use cases different types of input can lead into the same kind of data being added.

                    Usually for simple CRUD applications it’s fine to just throw some form validated input in there, but as soon as things grow a bit it can be a bit of a hassle, especially when relations come into place. Then oftentimes logic in the application if too coupled with the application and maybe even the ORM can get messy because it trends to not be fun if you think of the same data in different context and actually as relational data. Then having a good central point for validation can also become hard.

                    Another thing is that even an ORM is used validation can be bad if you wanna use some more specific features or types. Take for example Arrays or Enum Types in Postgres. Validation here can be hard or effectively impossible without just resorting to “let me write actual SQL”

                    So to sun it up: Consistency checks one really should do in the database, others you may do, and sometimes one might want to do both. Broken data tends to beat lot more headache than maintaining both of these and it gives you extra certainty, especially when the data model chances, let’s say for business requirements, or simply initially missing knowledge.

                    For the email address one might do both. Or essentially none because the verification might actually be if a link gets clicked in a confirmation email.

                    We live in a time where we want guarantees from programming languages, etc. I think having a database schema that does its best to guarantee correct data is a worthwhile investment.

                    1. 4

                      That seems to rhyme with “What would you use/declare a type for”?

                      • signed/unsigned? Yes
                      • email address? No.

                      What I find interesting here is that signed/unsigned are absolute data, where the number speaks for itself¹. Similarl for length-based validations. Email addresses, however may need verification in your application? The more logic you’ll find within this validation, the more likely it is that it’s error prone, which warrants automated tests. I’d put most validation logic in code, so that it is visible and reusable in version control and covered in the test suite. (Besides, many regexes for validating email addresses are wrong, as apparently almost anything goes in front of the @.)

                      ¹ There are some extreme cases with floats and positive/negative Infinity. Let’s pretend that doesn’t exist for the purpose of this discussion.

                      1. 9

                        It feels like the question that you’re asking is really ‘what can the database guarantee about the data?’ and you should make the database enforce anything that you may later depend on being true. This insulates you from bugs in the application logic: if an invariant doesn’t hold, the insert fails and the database remains in a consistent state.

                        The email address example is interesting because the property that you might depend on is not the property that you can enforce. I have never been in a situation where I’ve depended on the fact that an email address is syntactically valid. It might be useful for limiting the testing surface of whatever parses data structures containing an email address, since you can use the regex in your fuzzing pipeline, but that seems somewhat niche.

                        I have depended on an email address being an identifier that can be used to establish a communication channel. This is a temporal property. Even if it holds when you insert the email address into the database, it may not be true at any given point because of either transient or fatal errors. As such, there’s no useful validation that you can do in the database. You might try connecting to the relevant SMTP server before you do an insert, which would guarantee that, at the very least, the email address was working at some point in the past. You might even try sending an email and leave the email address in some transient storage until the recipient clicks on a link to validate that they’ve received it. None of this validation makes sense to put in the database because it doesn’t validate a property of the data that clients of the database can usefully depend on.

                        1. 1

                          This is a temporal property

                          I think this is actually the real issue here. The DB can’t verify this initially, and even if it could, it won’t ever be able to verify this in the future for every user.

                        2. 1

                          I’d choose to declare an Email type. Does it ever make sense to assign an Email to a Phone number? But they’re both “just” strings…

                          1. 1

                            There are some extreme cases with floats and positive/negative Infinity. Let’s pretend that doesn’t exist for the purpose of this discussion.

                            Why? I think the float is a great example of why it’s not really like that.

                            Also one might argue that an email address is simply a new type introduced by a class or equivalent mechanism depending on the programming language one uses. It’s just a pretty messy type to interpret, but float also is not just random numbers

                            There doesn’t really seem to be a clear cut that’s not using historical context or for example how CPUs work and what features they provide.

                            1. 1

                              That’s what I’m saying: It’s not clear cut. But for the sake of the argument, let’s pretend that numbers (or at least integers) can be strictly on one side. Without this assumption, I’d lack of an example for my in database versus in source code distinction.

                              In hindsight, I should have chosen “bool” for my example, that would make this side-thread moot :)

                              1. 1

                                Hehe, I actually wanted to bring up boolean as another example, cause of that sometimes being 0 and 1 integers (or bytes), so it’s numbers with special meanings rather than text with special meanings.

                                Still though, we build up on relatively primitive types, how complex built in types get is as you say not clear cut and could incorporate a lot. After all many programming languages have things like URLs, IP addresses, etc. as types. That some are more special seems to mostly be related to how things happen to be implemented in C.

                                At least if one looks at C++ and especially Java which seem to be early examples that were often copied.

                                But to be more on Topic again. URIs or maybe IPs are maybe a better example than both float and email addresses, because email addresses are often deemed to be very “special”.

                                Postgres for example has inet/cidr/mac address types. And unlike others they aren’t even part of contrib, so really built in.

                          2. 3

                            One of the nice things about Django’s model layer is that you can often get both db level and application level for the price of one. If you can arrange something similar then that is a winner.

                            1. 3

                              I would argue, if you have a sane database(like SQLite, PostgreSQL), then just put all the checks in the DB to start with. You move stuff out sparingly, only when it makes sense to do so. Other comments basically cover these exceptions, mostly limited to things the DB can’t easily check.

                              There is absolutely a case to do UI checks and database checks. The DB checks matter, the UI checks are just to make the user experience better, and should in no way be relied upon for data integrity(there are many ways to change data outside of a programmed UI, that happen all the time).

                              If your DB is more like MongoDB, you have no choice but to do it in the application code. Hence I recommend never using databases like these unless you happen to fit a very niche case, in which case no discussion here will be useful.

                              Most applications can happily live with SQLite’s capabilities, and for those that can’t there is a 99%[0] chance PostgreSQL can handle it. All the other databases are either for very niche use cases or for fun, and can be ignored.

                              0: For the pedantic, I guessed @ this number :)

                              1. 2

                                The answer to that depends on the database and whether every access goes through some shared application code or not. Which in turn depends strongly on what kind of application we’re talking about.

                                In general I think there should be a “service responsible for the model” that has a well-defined API and validates the coherence of everything. Whether the code that validates runs inside the database, outside or a bit or both is irrelevant.

                                1. 2

                                  Constraints in database also protect against someone connecting to the database server “just to fix a few things…”

                                2. 2

                                  Sometimes it is useful or even required from the performance point of view to have computation near the data just to not transfer it through the network - not talking about data validation but, for example, reporting. There are other solutions to reporting problems but most straightforward and requiring often less code is to just run a stored procedure. With all of the drawbacks of stored procedures.

                                  1. 2

                                    Sometimes it is useful or even required from the performance point of view to have computation near the data just to not transfer it through the network

                                    This is a really good point on data locality. If your database server’s hardware specs are significantly better than your app servers, then this could help improve performance a lot

                                  2. 2

                                    Q1: hard question. Basically use heuristics to find the best fracture planes depending on team organization/ownership + understanding of the solution + maintenance burden + coupling + technical constraints.

                                    Depending on what you decide to implement in the database, it will be able (or not) to scale computations over data. Having remote stateless applications allows to scale computations at the expense of bandwidth. my 2 cents in 2 minutes.

                                    Q2: https://rob.conery.io/2015/02/21/its-time-to-get-over-that-stored-procedure-aversion-you-have/

                                    1. 1

                                      Basically use heuristics to find the best fracture planes depending on team organization/ownership + understanding of the solution + maintenance burden + coupling + technical constraints.

                                      Well articulated & succinct outlining of lots of the variables involved. Thanks for the stored procedure article

                                    2. 2

                                      Q1: It’s a fine line indeed, and depends on your tooling, team knowledge and willingness to tolerate data inconsistencies.

                                      For example, you could validate the format of an email address, either with a stored procedure (in general these are somewhat painful because every update of the code requires adding a migration) or with a regex CHECK constraint. But then, are you able to set it up so that this results in a nice error message for your users? Generally, the database returns only the first error it encounters, and often this is tricky to catch nicely because all user-defined errors are returned with the same code.

                                      Typically you want to validate forms/entities all at once, returning all errors. This means you’d have duplicate code if you also want to enforce it in the database. This means you can’t really have a single point of truth (unless you opt to use stored procedures in your programming language of choice, or generate code and db validations from sort of DSL, but for many teams that’s a bridge too far), which means the definitions might diverge. If they diverge, either the frontend validation is too agressive or the database validation is, which means your users may face nasty error pages.

                                      In general, the more important it is that a certain type of data is consistent, the more necessary it is to move this validation to the database. If you deal with a particularly tricky piece of business logic that is easy to mess up, or if the cost of messing up is particularly high, you want to make sure there are some safety “guardrails” in the database.

                                      Of course, performance may also be a concern: in the system I’m working on at $DAYJOB, we originally had a lot of safety cross-checks as constraint triggers, to ensure we wouldn’t get any dangling records (there’s a main table that has to have a record, so if a side table had a record but the main table didn’t, we want to get notified of it). This was extremely helpful in the beginning, as we were laying the groundwork / infrastructure of the codebase. Initially we’d mess up quite often. But later, we almost never saw these error anymore, so we had more faith in the code. After the initial phases, when the system went live, it turned out that these triggers were a big bottleneck, and we decided to get rid of them.

                                      One sort of “pattern” I’ve found is helpful in case of performance problems: instead of simply dropping the guardrails, turn them into a daily/hourly/whatever task which performs a “consistency sanity check” and alerts the relevant team when an inconsistency is detected. This is a nice compromise: you get to get performance, and you still have the checks. Unfortunately, because they’re delayed, you won’t get a nice stack trace and extra context about how this situation came about. So it can be good if you run the task often enough that you have a fighting chance of figuring out which action created this mess.

                                      In the system I described above, we didn’t only create such a task, but wrapped our unit test in a post-check to validate the internal consistency. This means if someone broke something or wrote a new feature which caused an inconsistency to occur, they’d know when the CI pipeline would fail, so we wouldn’t even be merging such broken code.

                                      Finally, when in doubt, put it in the database. If it turns out to be a hassle for whatever reason, you can always move it back into the application. If you have it only in the application, and the validation somehow gets bypassed (through manual DB manipulation, faulty queries or the code somehow gets disabled), you’re hosed.

                                      Q2: I don’t remember any useful musings on this topic. Maybe I’ll write up something if I manage to summon up enough round tuits.

                                      1. 1

                                        My primary resource is my experience as an SQL developer, DB architect and a DBA. From that I developed the opinion that you should constrain each datum on all levels, for example:

                                        1. on the interface level, for better user experience (JS checking a zip code)
                                        2. on the application code level, to prevent maliciously crafted application-database communication
                                        3. on the database level, because the developed application is not the only thing talking to your database.

                                        Only the first two points never required any explanation. I think the unconvinced always take the single point of view that those are a stack of layers and if you guarantee something on one layer, you don’t have to and you shouldn’t repeat the logic on the layer below, because it will trickle down. This is obviously not true for something like UI code and application code of a web application, but harder to argue about when talking about something like an API gateway. So maybe we shouldn’t be calling them layers.

                                        1. 1

                                          I always err on the side of more in the DB. For example, I just rolled out a pay-for-overages system at work (a phone provider) which is implemented completely in the DB. This is great because it means the systems that handle calls can just do their job and not know about transactions, user settings, etc. They just insert a call log and the DB makes sure they get billed.

                                          1. 1

                                            The mental distinction I use is: is the database acting as the persistence layer for one service, or is it a shared service in its own right that people and programs access?

                                            The rich constraint language in SQL and many of the features like access control were designed with the second case in mind, when you would have a database at a company that many users would log into and various programs would access.

                                            If you’re using a database as a persistence layer for a single program, I would do the vast majority of your validation beyond basic schema (fixed columns and types) in the application layer.

                                            1. 1

                                              My view seems a bit different than what’s already here so: choose one or the other. There’s nothing wrong about putting all the logic in the app. There’s nothing wrong about having the app interface with the database through views and stored procedures that do all the business logic. The pain arrives with trying to find some threshold / splitting the responsibility in an arbitrary way that’s more justified to one person than another. Choose one, be consistent and you’ll be fine.

                                              1. 1

                                                I think you can’t go terribly wrong with either choice as you mentioned; some other responses in here did mention interesting use cases for putting it in the DB layer (but again context informs so much) like

                                                Choose one, be consistent and you’ll be fine. this is a good reminder ^

                                              2. 1

                                                Same question everyone seemed to be asking 40 years ago when I was starting in IT. I suggest that problem domain logic (structure and constraints) be separate from both persistence and user interface. At question is, do you know how, without that logic devolving into a big ball of mud as complexity grows?