My advice from painful experience: Do not do this.
Keep code in your git repo in, hopefully, a single language. Keep your data in a database. Try not to mix the two. Then you can definitively answer questions like “What version of the app is running right now?” Yes it’s possible to version the code in the database via migrations, but why? The only true upside I’ve ever seen is performance, which is a valid one, but reserve this for critical sections that are proven to be too slow.
There are (rare) cases where it’s not only faster but also clearer - when you are correlating data at different levels of rollup at the same time.
For instance I have an app that tracks where in a store stock is kept.
When taking an order, you want to know how much stock the whole store has (minus outstanding orders). That’s a horrendous thing to compute application side (7-way join, N+1 queries are very hard to avoid). The equivalent sql is quite tidy and readable.
The other upside is to have a single image of code/data in production. Migrations and deployment of new servers is a easy as copying the database to the new server.
In some industries, like payroll, this facilitates migration of client data between multiple providers.
My advice from someone who used to think this was a bad idea over a decade ago, but now has been doing it for everything for the last decade (or so), learn what you were doing wrong, because you’re probably still doing it.
I don’t agree with any of your suggestions.This approach is is faster, more secure, easier to audit and review, easier to develop and scale. In every case that you’re not doing it Wrong™, so stop doing it Wrong™ instead of figuring out how to make the Wrong™ thing work.
I agree (I think; this post took me four tries to read and I’m still only 90% sure I understood), with the proviso that there are very real advantages to being able to use existing tools (ex: rails), even if they don’t support the Right way to do some bits.
If you’re a staff engineer at a bigco you can fix the framework (and probably get it merged upstream), but in agency/startup land you definitely do not have time for that.
It answers (in both prose and code) some of the questions here like how I do tests.
Now I also do something (not shown in the Github) that works great:
Put your database tables in one schema (I’ll call it “schema1” here)
Put your functions in another schema (“schema2”)
Be careful with schema1 since it has all your data, but…
“drop schema2 cascade; create schema2; load functions.sql” is a great way to quickly load all of your functions to their most current state as you’re programming.
To all the people saying, “We tried this 20 years ago and it hurt!”, I don’t know your situations, but I run on a bare-metal OpenBSD server where I’m the only user, have no collaborators, versions, migrations, or any other complexities that most people working in teams and companies have, so to each his own.
Working this way daily for 5 years has been wonderful with no downsides. I absolutely love the encapsulation of it. It’s simplified my code and maintenance a lot. And I like the freedom it gives me to easily switch all my surrounding code to Elixir, Nim, Racket, or whatever else, because the inner data-logic code is inside the database.
P.S. pushcx posting this four years ago was how I found lobste.rs in the first place. lobste.rs is one of only three bookmarks in my browser, the only ones I read every day. I’m really thankful for you guys.
@sivers how do you version things like functions and views and the like? I know you can use database migrations and whatnot, but it always seems like a hassle, and it’s kinda hard to roll back changes unless you’re really careful.
That’s basically the thing that’s stopping me from using too much logic in databases, because in principle I agree with your points in the article and that it makes things easier, but it seems to me that the whole tooling surrounding it makes everything much harder again.
If you are in rails land, if you are using structure.sql instead of schema.rb, you are keeping track of that anyway. You can also use something like scenic. It works pretty well: https://github.com/scenic-views/scenic.
I’m not sure I am a fan of 100% business logic in the database, but I definitely am for letting the database do what it is good at: working with and transforming data.
If your app must support multiple databases then moving some logic/functions/using some feature/etc. may not be a good idea. But, if you are all in with your chosen database, then might as well take advantage. Postgres gives you essentially a layer of caching for free (materialized views).
If you are stuck with a multi-version problem of executing apps against the DB, where you don’t control the deployment of the app(s) using the DB, then it gets a lot trickier when executing code in your DB. You can do like @danielrheath or you can add _v0, _v1, _v2 etc to the function names, or other schemes. The trade-offs are definitely not as much in your favour when you run into this problem.
For those of us, that do control the apps running against the DB and can deploy in sync with your DB changes, then it’s way way easier, and it’s definitely more of a sweet spot for executing code in the DB. This is what we do, and it’s very easy. We do a rolling release schedule, from development to nightly to production, so we have a new production release with any DB schema changes and app changes rolled out every night(and by hand in emergency situations, which are kept to maybe 1 every few years now).
Regardless you should use a schema version control system, so you can handle rollbacks as needed. We use liquibase(and are happy with it), but there are many solutions out there.
Also because the IO is so well documented and stable in the case of PostgreSQL, it’s arguably EASIER to test, as the input and output is very stable(basically never changes) and is easy to reason about.
I actually just test everything against the database these days; the entire test suite takes about 1.5 seconds to run (or 19 seconds with PostgreSQL instead of SQLite, mainly because I haven’t optimized that very well; it runs createdb and dropdb on every test whereas SQLite just uses a :memory: DB; not sure how to do something like that in PostgreSQL).
I know some people don’t like this for various reasons (“not true unit tests”), which is perfectly reasonable (I just don’t agree), but performance doesn’t seem like a big issue to me.
I had the same question as above. My thought process:
If the idea is to put all the logic in the database because it is language agnostic then one would think that testing would be language agnostic. Except I don’t know of a testing suite in pure SQL.
Plus, you can’t really do unit tests. Everything is an integration test. (Unless you have an embedded db?)
I like the Michael Feathers definition of a unit test. Small and fast; it doesn’t call a database, communicate across the network, touch the file system or change the environment.
From that definition, I could see it being a unit test if it was being called from within the database. But I don’t know how I would want to do that.
You are right though there is no reason why it couldn’t be called from another language. Don’t let perfect be the enemy of the good.
To be clear, I work in an environment that uses a lot of stored procedures so I can see the benefit. I just want to know what other people are doing for their testing.
Ours are mostly tested through the API tests, which doesn’t feel direct enough for me. Plus my manager is a SQL guy. He does a bit of nodejs, but having a pure SQL option would be a selling point for him.
One approach I have seen is to have the tests in named stored procedures, kept in the same file. That way you can feed the whole file into psql to create the function, and it’ll fail (rollback) if the tests don’t pass (some code generation pass may be required to setup schema etc).
I asked the question because it is not addressed in the blog post. Do you load the JavaScript in units and emulate the inputs or do you write integration tests where postgresql is loaded with the stored procedures?
While I sometimes agree that some stuff is easier to do in the DB, I do not agree with article.
Constraints are OK, but that do not mean that you should not validate data in the application. It is just additional check layer. However your check for email address is wrong, and such complex checks should be IMHO handled outside DB.
Instead of lowercasing email in the DB (which can be wrong, as user part of the email can be case dependant) you should instead use citext which handle case independent comparisons for you, while preserving casing in store.
Query functions have problem that versioning them is hard and many tools do not support them. Almost always you can achieve the same within your application which will be much clearer and versionable.
Views are nice, and I agree that these should be used more.
For gods sake, never, ever use pgcrypto nor such functions for password updating. In case of screwup it is much easier to update your app rather than DB.
Switching to citext isnt an option without downtime; my users table is way too big and busy. A trigger to downcase can be installed instantly.
I think you could create the new column under a different name and make it NULLable, then copy over the old one’s data once.
Then, in one transaction, copy over everything that’s been added in the meantime, add a NOT NULL constraint, drop the old column and rename the new one.
How do you determine this in a large, busy system?
I can think of a couple of options:
Parse the transaction stream (either from disk or set up something that can talk the replication protocol)
Install a trigger to keep a last_modified column up-to-date.
Neither is super appealing compared to “leave the working implementation in place”.
Not to mention, once you drop the column and rename the new one, you’ll need to do vacuum full analyze users as there won’t be stats for the new column. That action also locks the table for quite a few seconds and slows everything else down.
Constraints do not integrate with the tails error mechanism, so you need to also validate or you get error pages instead of validation messages.
That is why I said that it can be used as an additional layer, some checks are much harder in the application code (for example overlaps), and some are much easier in the application code (emails).
Switching to citext isnt an option without downtime; my users table is way too big and busy. A trigger to downcase can be installed instantly.
You can do that gradually as citext is compatible with string. Also I am not sure if that requires table rewrites as I have never done that. Downcasing trigger will have exactly the same problems as citext. If there would be any place for triggers it is migration period where you are renaming columns and want to have 0 downtime migrations.
Hmm. I wonder if I could actually integrate them (so that you got a validation error instead of a failure). Would still need to blow up if you called save!.
In Ecto (Elixir library) you can do so. But still, often it is easier to do check in application code than in DB. Also there is very important thing - it is easier to update check in application than in DB, and updating them is more important than writing them.
database functions and triggers and constraints are nice, but this post fails to discuss any of the tradeoffs being made.
it’s not really simpler, because the logic is mixed between being in your app layer and your database layer. If you want to, for example, grep around your codebase to see where something is happening, you now have to search across two different compute environments.
keeping your procedures in sync between dev/prod/staging is an added complication.
you still have to figure out how to get these definitions into version control or they’re effectively undocumented.
if you’re unit testing these things, you’re probably doing from the app layer and not from SQL itself, meaning the thing you’re testing and the tests are expressed in different languages.
Those things are pretty navigable. The big hurtle for a lot of people is that burning DB node resources means you’re likely to have the DB be the bottleneck earlier. Scaling a database is harder than scaling a stateless HTTP layer in 90% of projects.
this style was a lot more common years ago, but a lot of people have been burned by it and have turned to using the database just for its indexing, durability, and replication properties, which postgres does extremely well and is very difficult to get right on your own. With the email example, it’s … not really all that tough to check that a string matches a regex in the app layer.
Since a JSON API — a hash/map of strings — is often the eventual interface, it’s even more reason to skip the abstractions and work with values directly.
This is a pretty good way to accidentally leak secrets.
I mean it’s common for a web application to need to be able to provide both internal and external representations of data, e.g., the password field of a user object. Stuffing all this logic into a database would be both a maintenance and operational nightmare.
I mean it’s common for a web application to need to be able to provide both internal and external representations of data, e.g., the password field of a user object
The real mistake is commingling public and private data together.
Don’t do that.
Just because a pattern is common doesn’t make it right.
Stuffing all this logic into a database would be both a maintenance and operational nightmare.
Funny. I think having global variables that are UPDATE and INSERTed all over your code base is worse.
The best possible scenario is that you don’t have to learn what kind of security and maintenance features your database has and you can just treat it as a serialisation point. That’s the best: that you don’t have to learn what code already exists (so you get to invent it yourself!) and it’s slower.
What usually happens is someone makes a change, adds a private element to a table that was previously completely public and forgets to review every update/insert/select in their codebase (since they’re all over the codebase, and possibly spanning multiple concurrent branches – impossible to get right for big teams!) so you get security vulnerabilities. Almost certainly.
Or they try to implement their own security features (like putting password fields in “user” objects) and get them wrong in subtle ways, then blame their tools.
password (or encrypted_password) does not belong in the user table. It belongs in an authentication_methods table.
Why? It lets you handle situations like:
Adding (multiple?) API keys that act on behalf of a user (ex: fastmail has ‘app passwords’ which are functionally per-mail-client api keys).
A customer complains that their account was taken over by an attacker who managed to reset their password. They’ve taken it back over and want to know what the attacker did. Your audit trail is linked to the password, and reveals which actions were taken by the hacker and which were taken by the customer as they used different passwords to authenticate.
Was this failed password attempt because they entered their old password? If so, tell them that their password has changed and to contact support if this was not them!
Was this new password the same as any old (now-inactive) password? One that might have been leaked? Pick something else!
My advice from painful experience: Do not do this.
Keep code in your git repo in, hopefully, a single language. Keep your data in a database. Try not to mix the two. Then you can definitively answer questions like “What version of the app is running right now?” Yes it’s possible to version the code in the database via migrations, but why? The only true upside I’ve ever seen is performance, which is a valid one, but reserve this for critical sections that are proven to be too slow.
There are (rare) cases where it’s not only faster but also clearer - when you are correlating data at different levels of rollup at the same time.
For instance I have an app that tracks where in a store stock is kept.
When taking an order, you want to know how much stock the whole store has (minus outstanding orders). That’s a horrendous thing to compute application side (7-way join, N+1 queries are very hard to avoid). The equivalent sql is quite tidy and readable.
The other upside is to have a single image of code/data in production. Migrations and deployment of new servers is a easy as copying the database to the new server.
In some industries, like payroll, this facilitates migration of client data between multiple providers.
My advice from someone who used to think this was a bad idea over a decade ago, but now has been doing it for everything for the last decade (or so), learn what you were doing wrong, because you’re probably still doing it.
I don’t agree with any of your suggestions.This approach is is faster, more secure, easier to audit and review, easier to develop and scale. In every case that you’re not doing it Wrong™, so stop doing it Wrong™ instead of figuring out how to make the Wrong™ thing work.
I agree (I think; this post took me four tries to read and I’m still only 90% sure I understood), with the proviso that there are very real advantages to being able to use existing tools (ex: rails), even if they don’t support the Right way to do some bits.
If you’re a staff engineer at a bigco you can fix the framework (and probably get it merged upstream), but in agency/startup land you definitely do not have time for that.
Since this original post, four years ago, I’ve posted a public example of a simple self-contained shopping cart, all in PostgreSQL:
https://github.com/sivers/store
It answers (in both prose and code) some of the questions here like how I do tests.
Now I also do something (not shown in the Github) that works great:
To all the people saying, “We tried this 20 years ago and it hurt!”, I don’t know your situations, but I run on a bare-metal OpenBSD server where I’m the only user, have no collaborators, versions, migrations, or any other complexities that most people working in teams and companies have, so to each his own.
Working this way daily for 5 years has been wonderful with no downsides. I absolutely love the encapsulation of it. It’s simplified my code and maintenance a lot. And I like the freedom it gives me to easily switch all my surrounding code to Elixir, Nim, Racket, or whatever else, because the inner data-logic code is inside the database.
P.S. pushcx posting this four years ago was how I found lobste.rs in the first place. lobste.rs is one of only three bookmarks in my browser, the only ones I read every day. I’m really thankful for you guys.
— Derek
@sivers how do you version things like functions and views and the like? I know you can use database migrations and whatnot, but it always seems like a hassle, and it’s kinda hard to roll back changes unless you’re really careful.
That’s basically the thing that’s stopping me from using too much logic in databases, because in principle I agree with your points in the article and that it makes things easier, but it seems to me that the whole tooling surrounding it makes everything much harder again.
If you are in rails land, if you are using structure.sql instead of schema.rb, you are keeping track of that anyway. You can also use something like scenic. It works pretty well: https://github.com/scenic-views/scenic.
I’m not sure I am a fan of 100% business logic in the database, but I definitely am for letting the database do what it is good at: working with and transforming data.
If your app must support multiple databases then moving some logic/functions/using some feature/etc. may not be a good idea. But, if you are all in with your chosen database, then might as well take advantage. Postgres gives you essentially a layer of caching for free (materialized views).
If you are stuck with a multi-version problem of executing apps against the DB, where you don’t control the deployment of the app(s) using the DB, then it gets a lot trickier when executing code in your DB. You can do like @danielrheath or you can add _v0, _v1, _v2 etc to the function names, or other schemes. The trade-offs are definitely not as much in your favour when you run into this problem.
For those of us, that do control the apps running against the DB and can deploy in sync with your DB changes, then it’s way way easier, and it’s definitely more of a sweet spot for executing code in the DB. This is what we do, and it’s very easy. We do a rolling release schedule, from development to nightly to production, so we have a new production release with any DB schema changes and app changes rolled out every night(and by hand in emergency situations, which are kept to maybe 1 every few years now).
Regardless you should use a schema version control system, so you can handle rollbacks as needed. We use liquibase(and are happy with it), but there are many solutions out there.
I have tried a few options and now include a short hash of the function body in the function name.
Hooking that up took 30-40 lines of ruby and ensures I can’t accidentally call the wrong version.
How do you write tests for this?
The same way you write tests for anything else?
I don’t understand why this is a question. It’s a deterministic process with visible inputs and outputs.
Also because the IO is so well documented and stable in the case of PostgreSQL, it’s arguably EASIER to test, as the input and output is very stable(basically never changes) and is easy to reason about.
I mean, it is slightly more awkward than plain objects. It means there’s more IPC in your test suite which isn’t great for speed.
I actually just test everything against the database these days; the entire test suite takes about 1.5 seconds to run (or 19 seconds with PostgreSQL instead of SQLite, mainly because I haven’t optimized that very well; it runs
createdb
anddropdb
on every test whereas SQLite just uses a:memory:
DB; not sure how to do something like that in PostgreSQL).I know some people don’t like this for various reasons (“not true unit tests”), which is perfectly reasonable (I just don’t agree), but performance doesn’t seem like a big issue to me.
I had the same question as above. My thought process:
If the idea is to put all the logic in the database because it is language agnostic then one would think that testing would be language agnostic. Except I don’t know of a testing suite in pure SQL.
Plus, you can’t really do unit tests. Everything is an integration test. (Unless you have an embedded db?)
The distinction between unit tests and integration tests is not well defined enough to be useful in this context.
If the unit being tested is a stored procedure, why does calling it from another language stop it being a unit?
I put logic in the database when it’s heavy on joins and aggregation.
I like the Michael Feathers definition of a unit test. Small and fast; it doesn’t call a database, communicate across the network, touch the file system or change the environment.
From that definition, I could see it being a unit test if it was being called from within the database. But I don’t know how I would want to do that.
You are right though there is no reason why it couldn’t be called from another language. Don’t let perfect be the enemy of the good.
To be clear, I work in an environment that uses a lot of stored procedures so I can see the benefit. I just want to know what other people are doing for their testing.
Ours are mostly tested through the API tests, which doesn’t feel direct enough for me. Plus my manager is a SQL guy. He does a bit of nodejs, but having a pure SQL option would be a selling point for him.
One approach I have seen is to have the tests in named stored procedures, kept in the same file. That way you can feed the whole file into psql to create the function, and it’ll fail (rollback) if the tests don’t pass (some code generation pass may be required to setup schema etc).
I like this idea. I’ll give this a shot. Thanks!
I asked the question because it is not addressed in the blog post. Do you load the JavaScript in units and emulate the inputs or do you write integration tests where postgresql is loaded with the stored procedures?
While I sometimes agree that some stuff is easier to do in the DB, I do not agree with article.
citext
which handle case independent comparisons for you, while preserving casing in store.pgcrypto
nor such functions for password updating. In case of screwup it is much easier to update your app rather than DB.I’d love to see more reasoning behind these. For instance:
Constraints do not integrate with the tails error mechanism, so you need to also validate or you get error pages instead of validation messages.
Mutation functions do not play nice with the rails query cache.
Switching to citext isnt an option without downtime; my users table is way too big and busy. A trigger to downcase can be installed instantly.
I think you could create the new column under a different name and make it NULLable, then copy over the old one’s data once.
Then, in one transaction, copy over everything that’s been added in the meantime, add a NOT NULL constraint, drop the old column and rename the new one.
How do you determine this in a large, busy system?
I can think of a couple of options:
Neither is super appealing compared to “leave the working implementation in place”.
Not to mention, once you drop the column and rename the new one, you’ll need to do
vacuum full analyze users
as there won’t be stats for the new column. That action also locks the table for quite a few seconds and slows everything else down.That is why I said that it can be used as an additional layer, some checks are much harder in the application code (for example overlaps), and some are much easier in the application code (emails).
You can do that gradually as
citext
is compatible with string. Also I am not sure if that requires table rewrites as I have never done that. Downcasing trigger will have exactly the same problems ascitext
. If there would be any place for triggers it is migration period where you are renaming columns and want to have 0 downtime migrations.Hmm. I wonder if I could actually integrate them (so that you got a validation error instead of a failure). Would still need to blow up if you called save!.
In Ecto (Elixir library) you can do so. But still, often it is easier to do check in application code than in DB. Also there is very important thing - it is easier to update check in application than in DB, and updating them is more important than writing them.
How much easier? Constraints don’t require downtime or anything - is it just that you have to have a db change recorded vs a text file change?
database functions and triggers and constraints are nice, but this post fails to discuss any of the tradeoffs being made.
Those things are pretty navigable. The big hurtle for a lot of people is that burning DB node resources means you’re likely to have the DB be the bottleneck earlier. Scaling a database is harder than scaling a stateless HTTP layer in 90% of projects.
this style was a lot more common years ago, but a lot of people have been burned by it and have turned to using the database just for its indexing, durability, and replication properties, which postgres does extremely well and is very difficult to get right on your own. With the email example, it’s … not really all that tough to check that a string matches a regex in the app layer.
This is a pretty good way to accidentally leak secrets.
What do you mean here? Shouldn’t you know exactly what a JSON object returns from a deterministic db function?
I mean it’s common for a web application to need to be able to provide both internal and external representations of data, e.g., the password field of a user object. Stuffing all this logic into a database would be both a maintenance and operational nightmare.
The real mistake is commingling public and private data together.
Don’t do that.
Just because a pattern is common doesn’t make it right.
Funny. I think having global variables that are UPDATE and INSERTed all over your code base is worse.
The best possible scenario is that you don’t have to learn what kind of security and maintenance features your database has and you can just treat it as a serialisation point. That’s the best: that you don’t have to learn what code already exists (so you get to invent it yourself!) and it’s slower.
What usually happens is someone makes a change, adds a private element to a table that was previously completely public and forgets to review every update/insert/select in their codebase (since they’re all over the codebase, and possibly spanning multiple concurrent branches – impossible to get right for big teams!) so you get security vulnerabilities. Almost certainly.
Or they try to implement their own security features (like putting password fields in “user” objects) and get them wrong in subtle ways, then blame their tools.
Argh, yes.
password
(orencrypted_password
) does not belong in the user table. It belongs in anauthentication_methods
table.Why? It lets you handle situations like:
Why not just go back to all logic in database Stored Procedures? Have we as software developers learned nothing?
Yes, I’m not sure there’s much new here. Maybe if coming from a less capable db?
[Comment removed by author]