The first one is reading the complete, up-to-date database schema structure. I’d constantly fire up a Postgres GUI (I use TablePlus) to look at the structure of the database, since it was hard to navigate old migrations and piece together what their end result is.
I think people are supposed to look at the models in the application layer to see the end result. And for the people who don’t use ORMs, committing a full schema.sql file to the repository (in addition to any migrations) does the job. This has been my approach for the last couple of years.
The second one revolves around searching through code. Working on the new codebase involved a lot of searching all around the code to understand the structure of the application. Function names, modules, database columns, and what have you. However, database columns stuck with me: I’d always find a bunch of misleading search results in old migrations. For example, I’d see a few results for a column name that was created, then modified, and then dropped.
Searching the migrations definitely is a bad idea. IMO what needs to be searched should either be the models (in case of ORM use) or a full SQL schema committed to the repo. And if either doesn’t exist, a database dump could be used. Postgres example: pg_dump --no-data > schema.sql
So I started wondering: why do we keep old migrations around? Don’t get me wrong, I know why we write migrations in the first place. They’re great, no doubts. But why not throw them away after they’ve done their job? How many times did you roll back more than one migration? I have never done that. It’s hard to imagine rolling back many changes, especially when they involve not only the schema but also the data in the database itself. There must be a better way.
This is a dangerous idea =) Some warnings:
How many times did you roll back more than one migration?
You can’t depend on such an argument in case of a production system. Things do break and you always have to have convenient rollback mechanisms. Can’t take any chances.
It’s hard to imagine rolling back many changes, especially when they involve not only the schema but also the data in the database itself.
Proper migrations are supposed to migrate the data as well, in addition to the schema. It doesn’t always work good for forward migrations, but it always should for rollbacks.
Migrations help during development as well. Imagine a repository with 3+ active feature branches. In case the majority of them deal with database migrations, you will need to migrate back and forward pretty frequently if you develop the features in parallel. I’ve been through this scenario many times.
At least in Ecto, the models in the application layer can (and frankly probably should in many cases) differ significantly from the table layout. And by only keeping the relevant migrations in git, you get something really clever: it encodes what migration was necessary for that commit. That means that you can create a custom git command that traverses not only versions of the code, but also versions of the database schema. That actually makes the “switching between 3+ active feature branches” process dramatically easier, and would definitely be something I set up in conjunction with a git hook to allow migrations to only live in 1 commit.
Something I do in the laravel world is use the $fillable feature - I’ve seen codebases that do not use it, and allow data to be dynamically filled which brings the issue of having no clue what attributes are available in the model, so you have to poke at the db to find what there is.
I’ve not written much elixir code, but once I do, I really want to do something similar with ecto, because it makes life so much easier.
The article mentioned Django’s migration “squashing”, and I think it’s a good middle ground. The way it works is:
Gather up a specified list of separate migration files, and extract the actual operations from them
Put all those operations in order, and examine whether any “cancel out” (i.e., if one adds a column and a later one removes the same column). Get rid of those.
Write the remaining operations, in order, to a new single migration file containing some metadata to indicate which ones it takes the place of.
Then the original set of fils can be deleted (Django doesn’t automatically do this, but encourages you to do so manually after you verify deployment of the squashed migration file to all your systems).
So you wind up with a single file containing all the historical change operations that are still relevant to your current schema, and which produces your current schema when run.
Have a codebase at work where somebody did this prior to the current team joining. So, of course, we now have both a massive sql ball and a bunch of migrations.
If you want to make future maintainers’ lives really annoying for code discovery and refactors, by all means do this.
At work we have a CI pipeline that checks out the committed SQL file from the “base branch” of a merge request and applies the migrations from the current branch, then dumps the schema to another SQL file. It then diffs the dumped file to ensure the schema matches the previous schema plus migrations.
Another approach could be to simply generate the SQL file in a commit hook.
Having an SQL file makes dumping and loading much faster, and if you have a large test suite, resetting the db to the schema’s version is faster too. Although nowadays we no longer do that, but work with a template db that gets restored at test start, that’s even faster.
An SQL file is also a lot easier to search through when discovering the schema or quickly grepping for existing indexes, triggers and such. Sure, you can also connect to the db and inspect it that way, but navigating an a schema file in a good editor can be more efficient than using DB tools depending on what you’re doing.
In some cases like the ORM-based migrations (let’s say Django), you might have to squash your migrations, if at some point you used a library that provided custom column types but that are no longer used. If you didn’t squash your migrations, you’d need to keep that library around as a dependency “forever”.
In the past I’ve also seen Rails projects where people loaded the current version of the models inside Ruby migration scripts in order to make complex data migrations easier. Of course, this breaks as soon as the model is changed in any way that the migrations rely on. In those cases, using and committing a DB dump is also probably the only way to go. From that experience I’ve really come to appreciate how Django instantiates a pseudo-model that only has the in-memory representation of the model at the current point in the migration history, so you can still write idiomatic ORM code to do complex data migrations.
In your specific case, if you really dislike the situation you could probably just drop the SQL file and always apply the migrations from scratch. Or mark it as an ignored file for your editor so it doesn’t show up when grepping.
However as I understand it there is a huge difference. It looks like Ecto works like Rails, where migrations are the source of truth: you write migrations, it infers the schema. In the Python stack I mentioned it is the other way around: you edit the current version of the schema in your code and the tool infers the migrations (sometimes you have to help it a little).
I wish more git guis and tools had the option to show deleted files in a directory. After working with a VCS that did this it is amazing how much it makes deleting files more palatable. It is so easy to see the latest version or what commit it was deleted in. Also ideally links would say “deleted at commit 123” instead of just a 404 with no info.
Of course looking this up in Git is a little expensive (you need to walk the whole history). But caching this is trivial so you only need to do it once.
I think people are supposed to look at the models in the application layer to see the end result. And for the people who don’t use ORMs, committing a full
schema.sql
file to the repository (in addition to any migrations) does the job. This has been my approach for the last couple of years.Searching the migrations definitely is a bad idea. IMO what needs to be searched should either be the models (in case of ORM use) or a full SQL schema committed to the repo. And if either doesn’t exist, a database dump could be used. Postgres example:
pg_dump --no-data > schema.sql
This is a dangerous idea =) Some warnings:
You can’t depend on such an argument in case of a production system. Things do break and you always have to have convenient rollback mechanisms. Can’t take any chances.
Proper migrations are supposed to migrate the data as well, in addition to the schema. It doesn’t always work good for forward migrations, but it always should for rollbacks.
Migrations help during development as well. Imagine a repository with 3+ active feature branches. In case the majority of them deal with database migrations, you will need to migrate back and forward pretty frequently if you develop the features in parallel. I’ve been through this scenario many times.
At least in Ecto, the models in the application layer can (and frankly probably should in many cases) differ significantly from the table layout. And by only keeping the relevant migrations in git, you get something really clever: it encodes what migration was necessary for that commit. That means that you can create a custom git command that traverses not only versions of the code, but also versions of the database schema. That actually makes the “switching between 3+ active feature branches” process dramatically easier, and would definitely be something I set up in conjunction with a git hook to allow migrations to only live in 1 commit.
Something I do in the laravel world is use the $fillable feature - I’ve seen codebases that do not use it, and allow data to be dynamically filled which brings the issue of having no clue what attributes are available in the model, so you have to poke at the db to find what there is.
I’ve not written much elixir code, but once I do, I really want to do something similar with ecto, because it makes life so much easier.
The article mentioned Django’s migration “squashing”, and I think it’s a good middle ground. The way it works is:
Then the original set of fils can be deleted (Django doesn’t automatically do this, but encourages you to do so manually after you verify deployment of the squashed migration file to all your systems).
So you wind up with a single file containing all the historical change operations that are still relevant to your current schema, and which produces your current schema when run.
Have a codebase at work where somebody did this prior to the current team joining. So, of course, we now have both a massive sql ball and a bunch of migrations.
If you want to make future maintainers’ lives really annoying for code discovery and refactors, by all means do this.
I argue that this is an Ecto antipattern.
At work we have a CI pipeline that checks out the committed SQL file from the “base branch” of a merge request and applies the migrations from the current branch, then dumps the schema to another SQL file. It then diffs the dumped file to ensure the schema matches the previous schema plus migrations.
Another approach could be to simply generate the SQL file in a commit hook.
Having an SQL file makes dumping and loading much faster, and if you have a large test suite, resetting the db to the schema’s version is faster too. Although nowadays we no longer do that, but work with a template db that gets restored at test start, that’s even faster.
An SQL file is also a lot easier to search through when discovering the schema or quickly grepping for existing indexes, triggers and such. Sure, you can also connect to the db and inspect it that way, but navigating an a schema file in a good editor can be more efficient than using DB tools depending on what you’re doing.
In some cases like the ORM-based migrations (let’s say Django), you might have to squash your migrations, if at some point you used a library that provided custom column types but that are no longer used. If you didn’t squash your migrations, you’d need to keep that library around as a dependency “forever”.
In the past I’ve also seen Rails projects where people loaded the current version of the models inside Ruby migration scripts in order to make complex data migrations easier. Of course, this breaks as soon as the model is changed in any way that the migrations rely on. In those cases, using and committing a DB dump is also probably the only way to go. From that experience I’ve really come to appreciate how Django instantiates a pseudo-model that only has the in-memory representation of the model at the current point in the migration history, so you can still write idiomatic ORM code to do complex data migrations.
In your specific case, if you really dislike the situation you could probably just drop the SQL file and always apply the migrations from scratch. Or mark it as an ignored file for your editor so it doesn’t show up when grepping.
I wrote something similar for SQLAlchemy / Alembic / Flask in June 2019: https://blog.separateconcerns.com/2019-06-01-truncating-alembic.html
However as I understand it there is a huge difference. It looks like Ecto works like Rails, where migrations are the source of truth: you write migrations, it infers the schema. In the Python stack I mentioned it is the other way around: you edit the current version of the schema in your code and the tool infers the migrations (sometimes you have to help it a little).
Vaguely related:
I wish more git guis and tools had the option to show deleted files in a directory. After working with a VCS that did this it is amazing how much it makes deleting files more palatable. It is so easy to see the latest version or what commit it was deleted in. Also ideally links would say “deleted at commit 123” instead of just a 404 with no info.
Of course looking this up in Git is a little expensive (you need to walk the whole history). But caching this is trivial so you only need to do it once.