This is a good start.
Things are nice and simple for this outer layer of git.
Once you start looking into packfiles, bitmap index, commit-graph, multi-pack-index,… you will start to see the advanced data structure in modern git looks very much like a database. And that’s where we will be heading.
Fossil makes a number of design decisions I wouldn’t even have made at the time in the details of how it stores source in SQLite, but yeah, I’m fully with you. I remember from working on Kiln how many operations (both of Git and Hg) we sped up by storing commit/changeset info in a DB instead of directly using any of the native data structures. In quite a few cases (especially for 2010-era Git), operations that took multiple seconds could be reduced to single-digit milliseconds, and at a pretty minimal amortized cost to keep the data up-to-date
I have seen different use cases where git just does not work that well:
Game development with artist workflows
Big monorepo
Some design choices that were made in Git’s early days could be really hard to replace with a better solution.
For example: SHA1 is insecure, so git has to apply a workaround patch of SHA1DC for collision detection. The roadmap to migrate to SHA256 is completely frozen while most dev tools have moved from SHA256 to Blake3.
Another example is the inefficiency of gzip used in packfiles… where Meta’s Sapling SCM has moved on to Zstd + custom dictionaries.
I know that many people prefer Perforce for version control of large binary files. Is that what you’re talking about here? Or is there something else about artist workflows that doesn’t fit with git?
This is a good start. Things are nice and simple for this outer layer of git.
Once you start looking into packfiles, bitmap index, commit-graph, multi-pack-index,… you will start to see the advanced data structure in modern git looks very much like a database. And that’s where we will be heading.
Yes! When I read this, I had this thought sequence:
You could have a git implementation that uses a database. In fact, libgit2 in theory makes this easy.
Fossil makes a number of design decisions I wouldn’t even have made at the time in the details of how it stores source in SQLite, but yeah, I’m fully with you. I remember from working on Kiln how many operations (both of Git and Hg) we sped up by storing commit/changeset info in a DB instead of directly using any of the native data structures. In quite a few cases (especially for 2010-era Git), operations that took multiple seconds could be reduced to single-digit milliseconds, and at a pretty minimal amortized cost to keep the data up-to-date
And before either Fossil or git, there was Monotone which also used SQLite.
And what a database! In terms of speed, size, reliability it really is phenomenally good at answering the queries that it’s designed for.
I’m increasingly leaning on Git for all sorts of purposes beyond storing code and it keeps impressing me with how capable it is - see https://simonwillison.net/2020/Oct/9/git-scraping/ for example.
I would say it’s good up until a certain point.
I have seen different use cases where git just does not work that well:
Some design choices that were made in Git’s early days could be really hard to replace with a better solution. For example: SHA1 is insecure, so git has to apply a workaround patch of SHA1DC for collision detection. The roadmap to migrate to SHA256 is completely frozen while most dev tools have moved from SHA256 to Blake3. Another example is the inefficiency of gzip used in packfiles… where Meta’s Sapling SCM has moved on to Zstd + custom dictionaries.
Oh and there are better ways for you to query data out of this DB as well. For example: https://github.com/arxanas/git-branchless/wiki/Reference:-Revsets
I know that many people prefer Perforce for version control of large binary files. Is that what you’re talking about here? Or is there something else about artist workflows that doesn’t fit with git?
Oh, I started working through Building Git a couple years ago and got sidetracked, I should get back into that