In cases where rewriting history is viable, git filter-repo has a --mailmap flag that can use a .mailmap file to modify past commits.
I’ve found this useful for personal projects and teams small enough that asking people to make a fresh clone of the repo isn’t a huge bother (…and honestly, people’s frustrations with git frequently lead to fresh clones anyway).
It would be nice if there were a way to make the original commit hash a parent of the rewritten one so that anyone with a downstream clone can still pull, merge, and so on.
There’s one other place where mailmaps are important: GDPR and other regulatory compliance. Names and email addresses are PII and so you can be required by law to remove them from your git repo. Ideally, you’d do this by putting a UUID, rather than a name and email address, in each commit and then having a mailmap that turns these into human names, so you just need to edit the mailmap. This, unfortunately, requires storing the mailmap outside of the repo, because otherwise you end up with the same problem (you must rewrite history to edit it in the past). I believe Pijul has a good solution to this.
In Pijul all patches are signed, and authors are identified by their signing key. We still need a system for each repository to map keys to identities, but these maps are authenticated by signatures as well.
If you generate a new key per repo, that should be fine. Unfortunately, if the key is used across repos then it’s linkable data and has all of the same GDPR concerns.
Ideally, Pijul would sign each patch with an ephemeral key that it signed with your personal key. You’d then have a per-repo notion of ‘these commits are all signed by the same person’ but could store the signatures of the signatures separately and, if necessary, re-sign all of the commits with a new key to establish an attestation that the repo guarantees that they all came from the same person, but not in a way that identifies that person.
Is it just the KV store that would need to be replaced to self-host with workerd? Part of me is tempted to try to package this for Sandstorm, but any components that aren’t FOSS would of course be blockers.
Kenton was at one point excited about the idea of workers apps on Sandstorm, having started both projects. But up until now I haven’t seen any workers apps that seemed interesting to package.
Durable objects are really important for the Nest, but if you don’t require any distributed setup, they’re just fundamentally another layer of KV stores.
Their platform and setup is not at all how I would have done things, but the main value of CF Workers for me is not the software, it’s the scalability and reliability.
Are there any papers or posts that describe the current Pijul data model? I’d be interested in reading about the underlying format. I know it’s undergone some revisions over the past few years.
we want to make it open source and easy to contribute to
i’m confused about the choice to run this on cloudflare workers. i understand that the author wants the code to be capable of running on more than a single platform. it just feels inevitable that the code gets tied up around proprietary cloudflare features.
my opinion: it’d have been better off as a monolith that runs on any old linux vps, not a bunch of newfangled stuff that targets a proprietary platform.
i like pijul, but this feels a little shortsighted to me
“Shortsighted” is a strong word here, I’ve been running a “your opinion”-style solution for years and know exactly why I can’t make it work.
“Proprietary Cloudflare Workers features” aren’t so unique that we can’t easily replicate them in a day of work (without the open source runtime, it would probably take a few days of work instead).
cloudflare is possibly the biggest threat to the free & open web, and hitching nest to it & claiming it’s for “ease of contribution” feels off optically. i have no doubt that you have good reasons, you’re much smarter than i - but intuitively, it feels wrong, and i believe this perception could be problematic. actually optimizing for contribution would mean making nest simple to run without a cloudflare account.
can i self-host nest without tying it to a proprietary platform? if the answer is no, it doesn’t feel very “open source”, even if the license says so.
I wouldn’t have done this without workerd and miniflare, two open source runtime environments to run Cloudlfare Workers scripts outside of Cloudflare.
At a higher level, the biggest threat to the free and open internet isn’t a company providing useful services and infrastructure, it is the absence of political will, public investment and regulations. Europe could have done that, we have some of the world experts in databases, CRDTs, replication, programming languages… I actually made contributions in some of these fields, mostly out of my own money and on my free time. There was also a time in Europe where we were not afraid of ambitious projects and regulations.
So, until Europe (or someone else: China?) invests in open and public cloud infrastructure and useful projects instead of what they’re doing (I won’t cite any example, but there are many), I’m fine using Cloudflare.
Cloudflare maintains two open source runtimes/simulators for cloudflare workers that you could probably use to self-host. The open source runtimes are workerd (based at least in part on the real code used at cloudflare) and miniflare (a simulator written in typescript).
This feels like an exaggeration. Cloudflare to my knowledge have only ever blocked one site for non-legal reasons, and they seemed to regret it. Unless cloudflare start disabling “DNS only” settings I can’t see the issue, could you enlighten me?
I don’t want to get into an argument about it, but it is untrue that Cloudflare has only blocked one site (except where compelled by law). Cloudflare kicked the fascism-promoting daily stormer and 8chan sites and kicked the hatespeech site kiwi farms.
Cloudflare also kicked switter.at, a mastodon instance for sex workers, which they claimed to be doing because of FOSTA, but they did this without warning and before any lawsuit was filed. In other cases cloudflare has fought censorship demands in the courts and won, and in the years following FOSTA it seems like courts are less willing to prosecute service providers than was originally thought, so cloudflare could probably have kept hosting switter.at, and could certainly have given them notice.
By the way, I am aware of these things with Cloudflare, and I do disagree with their responses (or delays, in the case of KiwiFarms) in 100% of the cases, but I’m willing to believe it is an irresponsible, rather than malevolent, behaviour, as is often the case in large organisations, especially if they’ve grown too fast.
I chose to work with CF anyway, because I don’t think that choice will influence Pijul’s future. If anything bad happens to us (censorship or otherwise) because of CF, it will mean that Pijul is big enough to be a problem, and that’ll prompt us to come up with a solution to keep going. I’m used to working with extremely limited setups and budgets, so that doesn’t scare me much.
Using a privacy-oriented machine setup and either on my real IP which is not in the West or using a commercial VPN, I am constantly needing to solve machine learning hCAPTCHAs to the point I peace out if I don’t feel it’s worth it because it’s exhausting to do multiple times a day.
IP-based blocking and the service itself may to to comply with US embargoes which can lights-out entire regions even if the folks living there have nothing to do with their ruling government.
There’s also the morale issue of pointing at a proprietary service as the path of leash setup resistance in this particular hosting case. Maybe we’ll get lucky and someone determined enough will make the NixOS module to where you can services.pijul.enable = true , but until then I bet most of the resources will recommend signing up with another publicity-traded, proprietary service.
Fortunately, no, I’ve never seen it with the current Nest at least. I stopped subscribing to a VPN though because of how many Cloudflare-fronted services prevented me from using them so I’ve not tested with them.
I know that Cloudflare gives me that option of “verifying” users, but that’s totally optional from what I’m seeing. I still dream of a public infrastructure that would provide the same thing, but it doesn’t exist yet.
Not yet, but I’ll probably share them when people start using it.
Without counting any external costs:
My goal with this serverless version is to provide a 100% reliable service, which the previous version could not possibly become from where it was, mostly because of the way PostgreSQL and Pijul have to work together (you can’t do a “join” between a Pijul branche and a PostgreSQL table, so you have to do lots of SQL requests, and have the servers close to your repos, which is hard to replicate). This will in turn allow me to sell “pro” account, which you can already buy in the beta version (nest.pijul.org) if you want to host private projects, so instead of having only bills to pay, I hope to stop losing (my own personal) money with that service.
FaaS means you don’t need to fix crashes, downtimes, servers, databases… It’s a nightmare to debug, especially if you’re like me and need to mix JS and WASM: no stack traces, no break points, manual debugging messages. But once you get past that, you can release confidently and sleep well at night. I guess the value of that increases as you age, so I can’t evaluate it properly.
The comparison between the provider bills is probably negligible next to these two parameters.
Pijul used to be double-exponentially faster for applying a patch. With the recent improvements in Darcs, Pijul is now just exponentially faster (Darcs is linear in the size of history, while Pijul is logarithmic).
For some reason, it seems many people assumed you couldn’t do that and were forced to use nest.pijul.com. But this self-hosting solution was always there, starting around the end of 2015, i.e. way before the Nest existed.
oh cool, this looks very simple and straightforward, i might have to set it up! have you ever experimented with nidobyte? i’ve been interested, but haven’t yet played around with it
always exciting to hear about pijul! i think the biggest blocker to me using it more widely is that it isn’t supported as a way to version control Nix flakes, which requires explicit support in Nix. once I have a bit more free time and motivation, I hope to contribute a fetchpijul primitive to Nix to start building that.
This would be nice indeed. There’s a planned feature aiming at making a hybrid system between patches and snapshots. You would have the ease of use of patches (commutativity etc) plus the history navigation of snapshots.
This is relatively easy to do, all the formats are ready for it, but I haven’t found the time to do it yet (nor have I found the money to justify my time on this).
Mostly a bug fixing release, but with an unprecedented number of external contributions. The new identity system is really novel. I should probably write a blog post announcing this release properly, but it would probably be full of minor points.
This is a great example of why I’d really like a “flag: blatantly incorrect”.
On Mercurial, as many others have pointed out, the author didn’t actually read the page they wrote on. Contemporary Mercurial runs great on Python 3, thank you very much, and is well on the way to porting to Rust.
I’m no Fossil fan, but even I know that the error they’re highlighting with Fossil is because the Git repository isn’t valid. [edit: see note below; I’m wrong on the issue on that repo, but the rest of this paragraph is vald.] I hit this type of issue all the time while working on Kiln’s Harmony code. If you’re bored, try setting fsckObjects to true in your .gitconfig. You won’t be able to clone roughly a third of major repos on GitHub. (The ratio improves dramatically for more recent projects until it hits 100% success for stuff that was kicked off in the last couple of years, but anything five or more years old, you’re gonna hit this type of issue.)
Darcs is evicted over its website being HTTP, which has nothing to do with Darcs. Pijul over the author having no clue how to write code in Windows.
The author is trolling. Poorly. And somehow this is highly ranked on the front page. Let’s do better.
Edit: correction the Fossil bit: After two seconds of poking with Fossil and Git fast-export on Windows, the actual issue with their Windows repository is that they’re in a command shell that’s piping UTF-16 and/or sending \r\n. This is also something any Windows dev should recognize and know how to deal with. If you’ve set the default shell encoding to UTF-8 or use a shell like yori that does something sane, you won’t hit this issue, but I’d sure hope that anyone using Windows full-time who refuses to set a UTF-8 default is habitually slamming | Out-File -encoding utf8 after commands like that.
For Pijul it’s even worse, it’s over (1) the author relying on contributors to test and fix on platforms other than Linux and (2) WSL not implementing POSIX correctly, but blaming Pijul’s author is easier than blaming Microsoft.
Also, she knows it, has contributed to Pijul after that post, and has never updated it.
I wouldn’t go as far as saying “the author is trolling”. To me it’s an opinion piece where, sure, the author could have put some more effort, but that’s their particular style and how they chose to write it. To be fair I did not expect it to be this controversial. It does mention some points on what a “better git” would need to do, even if it’s a bit snarky. Particularly the suggestions at the end. At the very least this thread has brought a bunch of good alternatives and points 🤷♀️.
I’m glad the post essentially says “no” after trying most contenders. I hope people make better porcelain for git, but moving to another data model makes little sense to me, and I hope the plumbing remains.
I did kind of raise my eyebrows at the first sentence:
Most software development is not like the Linux kernel’s development; as such, Git is not designed for most software development.
It’s been a long time since git was developed only for the needs of Linux! For example github has a series of detailed blog posts on recent improvements to git: 123456.
The problem with Git is not the “better porcelain”, the current one is fine. The problem is that the fundamental data model doesn’t reflect how people actually work: commits are snapshots, yet all UIs present them as diffs, because that’s how people reason about work. The result of my work on code has never produced an entire new version of a repository, in all cases I remember of I’ve only ever made changes to existing (or empty) repos.
This is the cause of bad merges, git rerere, poorly-handled conflicts etc. which waste millions of man-hours globally every year.
I don’t see any reason to be “glad” that the author of that post didn’t properly evaluate alternatives (Darcs dismissed over HTTP and Pijul over WSL being broken: easier to blame the author than Microsoft).
In my experience, the more pain and suffering one has spent learning Git, the more fiercely one defends it.
Not my experience. I was there at the transition phase, teaching developers git. Some hadn’t used SCM at all,. most knew SVN.
The overall experience was: They could do just as much of their daily work as with SVN or CVS very quickly, and there were a few edge cases. But if you had someone knowledgeable it was SO much easier to fix mistakes or recover lost updates. Also if people put in a little work on top of their “checkout, commit, branch, tag” workflow they were very happy to be able to adapt it to their workflow.
I’m not saying none of the others would do that or that they wouldn’t be better - all I’m saying is that IMHO git doesn’t need fierce defense. It mostly works.
(I tried fossil very briefly and it didn’t click, also it ate our main repo and we had to get the maintainer to help :P and I couldn’t really make sense of darcs. I never had any beef with Mercurial and if it had won over git I would probably be just as happy, although it was a little slow back then… I’ve not used it in a decade I guess)
The underlying data model problem is something that I’ve run across with experienced devs multiple times. It manifests as soon as your VCS has anything other than a single branch with a linear history:
If I create a branch and then add a commit on top, that commit doesn’t have an identity, only the new branch head does. If I cherry pick that commit onto another branch and then try to merge the two, there’s a good chance that they’ll conflict because they’re both changing the same file. This can also happen after merging in anything beyond the non-trivial cases (try maintaining three branches with frequent merges between pairs of them and you’ll hit situations where a commit causes merge conflicts with itself).
Every large project where I’ve used git has workflows designed to work around this flaw in the underlying data model. I believe Pijul is built to prevent this by design (tracking patches, rather than trees, as the things that have identity) but I’ve never tried it.
I don’t understand, is it “not your experience” that:
Snapshots are always shown as diffs, even though they aren’t diffs
Bad merges, git rerere and conflicts happen. A lot.
The author of the original post didn’t evalute things correctly?
In any case, none of these things is contradicted by your explanation that Git came after SVN (I was there too). That said, SVN, CVS, Git, Fossil and Mercurial have the same underlying model: snapshots + 3-way merge. Git and Mercurial are smarter by doing it in a distributed way, but the fundamentals are the same.
Darcs and Pijul do things differently, using actual algorithms instead of hacks. This is never even hinted at in the article.
Snapshots are always shown as diffs, even though they aren’t diffs
Its more meaningful (to me at least) to show what changed between two trees represented as the commit as the primary representation rather than the tree, but
there’s a button to move between the two in most user interfaces github / elsewhere
Bad merges, git rerere and conflicts happen. A lot.
I don’t tend to use workflows that tend to use merges as their primary integration method. Work on a feature, rebase on mainline, run tests, merge clean.
The author of the original post didn’t evalute things correctly?
The author’s use cases are contradicted by the vast majority that use git successfully regardless of the problems cited. I’d say the only points that I do agree with the author on are:
Git is fundamentally a content-addressable filesystem with a VCS user interface written on top of it.
Not the author’s quote, but there’s a missing next step there which is to examine what happens if we actually do that part better. Fossil kinda has the right approach there. As doe things like Beos’ BFS filesystem, or WinFS which both built database like concepts into a filesystem. Some of the larger Git systems build a backend using databases rather than files, so there’s no real problem that is not being worked on there.
approach history as not just a linear sequence of facts but a story
The one thing I’d like git to have is the idea of correcting / annotating history. Let’s say a junior dev makes 3 commits with such messages as ‘commit’, ‘fixed’, ‘actual fix’. Being able to group and reword those commits into a single commit say ‘implemented foobar feature’ sometime after the fact, without breaking everything else would be a godsend. In effect, git history is the first derivative of your code (dCode/dTime), but there’s a missing second derivative.
Snapshots can be trivially converted to diffs and vice-versa, so I don’t see how this would impact merges. Whatever you can store as patches you can store as a sequence of snapshots that differ by the patch you want. Internally git stores snapshots as diffs in pack files anyway. Is there some clever merge algorithm that can’t be ported to git?
What git is missing is ability to preserve “grafts” across network to ensure that rebase and other branch rewrites don’t break old commit references.
I actually thought about the problem a bit (like, for a few years) before writing that comment.
Your comment sounds almost reasonable, but its truth depends heavily on how you define things. As I’m sure you’re aware, isomorphisms between datastructure are only relevant if you define the set of operations you’re interested in.
For a good DVCS, my personal favourite set of operations includes:
In Git terms: merge, rebase and cherry-pick.
In Pijul terms: all three are called the same: “apply a patch”.
Also, I want these operations to work even on conflicts.
If you try to convert a Pijul repo to a Git repo, you will lose information about which patch solved which conflict. You’ll only see snapshots. If you try to cherry pick and merge you’ll get odd conflicts and might even need to use git rerere.
The other direction works better: you can convert a Git repo to a Pijul repo without losing anything meaningful for these operations. If you do it naïvely you might lose information about branches.
Honestly I’m of the opinion that git’s underlying data model is actually pretty solid; it’s just the user interface that’s dogshit. Luckily that’s the easiest part to replace, and it doesn’t have any of the unfortunate network effect problems of changing systems altogether.
I’ve been using magit for a decade and a half; if magit (or any other alternate git frontends) had never existed, I would have dumped git ages ago, but … you don’t have to use the awful parts?
Honestly I’m of the opinion that git’s underlying data model is actually pretty solid; it’s just the user interface that’s dogshit.
For what it’s worth, I do disagree, but not in a way relevant to this article. If we’re going to discuss Git’s data model, I’d love to discuss its inability to meaningfully track rebased/edited commits, the fact that heads are not version tracked in any meaningful capacity (yeah, you’ve got the reflog locally, but that’s it), that the data formats were standardized at once too early and too late (meaning that Git’s still struggling to improve its performance on the one hand, and that tools that work with Git have to constantly handle “invalid” repositories on the other), etc. But I absolutely, unquestionably agree that Git’s UI is the first 90% of the problem with Git—and I even agree that magit fixes a lot of those issues.
I’ve come to the conclusion that there’s something wrong with the data model in the sense that any practical use of Git with a team requires linearization of commit history to keep what’s changing when straight. I think a better data model would be able to keep track of the history branches and rebases. A squash or rebase should include some metadata that lets you get back the state before the rebase. In theory, you could just do a merge, but no one does that at scale because they make it too messy to keep track of what changed when.
I don’t think that’s a data model problem. It’s a human problem. Git can store a branching history just fine. It’s just much easier for people to read a linearized list of changes and operate on diffs on a single axis.
Kind of semantic debate whether the problem is the data model per se or not, but the thing I want Git to do—show me a linear rebased history by default but have the ability to also show me the pre-flattened history and the branch names(!) involved—can’t be done by using Git as it is. In theory you could build what I want using Git as the engine and a new UI layer on top, but it wouldn’t be interoperable with other people’s use of Git.
It already has a distinction between git log, git log --graph and git log --decorate (if you don’t delete branches that you care about seeing). And yeah, you can add other UIs on top.
BTW: I never ever want my branch names immortalized in the history. I saw Mercurial do this, and that was the last time I’ve ever used it. IMHO people confuse having record of changes and ability to roll them back precisely with indiscriminately recording how the sausage has been made. These are close, but not the same.
git merge –no-ff (imo the only correct merge for more than a single commit) does use the branch name, but the message is editable if your branch had a useless name
They’re not supposed to! Squashing and amending are important tools for cleaning up unwanted history. This is a very important ability, because it allows committing often, even before each change is final, and then fixing it up into readable changes rather than “wip”, “wip”, “oops, typo”, “final”, “final 2”.
What I’m saying is, I want Git for Git. I want the ability to get back history that Git gives me for files, for Git itself. Git instead lets you either have one messy history (with a bunch of octopus merges) or one clean history (with rebase/linearization). But I want a clean history that I can see the history of and find out about octopuses (octopi?) behind it.
No. The user interface is one of the best parts of Git, in that it reflects the internals quite transparently. The fundamental storage doesn’t model how people work: Git reasons entirely in terms of commits/snapshots, yet any view of these is 100% of the time presented as diffs.
Git will never allow you to cherry-pick meaningfully, and you’ll always need dirty hacks like rerere to re-solve already-solved conflicts. Not because of porcelain (that would have been solved ten years ago), but because snapshots aren’t the right model for that particular problem.
Sounds interesting, but i definitely don’t grok it. I get the problems it points out, but definitely don’t get how it concretely fixes them. Probably just need to mess with it to understand better.
Also, being able to swap any commit order easily sounds like an anti-feature to me. I think history should not be edited generally speaking. Making that easy sounds concerning.
The commutation feature means that rebase and merge are the same operation, or rather, that it doesn’t matter which one you do. Patches are ordered locally in each Pijul repository, but they are only partially ordered globally, by their dependencies.
What Pijul gives you is a datastructure that is aware of conflicts, and where the order of incomparable patches doesn’t matter: you’ll get the exact same snapshot with different orders.
You can still bisect locally, and you can still get tags/snapshots/version identitifiers.
What I meant in my previous reply is that if your project requires a strict ordering (some projects are like that), you can model it in Pijul: you won’t be able to push a new patch without first pushing all the previous ones.
But not all projects are like that, some projects use feature branches and want the ability to merge (and unmerge) them. Or in some cases, your coauthors are working on several things at the same time and haven’t yet found the time to clean their branches, but you want to cherrypick one of their “bugfix” patches now, without (1) waiting for the bugfix to land on main and (2) without dealing with an artificial conflict between your cherry-picking and the landed bugfix in the future.
I haven’t really had the time yet to read through the documentation, but how does Pijul manage its state? A big problem with Git is that it sort of assumes everything is on disk and that you always have direct access to the data. This causes problems for large Git hosts (e.g. it was a big issue at GitLab), as it’s not feasible to give e.g. web workers access to 20+ shared disks (e.g. using NFS, yuck). The result is that at such a scale you basically end up having to write your own Git daemons with clustering support and all that. It would be nice if new VCS systems were better equipped to be used at such a scale.
Good question. Pijul can separate patches into an operational part (which contains only +/- from diffs, but with byte intervals rather than actual contents) and a contents part. Its internal datastructure works on the operational part. It is relatively efficient at the moment, but grows linearly with the size of history. We have plans to make it more efficient in the future, if this becomes a problem for some users.
When downloading a bunch of patches, the larger ones aren’t downloaded entirely if they were superseded by newer versions.
But I have to admit that while we have tested Pijul on large histories (by importing Git repos), we haven’t really used it at scale in practice. And as anyone with problems “at scale” knows, experimental measurements is what matters here.
Not quite the same. Jujutsu, Gitless and Stacked Git are UIs written on top of Git. The UX is better than Git, but they inherit all the issues of not having proper theoretical foundations. Mercurial is in that category too: better UX, same problems.
Pijul solves the most basic problem of version control: all these tools try to simulate patch commutation (using mixtures of merges and rebases), but would never say it like that, while Pijul has actual commutative patches.
Yes. As someone who cares a lot about version control, I have read all I could about Jujutsu when it was first released. I actually find it nice and interesting.
What I meant in my comment above was that while I acknowledge that the snapshot model is dominant and incremental improvements are a very positive thing, I also don’t think this is the right way to look at the problem:
All the algorithms modelling these problems in the context of concurrent datastructures use changes rather than snapshots: CRDTs, OTs… (I’m talking about the “latest” version being a CRDT, here, not the entire DAG of changes).
In all the Git UIs that I know of, the user is always shown commits as patches. Why represent it differently internally?
What’s presented to the user does not need to be the internal representation. Just like users and most of the tools work on the snapshot of the source repo, yet you can represent the snapshot as a set of patches internally. That, however, does not necessarily mean either snapshot or set-of-patches works superior than the other. Besides, any practical VCS would have both representations available anyway.
Good point, and actually Pijul’s internal representation is far from being as simple as just a list of patches.
However, what I meant there wasn’t about the bytes themselves, but rather about the operations defined on the internal datastructure. When your internals model snapshots (regardless of what bytes are actually written), all your operations will be on snapshots – yet, Git never graphically shows anything as snapshots, all UIs show patches. This has real consequences visible in the real world, for example the lack of associativity (bad merges), unintelligible conflicts (hello, git rerere), endless rebases…
Also, my main interests in this project are mathematical (how to model things properly?) and technical (how to make the implementation really fast?). So, I do claim that patches can simulate snapshots at no performance penalty, whereas the converse isn’t true if you want to do a proper merge and deal with conflicts rigorously.
Yeah, I do think that, like many people have commented, collaboration networks like GitHub are one thing any new DVCS will either need to produce, or need to somehow be compatible with. Even GitHub is famously unusable for the standard kernel workflow and it could be argued that it suffers due to that.
I really like the Jujutsu compromise of allowing interop with all the common git social networks at the same time as allowing more advanced treatment of conflicts and ordering between operations.
There isn’t a document yet on how the transition to the native backend in a git social network world would look.
I also think that the operation log of jujutsu not being sharable is a limitation that would be nice to cram into some hidden data structures in the actual git repo, but then you have a chicken and egg problem of how to store that operation…
So, it seems the phrase “the most basic” was unclear in my comment above: I meant that from a theory point of view, the most basic problem is “what is a version control system?”, and that is what we tried to tackle with Pijul.
Pijul looks very cool. Do you consider in its current state to be a production ready replacement for git? I checked the FAQ but didn’t see that question.
That’s really exciting to hear; the fact that the web UI was proprietary is something that I’d always found disappointing. What prompted the change of heart, may I ask?
My understanding is that the plan has always been to open source it, but that a closed source approach was taken at first so people would focus on contributions to the Pijul protocol rather than to the relatively unimportant CRUD app.
Exactly. Maintaining open source projects takes time, one has to prioritise. In my experience, many of the “future eager contributors” to the Nest didn’t even care to look at Pijul’s source code, which makes me seriously doubt about their exact intentions.
No change of heart. Pijul is usable now, so the focus can move on to something else. Also, I find it easier to guarantee the security (and react in case of a problem) of a web server written 100% by me.
Not that I am aware of. The sad thing is @pmeunier remains the cornerstone of this project and as far as I understand, he lack time to work on pijul these days (which is totally okay, don’t get me wrong, the amount of work he already spent on this project is tremendous).
I am curious to see what the future holds for pijul, but I am a bit pessimistic I admit.
I am curious to see what the future holds for pijul, but I am a bit pessimistic I admit.
Any concrete argument?
It is true that I haven’t had much time in the last few weeks, but that doesn’t mean Pijul is unusable, or that I won’t come back to it once my current workload lightens. Others have started contributing at a fast pace, actually.
Others have started contributing at a fast pace, actually.
That is wonderful news! I keep wishing the best for pijul, even though I don’t use it anymore. This remains a particularly inspiring software to me. Sorry if my comment sounded harsh or unjust to you, I should know better but to write about my vague pessimism when the Internet is already a sad enough place.
I seem to recall that there we an announcement from a pijul author that pijul was in maintenance mode and that he was working on a new VCS. I can’t find mention of this VCS now though. Does anyone remember this?
My best guess, is that Git is a local maximum we’re going to be stuck on until we move away from the entire concept of “historic sequence of whole trees of static text” as SCM.
There’s nothing more powerful than people and project adopting it one by one. If you start a new project, using Pijul and the Nest is the best thing you can do to make the project grow.
The interface of Git and its underlying data models are two very different things, that are best treated separately.
The interface is pretty bad. If I wasn’t so used to it I would be fairly desperate for an alternative. I don’t care much for the staging area, I don’t like to have to clean up my working directory every time I need to switch branches, and I don’t like how easy it is to lose commit from a detached HEAD (though there’s always git reflog I guess).
The underlying data model however is pretty good. We can probably ditch the staging area, but apart from that, viewing the history of a repository as a directed graph of snapshots is nice. Captures everything we need. Sure patches have to be derived from those snapshots, but we care less about the patches than we care about the various versions we saved. If there’s one thing we need to get right, it’s those snapshots. You get reproducible builds & test from them, not from patches. So I think Patches are secondary. I used to love DARCS, but I think patch theory was probably the wrong choice.
Now one thing Git really really doesn’t like is large binary files. Especially if we keep changing them. But then that’s just a compression problem. Let the data model pretend there’s a blob for each version of that huge file, even though in fact the software is automatically compressing & decompressing things under the hood.
What’s wrong with the staging area? I use it all the time to break big changes into multiple commits and smaller changes. I’d hate to see it removed just because a few people don’t find it useful.
Absolutely, I would feel like I’m missing a limb without the staging area. I understand that it’s conceptually difficult at first, but imo it’s extremely worth the cost.
Do you actually use it, or do you just do git commit -p, which only happens to use the staging area as an implementation detail?
And how do you test the code you’re committing? How do you make sure that the staged hunks aren’t missing another hunk that, for example, changes the signature the function you’re calling? It’s a serious slowdown in workflow to need to wait for CI rounds, stash and rebase to get a clean commit, and push again.
I git add -p to the staging area and then diff it before generating the commit. I guess that could be done without a staging area using a different workflow but I don’t see the benefit (even if I have to check git status for the command every time I need to unstage something (-: )
As for testing, since I’m usually using Github I use the PR as the base unit that needs to pass a test (via squash merges, the horror I know). My commits within a branch often don’t pass tests; I use commits to break things up into sections of functionality for my own benefit going back later.
Just to add on, the real place where the staging area shines is with git reset -p. You can reset part of a commit, amend the commit, and then create a new commit with your (original) changes or continue editing. The staging area becomes more useful the more you do commit surgery.
Meh, you don’t need a staging area for that (or anything). hg uncommit -i (for --interactive) does quite the same thing, and because it has no artificial staging/commit split it gets to use the clear verb.
I guess that could be done without a staging area using a different workflow but I don’t see the benefit
I don’t see the cost.
My commits within a branch often don’t pass tests;
If you ever need to git bisect, you may come to regret that. I almost never use git bisect, but for the few times I did need it it was a life saver, and passing tests greatly facilitate it.
I bisect every so often, but on the squashed PR commits on main, not individual commits within a PR branch. I’ve never needed to do that to diagnose a bug. If you have big PRs, don’t squash, or don’t use a PR-based workflow, that’s different of course. I agree with the general sentiment that all commits on main should pass tests for the purposes of bisection.
I use git gui for committing, (the built in git gui command) which let’s you pick by line not just hunks. Normally the things I’m excluding are stuff like enabling debug flags, or just extra logging, so it’s not really difficult to make sure it’s correct. Not saying I never push bad code, but I can’t recall an instance where I pushed bad code because of that so use the index to choose parts of my unfinished work to save in a stash (git stash –keep-index), and sometimes if I’m doing something risky and iterative I’ll periodically add things to the staging area as I go so I can have some way to get back to the last known good point without actually making a bunch of commits ( I could rebase after, yeah but meh).
It being just an implementation detail in most of that is a fair point though.
I personally run the regression test (which I wrote) to test changes.
Then I have to wait for the code review (which in my experience has never stopped a bug going through; when I have found bugs, in code reviews, it was always “out of scope for the work, so don’t fix it”) before checking it in. I’m dreading the day when CI is actually implemented as it would slow down an already glacial process [1].
Also, I should mention I don’t work on web stuff at all (thank God I got out of that industry).
[1] Our customer is the Oligarchic Cell Phone Company, which has a sprint of years, not days or weeks, with veto power over when we deploy changes.
I missed the staging area for at most a few weeks after I switched from Git to Mercurial many years ago. Now I miss Mercurial’s tools for splitting commits etc. much more whenever I use Git.
Thanks for the write up. From what I read it seems like with Jujutsu if I have some WIP of which I want to commit half and continue experimenting with the other half I would need to commit it all across two commits. After that my continuing WIP would be split across two places: the second commit and the working file changes. Is that right? If so, is there any way to tag that WIP commit as do-not-push?
Not quite. Every time you run a command, the working copy is snapshotted and becomes a real commit, amending the precis working-copy commit. The changes in the working copy are thus treated just like any other commit. The corresponding think to git commit -p is jj split, which creates two stacked commits from the previous working-copy commit, and the second commit (the child) is what you continue to edit in the working copy.
Your follow-up question still applies (to both commits instead of the single commit you seemed to imagine). There’s not yet any way of marking the working copy as do-not-push. Maybe we’ll copy Mercurial’s “phase” concept, but we haven’t decided yet.
Way I see it, the staging area is a piece of state needed specifically for a command line interface. I use it too, for the exact reason you do. But I could do the same by committing it directly. Compare the possible workflows. Currently we do:
# most of the time
git add .
git commit
# piecemeal
git add -p .
# review changes
git commit
Without a staging area, we could instead do that:
# most of the time
git commit
# piecemeal
git commit -p
# review changes
git reset HEAD~ # if the changes are no good
And I’m not even talking about a possible GUI for the incremental making of several commits.
Personally I use git add -p all of the time. I’ve simply been burned by the other way too many times. What I want is not to save commands but to have simple commands that work for me in every situation. I enjoy the patch selection phase. More often than not it is what triggers my memory of a TODO item I forgot to jot down, etc. The patch selection is the same as reviewing the diff I’m about to push but it lets me do it incrementally so that when I’m (inevitably) interrupted I don’t have to remember my place.
From your example workflows it seems like you’re interested in avoiding multiple commands. Perhaps you could use git commit -a most of the time? Or maybe add a commit-all alias?
Never got around to write that alias, and if I’m being honest I quite often git diff --cached to see what I’ve added before I actually commit it.
I do need something that feels like a staging area. I was mostly wondering whether that staging area really needed to be implemented differently than an ordinary commit. Originally I believed commits were enough, until someone pointed out pre-commit hooks. Still, I wonder why the staging area isn’t at least a pointer to a tree object. It would have been more orthogonal, and likely require less effort to implement. I’m curious what Linus was thinking.
Very honourable to revise your opinion in the face of new evidence, but I’m curious to know what would happen if you broadened the scope of your challenge with “and what workflow truly requires pre-commit hooks?”!
Hmm, that’s a tough one. Strictly speaking, none. But I can see the benefits.
Take Monocypher for instance: now it’s pretty stable, and though it is very easy for me to type make test every time I modify 3 characters, in practice I may want to make sure I don’t forget to do it before I commit anything. But even then there are 2 alternatives:
Running tests on the server (but it’s better suited to a PR model, and I’m almost the only committer).
Having a pre push hook. That way my local commits don’t need the hook, and I could go back to using the most recent one as a staging area.
I use git add -p all the time, but only because Magit makes it so easy. If I had an equally easy interface to something like hg split or jj split, I don’t think I’d care about the lack of an index/staging area.
Do you actually add your entire working directory most of the time? Unless I’ve just initialized a repository I essentially never do that.
Here’s something I do do all the time, because my mind doesn’t work in a red-green-refactor way:
Get a bug report
Fix bug in foo_controller
Once the bug is fixed, I finally understand it well enough to write an automated regression test around it, so go do that in foo_controller_spec
Run test suite to ensure I didn’t break anything and that my new test is green
Add foo_controller and foo_controller_spec to staging area
Revert working copy (but not staged copy!) of foo_controller (but not it’s spec)
Run test suite again and ensure I have exactly one red test (the new regression test). If yes, commit the stage.
If no, debug spec against old controller until I understand why it’s not red, get it red, pull staged controller back to working area, make sure it’s green.
—
Yeah, I could probably simulate this by committing halfway through and then doing some bullshit with cherry-picks from older commits and in some cases reverting the top commit but, like, why? What would I gain from limiting myself to just this awkward commit dance as the only way of working? That’s just leaving me to cobble together a workflow that’s had a powerful abstraction taken away from it, just to satisfy some dogmatic “the commit is the only abstraction I’m willing to allow” instinct.
Yes. And when I get a bug report, I tend to first reproduce the bug, then write a failing test, then fix the code.
Right, but that was just one example. Everything in your working copy should always be committed at all times? I’m almost never in that state. Either I’ve got other edits in progress that I intend to form into later commits, or I’ve got edits on disk that I never intend to commit but in files that should not be git ignored (because I still intend to merge upstream changes into them).
I always want to be intentionally forming every part of a commit, basically.
Sounds useful. How do you do that?
git add foo_controller <other files>; git restore -s HEAD foo_controller
and then
git restore foo_controller will copy the staged version back into the working set.
TBH, I have no idea what “git add -p” does off hand (I use Magit), and I’ve never used staging like that.
I had a great example use of staging come up just yesterday. I’m working in a feature branch, and we’ve given QA a build to test what we have so far. They found a bug with views, and it was an easy fix (we didn’t copy attributes over when copying a view).
So I switched over to views.cpp and made the change. I built, tested that specific view change, and in Magit I staged that specific change in views.cpp. Then I commited, pushed it, and kicked off a pipeline build to give to QA.
I also use staging all the time if I refactor while working on new code or fixing bugs. Say I’m working on “foo()”, but while doing so I refactor “bar()” and “baz()”. With staging, I can isolate the changes to “bar()” and “baz()” in their own commits, which is handy for debugging later, giving the changes to other people without pulling in all of my changes, etc.
Overall, it’s trivial to ignore staging if you don’t want it, but it would be a lot of work to simulate it if it weren’t a feature.
What’s wrong with the staging area? I use it all the time to break big changes into multiple commits and smaller changes.
I’m sure you do – that’s how it was meant to be used. But you might as well use commits as the staging area – it’s easy to commit and squash. This has the benefit that you can work with your whole commit stack at the same time. I don’t know what problem the staging area solves that isn’t better solved with commits. And yet, the mere existence of this unnecessary feature – this implicitly modified invisible state that comes and crashes your next commit – adds cognitive load: Commands like git mv, git rm and git checkout pollutes the state, then git diff hides it, and finally, git commit --amend accidentally invites it into the topmost commit.
The combo of being not useful and a constant stumbling block makes it bad.
Using e.g. hg split or jj split. The former has a text-based interface similar to git commit -p as well as a curses-based TUI. The latter lets you use e.g. Meld or vimdiff to edit the diff in a temporary directory and then rewrites the commit and all descendants when you’re done.
That temporary directory sounds a lot like the index – a temporary place where changes to the working copy can be batched. Am I right to infer here that the benefit you find in having a second working copy in a temp directory because it works better with some other tools that expect to work files?
The temporary directory is much more temporary than the index - it only exists while you split the commit. For example, if you’re splitting a commit that modifies 5 files, then the temporary directory will have only 2*5 files (for before and after). Does that clarify?
The same solution for selecting part of the changes in a commit is used by jj amend -i (move into parent of specified commit, from working-copy commit by default), jj move -i --from <rev> --to <rev> (move changes between arbitrary commits) etc.
I use git revise. Interactive revise is just like interactive rebase, except that it has is a cut subcommand. This can be used to split a commit by selecting and editing hunks like git commit -p.
Before git-revise, I used to manually undo part of the commit, commit that, then revert it, and then sqash the undo-commit into the commit to be split. The revert-commit then contains the split-off changes.
I don’t know, I find it useful. Maybe if git built in mercurials “place changes into commit that isn’t the most recent” amend thing then I might have an easier time doing things but just staging up relevant changes in a patch-based flow is pretty straightforward and helpful IMO
I wonder if this would be as controversial if patching was the default
What purpose does it serve that wouldn’t also be served by first-class rollback and an easier way of collapsing changesets on their way upstream? I find that most of the benefits of smaller changesets disappear when they don’t have commit messages, and when using the staging area for this you can only rollback one step without having to get into the hairy parts of git.
The staging area is difficult to work with until you understand what’s happening under the hood. In most version control systems, an object under version control would be in one of a handful of states: either the object has been cataloged and stored in its current state, or it hasn’t. From a DWIM standpoint for a new git user, would catalog and store the object in its current state. With the stage, you can stage, and change, stage again, and change again. I’ve used this myself to logically group commits so I agree with you that it’s useful. But I do see how it breaks peoples DWIM view on how git works.
Also, If I stage, and then change, is there a way to have git restore the file as I staged it if I haven’t committed?
Interesting, that would solve the problem. I’m surprised I’ve not come across that before.
In terms of “what’s wrong with the staging area”, what I was suggesting would work better is to have the whole thing work in reverse. So all untracked files are “staged” by default and you would explicitly un-stage anything you don’t want to commit. Firstly this works better for the 90% use-case, and compared to this workaround it’s a single step rather than 2 steps for the 10% case where you don’t want to commit all your changes yet.
The fundamental problem with the staging area is that it’s an additional, hidden state that the final committed state has to pass through. But that means that your commits do not necessarily represent a state that the filesystem was previously in, which is supposed to be a fundamental guarantee. The fact that you have to explicitly stash anything to put the staging area into a knowable state is a bit of a hack. It solves a problem that shouldn’t exist.
The way I was taught this, the way I’ve taught this to others, and the way it’s represented in at least some guis is not compatible.
I mean, sure, you can have staged and unstaged changes in a file and need to figure it out for testing, or unstage parts, but mostly it’s edit -> stage -> commit -> push.
That feels, to me and to newbies who barely know what version control is, like a logical additive flow. Tons of cases you stage everything and commit so it’s a very small operation.
The biggest gripe may be devs who forget to add files in the proper commit, which makes bisect hard. Your case may solve that for sure, but I find it a special case of bad guis and sloppy devs who do that. Also at some point the fs layout gets fewer new files.
Except that in a completely linear flow the distinction between edit and stage serves no purpose. At best it creates an extra step for no reason and at worst it is confusing and/or dangerous to anyone who doesn’t fully understand the state their working copy is in. You can bypass the middle state with git add .; git commit and a lot of new developers do exactly that, but all that does is pretend the staging state doesn’t exist.
Staging would serve a purpose if it meant something similar to pushing a branch to CI before a merge, where you have isolated the branch state and can be assured that it has passed all required tests before it goes anywhere permanent. But the staging area actually does the opposite of that, by creating a hidden state that cannot be tested directly.
As you say, all it takes is one mistake and you end up with a bad commit that breaks bisect later. That’s not just a problem of developers being forgetful, it’s the bad design of the staging area that makes this likely to happen by default.
I think I sort of agree but do not completely concur.
Glossing over the staging can be fine in some projects and dev sloppiness is IMO a bigger problem than an additive flow for clean commits.
These are societal per-project issues - what’s the practice or policy or mandate - and thus they could be upheld by anything, even using the undo buffer for clean commits like back in the day. Which isn’t to say you never gotta do trickery like that with Git, just that it’s a flow that feels natural and undo trickery less common.
Skimming the other comments, maybe jj is more like your suggestion, and I wouldn’t mind “a better Git”, but I can’t be bothered when eg. gitless iirc dropped the staging and would make clean commits feel like 2003.
Everyone seem to suppose I would like to ditch the workflows enabled by the staging area. I really don’t. I’m quite sure there ways to keep those workflows without using a staging area. If there aren’t well… I can always admit I was wrong.
Well, what I prize being able to do is to build up a commit piecemeal out of some but not all of the changes in my working directory, in an incremental rather than all-in-one-go fashion (ie. I should be able to form the commit over time and I should be able to modify a file, move it’s state into the “pending commit” and continue to modify the file further without impacting the pending commit). It must be possible for any commit coming out of this workflow to both not contain everything in my working area, and to contain things no longer in my working area. It must be possible to diff my working area against the pending commit and against the last actual commit (separately), and to diff the pending commit against the last actual commit.
You could call it something else if you wanted but a rose by any other name etc. A “staging area” is a supremely natural metaphor for what I want to work with in my workflow, so replacing it hardly seems desirable to me.
How about making the pending commit an actual commit? And then adding the porcelain necessary to treat it like a staging area? Stuff like git commit -p foo if you want to add changes piecemeal.
No. That’s cool too and is what tools like git revise and git absorb enable, but making it an actual commit would have other drawbacks: it would imply it has a commit message and passes pre-commit hooks and things like that. The staging area is useful precisely for what it does now—help you build up the pieces necessary to make a commit. As such it implies you don’t have everything together to make a commit out of it. As soon as I do I commit, then if necessary --ammend, --edit, or git revise later. If you don’t make use of workflows that use staging then feel free to use tooling that bypasses it for you, but don’t try to take it away from the rest of us.
Oh, totally missed that one. Probably because I’ve never used it (instead i rely on CI or manually pushing a button). Still, that’s the strongest argument so far, and I have no good solution that doesn’t involve an actual staging area there. I guess it’s time to change my mind.
I think you missed the point, my argument is that the staging area is useful as a place to stage stuff before things like commit related hooks get run. I don’t want tools like git revise to run precommit hooks. When I use git revise the commit has already been made and presumably passed precommit phase.
For the problem that git revise “bypasses” the commit hook when using it to split a commit, I meant the commit hook (not precommit hook).
I get that the staging area lets you assemble a commit before you can run the commit hook. But if this was possible to do statelessly (which would only be an improvement), you could do without it. And for other reasons, git would be so much better without this footgun:
Normally, you can look at git diff and commit what you see with git commit -a. But if the staging area is clobbered, which you might have forgot, you also have invisible state that sneaks in!
Normally, you can look at git diff and commit what you see with git commit -a.
Normally I do nothing of the kind. I might have used git commit -a a couple times in the last 5 years (and I make dozens to hundreds of commits per day). The stattefullness of the staging area is exactly what benefits my workflow and not the part I would be trying to eliminate. The majority of the time I stage things I’m working on from my editor one hunk at a time. The difference between my current buffer and the last git commit is highlighted and after I make some progress I start adding related hunks and shaping them into commits. I might fiddle around with a couple things in the current file, then when I like it stage up pieces into a couple different commits.
The most aggressive I’d get is occasionally (once a month?) coming up with a use for git commit -u.
A stateless version of staging that “lets you assemble a commit” sounds like an oxymoron to me. I have no idea what you think that would even look like, but a state that is neither the full contents of the current file system nor yet a commit is exactly what I want.
Why deliberately make a mess of things? Why make a discreet concept of a “commit” into something else with multiple possible states? Why not just use staging like it is now? I see no benefit to jurry rigging more states on top of a working one. If the point is to simplify the tooling you won’t get there by overloading one clean concept with an indefinite state and contextual markers like “if commit message empty then this is not a real commit”.
Sure, you could awkwardly simulate a staging area like this. The porcelain would have to juggle a whole bunch of shit to avoid breaking anytime you merge a bunch of changes after adding something to the fake “stage”, pull in 300 new commits, and then decide you want to unstage something, so the replacement of the dedicated abstraction seems likely to leak and introduce merge conflict resolution where you didn’t previously have to worry about it, but maybe with enough magic you could do it.
But what’s the point? To me it’s like saying that I could awkwardly simulate if, while and for with goto, or simulate basically everything with enough NANDs. You’re not wrong, but what’s in it for me? Why am I supposed to like this any better than having a variety of fit-for-purpose abstractions? It just feels like I’d be tying one hand behind my back so there can be one less abstraction, without explain why having N-1 abstractions is even more desirable than having N.
Seems like an “a foolish consistency is the hobgoblin of little minds” desire than anything beneficial, really.
Simplicity of implementation. Implementing the staging area like a commit, or at least like a pointer to a tree object, would likely make the underlying data model simpler. I wonder why the staging area was implemented the way it is.
At the interface level however I’ve had to change my mind because of pre-commit hooks. When all you have is commits, and some tests are automatically launched every time you commit anything, it’s pretty hard to add stuff piecemeal.
Yes, simplicity of implementation and UI. https://github.com/martinvonz/jj (mentioned in the article) makes the working copy (not the staging area) an actual commit. That does make the implementation quite a lot simpler. You also get backups of the working copy that way.
No offence but, why would I give a shit about this? git is a tool I use to enable me to get other work done, it’s not something I’m reimplementing. If “making the implementation simpler” means my day-to-day workflows get materially more unpleasant, the simplicity of the implementation can take a long walk off a short pier for all I care.
It’s not just pre-commit hooks that get materially worse with this. “Staging” something would then have to have a commit message, I would effectively have to branch off of head before doing every single “staging” commit in order to be able to still merge another branch and then rebase it back on top of everything without fucking about in the reflog to move my now-burried-in-the-past stage commit forward, etc, etc. “It would make the implementation simpler” would be a really poor excuse for a user hostile change.
If “making the implementation simpler” means my day-to-day workflows get materially more unpleasant, the simplicity of the implementation can take a long walk off a short pier for all I care.
I agree. Users shouldn’t have to care about the implementation (except for minor effects like a simpler implementation resulting in fewer bugs). But I don’t understand why your workflows would be materially more unpleasant. I think they would actually be more pleasant. Mercurial users very rarely miss the staging area. I was a git developer (mostly working on git rebase) a long time ago, so I consider myself a (former) git power user. I never miss the staging area when I use Mercurial.
“Staging” something would then have to have a commit message
Why? I think the topic of this thread is about what can be done differently, so why would the new tool require a commit message? I agree that it’s useful if the tool lets you provide a message, but I don’t think it needs to be required.
I would effectively have to branch off of head before doing every single “staging” commit in order to be able to still merge another branch and then rebase it back on top of everything without fucking about in the reflog to move my now-burried-in-the-past stage commit forward
I don’t follow. Are you saying you’re currently doing the following?
I don’t see why the new tool would bury the staging commit in the past. That’s not what happens with Jujutsu/jj anyway. Since the working copy is just like any other commit there, you can simply merge the other branch with it and then rebase the whole stack onto the other branch after.
Mercurial users very rarely miss the staging area.
Well, I’m not them. As somebody who was forced to use Mercurial for a bit and hated every second of it, I missed the hell out of it, personally (and if memory serves, there was later at least one inevitably-nonstandard Mercurial plugin to paper over this weakness, so I don’t think I was the only person missing it).
I’ve talked about my workflow elsewhere in this thread, I’m not really interested in rehashing it, but suffice to say I lean on the index for all kinds of things.
Are you saying you’re currently doing the following? git add -p
git merge
I’m saying that any number of times I start putting together a commit by staging things on Friday afternoon, come back on Monday, pull in latest from main, and continue working on forming a commit.
I’m honestly not interested in reading it, or in what “Jujutsu” does, as I’m really happy with git and totally uninterested in replacing it. All I was discussing in this thread with Loup-Vaillant was the usefulness of the stage as an abstraction and my disinterest in seeing it removed under an attitude of “well you could just manually make commits when you would want to stage things, instead”.
As a Git power-user, you may think that you need the power of the index to commit only part of the working copy. However, Jujutsu provides commands for more directly achieving most use cases you’re used to using Git’s index for.
I was claiming that the workflows we have with the staging area, we could achieve without. And Jujutsu here has ways to do exactly that. It has everything to do with the scenario you were objecting to.
Also, this page (and what I cited specifically) is not about what jujutsu does under the hood, it’s about its user interface.
And it’s because of developers with their heads stuck so far up their asses that they prioritize their implementation simplicity over the user experience that so much software is actively user-hostile.
Sublime Merge is the ideal git client for me. It doesn’t pretend it’s not git like all other GUI clients I’ve used so you don’t have to learn something new and you don’t unlearn git. It uses simple git commands and shows them to you. Most of git’s day-to-day problems go away if you can just see what you’re doing (including what you’ve mentioned).
CLI doesn’t cut it for projects of today’s size. A new git won’t fix that. The state of a repository doesn’t fit in a terminal and it doesn’t fit in my brain. Sublime Merge shows it just right.
I use Fork for the same purpose and the staging area has never been a problem since it is visible and diffable at any time, and that’s how you compose your commits.
Well, on the one hand people could long for a better way to store the conflict resolutions to reuse them better on future merges.
On the other hand, of all approaches to DAG-of-commits, Git’s model is plain worse than the older/parallel ones. Git is basically intended to lose valuable information about intent. The original target branch of the commit often tells as much as the commit message… but it is only available in reflog… auto-GCed and impossible to sync.
Half of my branches are called werwerdsdffsd. I absolutely don’t want them permanently burned in the history. These scars from work-in-progress annoyed me in Mercurial.
Honestly I have completely the opposite feeling. Back in the days before git crushed the world, I used Mercurial quite a lot and I liked that Mercurial had both the ephemeral “throw away after use” model (bookmarks) and the permanent-part-of-your-repository-history model (branches). They serve different purposes, and both are useful and important to have. Git only has one and mostly likes to pretend that the other is awful and horrible and nobody should ever want it, but any long-lived project is going to end up with major refactoring or rewrites or big integrations that they’ll want to keep some kind of “here’s how we did it” record to easily point to, and that’s precisely where the heavyweight branch shines.
This is a very good point.
It would be interesting to tag and attach information to a group of related commits. I’m curious of the linux kernel workflows. If everything is an emailed patch, maybe features are done one commit at a time.
If you go further, there are many directions to extend what you can store and query in the repository! And of course they are useful. But even the data Git forces you to have (unlike, by the way, many other DVCSes where if you do not want a meaningful name you can just have multiple heads in parallel inside a branch) could be used better.
I can’t imagine a scenario where the original branch point of a feature would ever matter, but I am constantly sifting through untidy merge histories that obscure the intent.
Tending to your commit history with intentionality communicates to reviewers what is important, and removes what isn’t.
It is not about the point a branch started from. It is about which of the recurring branches the commit was in. Was it in quick-fix-train branch or in update-major-dependency-X branch?
The reason why this isn’t common is because of GitHub more than Git. They don’t provide a way to use merge commits that isn’t a nightmare.
When I was release managing by hand, my preferred approach was rebasing the branch off HEAD but retaining the merge commit, so that the branch commits were visually grouped together and the branch name was retained in the history. Git can do this easily.
I never understood the hate for Git’s CLI. You can learn 99% of what you need to know on a daily basis in a few hours. That’s not a bad time investment for a pivotal tool that you use multiple times every day. I don’t expect a daily driver tool to be intuitive, I expect it to be rock-solid, predictable, and powerful.
This is a false dichotomy: it can be both (as Mercurial is). Moreover, while it’s true that you can learn the basics to get by with in a few hours, it causes constant low-level mental overhead to remember how different commands interact, what the flag is in this command vs. that command, etc.—and never mind that the man pages are all written for people thinking in terms of the internals, instead of for general users. (That this is a common failing of man pages does not make it any less a problem for git!)
One way of saying it: git has effectively zero progressive disclosure of complexity. That makes it a continual source of paper cuts at minimum unless you’ve managed to actually fully internalize not only a correct mental model for it but in many cases the actual implementation mechanics on which it works.
Its predecessors CVS and svn had much more intuitive commands (even if they were was clumsy to use in other ways). DARCS has been mentioned many times as being much more easy to use as well. People migrating from those tools really had a hard time, especially because git changed the meanings of some commands, like checkout.
Then there were some other tools that came up around the same time or shortly after git but didn’t get the popularity of git like hg and bzr, which were much more pleasant to use as well.
I think the issues people have are less about the CLI itself and more about how it interfaces with the (for some developers) complex and hard to understand concepts at hand.
Take rebase for example. Once you grok what it is, it’s easy, but trying to explain the concept of replaying commits on top of others to someone used to old school tools like CVS or Subversion can be a challenge, especially when they REALLY DEEPLY don’t care and see this as an impediment to getting their work done.
I’m a former release engineer, so I see the value in the magic Git brings to the table, but it can be a harder sell for some :)
I would argue that this is one of the main reasons for git’s success. The CLI is so bad that people were motivated to look for tools to avoid using it. Some of them were motivated to write tools to avoid using it. There’s a much richer set of local GUI and web tools than I’ve seen for any other revision control system and this was true even when git was still quite new.
I never used a GUI with CVS or Subversion, but I wanted to as soon as I started touching the git command line. I wanted features like PRs and web-based code review, because I didn’t want to merge things locally. I’ve subsequently learned a lot about how to use the git CLI and tend to use it for a lot of tasks. If it had been as good as, say, Mercurial’s from the start then I never would have adopted things like gitx / gitg and GitHub and it’s those things that make the git ecosystem a pleasant place to be.
The interface of Git and its underlying data models are two very different things, that are best treated separately.
Yes a thousand times this! :) Git’s data model has been a quantum leap for people who need to manage source code at scale. Speaking as a former release engineer, I used to be the poor schmoe who used to have to conduct Merge Day, where a branch gets merged back to main.
There was exactly one thing you could always guarantee about merge day: There Will Be Blood.
So let’s talk about looking past git’s god awful interface, but keep the amazing nubbins intact and doing the nearly miraculous work they do so well :)
And I don’t just mean throwing a GUI on top either. Let’s rethink the platonic ideal for how developers would want their workflow to look in 2022. Focus on the common case. Let the ascetics floating on a cloud of pure intellect script their perfect custom solutions, but make life better for the “cold dark matter” developers which are legion.
I would say that you simultaneously give credit where it is not due (there were multiple DVCSes before Git, and approximately every one had a better data model, and then there are things that Subversion still has better than everyone else, somehow), and ignore the part that actually made your life easier — the efforts of pushing Git down people’s throat, done by Linus Torvalds, spending orders of magnitude more of his time on this than on getting things right beyond basic workability in Git.
Well, Bazaar is technically earlies. Monotone is significantly earlier. Monotone has quite interesting and nicely decoupled data model where the commit DAG is just one thing; changelog, author — and branches get the same treatment — are not parts of a commit, but separately stored claims about a commit, and this claim system is extensible and queriable. And of course Git was about Linus Torvalds speedrunning implementation of the parts of BitKeeper he really really needed.
It might be that in the old days running on Python limited speed of both Mercurial and Bazaar. Rumour has it that the Monotone version Torvalds found too slow was indeed a performance regression (they had one particularly slow release at around that time; Monotone is not in Python)
Note that one part of things making Git fast is that enables some optimisations that systems like Monotone make optional (it is quite optimistic about how quickly you can decide that the file must not have been modified, for example). Another is that it was originally only intended to be FS-safe on ext3… and then everyone forgot to care, so now it is quite likely to break the repository in case of unclean shutdown mid-operation. Yes, I have damaged repositories that way to a state where I could not find advice on how to avoid re-cloning to get even partially working repository.
As of Subversion, it has narrow checkouts which are a great feature, and DVCSes could also have them, but I don’t think anyone properly has them. You kind of can hack something with remote-automate in Monotone, but probably flakily.
Let the data model pretend there’s a blob for each version of that huge file, even though in fact the software is automatically compressing & decompressing things under the hood.
Ironically, that’s part of the performance problem – compressing the packfiles tends to be where things hurt.
Hmm, knowing you I’m sure you’ve tested it to death.
I guess they got rid of the exponential conflict resolution that plagued DARCS? If so perhaps I should give patch theory another go. Git ended up winning the war before I got around to actually study patch theory, maybe it is sounder than I thought.
Pijul is a completely different thing than Darcs, the current state of a repository in Pijul is actually a special instance of a CRDT, which is exactly what you want for a version control system.
Git is also a CRDT, but HEAD isn’t (unlike in Pijul), the CRDT in Git is the entire history, and that is not a very useful property.
Best test suite ever. Thanks again, and again, and again for that. It also helped debug Sanakirja, a database engine used as the foundation of Pijul, but usable in other contexts.
You can’t describe Git without discussing rebase and merge: these are the two most common operations in Git, yet they don’t satisfy any interesting mathematical property such as associativity or symmetry:
Associativity is when you want to merge your commits one by one from a remote branch. This should intuitively be the same as merging the remote HEAD, but Git manages to make it different sometimes. When that happens, your lines can be shuffled around more or less randomly.
Symmetry means that merging A and B is the same as merging B and A. Two coauthors doing the same conflictless merge might end up with different results. This is one of the main benefits of GitHub: merges are never done concurrently when you use a central server.
Well, at least this is not the fault of the data model: if you have all the snapshots, you can deduce all the patches. It’s the operations themselves that need fixing.
My point is that this is a common misconception: no datastructure is ever relevant without considering the common operations we want to run on it.
For Git repos, you can deduce all the patches indeed, but merge and rebase can’t be fixed while keeping a reasonable performance, since the merge problem Git tries to solve is the wrong one (“merge the HEADs, knowing their youngest common ancestor”). That problem cannot have enough information to satisfy basic intuitive properties.
The only way to fix it is to fetch the entire sequence of commits from the common ancestor. This is certainly doable in Git, but merges become O(n) in time complexity, where n is the size of history.
The good news is, this is possible. The price to pay is a slightly more complex datastructure, slightly harder to implement (but manageable). Obviously, the downside is that it can’t be consistent with Git, since we need more information. On the bright side, it’s been implemented: https://pijul.org
no datastructure is ever relevant without considering the common operations we want to run on it.
Agreed. Now, how often do we actually merge stuff, and how far is the common ancestor in practice?
My understanding of the usage of version control is that merging two big branches (with an old common ancestor) is rare. Far more often we merge (or rebase) work units with just a couple commits. Even more often than that we have one commit that’s late, so we just pull in the latest change then merge or rebase that one commit. And there are the checkout operations, which in some cases can occur most frequently. While a patch model would no doubt facilitate merges, it may not be worth the cost of making other, arguably more frequent operations, slower.
(Of course, my argument is moot until we actually measure. But remember that Git won in no small part because of its performance.)
the only proper modelling of conflicts, merges and rebases/cherry-picking I know of (Pijul) can’t rely on common ancestors only, because rebases can make some future merges more complex than a simple 3-way merge problem.
I know many engineers are fascinated by Git’s speed, but the algorithm running on the CPU is almost never the bottleneck: the operator’s brain is usually much slower than the CPU in any modern version control system (even Darcs has fixed its exponential merge). Conflicts do happen, so do cherry-picks and rebases. They aren’t rare in large projects, and can be extremely confusing without proper tools. Making these algorithms fast is IMHO much more important from a cost perspective than gaining 10% on a operation already taking less than 0.1 second. I won’t deny the facts though: if Pijul isn’t used more in industry, it could be partly because that opinion isn’t widely shared.
some common algorithmic operations in Git are slower than in Pijul (pijul credit is much faster than git blame on large instances), and most operations are comparable in speed. One thing where Git is faster is browsing old history: the datastructures are ready in Pijul, but I haven’t implemented the operations yet (I promised I would do that as soon as this is needed by a real project).
In cases where rewriting history is viable, git filter-repo has a
--mailmap
flag that can use a.mailmap
file to modify past commits.I’ve found this useful for personal projects and teams small enough that asking people to make a fresh clone of the repo isn’t a huge bother (…and honestly, people’s frustrations with git frequently lead to fresh clones anyway).
It would be nice if there were a way to make the original commit hash a parent of the rewritten one so that anyone with a downstream clone can still pull, merge, and so on.
There’s one other place where mailmaps are important: GDPR and other regulatory compliance. Names and email addresses are PII and so you can be required by law to remove them from your git repo. Ideally, you’d do this by putting a UUID, rather than a name and email address, in each commit and then having a mailmap that turns these into human names, so you just need to edit the mailmap. This, unfortunately, requires storing the mailmap outside of the repo, because otherwise you end up with the same problem (you must rewrite history to edit it in the past). I believe Pijul has a good solution to this.
In Pijul all patches are signed, and authors are identified by their signing key. We still need a system for each repository to map keys to identities, but these maps are authenticated by signatures as well.
If you generate a new key per repo, that should be fine. Unfortunately, if the key is used across repos then it’s linkable data and has all of the same GDPR concerns.
Ideally, Pijul would sign each patch with an ephemeral key that it signed with your personal key. You’d then have a per-repo notion of ‘these commits are all signed by the same person’ but could store the signatures of the signatures separately and, if necessary, re-sign all of the commits with a new key to establish an attestation that the repo guarantees that they all came from the same person, but not in a way that identifies that person.
Is it just the KV store that would need to be replaced to self-host with workerd? Part of me is tempted to try to package this for Sandstorm, but any components that aren’t FOSS would of course be blockers.
Kenton was at one point excited about the idea of workers apps on Sandstorm, having started both projects. But up until now I haven’t seen any workers apps that seemed interesting to package.
Durable objects are really important for the Nest, but if you don’t require any distributed setup, they’re just fundamentally another layer of KV stores.
Their platform and setup is not at all how I would have done things, but the main value of CF Workers for me is not the software, it’s the scalability and reliability.
Someone pointed out to me that DO actually are in the FOSS version: https://github.com/cloudflare/workerd/pull/302
…so that’s exciting.
Are there any papers or posts that describe the current Pijul data model? I’d be interested in reading about the underlying format. I know it’s undergone some revisions over the past few years.
No, not really. The underlying format is complicated, mostly because it relies on Sanakirja, which is itself complicated.
i’m confused about the choice to run this on cloudflare workers. i understand that the author wants the code to be capable of running on more than a single platform. it just feels inevitable that the code gets tied up around proprietary cloudflare features.
my opinion: it’d have been better off as a monolith that runs on any old linux vps, not a bunch of newfangled stuff that targets a proprietary platform.
i like pijul, but this feels a little shortsighted to me
“Shortsighted” is a strong word here, I’ve been running a “your opinion”-style solution for years and know exactly why I can’t make it work.
“Proprietary Cloudflare Workers features” aren’t so unique that we can’t easily replicate them in a day of work (without the open source runtime, it would probably take a few days of work instead).
cloudflare is possibly the biggest threat to the free & open web, and hitching nest to it & claiming it’s for “ease of contribution” feels off optically. i have no doubt that you have good reasons, you’re much smarter than i - but intuitively, it feels wrong, and i believe this perception could be problematic. actually optimizing for contribution would mean making nest simple to run without a cloudflare account.
can i self-host nest without tying it to a proprietary platform? if the answer is no, it doesn’t feel very “open source”, even if the license says so.
I respectfully disagree, for two reasons:
I wouldn’t have done this without workerd and miniflare, two open source runtime environments to run Cloudlfare Workers scripts outside of Cloudflare.
At a higher level, the biggest threat to the free and open internet isn’t a company providing useful services and infrastructure, it is the absence of political will, public investment and regulations. Europe could have done that, we have some of the world experts in databases, CRDTs, replication, programming languages… I actually made contributions in some of these fields, mostly out of my own money and on my free time. There was also a time in Europe where we were not afraid of ambitious projects and regulations.
So, until Europe (or someone else: China?) invests in open and public cloud infrastructure and useful projects instead of what they’re doing (I won’t cite any example, but there are many), I’m fine using Cloudflare.
Cloudflare maintains two open source runtimes/simulators for cloudflare workers that you could probably use to self-host. The open source runtimes are workerd (based at least in part on the real code used at cloudflare) and miniflare (a simulator written in typescript).
This feels like an exaggeration. Cloudflare to my knowledge have only ever blocked one site for non-legal reasons, and they seemed to regret it. Unless cloudflare start disabling “DNS only” settings I can’t see the issue, could you enlighten me?
I don’t want to get into an argument about it, but it is untrue that Cloudflare has only blocked one site (except where compelled by law). Cloudflare kicked the fascism-promoting daily stormer and 8chan sites and kicked the hatespeech site kiwi farms.
Cloudflare also kicked switter.at, a mastodon instance for sex workers, which they claimed to be doing because of FOSTA, but they did this without warning and before any lawsuit was filed. In other cases cloudflare has fought censorship demands in the courts and won, and in the years following FOSTA it seems like courts are less willing to prosecute service providers than was originally thought, so cloudflare could probably have kept hosting switter.at, and could certainly have given them notice.
By the way, I am aware of these things with Cloudflare, and I do disagree with their responses (or delays, in the case of KiwiFarms) in 100% of the cases, but I’m willing to believe it is an irresponsible, rather than malevolent, behaviour, as is often the case in large organisations, especially if they’ve grown too fast.
I chose to work with CF anyway, because I don’t think that choice will influence Pijul’s future. If anything bad happens to us (censorship or otherwise) because of CF, it will mean that Pijul is big enough to be a problem, and that’ll prompt us to come up with a solution to keep going. I’m used to working with extremely limited setups and budgets, so that doesn’t scare me much.
My apologies, I only remembered them dropping KF.
No apology necessary, but thanks for being polite anyway :)
Using a privacy-oriented machine setup and either on my real IP which is not in the West or using a commercial VPN, I am constantly needing to solve machine learning hCAPTCHAs to the point I peace out if I don’t feel it’s worth it because it’s exhausting to do multiple times a day.
IP-based blocking and the service itself may to to comply with US embargoes which can lights-out entire regions even if the folks living there have nothing to do with their ruling government.
There’s also the morale issue of pointing at a proprietary service as the path of leash setup resistance in this particular hosting case. Maybe we’ll get lucky and someone determined enough will make the NixOS module to where you can
services.pijul.enable = true
, but until then I bet most of the resources will recommend signing up with another publicity-traded, proprietary service.Is this happening for the Nest though?
Fortunately, no, I’ve never seen it with the current Nest at least. I stopped subscribing to a VPN though because of how many Cloudflare-fronted services prevented me from using them so I’ve not tested with them.
I know that Cloudflare gives me that option of “verifying” users, but that’s totally optional from what I’m seeing. I still dream of a public infrastructure that would provide the same thing, but it doesn’t exist yet.
The answer is yes: https://github.com/cloudflare/workerd
Thanks for sharing! It’s cool that you’re hosting your database on top of the Cloudflare KV store.
You say this should be more financially sustainable for you. Do you have any numbers for that?
Not yet, but I’ll probably share them when people start using it.
Without counting any external costs:
My goal with this serverless version is to provide a 100% reliable service, which the previous version could not possibly become from where it was, mostly because of the way PostgreSQL and Pijul have to work together (you can’t do a “join” between a Pijul branche and a PostgreSQL table, so you have to do lots of SQL requests, and have the servers close to your repos, which is hard to replicate). This will in turn allow me to sell “pro” account, which you can already buy in the beta version (nest.pijul.org) if you want to host private projects, so instead of having only bills to pay, I hope to stop losing (my own personal) money with that service.
FaaS means you don’t need to fix crashes, downtimes, servers, databases… It’s a nightmare to debug, especially if you’re like me and need to mix JS and WASM: no stack traces, no break points, manual debugging messages. But once you get past that, you can release confidently and sleep well at night. I guess the value of that increases as you age, so I can’t evaluate it properly.
The comparison between the provider bills is probably negligible next to these two parameters.
Aside: darcs is still kicking. Are the performance issues overblown in comparison to Pijul?
Pijul used to be double-exponentially faster for applying a patch. With the recent improvements in Darcs, Pijul is now just exponentially faster (Darcs is linear in the size of history, while Pijul is logarithmic).
For some reason, it seems many people assumed you couldn’t do that and were forced to use nest.pijul.com. But this self-hosting solution was always there, starting around the end of 2015, i.e. way before the Nest existed.
I had this issue when I attempted to set that up related to key prove: https://nest.pijul.com/pijul/pijul/discussions/742
Didn’t continue setup since then. Do you know if this is still an issue?
Thanks.
Key prove doesn’t work for repos, indeed.
oh cool, this looks very simple and straightforward, i might have to set it up! have you ever experimented with nidobyte? i’ve been interested, but haven’t yet played around with it
I believe that project has stopped and the author now works for a commercial GitHub competitor if I’m not mistaken.
Also, we’re about to release a new, open source version of the Nest with cool new ideas.
looking forward to this! Excited! Thank you.
always exciting to hear about pijul! i think the biggest blocker to me using it more widely is that it isn’t supported as a way to version control Nix flakes, which requires explicit support in Nix. once I have a bit more free time and motivation, I hope to contribute a
fetchpijul
primitive to Nix to start building that.This would be nice indeed. There’s a planned feature aiming at making a hybrid system between patches and snapshots. You would have the ease of use of patches (commutativity etc) plus the history navigation of snapshots.
This is relatively easy to do, all the formats are ready for it, but I haven’t found the time to do it yet (nor have I found the money to justify my time on this).
Mostly a bug fixing release, but with an unprecedented number of external contributions. The new identity system is really novel. I should probably write a blog post announcing this release properly, but it would probably be full of minor points.
So where can we go to learn about the identity system?
Do you have a changelog that records the external contributions (in a way that non-experts can understand?). If so, can you point to that?
We do have a changelog, be it hasn’t been updated for this particular release yet.
This is a great example of why I’d really like a “flag: blatantly incorrect”.
On Mercurial, as many others have pointed out, the author didn’t actually read the page they wrote on. Contemporary Mercurial runs great on Python 3, thank you very much, and is well on the way to porting to Rust.
I’m no Fossil fan, but even I know that the error they’re highlighting with Fossil is because the Git repository isn’t valid. [edit: see note below; I’m wrong on the issue on that repo, but the rest of this paragraph is vald.] I hit this type of issue all the time while working on Kiln’s Harmony code. If you’re bored, try setting
fsckObjects
totrue
in your.gitconfig
. You won’t be able to clone roughly a third of major repos on GitHub. (The ratio improves dramatically for more recent projects until it hits 100% success for stuff that was kicked off in the last couple of years, but anything five or more years old, you’re gonna hit this type of issue.)Darcs is evicted over its website being HTTP, which has nothing to do with Darcs. Pijul over the author having no clue how to write code in Windows.
The author is trolling. Poorly. And somehow this is highly ranked on the front page. Let’s do better.
Edit: correction the Fossil bit: After two seconds of poking with Fossil and Git
fast-export
on Windows, the actual issue with their Windows repository is that they’re in a command shell that’s piping UTF-16 and/or sending\r\n
. This is also something any Windows dev should recognize and know how to deal with. If you’ve set the default shell encoding to UTF-8 or use a shell like yori that does something sane, you won’t hit this issue, but I’d sure hope that anyone using Windows full-time who refuses to set a UTF-8 default is habitually slamming| Out-File -encoding utf8
after commands like that.For Pijul it’s even worse, it’s over (1) the author relying on contributors to test and fix on platforms other than Linux and (2) WSL not implementing POSIX correctly, but blaming Pijul’s author is easier than blaming Microsoft.
Also, she knows it, has contributed to Pijul after that post, and has never updated it.
I wouldn’t go as far as saying “the author is trolling”. To me it’s an opinion piece where, sure, the author could have put some more effort, but that’s their particular style and how they chose to write it. To be fair I did not expect it to be this controversial. It does mention some points on what a “better git” would need to do, even if it’s a bit snarky. Particularly the suggestions at the end. At the very least this thread has brought a bunch of good alternatives and points 🤷♀️.
I’m glad the post essentially says “no” after trying most contenders. I hope people make better porcelain for git, but moving to another data model makes little sense to me, and I hope the plumbing remains.
I did kind of raise my eyebrows at the first sentence:
It’s been a long time since git was developed only for the needs of Linux! For example github has a series of detailed blog posts on recent improvements to git: 1 2 3 4 5 6.
The problem with Git is not the “better porcelain”, the current one is fine. The problem is that the fundamental data model doesn’t reflect how people actually work: commits are snapshots, yet all UIs present them as diffs, because that’s how people reason about work. The result of my work on code has never produced an entire new version of a repository, in all cases I remember of I’ve only ever made changes to existing (or empty) repos.
This is the cause of bad merges,
git rerere
, poorly-handled conflicts etc. which waste millions of man-hours globally every year.I don’t see any reason to be “glad” that the author of that post didn’t properly evaluate alternatives (Darcs dismissed over HTTP and Pijul over WSL being broken: easier to blame the author than Microsoft).
In my experience, the more pain and suffering one has spent learning Git, the more fiercely one defends it.
Not my experience. I was there at the transition phase, teaching developers git. Some hadn’t used SCM at all,. most knew SVN.
The overall experience was: They could do just as much of their daily work as with SVN or CVS very quickly, and there were a few edge cases. But if you had someone knowledgeable it was SO much easier to fix mistakes or recover lost updates. Also if people put in a little work on top of their “checkout, commit, branch, tag” workflow they were very happy to be able to adapt it to their workflow.
I’m not saying none of the others would do that or that they wouldn’t be better - all I’m saying is that IMHO git doesn’t need fierce defense. It mostly works.
(I tried fossil very briefly and it didn’t click, also it ate our main repo and we had to get the maintainer to help :P and I couldn’t really make sense of darcs. I never had any beef with Mercurial and if it had won over git I would probably be just as happy, although it was a little slow back then… I’ve not used it in a decade I guess)
The underlying data model problem is something that I’ve run across with experienced devs multiple times. It manifests as soon as your VCS has anything other than a single branch with a linear history:
If I create a branch and then add a commit on top, that commit doesn’t have an identity, only the new branch head does. If I cherry pick that commit onto another branch and then try to merge the two, there’s a good chance that they’ll conflict because they’re both changing the same file. This can also happen after merging in anything beyond the non-trivial cases (try maintaining three branches with frequent merges between pairs of them and you’ll hit situations where a commit causes merge conflicts with itself).
Every large project where I’ve used git has workflows designed to work around this flaw in the underlying data model. I believe Pijul is built to prevent this by design (tracking patches, rather than trees, as the things that have identity) but I’ve never tried it.
I don’t understand, is it “not your experience” that:
In any case, none of these things is contradicted by your explanation that Git came after SVN (I was there too). That said, SVN, CVS, Git, Fossil and Mercurial have the same underlying model: snapshots + 3-way merge. Git and Mercurial are smarter by doing it in a distributed way, but the fundamentals are the same.
Darcs and Pijul do things differently, using actual algorithms instead of hacks. This is never even hinted at in the article.
I simply think it matches very well. Yes, we could now spend time arguing if it’s just the same as SVN, but that was not my real point.
Not the OP, but I’ll respond with my experiences:
Its more meaningful (to me at least) to show what changed between two trees represented as the commit as the primary representation rather than the tree, but
I don’t tend to use workflows that tend to use merges as their primary integration method. Work on a feature, rebase on mainline, run tests, merge clean.
The author’s use cases are contradicted by the vast majority that use git successfully regardless of the problems cited. I’d say the only points that I do agree with the author on are:
Not the author’s quote, but there’s a missing next step there which is to examine what happens if we actually do that part better. Fossil kinda has the right approach there. As doe things like Beos’ BFS filesystem, or WinFS which both built database like concepts into a filesystem. Some of the larger Git systems build a backend using databases rather than files, so there’s no real problem that is not being worked on there.
The one thing I’d like git to have is the idea of correcting / annotating history. Let’s say a junior dev makes 3 commits with such messages as ‘commit’, ‘fixed’, ‘actual fix’. Being able to group and reword those commits into a single commit say ‘implemented foobar feature’ sometime after the fact, without breaking everything else would be a godsend. In effect, git history is the first derivative of your code (dCode/dTime), but there’s a missing second derivative.
Snapshots can be trivially converted to diffs and vice-versa, so I don’t see how this would impact merges. Whatever you can store as patches you can store as a sequence of snapshots that differ by the patch you want. Internally git stores snapshots as diffs in pack files anyway. Is there some clever merge algorithm that can’t be ported to git?
What git is missing is ability to preserve “grafts” across network to ensure that rebase and other branch rewrites don’t break old commit references.
I actually thought about the problem a bit (like, for a few years) before writing that comment.
Your comment sounds almost reasonable, but its truth depends heavily on how you define things. As I’m sure you’re aware, isomorphisms between datastructure are only relevant if you define the set of operations you’re interested in.
For a good DVCS, my personal favourite set of operations includes:
If you try to convert a Pijul repo to a Git repo, you will lose information about which patch solved which conflict. You’ll only see snapshots. If you try to cherry pick and merge you’ll get odd conflicts and might even need to use
git rerere
.The other direction works better: you can convert a Git repo to a Pijul repo without losing anything meaningful for these operations. If you do it naïvely you might lose information about branches.
Betteridge’s law of headlines strikes again.
Not really, Betteridge’s Law is better applied to headlines like Will there every be a better VCS than Git?
By assuming the answer to the headline in question is the default “No”, you’re basically assuming Git will never be surpassed.
That makes me sad. :-(
Honestly I’m of the opinion that git’s underlying data model is actually pretty solid; it’s just the user interface that’s dogshit. Luckily that’s the easiest part to replace, and it doesn’t have any of the unfortunate network effect problems of changing systems altogether.
I’ve been using magit for a decade and a half; if magit (or any other alternate git frontends) had never existed, I would have dumped git ages ago, but … you don’t have to use the awful parts?
For what it’s worth, I do disagree, but not in a way relevant to this article. If we’re going to discuss Git’s data model, I’d love to discuss its inability to meaningfully track rebased/edited commits, the fact that heads are not version tracked in any meaningful capacity (yeah, you’ve got the reflog locally, but that’s it), that the data formats were standardized at once too early and too late (meaning that Git’s still struggling to improve its performance on the one hand, and that tools that work with Git have to constantly handle “invalid” repositories on the other), etc. But I absolutely, unquestionably agree that Git’s UI is the first 90% of the problem with Git—and I even agree that magit fixes a lot of those issues.
The lack of ability to explicitly store file moves is also frustrating to me.
Don’t forget that fixing capitalization errors with file names is a huge PITA on Mac.
I’ve come to the conclusion that there’s something wrong with the data model in the sense that any practical use of Git with a team requires linearization of commit history to keep what’s changing when straight. I think a better data model would be able to keep track of the history branches and rebases. A squash or rebase should include some metadata that lets you get back the state before the rebase. In theory, you could just do a merge, but no one does that at scale because they make it too messy to keep track of what changed when.
I don’t think that’s a data model problem. It’s a human problem. Git can store a branching history just fine. It’s just much easier for people to read a linearized list of changes and operate on diffs on a single axis.
Kind of semantic debate whether the problem is the data model per se or not, but the thing I want Git to do—show me a linear rebased history by default but have the ability to also show me the pre-flattened history and the branch names(!) involved—can’t be done by using Git as it is. In theory you could build what I want using Git as the engine and a new UI layer on top, but it wouldn’t be interoperable with other people’s use of Git.
It already has a distinction between
git log
,git log --graph
andgit log --decorate
(if you don’t delete branches that you care about seeing). And yeah, you can add other UIs on top.BTW: I never ever want my branch names immortalized in the history. I saw Mercurial do this, and that was the last time I’ve ever used it. IMHO people confuse having record of changes and ability to roll them back precisely with indiscriminately recording how the sausage has been made. These are close, but not the same.
git merge –no-ff (imo the only correct merge for more than a single commit) does use the branch name, but the message is editable if your branch had a useless name
None of those show squashes/rebases.
They’re not supposed to! Squashing and amending are important tools for cleaning up unwanted history. This is a very important ability, because it allows committing often, even before each change is final, and then fixing it up into readable changes rather than “wip”, “wip”, “oops, typo”, “final”, “final 2”.
What I’m saying is, I want Git for Git. I want the ability to get back history that Git gives me for files, for Git itself. Git instead lets you either have one messy history (with a bunch of octopus merges) or one clean history (with rebase/linearization). But I want a clean history that I can see the history of and find out about octopuses (octopi?) behind it.
No. The user interface is one of the best parts of Git, in that it reflects the internals quite transparently. The fundamental storage doesn’t model how people work: Git reasons entirely in terms of commits/snapshots, yet any view of these is 100% of the time presented as diffs.
Git will never allow you to cherry-pick meaningfully, and you’ll always need dirty hacks like rerere to re-solve already-solved conflicts. Not because of porcelain (that would have been solved ten years ago), but because snapshots aren’t the right model for that particular problem.
How many people do all their filesystem work with CLI tools these days? Why should we do it for a content-addressable filesystem with a builtin VCS?
Never heard anyone complain that file managers abstract mv as “rename” either, why can’t git GUIs do the same in peace?
At least one. But I also prefer cables on my headphones.
Oh thank goodness, There’s two of us. I’m not alone!
Sounds interesting, but i definitely don’t grok it. I get the problems it points out, but definitely don’t get how it concretely fixes them. Probably just need to mess with it to understand better.
Also, being able to swap any commit order easily sounds like an anti-feature to me. I think history should not be edited generally speaking. Making that easy sounds concerning.
This is not what Pijul does. If you want a strict linear ordering in Pijul, you can have it.
Again, not claiming to grok Pijul at all, but isn’t that specifically the feature emphasized here: https://pijul.org/manual/why_pijul.html#change-commutation
I get that you don’t have to use a feature even if it exists, but having a feature means someone might use it even if that is a bad idea.
I probably just don’t understand the feature.
The commutation feature means that rebase and merge are the same operation, or rather, that it doesn’t matter which one you do. Patches are ordered locally in each Pijul repository, but they are only partially ordered globally, by their dependencies.
What Pijul gives you is a datastructure that is aware of conflicts, and where the order of incomparable patches doesn’t matter: you’ll get the exact same snapshot with different orders.
You can still bisect locally, and you can still get tags/snapshots/version identitifiers.
What I meant in my previous reply is that if your project requires a strict ordering (some projects are like that), you can model it in Pijul: you won’t be able to push a new patch without first pushing all the previous ones.
But not all projects are like that, some projects use feature branches and want the ability to merge (and unmerge) them. Or in some cases, your coauthors are working on several things at the same time and haven’t yet found the time to clean their branches, but you want to cherrypick one of their “bugfix” patches now, without (1) waiting for the bugfix to land on main and (2) without dealing with an artificial conflict between your cherry-picking and the landed bugfix in the future.
That makes a lot of sense. Thanks for the details!
I haven’t really had the time yet to read through the documentation, but how does Pijul manage its state? A big problem with Git is that it sort of assumes everything is on disk and that you always have direct access to the data. This causes problems for large Git hosts (e.g. it was a big issue at GitLab), as it’s not feasible to give e.g. web workers access to 20+ shared disks (e.g. using NFS, yuck). The result is that at such a scale you basically end up having to write your own Git daemons with clustering support and all that. It would be nice if new VCS systems were better equipped to be used at such a scale.
Good question. Pijul can separate patches into an operational part (which contains only +/- from diffs, but with byte intervals rather than actual contents) and a contents part. Its internal datastructure works on the operational part. It is relatively efficient at the moment, but grows linearly with the size of history. We have plans to make it more efficient in the future, if this becomes a problem for some users.
When downloading a bunch of patches, the larger ones aren’t downloaded entirely if they were superseded by newer versions.
But I have to admit that while we have tested Pijul on large histories (by importing Git repos), we haven’t really used it at scale in practice. And as anyone with problems “at scale” knows, experimental measurements is what matters here.
See also: jujutsu:
https://github.com/martinvonz/jj
Not quite the same. Jujutsu, Gitless and Stacked Git are UIs written on top of Git. The UX is better than Git, but they inherit all the issues of not having proper theoretical foundations. Mercurial is in that category too: better UX, same problems.
Pijul solves the most basic problem of version control: all these tools try to simulate patch commutation (using mixtures of merges and rebases), but would never say it like that, while Pijul has actual commutative patches.
Have you read the detailed conflicts document? https://github.com/martinvonz/jj/blob/main/docs/conflicts.md It also links to a more technical description: https://github.com/martinvonz/jj/blob/main/docs/technical/conflicts.md
Yes. As someone who cares a lot about version control, I have read all I could about Jujutsu when it was first released. I actually find it nice and interesting.
What I meant in my comment above was that while I acknowledge that the snapshot model is dominant and incremental improvements are a very positive thing, I also don’t think this is the right way to look at the problem:
All the algorithms modelling these problems in the context of concurrent datastructures use changes rather than snapshots: CRDTs, OTs… (I’m talking about the “latest” version being a CRDT, here, not the entire DAG of changes).
In all the Git UIs that I know of, the user is always shown commits as patches. Why represent it differently internally?
What’s presented to the user does not need to be the internal representation. Just like users and most of the tools work on the snapshot of the source repo, yet you can represent the snapshot as a set of patches internally. That, however, does not necessarily mean either snapshot or set-of-patches works superior than the other. Besides, any practical VCS would have both representations available anyway.
Good point, and actually Pijul’s internal representation is far from being as simple as just a list of patches.
However, what I meant there wasn’t about the bytes themselves, but rather about the operations defined on the internal datastructure. When your internals model snapshots (regardless of what bytes are actually written), all your operations will be on snapshots – yet, Git never graphically shows anything as snapshots, all UIs show patches. This has real consequences visible in the real world, for example the lack of associativity (bad merges), unintelligible conflicts (hello,
git rerere
), endless rebases…Also, my main interests in this project are mathematical (how to model things properly?) and technical (how to make the implementation really fast?). So, I do claim that patches can simulate snapshots at no performance penalty, whereas the converse isn’t true if you want to do a proper merge and deal with conflicts rigorously.
Yeah, I do think that, like many people have commented, collaboration networks like GitHub are one thing any new DVCS will either need to produce, or need to somehow be compatible with. Even GitHub is famously unusable for the standard kernel workflow and it could be argued that it suffers due to that.
I really like the Jujutsu compromise of allowing interop with all the common git social networks at the same time as allowing more advanced treatment of conflicts and ordering between operations.
There isn’t a document yet on how the transition to the native backend in a git social network world would look.
I also think that the operation log of jujutsu not being sharable is a limitation that would be nice to cram into some hidden data structures in the actual git repo, but then you have a chicken and egg problem of how to store that operation…
So, it seems the phrase “the most basic” was unclear in my comment above: I meant that from a theory point of view, the most basic problem is “what is a version control system?”, and that is what we tried to tackle with Pijul.
Have they open sourced the hosting webapp yet?
I have plans to do it really soon, that’s actually my next project for Pijul.
Is it possible to host Pijul without using Nest? Like with Git creating bare repo available via SSH? I couldn’t find that anywhere in the docs.
A very frequently asked question, yet it is a chapter of the manual, it couldn’t be much clearer: https://pijul.com/manual/working_with_others.html
Pijul looks very cool. Do you consider in its current state to be a production ready replacement for git? I checked the FAQ but didn’t see that question.
That’s really exciting to hear; the fact that the web UI was proprietary is something that I’d always found disappointing. What prompted the change of heart, may I ask?
My understanding is that the plan has always been to open source it, but that a closed source approach was taken at first so people would focus on contributions to the Pijul protocol rather than to the relatively unimportant CRUD app.
Exactly. Maintaining open source projects takes time, one has to prioritise. In my experience, many of the “future eager contributors” to the Nest didn’t even care to look at Pijul’s source code, which makes me seriously doubt about their exact intentions.
No change of heart. Pijul is usable now, so the focus can move on to something else. Also, I find it easier to guarantee the security (and react in case of a problem) of a web server written 100% by me.
Not that I am aware of. The sad thing is @pmeunier remains the cornerstone of this project and as far as I understand, he lack time to work on pijul these days (which is totally okay, don’t get me wrong, the amount of work he already spent on this project is tremendous).
I am curious to see what the future holds for pijul, but I am a bit pessimistic I admit.
Any concrete argument?
It is true that I haven’t had much time in the last few weeks, but that doesn’t mean Pijul is unusable, or that I won’t come back to it once my current workload lightens. Others have started contributing at a fast pace, actually.
That is wonderful news! I keep wishing the best for
pijul
, even though I don’t use it anymore. This remains a particularly inspiring software to me. Sorry if my comment sounded harsh or unjust to you, I should know better but to write about my vague pessimism when the Internet is already a sad enough place.I really wish you all the better for
pijul
(:I seem to recall that there we an announcement from a pijul author that pijul was in maintenance mode and that he was working on a new VCS. I can’t find mention of this VCS now though. Does anyone remember this?
I’m the author, this is totally wrong.
Thank you for the clarification. I’m not sure how I ended up remembering something that never happened.
I seriously doubt this is the case. There’s been a huge amount of work on pijul and it’s related ecosystem.
If I’m wrong, I’d love to know otherwise.
My best guess, is that Git is a local maximum we’re going to be stuck on until we move away from the entire concept of “historic sequence of whole trees of static text” as SCM.
darcs/pijul are the move away from fixed sequences of entire blobs. If only there was some powerful force to drive the adoption of pijul…
There’s nothing more powerful than people and project adopting it one by one. If you start a new project, using Pijul and the Nest is the best thing you can do to make the project grow.
The interface of Git and its underlying data models are two very different things, that are best treated separately.
The interface is pretty bad. If I wasn’t so used to it I would be fairly desperate for an alternative. I don’t care much for the staging area, I don’t like to have to clean up my working directory every time I need to switch branches, and I don’t like how easy it is to lose commit from a detached HEAD (though there’s always
git reflog
I guess).The underlying data model however is pretty good. We can probably ditch the staging area, but apart from that, viewing the history of a repository as a directed graph of snapshots is nice. Captures everything we need. Sure patches have to be derived from those snapshots, but we care less about the patches than we care about the various versions we saved. If there’s one thing we need to get right, it’s those snapshots. You get reproducible builds & test from them, not from patches. So I think Patches are secondary. I used to love DARCS, but I think patch theory was probably the wrong choice.
Now one thing Git really really doesn’t like is large binary files. Especially if we keep changing them. But then that’s just a compression problem. Let the data model pretend there’s a blob for each version of that huge file, even though in fact the software is automatically compressing & decompressing things under the hood.
What’s wrong with the staging area? I use it all the time to break big changes into multiple commits and smaller changes. I’d hate to see it removed just because a few people don’t find it useful.
Absolutely, I would feel like I’m missing a limb without the staging area. I understand that it’s conceptually difficult at first, but imo it’s extremely worth the cost.
Do you actually use it, or do you just do
git commit -p
, which only happens to use the staging area as an implementation detail?And how do you test the code you’re committing? How do you make sure that the staged hunks aren’t missing another hunk that, for example, changes the signature the function you’re calling? It’s a serious slowdown in workflow to need to wait for CI rounds, stash and rebase to get a clean commit, and push again.
Yes.
rebase with
--exec
I
git add -p
to the staging area and then diff it before generating the commit. I guess that could be done without a staging area using a different workflow but I don’t see the benefit (even if I have to check git status for the command every time I need to unstage something (-: )As for testing, since I’m usually using Github I use the PR as the base unit that needs to pass a test (via squash merges, the horror I know). My commits within a branch often don’t pass tests; I use commits to break things up into sections of functionality for my own benefit going back later.
Just to add on, the real place where the staging area shines is with
git reset -p
. You can reset part of a commit, amend the commit, and then create a new commit with your (original) changes or continue editing. The staging area becomes more useful the more you do commit surgery.Meh, you don’t need a staging area for that (or anything).
hg uncommit -i
(for--interactive
) does quite the same thing, and because it has no artificial staging/commit split it gets to use the clear verb.I don’t see the cost.
If you ever need to
git bisect
, you may come to regret that. I almost never usegit bisect
, but for the few times I did need it it was a life saver, and passing tests greatly facilitate it.I bisect every so often, but on the squashed PR commits on main, not individual commits within a PR branch. I’ve never needed to do that to diagnose a bug. If you have big PRs, don’t squash, or don’t use a PR-based workflow, that’s different of course. I agree with the general sentiment that all commits on main should pass tests for the purposes of bisection.
I use git gui for committing, (the built in git gui command) which let’s you pick by line not just hunks. Normally the things I’m excluding are stuff like enabling debug flags, or just extra logging, so it’s not really difficult to make sure it’s correct. Not saying I never push bad code, but I can’t recall an instance where I pushed bad code because of that so use the index to choose parts of my unfinished work to save in a stash (git stash –keep-index), and sometimes if I’m doing something risky and iterative I’ll periodically add things to the staging area as I go so I can have some way to get back to the last known good point without actually making a bunch of commits ( I could rebase after, yeah but meh).
It being just an implementation detail in most of that is a fair point though.
I personally run the regression test (which I wrote) to test changes.
Then I have to wait for the code review (which in my experience has never stopped a bug going through; when I have found bugs, in code reviews, it was always “out of scope for the work, so don’t fix it”) before checking it in. I’m dreading the day when CI is actually implemented as it would slow down an already glacial process [1].
Also, I should mention I don’t work on web stuff at all (thank God I got out of that industry).
[1] Our customer is the Oligarchic Cell Phone Company, which has a sprint of years, not days or weeks, with veto power over when we deploy changes.
Author of the Jujutsu VCS mentioned in the article here. I tried to document at https://github.com/martinvonz/jj/blob/main/docs/git-comparison.md#the-index why I think users don’t actually need the index as much as they think.
I missed the staging area for at most a few weeks after I switched from Git to Mercurial many years ago. Now I miss Mercurial’s tools for splitting commits etc. much more whenever I use Git.
Thanks for the write up. From what I read it seems like with Jujutsu if I have some WIP of which I want to commit half and continue experimenting with the other half I would need to commit it all across two commits. After that my continuing WIP would be split across two places: the second commit and the working file changes. Is that right? If so, is there any way to tag that WIP commit as do-not-push?
Not quite. Every time you run a command, the working copy is snapshotted and becomes a real commit, amending the precis working-copy commit. The changes in the working copy are thus treated just like any other commit. The corresponding think to
git commit -p
isjj split
, which creates two stacked commits from the previous working-copy commit, and the second commit (the child) is what you continue to edit in the working copy.Your follow-up question still applies (to both commits instead of the single commit you seemed to imagine). There’s not yet any way of marking the working copy as do-not-push. Maybe we’ll copy Mercurial’s “phase” concept, but we haven’t decided yet.
Way I see it, the staging area is a piece of state needed specifically for a command line interface. I use it too, for the exact reason you do. But I could do the same by committing it directly. Compare the possible workflows. Currently we do:
Without a staging area, we could instead do that:
And I’m not even talking about a possible GUI for the incremental making of several commits.
Personally I use
git add -p
all of the time. I’ve simply been burned by the other way too many times. What I want is not to save commands but to have simple commands that work for me in every situation. I enjoy the patch selection phase. More often than not it is what triggers my memory of a TODO item I forgot to jot down, etc. The patch selection is the same as reviewing the diff I’m about to push but it lets me do it incrementally so that when I’m (inevitably) interrupted I don’t have to remember my place.From your example workflows it seems like you’re interested in avoiding multiple commands. Perhaps you could use
git commit -a
most of the time? Or maybe add acommit-all
alias?Never got around to write that alias, and if I’m being honest I quite often
git diff --cached
to see what I’ve added before I actually commit it.I do need something that feels like a staging area. I was mostly wondering whether that staging area really needed to be implemented differently than an ordinary commit. Originally I believed commits were enough, until someone pointed out pre-commit hooks. Still, I wonder why the staging area isn’t at least a pointer to a
tree
object. It would have been more orthogonal, and likely require less effort to implement. I’m curious what Linus was thinking.Very honourable to revise your opinion in the face of new evidence, but I’m curious to know what would happen if you broadened the scope of your challenge with “and what workflow truly requires pre-commit hooks?”!
Hmm, that’s a tough one. Strictly speaking, none. But I can see the benefits.
Take Monocypher for instance: now it’s pretty stable, and though it is very easy for me to type
make test
every time I modify 3 characters, in practice I may want to make sure I don’t forget to do it before I commit anything. But even then there are 2 alternatives:I use
git add -p
all the time, but only because Magit makes it so easy. If I had an equally easy interface to something likehg split
orjj split
, I don’t think I’d care about the lack of an index/staging area.Do you actually add your entire working directory most of the time? Unless I’ve just initialized a repository I essentially never do that.
Here’s something I do do all the time, because my mind doesn’t work in a red-green-refactor way:
Get a bug report
Fix bug in foo_controller
Once the bug is fixed, I finally understand it well enough to write an automated regression test around it, so go do that in foo_controller_spec
Run test suite to ensure I didn’t break anything and that my new test is green
Add foo_controller and foo_controller_spec to staging area
Revert working copy (but not staged copy!) of foo_controller (but not it’s spec)
Run test suite again and ensure I have exactly one red test (the new regression test). If yes, commit the stage.
If no, debug spec against old controller until I understand why it’s not red, get it red, pull staged controller back to working area, make sure it’s green.
—
Yeah, I could probably simulate this by committing halfway through and then doing some bullshit with cherry-picks from older commits and in some cases reverting the top commit but, like, why? What would I gain from limiting myself to just this awkward commit dance as the only way of working? That’s just leaving me to cobble together a workflow that’s had a powerful abstraction taken away from it, just to satisfy some dogmatic “the commit is the only abstraction I’m willing to allow” instinct.
Yes. And when I get a bug report, I tend to first reproduce the bug, then write a failing test, then fix the code.
Sounds useful. How do you do that?
You can
checkout
a file into your working copy from any commit.Right, but that was just one example. Everything in your working copy should always be committed at all times? I’m almost never in that state. Either I’ve got other edits in progress that I intend to form into later commits, or I’ve got edits on disk that I never intend to commit but in files that should not be git ignored (because I still intend to merge upstream changes into them).
I always want to be intentionally forming every part of a commit, basically.
git add foo_controller <other files>; git restore -s HEAD foo_controller
and then
git restore foo_controller
will copy the staged version back into the working set.TBH, I have no idea what “git add -p” does off hand (I use Magit), and I’ve never used staging like that.
I had a great example use of staging come up just yesterday. I’m working in a feature branch, and we’ve given QA a build to test what we have so far. They found a bug with views, and it was an easy fix (we didn’t copy attributes over when copying a view).
So I switched over to views.cpp and made the change. I built, tested that specific view change, and in Magit I staged that specific change in views.cpp. Then I commited, pushed it, and kicked off a pipeline build to give to QA.
I also use staging all the time if I refactor while working on new code or fixing bugs. Say I’m working on “foo()”, but while doing so I refactor “bar()” and “baz()”. With staging, I can isolate the changes to “bar()” and “baz()” in their own commits, which is handy for debugging later, giving the changes to other people without pulling in all of my changes, etc.
Overall, it’s trivial to ignore staging if you don’t want it, but it would be a lot of work to simulate it if it weren’t a feature.
I’m sure you do – that’s how it was meant to be used. But you might as well use commits as the staging area – it’s easy to commit and squash. This has the benefit that you can work with your whole commit stack at the same time. I don’t know what problem the staging area solves that isn’t better solved with commits. And yet, the mere existence of this unnecessary feature – this implicitly modified invisible state that comes and crashes your next commit – adds cognitive load: Commands like
git mv
,git rm
andgit checkout
pollutes the state, thengit diff
hides it, and finally,git commit --amend
accidentally invites it into the topmost commit.The combo of being not useful and a constant stumbling block makes it bad.
If I’ve committed too much work in a single commit how would I use commits to split that commit into two commits?
Using e.g.
hg split
orjj split
. The former has a text-based interface similar togit commit -p
as well as a curses-based TUI. The latter lets you use e.g. Meld or vimdiff to edit the diff in a temporary directory and then rewrites the commit and all descendants when you’re done.That temporary directory sounds a lot like the index – a temporary place where changes to the working copy can be batched. Am I right to infer here that the benefit you find in having a second working copy in a temp directory because it works better with some other tools that expect to work files?
The temporary directory is much more temporary than the index - it only exists while you split the commit. For example, if you’re splitting a commit that modifies 5 files, then the temporary directory will have only 2*5 files (for before and after). Does that clarify?
The same solution for selecting part of the changes in a commit is used by
jj amend -i
(move into parent of specified commit, from working-copy commit by default),jj move -i --from <rev> --to <rev>
(move changes between arbitrary commits) etc.I use git revise. Interactive revise is just like interactive rebase, except that it has is a
cut
subcommand. This can be used to split a commit by selecting and editing hunks likegit commit -p
.Before git-revise, I used to manually undo part of the commit, commit that, then revert it, and then sqash the undo-commit into the commit to be split. The revert-commit then contains the split-off changes.
I don’t know, I find it useful. Maybe if git built in mercurials “place changes into commit that isn’t the most recent” amend thing then I might have an easier time doing things but just staging up relevant changes in a patch-based flow is pretty straightforward and helpful IMO
I wonder if this would be as controversial if patching was the default
What purpose does it serve that wouldn’t also be served by first-class rollback and an easier way of collapsing changesets on their way upstream? I find that most of the benefits of smaller changesets disappear when they don’t have commit messages, and when using the staging area for this you can only rollback one step without having to get into the hairy parts of git.
The staging area is difficult to work with until you understand what’s happening under the hood. In most version control systems, an object under version control would be in one of a handful of states: either the object has been cataloged and stored in its current state, or it hasn’t. From a DWIM standpoint for a new git user, would catalog and store the object in its current state. With the stage, you can stage, and change, stage again, and change again. I’ve used this myself to logically group commits so I agree with you that it’s useful. But I do see how it breaks peoples DWIM view on how git works.
Also, If I stage, and then change, is there a way to have git restore the file as I staged it if I haven’t committed?
Git restore .
I’ve implemented git from scratch. I still find the staging area difficult to use effectively in practice.
Try testing your staged changes atomically before you commit. You can’t.
A better design would have been an easy way to unstage, similar to git stash but with range support.
You mean
git stash --keep-index
?Interesting, that would solve the problem. I’m surprised I’ve not come across that before.
In terms of “what’s wrong with the staging area”, what I was suggesting would work better is to have the whole thing work in reverse. So all untracked files are “staged” by default and you would explicitly un-stage anything you don’t want to commit. Firstly this works better for the 90% use-case, and compared to this workaround it’s a single step rather than 2 steps for the 10% case where you don’t want to commit all your changes yet.
The fundamental problem with the staging area is that it’s an additional, hidden state that the final committed state has to pass through. But that means that your commits do not necessarily represent a state that the filesystem was previously in, which is supposed to be a fundamental guarantee. The fact that you have to explicitly stash anything to put the staging area into a knowable state is a bit of a hack. It solves a problem that shouldn’t exist.
The way I was taught this, the way I’ve taught this to others, and the way it’s represented in at least some guis is not compatible.
I mean, sure, you can have staged and unstaged changes in a file and need to figure it out for testing, or unstage parts, but mostly it’s
edit
->stage
->commit
->push
.That feels, to me and to newbies who barely know what version control is, like a logical additive flow. Tons of cases you stage everything and commit so it’s a very small operation.
The biggest gripe may be devs who forget to add files in the proper commit, which makes
bisect
hard. Your case may solve that for sure, but I find it a special case of bad guis and sloppy devs who do that. Also at some point the fs layout gets fewer new files.Except that in a completely linear flow the distinction between edit and stage serves no purpose. At best it creates an extra step for no reason and at worst it is confusing and/or dangerous to anyone who doesn’t fully understand the state their working copy is in. You can bypass the middle state with
git add .; git commit
and a lot of new developers do exactly that, but all that does is pretend the staging state doesn’t exist.Staging would serve a purpose if it meant something similar to pushing a branch to CI before a merge, where you have isolated the branch state and can be assured that it has passed all required tests before it goes anywhere permanent. But the staging area actually does the opposite of that, by creating a hidden state that cannot be tested directly.
As you say, all it takes is one mistake and you end up with a bad commit that breaks bisect later. That’s not just a problem of developers being forgetful, it’s the bad design of the staging area that makes this likely to happen by default.
I think I sort of agree but do not completely concur.
Glossing over the staging can be fine in some projects and dev sloppiness is IMO a bigger problem than an additive flow for clean commits.
These are societal per-project issues - what’s the practice or policy or mandate - and thus they could be upheld by anything, even using the undo buffer for clean commits like back in the day. Which isn’t to say you never gotta do trickery like that with Git, just that it’s a flow that feels natural and undo trickery less common.
Skimming the other comments, maybe
jj
is more like your suggestion, and I wouldn’t mind “a better Git”, but I can’t be bothered when eg.gitless
iirc dropped the staging and would make clean commits feel like 2003.If
git stash --keep-index
doesn’t do what you want the you could help further the conversation by elaborating on what you want.It’s usually not that hard.
https://lobste.rs/s/yi97jn/is_it_time_look_past_git#c_ss5cj3
Absolutely not. The staging area was a godsend coming from Subversion – it’s my favorite part of git bar none.
Everyone seem to suppose I would like to ditch the workflows enabled by the staging area. I really don’t. I’m quite sure there ways to keep those workflows without using a staging area. If there aren’t well… I can always admit I was wrong.
Well, what I prize being able to do is to build up a commit piecemeal out of some but not all of the changes in my working directory, in an incremental rather than all-in-one-go fashion (ie. I should be able to form the commit over time and I should be able to modify a file, move it’s state into the “pending commit” and continue to modify the file further without impacting the pending commit). It must be possible for any commit coming out of this workflow to both not contain everything in my working area, and to contain things no longer in my working area. It must be possible to diff my working area against the pending commit and against the last actual commit (separately), and to diff the pending commit against the last actual commit.
You could call it something else if you wanted but a rose by any other name etc. A “staging area” is a supremely natural metaphor for what I want to work with in my workflow, so replacing it hardly seems desirable to me.
How about making the pending commit an actual commit? And then adding the porcelain necessary to treat it like a staging area? Stuff like
git commit -p foo
if you want to add changes piecemeal.No. That’s cool too and is what tools like
git revise
andgit absorb
enable, but making it an actual commit would have other drawbacks: it would imply it has a commit message and passes pre-commit hooks and things like that. The staging area is useful precisely for what it does now—help you build up the pieces necessary to make a commit. As such it implies you don’t have everything together to make a commit out of it. As soon as I do I commit, then if necessary--ammend
,--edit
, orgit revise
later. If you don’t make use of workflows that use staging then feel free to use tooling that bypasses it for you, but don’t try to take it away from the rest of us.Oh, totally missed that one. Probably because I’ve never used it (instead i rely on CI or manually pushing a button). Still, that’s the strongest argument so far, and I have no good solution that doesn’t involve an actual staging area there. I guess it’s time to change my mind.
I think the final word is not said. These tools could also run hooks. It may be that new hooks need to be defined.
Here is one feature request: run git hooks on new commit
I think you missed the point, my argument is that the staging area is useful as a place to stage stuff before things like commit related hooks get run. I don’t want tools like
git revise
to run precommit hooks. When I usegit revise
the commit has already been made and presumably passed precommit phase.For the problem that
git revise
“bypasses” the commit hook when using it to split a commit, I meant the commit hook (not precommit hook).I get that the staging area lets you assemble a commit before you can run the commit hook. But if this was possible to do statelessly (which would only be an improvement), you could do without it. And for other reasons, git would be so much better without this footgun:
Normally, you can look at
git diff
and commit what you see withgit commit -a
. But if the staging area is clobbered, which you might have forgot, you also have invisible state that sneaks in!Normally I do nothing of the kind. I might have used
git commit -a
a couple times in the last 5 years (and I make dozens to hundreds of commits per day). The stattefullness of the staging area is exactly what benefits my workflow and not the part I would be trying to eliminate. The majority of the time I stage things I’m working on from my editor one hunk at a time. The difference between my current buffer and the last git commit is highlighted and after I make some progress I start adding related hunks and shaping them into commits. I might fiddle around with a couple things in the current file, then when I like it stage up pieces into a couple different commits.The most aggressive I’d get is occasionally (once a month?) coming up with a use for
git commit -u
.A stateless version of staging that “lets you assemble a commit” sounds like an oxymoron to me. I have no idea what you think that would even look like, but a state that is neither the full contents of the current file system nor yet a commit is exactly what I want.
Why not allow an empty commit message, and skip the commit hooks if a message hasn’t been set yet?
Why deliberately make a mess of things? Why make a discreet concept of a “commit” into something else with multiple possible states? Why not just use staging like it is now? I see no benefit to jurry rigging more states on top of a working one. If the point is to simplify the tooling you won’t get there by overloading one clean concept with an indefinite state and contextual markers like “if commit message empty then this is not a real commit”.
Empty commit message is how you abort a commit
With the current UI.
When discussing changes, there’s the possibility of things changing.
Again, what’s the benefit?
Sure, you could awkwardly simulate a staging area like this. The porcelain would have to juggle a whole bunch of shit to avoid breaking anytime you merge a bunch of changes after adding something to the fake “stage”, pull in 300 new commits, and then decide you want to unstage something, so the replacement of the dedicated abstraction seems likely to leak and introduce merge conflict resolution where you didn’t previously have to worry about it, but maybe with enough magic you could do it.
But what’s the point? To me it’s like saying that I could awkwardly simulate
if
,while
andfor
withgoto
, or simulate basically everything with enoughNAND
s. You’re not wrong, but what’s in it for me? Why am I supposed to like this any better than having a variety of fit-for-purpose abstractions? It just feels like I’d be tying one hand behind my back so there can be one less abstraction, without explain why having N-1 abstractions is even more desirable than having N.Seems like an “a foolish consistency is the hobgoblin of little minds” desire than anything beneficial, really.
Simplicity of implementation. Implementing the staging area like a commit, or at least like a pointer to a
tree
object, would likely make the underlying data model simpler. I wonder why the staging area was implemented the way it is.At the interface level however I’ve had to change my mind because of pre-commit hooks. When all you have is commits, and some tests are automatically launched every time you commit anything, it’s pretty hard to add stuff piecemeal.
Yes, simplicity of implementation and UI. https://github.com/martinvonz/jj (mentioned in the article) makes the working copy (not the staging area) an actual commit. That does make the implementation quite a lot simpler. You also get backups of the working copy that way.
No offence but, why would I give a shit about this? git is a tool I use to enable me to get other work done, it’s not something I’m reimplementing. If “making the implementation simpler” means my day-to-day workflows get materially more unpleasant, the simplicity of the implementation can take a long walk off a short pier for all I care.
It’s not just pre-commit hooks that get materially worse with this. “Staging” something would then have to have a commit message, I would effectively have to branch off of head before doing every single “staging” commit in order to be able to still merge another branch and then rebase it back on top of everything without fucking about in the reflog to move my now-burried-in-the-past stage commit forward, etc, etc. “It would make the implementation simpler” would be a really poor excuse for a user hostile change.
I agree. Users shouldn’t have to care about the implementation (except for minor effects like a simpler implementation resulting in fewer bugs). But I don’t understand why your workflows would be materially more unpleasant. I think they would actually be more pleasant. Mercurial users very rarely miss the staging area. I was a git developer (mostly working on
git rebase
) a long time ago, so I consider myself a (former) git power user. I never miss the staging area when I use Mercurial.Why? I think the topic of this thread is about what can be done differently, so why would the new tool require a commit message? I agree that it’s useful if the tool lets you provide a message, but I don’t think it needs to be required.
I don’t follow. Are you saying you’re currently doing the following?
I don’t see why the new tool would bury the staging commit in the past. That’s not what happens with Jujutsu/jj anyway. Since the working copy is just like any other commit there, you can simply merge the other branch with it and then rebase the whole stack onto the other branch after.
I’ve tried to explain a bit about this at https://github.com/martinvonz/jj/blob/main/docs/git-comparison.md#the-index. Does that help clarify?
Well, I’m not them. As somebody who was forced to use Mercurial for a bit and hated every second of it, I missed the hell out of it, personally (and if memory serves, there was later at least one inevitably-nonstandard Mercurial plugin to paper over this weakness, so I don’t think I was the only person missing it).
I’ve talked about my workflow elsewhere in this thread, I’m not really interested in rehashing it, but suffice to say I lean on the index for all kinds of things.
I’m saying that any number of times I start putting together a commit by staging things on Friday afternoon, come back on Monday, pull in latest from main, and continue working on forming a commit.
If I had to (manually, we’re discussing among other things the assertion that you could eliminate the stage because it’s pointless, and you could “just” commit whenever you want to stage and revert the commit whenever they want to unstage ) commit things on Friday, forget I’d done so on Monday, pull in 300 commits from main, and then whoops I want to revert a commit 301 commits back so now I get to back out the merge and etc etc, this is all just a giant pain in the ass to even type out.
I’m honestly not interested in reading it, or in what “Jujutsu” does, as I’m really happy with git and totally uninterested in replacing it. All I was discussing in this thread with Loup-Vaillant was the usefulness of the stage as an abstraction and my disinterest in seeing it removed under an attitude of “well you could just manually make commits when you would want to stage things, instead”.
Too bad, this link you’re refusing to read is highly relevant to this thread. Here’s a teaser:
What “jujutsu” does under the hood has nothing whatsoever to do with this asinine claim of yours, which is the scenario I was objecting to: https://lobste.rs/s/yi97jn/is_it_time_look_past_git#c_k6w2ut
At this point I’ve had enough of you showing up in my inbox with these poorly informed, bad faith responses. Enough.
I was claiming that the workflows we have with the staging area, we could achieve without. And Jujutsu here has ways to do exactly that. It has everything to do with the scenario you were objecting to.
Also, this page (and what I cited specifically) is not about what jujutsu does under the hood, it’s about its user interface.
I’ve made it clear that I’m tired of interacting with you. Enough already.
It’s because people don’t give a shit that we have bloated (and often slow) software.
And it’s because of developers with their heads stuck so far up their asses that they prioritize their implementation simplicity over the user experience that so much software is actively user-hostile.
Let’s end this little interaction here, shall we.
Sublime Merge is the ideal git client for me. It doesn’t pretend it’s not git like all other GUI clients I’ve used so you don’t have to learn something new and you don’t unlearn git. It uses simple git commands and shows them to you. Most of git’s day-to-day problems go away if you can just see what you’re doing (including what you’ve mentioned).
CLI doesn’t cut it for projects of today’s size. A new git won’t fix that. The state of a repository doesn’t fit in a terminal and it doesn’t fit in my brain. Sublime Merge shows it just right.
I like GitUp for the same reasons. Just let me see what I’m doing… and Undo! Since it’s free, it’s easy to get coworkers to try it.
I didn’t know about GitUp but I have become a big fan of gitui as of late.
I’ll check that out, thank you!
I use Fork for the same purpose and the staging area has never been a problem since it is visible and diffable at any time, and that’s how you compose your commits.
See Game of Trees for an alternative to the git tool that interacts with normal git repositories.
Have to agree with others about the value of the staging area though! It’s the One Big Thing I missed while using Mercurial.
Well, on the one hand people could long for a better way to store the conflict resolutions to reuse them better on future merges.
On the other hand, of all approaches to DAG-of-commits, Git’s model is plain worse than the older/parallel ones. Git is basically intended to lose valuable information about intent. The original target branch of the commit often tells as much as the commit message… but it is only available in reflog… auto-GCed and impossible to sync.
Half of my branches are called
werwerdsdffsd
. I absolutely don’t want them permanently burned in the history. These scars from work-in-progress annoyed me in Mercurial.Honestly I have completely the opposite feeling. Back in the days before git crushed the world, I used Mercurial quite a lot and I liked that Mercurial had both the ephemeral “throw away after use” model (bookmarks) and the permanent-part-of-your-repository-history model (branches). They serve different purposes, and both are useful and important to have. Git only has one and mostly likes to pretend that the other is awful and horrible and nobody should ever want it, but any long-lived project is going to end up with major refactoring or rewrites or big integrations that they’ll want to keep some kind of “here’s how we did it” record to easily point to, and that’s precisely where the heavyweight branch shines.
And apparently I wrote this same argument in more detail around 12 years ago.
ffs_please_stop_refactoring_and_review_this_pr8
This is a very good point. It would be interesting to tag and attach information to a group of related commits. I’m curious of the linux kernel workflows. If everything is an emailed patch, maybe features are done one commit at a time.
If you go further, there are many directions to extend what you can store and query in the repository! And of course they are useful. But even the data Git forces you to have (unlike, by the way, many other DVCSes where if you do not want a meaningful name you can just have multiple heads in parallel inside a branch) could be used better.
I can’t imagine a scenario where the original branch point of a feature would ever matter, but I am constantly sifting through untidy merge histories that obscure the intent.
Tending to your commit history with intentionality communicates to reviewers what is important, and removes what isn’t.
It is not about the point a branch started from. It is about which of the recurring branches the commit was in. Was it in quick-fix-train branch or in update-major-dependency-X branch?
The reason why this isn’t common is because of GitHub more than Git. They don’t provide a way to use merge commits that isn’t a nightmare.
When I was release managing by hand, my preferred approach was rebasing the branch off HEAD but retaining the merge commit, so that the branch commits were visually grouped together and the branch name was retained in the history. Git can do this easily.
I never understood the hate for Git’s CLI. You can learn 99% of what you need to know on a daily basis in a few hours. That’s not a bad time investment for a pivotal tool that you use multiple times every day. I don’t expect a daily driver tool to be intuitive, I expect it to be rock-solid, predictable, and powerful.
This is a false dichotomy: it can be both (as Mercurial is). Moreover, while it’s true that you can learn the basics to get by with in a few hours, it causes constant low-level mental overhead to remember how different commands interact, what the flag is in this command vs. that command, etc.—and never mind that the man pages are all written for people thinking in terms of the internals, instead of for general users. (That this is a common failing of man pages does not make it any less a problem for git!)
One way of saying it: git has effectively zero progressive disclosure of complexity. That makes it a continual source of paper cuts at minimum unless you’ve managed to actually fully internalize not only a correct mental model for it but in many cases the actual implementation mechanics on which it works.
Its manpages are worthy of a parody: https://git-man-page-generator.lokaltog.net
Its predecessors CVS and svn had much more intuitive commands (even if they were was clumsy to use in other ways). DARCS has been mentioned many times as being much more easy to use as well. People migrating from those tools really had a hard time, especially because git changed the meanings of some commands, like checkout.
Then there were some other tools that came up around the same time or shortly after git but didn’t get the popularity of git like hg and bzr, which were much more pleasant to use as well.
I think the issues people have are less about the CLI itself and more about how it interfaces with the (for some developers) complex and hard to understand concepts at hand.
Take rebase for example. Once you grok what it is, it’s easy, but trying to explain the concept of replaying commits on top of others to someone used to old school tools like CVS or Subversion can be a challenge, especially when they REALLY DEEPLY don’t care and see this as an impediment to getting their work done.
I’m a former release engineer, so I see the value in the magic Git brings to the table, but it can be a harder sell for some :)
I would argue that this is one of the main reasons for git’s success. The CLI is so bad that people were motivated to look for tools to avoid using it. Some of them were motivated to write tools to avoid using it. There’s a much richer set of local GUI and web tools than I’ve seen for any other revision control system and this was true even when git was still quite new.
I never used a GUI with CVS or Subversion, but I wanted to as soon as I started touching the git command line. I wanted features like PRs and web-based code review, because I didn’t want to merge things locally. I’ve subsequently learned a lot about how to use the git CLI and tend to use it for a lot of tasks. If it had been as good as, say, Mercurial’s from the start then I never would have adopted things like
gitx
/gitg
and GitHub and it’s those things that make the git ecosystem a pleasant place to be.Yes a thousand times this! :) Git’s data model has been a quantum leap for people who need to manage source code at scale. Speaking as a former release engineer, I used to be the poor schmoe who used to have to conduct Merge Day, where a branch gets merged back to main.
There was exactly one thing you could always guarantee about merge day: There Will Be Blood.
So let’s talk about looking past git’s god awful interface, but keep the amazing nubbins intact and doing the nearly miraculous work they do so well :)
And I don’t just mean throwing a GUI on top either. Let’s rethink the platonic ideal for how developers would want their workflow to look in 2022. Focus on the common case. Let the ascetics floating on a cloud of pure intellect script their perfect custom solutions, but make life better for the “cold dark matter” developers which are legion.
I would say that you simultaneously give credit where it is not due (there were multiple DVCSes before Git, and approximately every one had a better data model, and then there are things that Subversion still has better than everyone else, somehow), and ignore the part that actually made your life easier — the efforts of pushing Git down people’s throat, done by Linus Torvalds, spending orders of magnitude more of his time on this than on getting things right beyond basic workability in Git.
Not a DVCS expert here, so would you please consider enlightening me? Which earlier DVCS were forgotten?
My impressions of Mercurial and Bazaar are that they were SL-O-O-W, but they’re just anecdotal impressions.
Well, Bazaar is technically earlies. Monotone is significantly earlier. Monotone has quite interesting and nicely decoupled data model where the commit DAG is just one thing; changelog, author — and branches get the same treatment — are not parts of a commit, but separately stored claims about a commit, and this claim system is extensible and queriable. And of course Git was about Linus Torvalds speedrunning implementation of the parts of BitKeeper he really really needed.
It might be that in the old days running on Python limited speed of both Mercurial and Bazaar. Rumour has it that the Monotone version Torvalds found too slow was indeed a performance regression (they had one particularly slow release at around that time; Monotone is not in Python)
Note that one part of things making Git fast is that enables some optimisations that systems like Monotone make optional (it is quite optimistic about how quickly you can decide that the file must not have been modified, for example). Another is that it was originally only intended to be FS-safe on ext3… and then everyone forgot to care, so now it is quite likely to break the repository in case of unclean shutdown mid-operation. Yes, I have damaged repositories that way to a state where I could not find advice on how to avoid re-cloning to get even partially working repository.
As of Subversion, it has narrow checkouts which are a great feature, and DVCSes could also have them, but I don’t think anyone properly has them. You kind of can hack something with remote-automate in Monotone, but probably flakily.
Ironically, that’s part of the performance problem – compressing the packfiles tends to be where things hurt.
Still, this is definitely a solvable problem.
I have created and maintains official test suite for pijul, i am the happiest user ever.
Hmm, knowing you I’m sure you’ve tested it to death.
I guess they got rid of the exponential conflict resolution that plagued DARCS? If so perhaps I should give patch theory another go. Git ended up winning the war before I got around to actually study patch theory, maybe it is sounder than I thought.
Pijul is a completely different thing than Darcs, the current state of a repository in Pijul is actually a special instance of a CRDT, which is exactly what you want for a version control system.
Git is also a CRDT, but HEAD isn’t (unlike in Pijul), the CRDT in Git is the entire history, and that is not a very useful property.
Best test suite ever. Thanks again, and again, and again for that. It also helped debug Sanakirja, a database engine used as the foundation of Pijul, but usable in other contexts.
There are git-compatible alternatives that keep the underlying model and change the interface. The most prominent of these is probably gitless.
I’ve been using git entirely via UI because of that. Much better overview, much more intuitive, less unwanted side effects.
You can’t describe Git without discussing rebase and merge: these are the two most common operations in Git, yet they don’t satisfy any interesting mathematical property such as associativity or symmetry:
Associativity is when you want to merge your commits one by one from a remote branch. This should intuitively be the same as merging the remote HEAD, but Git manages to make it different sometimes. When that happens, your lines can be shuffled around more or less randomly.
Symmetry means that merging A and B is the same as merging B and A. Two coauthors doing the same conflictless merge might end up with different results. This is one of the main benefits of GitHub: merges are never done concurrently when you use a central server.
Well, at least this is not the fault of the data model: if you have all the snapshots, you can deduce all the patches. It’s the operations themselves that need fixing.
My point is that this is a common misconception: no datastructure is ever relevant without considering the common operations we want to run on it.
For Git repos, you can deduce all the patches indeed, but merge and rebase can’t be fixed while keeping a reasonable performance, since the merge problem Git tries to solve is the wrong one (“merge the HEADs, knowing their youngest common ancestor”). That problem cannot have enough information to satisfy basic intuitive properties.
The only way to fix it is to fetch the entire sequence of commits from the common ancestor. This is certainly doable in Git, but merges become O(n) in time complexity, where n is the size of history.
The good news is, this is possible. The price to pay is a slightly more complex datastructure, slightly harder to implement (but manageable). Obviously, the downside is that it can’t be consistent with Git, since we need more information. On the bright side, it’s been implemented: https://pijul.org
Agreed. Now, how often do we actually merge stuff, and how far is the common ancestor in practice?
My understanding of the usage of version control is that merging two big branches (with an old common ancestor) is rare. Far more often we merge (or rebase) work units with just a couple commits. Even more often than that we have one commit that’s late, so we just pull in the latest change then merge or rebase that one commit. And there are the checkout operations, which in some cases can occur most frequently. While a patch model would no doubt facilitate merges, it may not be worth the cost of making other, arguably more frequent operations, slower.
(Of course, my argument is moot until we actually measure. But remember that Git won in no small part because of its performance.)
I agree with all that, except that:
the only proper modelling of conflicts, merges and rebases/cherry-picking I know of (Pijul) can’t rely on common ancestors only, because rebases can make some future merges more complex than a simple 3-way merge problem.
I know many engineers are fascinated by Git’s speed, but the algorithm running on the CPU is almost never the bottleneck: the operator’s brain is usually much slower than the CPU in any modern version control system (even Darcs has fixed its exponential merge). Conflicts do happen, so do cherry-picks and rebases. They aren’t rare in large projects, and can be extremely confusing without proper tools. Making these algorithms fast is IMHO much more important from a cost perspective than gaining 10% on a operation already taking less than 0.1 second. I won’t deny the facts though: if Pijul isn’t used more in industry, it could be partly because that opinion isn’t widely shared.
some common algorithmic operations in Git are slower than in Pijul (
pijul credit
is much faster thangit blame
on large instances), and most operations are comparable in speed. One thing where Git is faster is browsing old history: the datastructures are ready in Pijul, but I haven’t implemented the operations yet (I promised I would do that as soon as this is needed by a real project).