1. 7

    How does git9 support staging only part of the changes to a file? From what I can tell it does not.

    I would describe any version control system which doesn’t allow me to commit only some of the hunks in a file or edit the patch myself as “boneheaded.”

    1. 9

      Can I quote you on the boneheaded bit? It seems like a great endorsement.

      Anyways – this doesn’t fit my workflow. I build and test my code before I commit, and incrementally building commits from hunks that have never been compiled in that configuration is error prone. It’s bad enough committing whole files separately – I’m constantly forgetting files and making broken commits as a result. I’ve been using git since 2006, and every time I’ve tried doing partial stages, I’ve found it more pain than it was worth.

      So, for me (and many others using this tool) this simply isn’t a feature that’s been missed.

      That said, it’s possible to build up a patch set incrementally, and commit it: there are tools like divergefs that provide a copy on write view of the files, so you can start from the last commit in .git/fs/HEAD/tree, and pull in the hunks from your working tree that you want to commit using idiff. That shadowed view will even give you something that you can test before committing.

      If someone wanted to provide a patch for automating this, I’d consider it.

      1. 6

        Thanks for this response - its a very clear argument for a kind of workflow where staging partial changes to a file doesn’t make sense.

        I work primarily as a data scientist using languages like R and Python which don’t have a compilation step and in which it is often the case that many features are developed concurrently and more or less independently (consider that my projects usually have a “utils” file which accumulates mostly independent trivia). In this workflow, I like to make git commits which touch on a single feature at a time and its relatively easy in most cases to select out hunks from individual files which tell that story.

      2. 5

        As somebody who near-exclusively uses hggit, and hence no index, I can answer this from experience. If you want to commit only some of your changes, that’s what you do. No need to go through an index.

        Commit only some of your changes?
        hg commit --interactive
        git commit --patch

        Add more changes to the commit you’re preparing?
        hg amend --interactive
        git commit --amend --patch

        Remove changes from the commit?
        hg uncommit --interactive
        git something-complicated --hopefully-this-flag-is-still-called-patch

        The main advantage this brings: because the commit-you’re-working-on is a normal commit, all the normal verbs apply. No need for special index-flavoured verbs/flags like reset or diff --staged. One less concept.

        If you want to be sure you won’t push it before you’re done, use hg commit --secret on that / those commits; then hg phase --draft when you’re ready.

        1. 2

          Actually sounds pretty good! Anyone know if such a thing is possible with fossil?

        2. 4

          You can do it like hg does with shelve - always commit what is on disk, but allow the user to shelve hunks. These can be restored after the commit is done. Sort of a reverse staging area.

          1. 3

            I haven’t tried git9, but it should still be possible to support committing parts of files in a world without a staging area. As I imagine it, the --patch option would just be on the commit command (instead of the add command).

            Same with all other functionality of git add/rm/mv – these commands wouldn’t exist. Just make them options of git commit. It doesn’t matter if the user makes a commit for each invocation (or uses --amend to avoid that): If you can squash, you don’t need a staging area for the purpose of accumulating changes.

            Proof of concept: You can already commit parts of files without using the index, and without touching the workspace: Just commit everything first, then split it interactively using git-revise (yes, you can edit the inbetween patch too). I even do that quite often. Splitting a commit is something you have to do sometimes anyway, so you might as well learn that instead. When you can do this – edit the commit boundaries after the fact, you no longer need to get it perfect on the first try, which is all that the staging area can help you with.

            Rather than a staging area, I wish I could mark commits as “unfinished” (meaning that I don’t want to push them anywhere), and that I could refer to these unfinished commits by a never-changing id that didn’t change while working on them.

            1. 3

              This fits my mental model much better too. Any time I have files staged and am not in the “the process of committing” I probably messed someting up. The next step is always clear the index or add everything to the index and commit.

              1. 3

                -p is indeed available on commit. And also on stash.

              2. 2

                I feel the Plan 9 way would be to use a dedicated tool to help stash away parts of the working directory instead.

                1. 2

                  I would describe any version control system which doesn’t allow me to commit only some of the hunks in a file or edit the patch myself as “boneheaded.”

                  I would describe people wedded to the index in softer but similar terms.

                  Here’s the thing: if you’re committing only part of your working tree, then you are, by definition, committing code that you have never run or even attempted to compile. You cannot have tested it, you cannot have even built it, because you can’t do any of those things against the staged hunks. You’re making a bet that any errors you make are going to be caught either by CI or by a reviewer. If you’ve got a good process, you’ve got good odds, but only that: good odds. Many situations where things can build and a reviewer might approve won’t work (e.g., missing something that’s invoked via reflection, missing a tweak to a data file, etc.).

                  These aren’t hypotheticals; I’ve seen them. Many times. Even in shops with absolute top-tier best-practices.

                  Remove-to-commit models (e.g. hg shelve, fossil stash, etc.) at least permit you not to go there. I can use pre-commit or pre-push hooks to ensure that the code at the very least builds and passes tests. I’ve even used pre-push hooks in this context to verify your build was up-to-date (by checking whether a make-like run would be a no-op or not), and rejected the push if not, telling the submitter they need to at least do a sanity check. And I have, again, seen this prevent actual issues in real-world usage.

                  Neither of these models is perfect, both have mitigations and workarounds, and I will absolutely agree that git add -p is an incredibly seductive tool. But it’s an error-prone tool that by definition must lead to you submitting things you’ve never tested.

                  I don’t think my rejection of that model is boneheaded.

                  1. 6

                    You cannot have tested it, you cannot have even built it, because you can’t do any of those things against the staged hunks.

                    Sure you can, I do this all the time.

                    When developing a feature, I’ll often implement the whole thing (or a big chunk of it) in one go, without really thinking about how to break that up into commits. Then when I have it implemented and working, I’ll go back and stage / commit individual bits of it.

                    You can stage some hunks, stash the unstaged changes, and then run your tests.

                    1. 5

                      Here’s the thing: if you’re committing only part of your working tree, then you are, by definition, committing code that you have never run or even attempted to compile. You cannot have tested it, you cannot have even built it, because you can’t do any of those things against the staged hunks

                      While this is true, it isn’t quite as clear-cut as you make it seem. The most common case I have for this is fixing typos or other errors in comments or documentation that I fixed while adding comments / docs for the new feature. I don’t want to include those changes in an unrelated PR, so I pull them out into a separate commit and raise that as a separate (and trivial to review) PR. It doesn’t matter that I’ve never tried to build them because there are no changes in the code, so they won’t change the functionality at all.

                      Second, just because I haven’t compiled them when I commit doesn’t mean that I haven’t compiled them when I push. Again, my typical workflow here is to notice that there are some self-contained bits, commit them, stash everything else, test them, and then push them and raise a PR, before popping the stash and working on the next chunk. The thing that I push is tested locally, then tested by CI, and is then a small self-contained thing that is easy to review before merging.

                      But it’s an error-prone tool that by definition must lead to you submitting things you’ve never tested.

                      And yet, in my workflow, it doesn’t. It allows you to submit things that you’ve never tested, but so does any revision-control system that isn’t set up with pre-push hooks that check for tests (and if you’re relying on that, rather than pre-merge CI with a reasonable matrix of targets, as any kind of useful quality bar then you’re likely to end up with a load of code that ‘works on my machine’).

                      1. 3

                        I mentioned there are “mitigations and workarounds,” some of which you’re highlighting, but you’re not actually disagreeing with my points. Git is the only SCM I’ve ever encountered where make can work, git diff can show nothing, git commit won’t be a no-op, and the resulting commit can’t compile.

                        And the initial comment I’m responding to is that a position like mine is “boneheaded”. I’m just arguing it isn’t.

                      2. 5

                        Here’s the thing: if you’re committing only part of your working tree, then you are, by definition, committing code that you have never run or even attempted to compile.

                        I mean, sure. And there are many places where this matters.

                        Things like cleanly separating a bunch of changes to my .vimrc into logical commits, and similar sorts of activity, are… Not really among them.

                        1. 1

                          Gotta admit, this is a very solid argument.

                        2. 1

                          That’s a good question. I imagine something like

                          @{cp file /tmp && bind /tmp/file file && ed file && git/commit file}
                          

                          should work.

                        1. 4

                          One of the deepest lessons not yet internalized from category theory to computer science is that logical programming does not line up with classical computation; specifically, relations are dagger compact closed, but functions are Cartesian closed. The most important consequence is that a logical program can be reversed or run in multiple directions, while a classical program can only run forwards.

                          This mismatch also underlies modern quantum-computing projects, and it is no accident that those projects are founded on the creation of hardware which efficiently represents qubits. We can only imagine how the fifth-generation projects would have been different, had an efficient relational computer been designed.

                          1. 2

                            I notice that KL1 (guarded Horn clauses) trades away running in both directions for running in parallel. KL1 is the language for fifth generation computer systems mentioned in the article.

                          1. 2

                            Can anyone explain why NetBSD is investing effort in a kernel implementation of posix_spawn? The API is specifically designed to permit userspace implementations, which is a big part of the reason that it’s so awful and no one wants to use it. The cost of execve is so high that a few extra system calls in a userspace posix_spawn implementation is negligible. With vfork, there cost of the new process creation is negligible and most of the multithreading concerns don’t apply: in the temporary vfork context the userspace thread has a new kernel context (file descriptor table and so on) associated with it and so can modify it at will. An in-kernel posix_spawn requires providing functionality in the spawn implementation that duplicates every single system call that modifies the kernel state associated with a process. A userspace implementation can just use chdir here directly.

                            1. 2

                              Was wondering the same. Looks like the goal is to replace fork+exec in sh for performance. Creating an address space from scratch avoids some reference counting and inter-processor interrupts, according to this previous blog entry: https://blog.netbsd.org/tnf/entry/gsoc_reports_make_system_31

                              1. 1

                                Odd. On FreeBSD, at least, vfork is very fast: so much faster than execve that there’s really no point in optimising it. It’s not the easiest API to use, because you are still running in the parent’s address space and so you have to make sure that you undo any changes you make there (release any locks you acquire, free any memory that you allocate), but if you use the libc posix_spawn then it does that for you and if you want to do something that doesn’t have posix_spawn support (e.g. set a resource accounting limit, enter Capsicum mode, or enter a jail in the child) then you’re using the same kernel interfaces.

                                If I were to try to improve process creation on *NIX, I’d add a variant of pdfork that created an empty address space and then add variants of all of the system calls that modified a process to take a file descriptor. Windows actually does this: you create an empty process and then all of the system calls that can modify your process take a HANDLE to the process that they’re modifying. I’d love to have, for example, something like pdmmap that would let me map memory in another process. This would be a much cleaner way of setting up shared memory segments than passing a file descriptor to an anonymous memory object and having the other process map it.

                              2. 2

                                We’re not investing any effort, NetBSD’s posix_spawn has always been an in-kernel implementation from day one. This project is about extending the implementation to bring it up to spec for upcoming POSIX changes. The actual in-kernel implementation isn’t much code, either.

                                As for why it was originally implemented in the kernel - why is anything implemented in the kernel? Why didn’t you implement sendfile in userspace? It’s not right to assume everything has or should have the exact same performance characteristics as it does on FreeBSD.

                                1. 5

                                  We’re not investing any effort, NetBSD’s posix_spawn has always been an in-kernel implementation from day one

                                  Sorry, that’s my question: why did NetBSD decide to do a kernel implementation?

                                  As for why it was originally implemented in the kernel - why is anything implemented in the kernel?

                                  Generally, for one of three reasons:

                                  • It needs to access some in-kernel data structures that are difficult to expose cleanly to userspace.
                                  • A userspace implementation would be significantly slower (in terms of latency or throughput)
                                  • It needs to perform some privileged operations that require it to be protected from userspace (sometimes this leads to a privileged userspace daemon as the correct choice, though reason 2 can impact this decision).

                                  I don’t see any of these applying for posix_spawn, hence my question. Reasons 1 and 2 don’t apply, the API was specifically designed to allow pure-userspace implementations. Reason 2 may apply, but in a vfork + execve sequence the time is dominated by the execve call and process initialisation, so I doubt this is the reason. Presumably NetBSD had a Reason 4 that I’m missing and I’d like to know what that is / was.

                                  Why didn’t you implement sendfile in userspace?

                                  Sendfile is specifically designed to avoid a copy to or from userspace. An in-kernel implementation can DMA from disk to the kernel buffer cache and then DMA from the buffer cache to the NIC, with no userspace page-table updates or copies. A userspace implementation would either require at least one additional copy (reason 2) or require exposing the buffer cache to userspace, which is hard (reason 1) and would probably be difficult to do securely (reason 3).

                                  It’s not right to assume everything has or should have the exact same performance characteristics as it does on FreeBSD.

                                  Given that vfork was inherited by both from 4BSD and still has similar performance characteristics, I think it’s a fair assumption here.