1. 12
  1. 15

    This needs no fundamental changes to git.

    If you want to implement your own diff/merge tool that is syntax aware, this is a git configuration away.

    The fact that there are no great tools for this speaks to the sufficiency of text, and the difficulty of implementing an acceptable syntax aware tool.

    1. 4

      Yep! When I worked at Intentional Software a decade ago, where we built a structured editor that edited programs directly as AST and saved them in a binary database file, I built a custom git integration that did exactly this. Every node in the tree had its own identity, so you could trivially handle cases like “one person renamed a function and changed some code inside, and another person moved it to a different part of the document”, because the function and the node that represented its name and the child nodes representing its definition were all distinct. References elsewhere in the tree were done by this identity, rather than by name, so renames only modified one place in the document and every reference immediately reflected the change.

      The hard part is shipping a structured editor that someone might actually want to use. I’ve come to be convinced that text’s one-dimensionality is actually a tremendous, maybe insurmountable usability advantage, because editing necessarily happens one keystroke at a time. But that’s just, like, my opinion, man.

      1. 4

        the sufficiency of text

        Worse really does seem better, doesn’t it? :/

        1. 3

          If I understand it correctly, git isn’t really text aware. It just provides a way of retrieving whatever content matches a hash.

          Some of that content is metadata.

          An orthogonal part of it stores and packs the data by whichever algorithm works best for the data in hand.

          ie. Not worse is better, more separation of concerns.

        2. 2

          I implemented one for our product’s config files. The git config is extremely tedious. I’m assuming it’s there for internal experimenting with different diff algos. Not to support setting up a semantic merge.

          1. 1

            Plastic SCM has a tool they call SemanticMerge that supposedly is more intelligent than plain diff, but I haven’t used it (yet) so cannot comment on how useful it is.

            https://semanticmerge.com/

            1. 1

              Oh, those crafty git configuration options. There’s so many of them though! Which one?

            2. 1

              That is what Unison is, essentially: https://www.unisonweb.org/

              1. 1

                A response to some of the points made in response (I am not the author):

                https://github.com/GavinMendelGleason/syntactic_versioning/blob/master/IS_IT_TEXT.md

                1. 1

                  Git is a child of C….

                  While you have the C preprocessor….. Parsing unpreprocessed C source is a total nightmare.

                  Been there, done that, deeply scarred.

                  That’s one of the reasons I like D.

                  However, D’s mixins are in equal parts the stuff of Genius and the stuff of Nightmares.

                  1. 1

                    Git is a child of C

                    Say not C, but unix. (Even worse.)

                    Parsing unpreprocessed C source is a total nightmare.

                    Tell that to ctags :)

                    D

                    Compiling D is a worse nightmare. It cannot reasonably be done without coroutines (and the reference implementation is riddled with bugs because it does not use coroutines).

                    1. 1

                      The more I use coroutines the more I think… hey, these things are good idea. Relax and just use them.

                    2. 1

                      unpreprocessed C source

                      There is a concept of a ‘dual’ combinator. If f and g are functions, and gi is the inverse of g, then (f under g)(x) is the same as gi(f(g(x)). If we would like to apply a function to a value not in its domain, all we have to do is find an invertible procedure which brings the value into the relevant domain.

                      In this case, we would like to parse code which is not preprocessed, but it is only possible to parse preprocessed code; so ‘all’ we have to do is come up with a way to ‘unpreprocess’ procedure. If you tag AST nodes with the macros that produced them, I expect it would not even be very hard.

                      And i would rather manually unpreprocess code snippets in case of ambiguity than manually resolve textual merge conflicts.

                      1. 2

                        In this case, we would like to parse code which is not preprocessed, but it is only possible to parse preprocessed code; so ‘all’ we have to do is come up with a way to ‘unpreprocess’ procedure. If you tag AST nodes with the macros that produced them, I expect it would not even be very hard.

                        The C preprocessor is Turing complete. A general solution requires taking the output of a general program and inverting it. This can be trivially proven to not be computable. Now, the nice thing about the halting problem is that you generally only need to solve computable subsets of it. This is not the case for the C preprocessor unfortunately. Macros that expand to tokens are the trivial cases. The biggest problem is #if / #ifdef because these can remove large chunks from the preprocessed output that are present in the source. If you have something in #ifdef __unix__, some in #ifdef _MSVCVER and then some specialisations in #ifdef __linux__ and #ifdef __FreeBSD__ then the preprocessed output on a single build will include a subset of them. Reversing these requires reintroducing information. Add to that, platform-specific headers, which bring in types. For the most trivial example, int64_t is either long or long long (or, very occasionally, int) depending on what your platform’s stdint.h included. That’s reversible, but it’s not obvious except from knowledge of the standard that this can change between platforms but not between builds. For arbitrary dependencies, this problem gets much harder.

                        1. 1

                          The core of the notion of “invertible” is it has to be not lossy. C preprocessor is very lossy.

                          However, even if you do get AST aware diffs and merges, the semantics are too hard.

                          I have said it many times before… but I still think it is true…. the joy papers are really important.

                           https://hypercubed.github.io/joy/html/j04alg.html
                           http://nsl.com/papers/rewritejoy.html
                          

                          Language designers really need to be waking up and say something very important is going on there. As programs head beyond the megaline size, it becomes more important to be able to automagically AND precisely reason about programs.

                          Part of the magic is von Thun conflated function composition operator with token list concatenation, but I don’t think that is absolutely fundamental, it just makes it easy to talk about.

                          1. 1

                            The core of the notion of “invertible” is it has to be not lossy

                            Yes, but the same limitation does not apply to ‘dual’.

                            For instance, consider ‘take 5’. Obviously this is not an invertible operation in general. But ‘the dual of negate under take 5’ can take the first 5 elements of a list, negate them, and then affix them to the rest of the list, effectively inverting the ‘take’.

                          2. 1

                            In this case, we would like to parse code which is not preprocessed, but it is only possible to parse preprocessed code; so ‘all’ we have to do is come up with a way to ‘unpreprocess’ procedure. If you tag AST nodes with the macros that produced them, I expect it would not even be very hard.

                            The C preprocessor is Turing complete. A general solution requires taking the output of a general program and inverting it. This can be trivially proven to not be computable. Now, the nice thing about the halting problem is that you generally only need to solve computable subsets of it. This is not the case for the C preprocessor unfortunately. Macros that expand to tokens are the trivial cases. The biggest problem is #if / #ifdef because these can remove large chunks from the preprocessed output that are present in the source. If you have something in #ifdef __unix__, some in #ifdef _MSVCVER and then some specialisations in #ifdef __linux__ and #ifdef __FreeBSD__ then the preprocessed output on a single build will include a subset of them. Reversing these requires reintroducing information. Add to that, platform-specific headers, which bring in types. For the most trivial example, int64_t is either long or long long (or, very occasionally, int) depending on what your platform’s stdint.h included. That’s reversible, but it’s not obvious except from knowledge of the standard that this can change between platforms but not between builds. For arbitrary dependencies, this problem gets much harder.