1. 2

    Great work became fruitful by combining the resources Microsoft have with the existing resources of GitHub.

    I have been making use of some of these changes GitHub published upstream to git into Gitlab’s Gitaly. Both Gitlab and GitTea are quite behind in supporting bigger repos use cases

    1. 1

      I’m surprised that they still keep recompressing pack files to be bigger and bigger. IIUC the main benift of large packfiles is a large surface for delta compression. However most of the repo won’t delta-compress with the rest of the repo so why do you need to stuff it into one file? It seems like once files were a GiB or there is little benift to making them larger.

      I guess we will learn more about this when they complete the additional work mentioned at the bottom of the post 🙂

      1. 3

        However most of the repo won’t delta-compress with the rest of the repo so why do you need to stuff it into one file?

        Commits are snapshots, not diff. So multi versions of a file can be delta compressed very well.

        Ofc, there are always edge cases. But for most git repo, delta compression works quite efficient ly.

        1. 2

          Would be ideal if one could tell git what constitutes a component in a large monorepo, so git could produce packfiles per component. But we probably won’t see that for years.

          1. 1

            If you want git to know about the parts, why use a monorepo at all?

            1. 1

              One of the reasons is discovery. Whenever I joined a new company, what helped me the most in figuring out how things work was going through the code. It’s way easier to do it in a single repo, then in tens/hundreds of them.

        1. 1

          Taking a good break. Cleaning up the house.

          Doing some reading on some basic bazel rules.

          1. 1

            What are you trying to build with bazel?

          1. 2

            Honestly I think Distributed Version Control System(DVCS) is not the correct way to move forward.

            Same with email, docs, excel sheets,.. to your compute servers… all moving to the cloud. What Github and Gitlab is proving is a well oiled, well managed Centralized Version Control System is much more desirable:

            • You get more than just version control, you get code review, issue trackers
            • You get CI and CD well integrated
            • You get better scaling for infrastructure that you dont have to manage.

            It’s 2021, everybody who code would do it with an internet connection. Stackoverflow, google, hackernews, reddit, lobste.rs etc… A connected VCS UX is much more desirable.

            The moment we recognize this fact would be would be much better off building something for an online experience. And that comes with a lot of assumption that you can make about storage, scalability, distributed, latency etc…

            I personally am keeping a close watch on https://github.com/facebookexperimental/eden/ as its well built based on that philosophy. I think this is the most advance, best invested Open Source VCS solution we have to date.

            1. 3

              I’ve used CVS, SVN and git. Of the three, I find git the easiest to set up a new repo. It’s just git init. With CVS, it was more work involved (both ends) and as a result, I only ever had two personal projects in CVS. I recall setting up SVN for my own use required even more insane setup and never bothered with it after getting it installed.

              And not everybody is comfortable with The Cloud(TM).

              1. 1

                Using CVS and SVN as the basis of comparison wont do Centralized VCS justice.

                In my mind, it would be something closer to GitPods or Github Codespaces, where you get a cloud instance provisioned with all the needed dependencies for your development plus an IDE server. You can either connect to that IDE server using an IDE client (web or actual IDE).

                The idea is that IDE, VCS, CI, CD, Monitoring, Alerts, Logs should/could all be well-oiled integrated when they are built together. From a user perspective, it should only be the IDE frontend which they interact with, not Git nor CVS nor Mercurial nor SVN.

              2. 1

                Interesting. Eden is essentially the surrender to stay compatible with Mercurial.

                I also keep an eye on Pijul and wonder how well it will scale.

                1. 1

                  Pijul is too … theoretical at it current state. Same with https://github.com/martinvonz/jj

                  For a VCS to scale, you need a server hosting solution that integrate with different component that would enable scaling:

                  • Object Storage for large files
                  • Graph Database (or something similar) to maintain relationship between ‘branches/bookmarks’ and between change sets,…

                  You also want a more mature client solution, with easy to learn UX while having knobs that let power users excel.

                  Finally you need a migration path for existing code repository to move to this new solution. I.e. converting from git/svn/mercurial to my-ideal-vcs is a must have.

              1. 5

                Heh, I personally just implement the thing myself (1, 2) :)

                More practically, I think there are these choices available:

                • use smart IDE, like IntelliJ
                • use editor with LSP support (and bug LSP, editor and server developers to improve support for this, as they don’t allow you to filter your code vs library code, or test vs main code)
                • produce symbol index as a part of build process (Google’s kythe).
                • use offline regex based index (ctags). This works well enough, but might not be super precise (as regexes can’t parse programming languages), and have a drawback that index rebuild must be triggered manually.

                The next two options are a straightforward (Christmas holidays scale) extensions to the last one, but I don’t think they are available as a popular stand-alone tool.

                • online indexing – listen for the file changes and apply an incremental diff to the index, rather than recomputing it from scratch
                • plug tree sitter instead of regular expressions for richer and more correct indexes.
                1. 2

                  Sourcegraph has an open source core, for a standalone tool: https://github.com/sourcegraph/sourcegraph

                  1. 2

                    But does the core actually has indexing capabilities? I don’t know, but I think they are consumers of LSIF (a feature of LSP) and ctags, rather then producers?

                    GitHub’s semantic might be closer to what I am talking about (but, again, haven’t looked close at it):

                    And, naturally, IntelliJ is open source, that would be (well, it is :) ) the first place for me to look how to do language-aware tooling:

                    1. 2

                      I was thinking about this comment more after replying and I think my first reply is confusing/wrong. I was thinking, as an end-user, does open source Sourcegraph have code search capabilities.

                      You’re right - the core consumes LSIF.

                      https://docs.sourcegraph.com/code_intelligence/references/indexers https://lsif.dev

                      1. 2

                        We’ve run the OSS sourcegraph indexers internally and they worked quite well. It isn’t currently running (just for lack of use - mostly). But unless they changed the OSS offering recently, it definitely works as an indexer.

                    2. 1

                      plug tree sitter instead of regular expressions for richer and more correct indexes.

                      I assume you’re referring to this tree sitter? Looks like an awesome project!

                      @matklad are you aware of any projects using tree sitter for symbol tagging/indexing? And, with your work on rust-analyzer (and thank you for all you’ve put into that) do you have any thoughts on where effort is best placed to improve code indexing outside of big IDEs like IntelliJ? I.e: working more on LSPs to expand support for indexing and searching, and ensure they are robust and work well on large codebases vs building a modern ctags-like tool that is perhaps based on tree sitter.

                      1. 2

                        @matklad are you aware of any projects using tree sitter for symbol tagging/indexing?

                        I think GitHub’s semantic does that (although not sure), but it tries to go way beyond simple indexing. In terms of making a dent, I think if you want to support a specific language, you really should push on the LSP implementation for that language making server more robust & fast, making client more powerful, and making the protocol more capable and less bet.

                        If you want to improve the base line support for all languages, I think a tree-sitter based map-reduced indexer could make a meaningful difference. Basically:

                        • some infra to write mappers which take tree-sitter’s syntax tree and output the list of (symbol name, symbol kind, symbol span) triples (with maybe some extra attributes)
                        • some fuzzy-search index for symbol names (if using rust, just take the fst crate (or copy-paste symbol_index.rs from rust-analyzer))
                        • a driver which can:
                          • process a bunch of files offline embarrassingly parallel
                          • incrementally update the index when files change (that is, if file x changes, remove all x keys from the index, and then re-add the keys after the mapper step).
                        • some client API – LSP protocol for easy integration, custom search protocl for more features (streaming & filtering) or just a CLI for unix way
                        • an optional persistence layer, to load the index from disk (but you most definitely can live with purely in-memory index for a long time, don’t implement this from the start, postpone)
                        • some pluggability, so that it’s not you, but your users who maintain grammars & mappers.
                        • a kind of a feature creep, but why not – a tri-gram index on top of driver/mapper infrastructure, to speed-up text-based greps.
                      2. 1

                        I really wish kythe has better open source support. The docs are out of date and its hard to get started without a viable frontend.

                      1. 4

                        That “embed” package looks fun!

                        1. 1

                          What is fun about it?

                          1. 15

                            It replaces a whole slew of other packages which do pretty much the same thing.

                            In particular, I’m excited about being able to just embed all the assets for a web-app into the binary. Then you can just distribute the binary and have it serve the assets as if it were from an actual FS.

                            I’m assuming people will end up writing adapters which will allow you to use assets on the FS in dev and the bundled assets in prod. This was hard in the past because there were so many alternatives.

                            1. 2

                              I would rather shipping the static assets as a separate layer inside my container rather than bundling it into the binary. But there are use cases outside of containers that perhaps static assets would be nice to have. I.e. desktop ui app

                              1. 3

                                Not everything runs in a container; not even web apps.

                              2. 1

                                Considering how the Go toolkit has evolved, I imagine (and hope) it will be a native feature of the Go build/test tools such that you won’t need adapters. It’ll simply be a build flag/annotation

                          1. 10

                            For static website you can just S3 + cloudfront without needing a compute service. Probably a lot cheaper also

                            1. 3

                              Less effort and cheaper. Unless a Rube Goldberg award is your goal. ;)

                              1. 3

                                Well, learning is fun, but this tip reminds me that I should get my tech blog up running again. Using S3 & Cloudfront, probably.

                                1. 3

                                  That’s the great thing about static websites: there are so many possible options for building and hosting them.

                                  On the subject of containers, I was pleasantly surprised by Netlify’s approach. It will happily spawn containers for you and let you run any custom script in them. The only fixed part is a TOML file where you tell it what to run and what directory to deploy. How the rest of the build process works is up to you.

                                  [build]
                                    publish = "build/"
                                    command = "./netlify.sh"
                                  

                                  The only annoying part is that it only offers a Ubuntu 16 image.

                                2. 2

                                  Or GitHub Pages/Netlify. Completely free.

                                  1. 2

                                    True. Both github and gitlab pages are free options that you can just slap a domain on top.

                                    1. 1

                                      And using existing workflows like peaceiris/actions-gh-pages makes that even easier.

                                      This is what I do for my own website, which is built off this template repository. Click the green button “Use this template”, and you got a static site up and running within a minute.

                                    2. 2

                                      You probably need Route53 as well.

                                      The monthly cost of one of my low traffic webpages is:

                                      • Domain name: 1,26$
                                      • AWS Route53: 0.5€
                                      • AWS S3: 0.01$
                                      • AWS Cloudfront: 0.01$

                                      Sum: 1,78$. Most of it is domain costs.

                                      I didn’t do much posting this year, so not much S3/Cloudfront costs arose. When I was posting more often, and had to invalidate cache multiple times because corrections/multiple publications in a spree/testing robots processing RSS feed, then sometimes the combined S3+Cloudfront cost reached over 0.1$!

                                      Also this setup scales basically infinitely (but then costs also rise), won’t be slashdotted, unlike nginx running on a potato tier VM.

                                      1. 1

                                        If you’d like to row against Big *aaS but prefer the pricing model, I’ve been pretty happy with hosting many of my simple (static and dynamic) sites at nearlyfreespeech.net (for over a decade now, I guess!)

                                        (Not being a purist, here; I use Big *aaS in my personal infra where it makes sense.)

                                      1. 7

                                        The fact that github PR workflows does not support git diff between force pushes is so mind bending to me. Their core staffs who contribute to git core uses rebase regularly.

                                        Gitlab does retains diff between force pushes is a nice to have, but comes with a long term performance trade-off.

                                        1. 4

                                          They’ve half-assed this feature now. On the message where it says you force pushed, if you click the text “force pushed,” it will give you a diff.

                                          However, multiple force pushes will be collapsed into one message, and clicking on that link will only show you the latest force push diff.

                                        1. 1

                                          How are you planning to address repack(with bitmap index), garbage collection, commit-graph generations and backup?

                                          1. 1

                                            How does sccache work? It’s tempting to just plug it in (ok I’ll admit I already did), but not knowing how or when it would benefit is a little unsettling 🤔

                                            1. 3

                                              The project’s README gives a brief introduction and has additional links. It boils down to putting more things in cache.

                                              1. 2

                                                There’s a variety of possible setups. E.g., we have a server in the office and about 120 cores worth of personal desktop workstation connected, which helps us compile Firefox very fast. For us, besides caching and distributing it’s also making it less painful that most of our rust compilation is very serial in nature.

                                                1. 1

                                                  As sccache requires absolute paths to match, how did you manage to get everyone to build at the same location?

                                                  We have a couple of beefy build machines, where people login and build Rust, but sadly, due to this limitation we can’t us sccache :(

                                                  1. 1

                                                    I’m merely a user, I’d have to defer you to our sccache experts in our #build:mozilla.org matrix channel

                                                    1. 1

                                                      You can just configure this variable

                                                      export RUSTC_WRAPPER=sccache
                                                      
                                                      1. 1

                                                        Perhaps they implemented some sort of Remote build execution and remote build cache.

                                                        Very similar techniques used by Bazel

                                                        1. 1

                                                          Would a shared chroot do it?

                                                      1. 2

                                                        I use the older codemod tool regularly – didn’t know that a faster Rust version existed. Thanks!

                                                        1. 1

                                                          Is there a specific way to integrate rg with fastmod?

                                                          1. 3

                                                            I don’t think so, but fastmod uses the same regex engine as ripgrep, although it probably doesn’t have all the optimizations that ripgrep has. (Specifically, inner literal optimizations. Those can’t really be added to the regex engine, so they live a layer above it inside of ripgrep, well, the grep-regex crate.)

                                                        1. 3

                                                          Very nice post. Its worth to mention https://kythe.io/ (by Google) which predate LSF and LSIF.

                                                          Essentially Kythe is built into Bazel on the compiler level and generate a universal Syntax Tree that can be serialized for later use. One of the supported storage is Cayley https://kythe.io/examples/#using-cayley-to-explore-a-graphstore (also by google) or Standard MySQL.

                                                          This is very useful of code intelligence at scale where you cannot simply load everything at once (like LSIF) but might want to update things incrementally instead.

                                                          1. 3

                                                            Its worth to mention https://kythe.io/ (by Google) which predate LSF and LSIF.

                                                            This is mostly tangential, but while LSP and kythe seems similar, they differ in a very fundamental aspect. Kythe is semantic-level API, it talks about symbols and usages of the symbols. LSP is, by design, a UI-level API, it doesn’t have a concept of “symbol” per se. Rather, it talks about “when clicking on this offset in this file, the editor should jump to that offset in that other file”. So, LSP does not try to build a universal semantic model of a language, and this is one of the reasons why it is successful.

                                                            A Google technology which is much closer to LSP (and which also predates it) would be Dart analysis server protocol: https://htmlpreview.github.io/?https://github.com/dart-lang/sdk/blob/master/pkg/analysis_server/doc/api.html

                                                          1. 4

                                                            How many have more than 1 line of commit message?

                                                            1. 4

                                                              From my testing, I did find that there were about 2 million commit messages with only a single present-tense imperative verb, like “Update”, or “Commit”. Also, I did notice a lot of commit messages with multiple lines while testing stuff out, but can’t give you an exact amount at the moment.

                                                              I would answer your question by writing/running a new query, but I racked up about $500 of charges playing with Google BigQuery yesterday, so trying to get that sorted before I run anything else lol…

                                                              1. 5

                                                                Those aren’t necessarily imperative verbs - “update” and “commit” are both perfectly good English deverbal nouns.

                                                                1. 2

                                                                  Haha very true. I guess at that point it comes down to the intent of the developer, which seems impossible to gauge. Imagining a frustrated developer at his wit’s end commanding his computer to “UPDATE!” or “COMMIT!” is amusing though…

                                                            1. 3

                                                              It’s interesting that they mentioned MSFT’s Scalar. I have been contributing to that project and making a bash script version https://github.com/sluongng/git-care

                                                              1. 1

                                                                Wow, pretty neat project! Will check it out on one of our big repos.

                                                              1. 4

                                                                Give my wife a vacation from the kitchen. Also reading through a house purchase agreement.

                                                                  1. 2

                                                                    I think the FUSE stuff I have been reading about from Google build system is definitely worth a startup in of itself.

                                                                    I know that SlothFS is being used for Android repos, as well as Kythe(Grok).

                                                                    But how to use it, how to expand the implementation (hard dependencies on Gerrit) is a bit hard to navigate through Google’s opensource documentations as an outsider looking in.

                                                                      1. 1

                                                                        I have seen this docs circulating around as referenced by some of google github repo.

                                                                        Is there a recorded talk about this?

                                                                        1. 1

                                                                          Unfortunately, I don’t know of any recording. However, there is the list of all slides at that meetup, and it’s a treasure trove :)

                                                                        1. 1

                                                                          Got several Pull Requests / Merge Requests in open source projects which I started but have yet finished… $DAY_JOB was too busy.

                                                                          Learning how to make kimchi with the wife :)