1. 12
  1.  

  2. 4

    I regularly do manual and automated code-analysis in my day job. Firefox’s main repo is quite big and (rip)grep gets you very far indeed, but I want to re-emphasize one thing:

    If you want to search exhaustively, you need a real code analysis tool for your language.

    Just think of JavaScript code like this[event.message]. Oh, the things I have seen… :)

    1. 5

      This brings to mind alternatives to grep: https://blog.burntsushi.net/ripgrep/#code-search-benchmarks

      1. 4

        ripgrep is a really amazing tool for searching through source. The speed is nice, but for me the main feature is its ability to respect .gitignore, and ignore binary and hidden files.

        1. 2

          I see everyone talk about ripgrep.

          But no one talks about git grep.

          Any devs who have git installed, can use git grep. It’s just as fast as ripgrep.

          1. 6

            As the author of ripgrep, I do think this is a somewhat interesting phenomenon. In the common case, it’s quite likely that git grep and rg are going to run at similar speeds. For example, on a checkout of the Linux repo:

            $ time git grep EVENT_PM_RESUME | wc -l
            2
            
            real    0.336
            user    0.774
            sys     1.024
            maxmem  53 MB
            faults  0
            
            $ time rg EVENT_PM_RESUME | wc -l
            2
            
            real    0.136
            user    0.724
            sys     0.674
            maxmem  21 MB
            faults  0
            

            But still, ripgrep is twice as fast here. 0.3s is still pretty fast, so you could reasonably say that the speed difference doesn’t matter too much. However, if you go search an even bigger repo, such as the Chromium repository, then the speed difference becomes must greater:

            $ time git grep Openbox | wc -l
            4
            
            real    3.668
            user    6.343
            sys     6.281
            maxmem  179 MB
            faults  0
            
            $ time rg Openbox | wc -l
            4
            
            real    0.581
            user    3.714
            sys     2.740
            maxmem  112 MB
            faults  0
            

            This gets close to an order of magnitude difference, and it’s definitely noticeable. But even this difference can get much worse if you’re searching anything other than a basic pattern:

            $ time LC_ALL=C git grep -E '[^ ]Openbox' | wc -l
            2
            
            real    16.960
            user    1:13.68
            sys     3.456
            maxmem  180 MB
            faults  0
            
            $ time LC_ALL=en_US.UTF-8 git grep -E '[^ ]Openbox' | wc -l
            2
            
            real    44.895
            user    3:18.59
            sys     2.702
            maxmem  180 MB
            faults  0
            
            $ time rg '[^ ]Openbox' | wc -l
            2
            
            real    0.566
            user    3.353
            sys     2.930
            maxmem  111 MB
            faults  0
            

            (Where the ripgrep search is effectively equivalent to running in the en_US.UTF-8 locale automatically. Also note that you can make git grep go faster with the -P switch, but it’s still slower than ripgrep. And all of this in spite of the fact that ripgrep re-parses and matches .gitignore files (of which there are ~100 in the Chromium repo) on every search, where as git grep just uses its index directly.)

            In these examples, I think saying git grep is just as fast as ripgrep is not quite right. If a lot of people are working on repositories this big—and maybe that is the case—then this could perhaps explain why git grep doesn’t get as much press. Slight changes to patterns result in performance cliffs.

            And I think the point that @craftyguy brings up is valid too. Sure, they were talking about source specifically, but that doesn’t negate the fact that ripgrep should work well anywhere.

            1. 2

              I am not saying git grep is better than ripgrep. All I’m saying is, in many circumstances, it’s enough. Chromium is a massive project, and there are not many that fall into this category. If you’re a developer who falls into this category I 100% agree you should use ripgrep.

              In fact, if you can install ripgrep from your distro’s repos, I would suggest to anyone to download ripgrep.

              You’ve done some wicked work, and in no way did I mean to offend. I hope you understand where I’m coming from.

              1. 3

                I wasn’t offended… I was just responding to two points that you raised, effectively: why don’t people talk about git grep and git grep is as fast as ripgrep.

                Myths about the performance of applications spread like wild fire, and they are constantly repeated. I just try to do my best to add more context when I see something unqualified like that. That’s why I spent so much time on benchmarks. I’m not trying to “convince” you to use ripgrep, or to even say that it is better—git grep does some things better than ripgrep.

                Chromium is a massive project, and there are not many that fall into this category.

                Eh. It is. But if you go take a look at monorepos used at companies, they can get quite big quite fast. I work at a tiny startup and it’s already about 10% the size of the Chromium repo. For FOSS projects, sure, there aren’t too many that fall into the same size as Chromium.

            2. 2

              Well, ripgrep works outside of git repos.. so there’s some benefit in that you don’t have to learn two ways to grep git source + everything else.

              1. 2

                I specifically brought it up because you mentioned source :)

                ripgrep is a really amazing tool for searching through source

                In any case: yes, ripgrep is useful for out-of-git searches. What I’m suggesting is that devs can get the speed of ripgrep today in their projects.

                1. 2

                  I switched from git grep to ripgrep after my workplace switched from git to mercurial. since I was already using it at work, I just got used to using it everywhere, even when I was already in a git repo.

        2. 2

          This is a topic very real for me. I’ve worked on multiple legacy web applications with horrific technical debt. One thing I’d like to be able to do is “dead function elemination”. Finding all (js) functions and their uses is non trivial with grep and friends. Does anyone have experience or tips on this?

          1. 1

            Generally that’s a linker/compiler internals kind of thing, looking for symbol dependency across source files. There’s also the various versions of late binding, so public functions? Good luck.

            I’m interested into looking more into the Language Service Provider stuff that’s coming out. There might be some goodness there to check into. I don’t know.

            ADD: Bottom-up tech debt elimination on huge codesbases is a bear. Your best bet is probably to write clean code for the work you do, shim up the work around what you change, and then start a refactoring effort from the outside in. Each of these by themselves might not work, but all together they tend to eventually get the job done.

            1. 1

              A combination of file name and file content search, and a human, is the only for-sure way to know what’s useful and useless.

            2. 2

              At my last job I wrote a shell-based tool much like this for auditing and generating sed-based redaction scripts for large codebases. Mostly it was for finding internal server names, IP addresses, and common encryption-key strings.

              Initially I used the_silver_searcher and mawk, but dropped back to straight grep and awk because the slight speed gain wasn’t worth the time explaining how to install them for non-developer security personnel. GNU parallel was the only concession I made.

              The final speed of the tool was around 5m to search a 2.5GB(52370 files) repository on a 2015-era Macbook Pro. (Running the sed redaction scripts was another thing entirely)

              1. 1

                My colleagues always ask why I leave PhpStorm for grep. Usually, it is not necessary, but sometimes I might need to awk around a bit and it’s good to already be in the right environment.

                If you haven’t seen it yet, check out ag, the “silver surfer”. It’s the only “grep replacement” that I could get behind after two decades of using grep.

                1. 0

                  ack is even better.