      I built something similar to this a few weeks ago. It has been a trove of insight into how distro packages and ELF binaries work in the wild.


      A little digging through the output turns up the error message xargs: unmatched single quote; by default quotes are special to xargs unless you use the -0 option, so unsurprisingly a single-quote in a file name somewhere is totally hosing our shell script. This is where a sane person would drop bash like the live grenade it is; a few minutes of trying to make the xargs -0 option cooperate results in it stubbornly saying xargs: argument line too long, which is just so helpful.

      Try xargs -d $'\n' when you have one filename per line. The only special character will be newline, not quotes or spaces or backslashes.

      while read -r line is another alternative in bash, but I tend to use xargs. (Without -r, bash will mangle backslashes, i.e. it has the wrong default.)

      This post explains why this is a good idea, and why -0 and “word” splitting tends to be less useful – because most Unix tools work on lines (grep, awk, sed). So xargs has the wrong default too!

      I tend to use TSV and cut / awk if I need to do this kind of analysis, and then go to R and tidyverse once bash gets out of hand: What is a Data Frame? In Python, R, and SQL

      It works very much like Unix pipes on tables; there’s no parsing and splitting once you load the TSV file. Here are some real examples: https://github.com/oilshell/oil/blob/master/benchmarks/report.R#L168

      i.e. you can divide it between data ingestion in bash (and Python), and data analysis in R or SQL

      TODO: Find that reference about Python DLL internal function call optimization cutting 30% off of runtimes.


        Thank you!

      Ok. Do we have any duplicates?

      sort files.txt | uniq | wc

      368435 ...

      Not sure what the point of this is, unless you expect find to print the same path more than once for some reason?

      Maybe you wanted something like <files.txt xargs -n 1 basename | sort | uniq | wc?

        Just to sanity-check, really. “Does this actually look the way I think it does? Ok good.”

          Fair enough.

      I’m not in front of a console so not 100% sure, but I believe the library usage is counted multiple times in some cases. Specifically libraries will be executable too and will list dependencies through ldd. So if you have dependency chain A-B-C-D, D will be counted as used 3 times, even though only A is a real entry point.

        Good catch, I’ll see if I can double check this.