A little digging through the output turns up the error message xargs: unmatched single quote; by default quotes are special to xargs unless you use the -0 option, so unsurprisingly a single-quote in a file name somewhere is totally hosing our shell script. This is where a sane person would drop bash like the live grenade it is; a few minutes of trying to make the xargs -0 option cooperate results in it stubbornly saying xargs: argument line too long, which is just so helpful.
Try xargs -d $'\n' when you have one filename per line. The only special character will be newline, not quotes or spaces or backslashes.
while read -r line is another alternative in bash, but I tend to use xargs. (Without -r, bash will mangle backslashes, i.e. it has the wrong default.)
This post explains why this is a good idea, and why -0 and “word” splitting tends to be less useful – because most Unix tools work on lines (grep, awk, sed). So xargs has the wrong default too!
I tend to use TSV and cut / awk if I need to do this kind of analysis, and then go to R and tidyverse once bash gets out of hand: What is a Data Frame? In Python, R, and SQL
I’m not in front of a console so not 100% sure, but I believe the library usage is counted multiple times in some cases. Specifically libraries will be executable too and will list dependencies through ldd. So if you have dependency chain A-B-C-D, D will be counted as used 3 times, even though only A is a real entry point.
I built something similar to this a few weeks ago. It has been a trove of insight into how distro packages and ELF binaries work in the wild.
https://gregoryszorc.com/blog/2022/01/09/bulk-analyze-linux-packages-with-linux-package-analyzer/
Try
xargs -d $'\n'
when you have one filename per line. The only special character will be newline, not quotes or spaces or backslashes.while read -r line
is another alternative in bash, but I tend to use xargs. (Without-r
, bash will mangle backslashes, i.e. it has the wrong default.)This post explains why this is a good idea, and why
-0
and “word” splitting tends to be less useful – because most Unix tools work on lines (grep, awk, sed). So xargs has the wrong default too!I tend to use TSV and
cut
/awk
if I need to do this kind of analysis, and then go to R and tidyverse once bash gets out of hand: What is a Data Frame? In Python, R, and SQLIt works very much like Unix pipes on tables; there’s no parsing and splitting once you load the TSV file. Here are some real examples: https://github.com/oilshell/oil/blob/master/benchmarks/report.R#L168
i.e. you can divide it between data ingestion in bash (and Python), and data analysis in R or SQL
https://bugs.python.org/issue38980
Thank you!
Not sure what the point of this is, unless you expect find to print the same path more than once for some reason?
Maybe you wanted something like
<files.txt xargs -n 1 basename | sort | uniq | wc
?Just to sanity-check, really. “Does this actually look the way I think it does? Ok good.”
Fair enough.
I’m not in front of a console so not 100% sure, but I believe the library usage is counted multiple times in some cases. Specifically libraries will be executable too and will list dependencies through ldd. So if you have dependency chain A-B-C-D, D will be counted as used 3 times, even though only A is a real entry point.
Good catch, I’ll see if I can double check this.