Reading the source code revealed that cp keeps track of which files
have been copied in a hash table that now and then has to be resized to avoid
too many collisions. When the RAM has been used up, this becomes a slow
Since -R is recursive copy I would have assumed there would be a simple stack to keep track of the depth-first traversal. Why would you require your memory to scale O(n) !?
GNU cp apparently has options to preserve hardlinks instead of creating duplicates. So it creates a hash table of every file copied and its source inode so it can detect links later in the traversal.
That’s clearly the situation described in the email, but the email doesn’t actually tell us what options were used and the man page is crap.
I’m guessing hardlinks. You’ll need to keep track of all inodes you’ve seen to be able to re-create them, right?
I found an issue in cp that caused 350% extra mem usage for the original bug reporter, which fixing would have kept his working set at least within RAM.
Yes, don’t do that!
If you have that many files, you probably need to come up with some saner archival strategy.