Hi Lobsters! Recently, I decided to take care of a task I had been procrastinating for a while: to organize and de-dupe data on our home file server. I was thinking of it as a mundane task that needed to get done at some point, but the problem turned out to be a bit more interesting than I initially thought.
There are tons of programs out there designed to find dupes, but most just spit out a huge list of duplicates and don’t help with the work that comes after that. This was problematic (we had ~500k dupes), so I wrote a small program to help me. The approach, at a high level, is to provide duplicate-aware analogs of coreutils, so e.g. a psc ls highlights duplicates and a psc rm deletes files only if they have duplicates elsewhere.
I thought it was a somewhat interesting problem and solution, so I wrote a little write-up of the experience. I’m curious to hear if any of you have faced similar problems, and how exactly you approached organizing/de-duping data.
I think that what you’re referring to is named “Single-instance storage”, where the data is unique at the file level. Data deduplication is done at the block level, using chunks of data. This means that even if multiple files differ, but have a common portion, that portion will only be stored once.
I was a bit confused when reading your article and was waiting for the actual “deduplication” to happen as I was reading it. The write-up is pretty cool and the tool as well though ! Nice article, besides the confusing title for me ^^
I see how “deduplication” is confusing and isn’t quite the right word to use. I guess I was thinking about it in terms of the layman’s intuitive definition of deduplication rather than the technical term. Is “single-instance storage” correct? It seems that it means something similar to (the technical term) deduplication, retaining multiple ways to access the file while having a single instance on disk, so e.g. replacing all duplicates with hard links would qualify. I’m not sure what technical term describes what I was going for…
To be fair, I don’t know ! I pointed out the meaning of deduplication because I worked on such tool a few month back, and was expecting a similar topic here.
If I had to name what you did I’d say “remove duplicates” ? But you’d loose the pun with the periscope then 😉