1. 10

Hello fellow lobsters,

I need to organize my images and photo’s and I’d like to ask you for tips and tricks. I’ve stored most of my images on a private Nextcloud instance, and a few years ago I gave my girlfriend her own account where she has uploaded hers.

As time progressed this became a bit messy. Not only are there literal duplicates - there are also similar files, i.e. resized and recompressed images. For example from our Signal caches, and images that were piped through a nifty Android app called Send Reduced Pro before publicising them or sharing with friends and family, who don’t need the huge originals.

My plan is to go to the data center and copy both our Nextcloud data directories to a temporary disk, so that I can process them at home more easily, on my Linux desktop computer. I know of a program called fdupes, that will allow me to find exact duplicates.

But how can I find similar pictures, that look the same even though their checksums differ? I would rather not resort to using online tools, I know for example Google supports a ‘similar image’ search, but are there standalone programs that can do this? I’d prefer to keep only the high-res versions.

The end result will be limited to saving pictures in directories - I’m aware that there are (pseudo-)filesystems that support tagging for example, and while those do seem useful, because I will move them back to my Nextcloud instance this isn’t an option for as far as I’m aware. Although it might still be an interesting thing to share, for others who might benefit from such things.

Looking forward to reading your input!


  2. 3

    Hmmm. It’d be sorta neat to implement something that could use decoded RGB data and a locality-sensitive hash to build a duplicate detector.

    1. 1

      Something like this has been on my business ideas list for a few years after I watched a semi-pro photographer struggle to keep track of multiple copies of photos (not even getting into versioning).

    2. 2

      With respect to image signatures/hashes (as others have mentioned), ImageMagick offers this via its identify command:


      Here we display the image texture features, moments, perceptual hash, and the number of unique colors in the image: identify -verbose -features 1 -moments -unique image.png

      I’ve been using it only for strict comparisons though so I’m not sure how it will handle resizes/minor differences =/

      Also, for reference I’m on an older version of ImageMagick and was in a performance tight spot so I had to use the -format variant:

      identify -format '%#' image.png
      # abcdefghash


      %# CALCULATED: ‘signature’ hash of image values

      1. 1

        For checking whether photos are duplicates, you could use something akin to pHash. The technique is called perceptual hashing (see the external links).

        1. 1

          Thanks for the suggestions, everyone! I got a message outside from Lobsters, someone suggested the program “dupeGuru”. That one seems most apt for my usecase (and it does music, too!) but I’ll definitely refer to your suggestions.

          Happy new year. ?

          1. 1

            You might try doing some simple fingerprinting of the images. You’re probably going to need a little Python and OpenCV, but you’ll be okay. You want to run this as a batch job, right?

            Some other approaches are discussed here.

            If you want, come swing by the Lobsters IRC around 1600 central tomorrow and maybe we can poke at it together.

            1. 1

              Hey, that’s cool, thanks for the tip and the invite. I’m on daddy duty again, which resumed today after two weeks of vacation. And I’m in the Netherlands, so 1600 central is a bit too late for me. But because it’s on IRC we can use it as a slow chat. So I’ll pop in tomorrow and ping you there. Nice!

            2. 1

              Do not try “Duplicate Photos Fixer Pro” by Systweak software. Discounted to less than £1 I thought it worth trying, but it didn’t even work in the slightest. I spent some time trying to make it work, but it couldn’t find more than a handful of duplicates (which were not even remotely similar) in my 30k+ list of photos, but didn’t find any near duplicates among the 12k+ nearly identical baby photos I took as a new parent. (Hint: there are many. Sleeping newborn babies don’t move much, and when you take 300 photos of them in a five-minute window, you get a lot of near-duplicate photos.)