Bear in mind you don’t need to keep track in memory of all the device:inode pairs for every file seen, unless you have an unusual situation: you just need to look at the link-count of regular files and keep track for those with a link-count greater than 1. So it’s “usually fine”, unless you have people hard-linking entire trees.
In an era with bind/loopback mounts, that’s much less likely to be encountered.
Honestly, the bigger issue will be if you’re trying to rsync across a unified presentation view of the filesystem, where the same underlying FS is bind/loopback-mounted into multiple locations: you’ll no longer have the same information available to detect that this has happened, and the device numbers will be different so existing deduplication will fail. It might mean needing rsync to be aware of OS-specific bind/loopback mechanisms and how they present, and then both unifying/normalizing device numbers and ignoring the link-count and instead tracking every inode encountered.
Maybe bind mounts should increment the reported hardlink count of all inodes in both themselves and the filesystem they’re duplicating. (Of course I realise the futility of “maybe everyone should change everything everywhere because of this tiny detail that comes up rarely” :) )
If you stat each directory, then you can tell when you’re crossing a mount point (the device changes between the parent and subdirectory). That gives you a place to look for aliased filesystems. With FreeBSD nullfs mounts, the inode numbers are unchanged, but the device node is different for each mount, I believe the same is true for bind mounts on Linux. This means that you need to do something OS-specific to parse the mount table and understand the remapping.
It would be nice if stat could be extended to provide an ‘underlying device’ device node field, so that you could differentiate between the device providing the mapping and the raw filesystem. That would be simpler than incrementing the link count because only nullfs and similar things would need to be modified to provide anything other than the default value for this field.
Ah, hardlinks… A decade and a half ago I worked for a company that ran multiple photo-related sites where people could upload photos, and manage albums. On one of them photos were stored on a giant file system. New joiners would get a copy of a demo account’s sample album of photos they could play with. (To facilitate demoing its red-eye removal feature, etc.) For reasons lost in the annals of time this copying was done using hardlinks, and every so often new signups would fail—when we hit the file system’s max limit of 65,000 or so hardlinks for a single file. The solution? Log into the demo account and re-upload the sample album… 😂😭
I remember that rsync in particular did not like this file system—nor did du and a host of other tools. The hardlink mechanism was an utter pain to deal with, particularly around backups. We had big and expensive hardware (for the time) to keep all files in a single file system, because splitting it across multiple disks would break the hardlinks and use much more space.
For a separate photo-related project at the same company my team developed an image store that used a database (instead of hardlinks) to facilitate cheap copies, and would store a configurable number of redundant copies over separate file systems. Sadly we were all made redundant before we could use it to fix the original problem site.