I think “keying” the number and its complement together is nifty. As the author pointed out, using max produces a branch, but that could be avoided by vectorizing the loop.
Thanks for posting, the author’s blog is filled with interesting stuff.
It’s possibly worth noting that you can still be O(n) with two traversals through the array. This means that your hash table just has to tell you, for any number, whether it exists or not and so is a set, rather than a map. You do one scan through the array where you see if the target minus the current value is in the set. If it is, stop and record the current number and index. If not, add the current number to the set. If you reach the end of the array, the problem is not solvable. If you exit from the first loop, do another linear scan to find the number that you now know that you’re looking for. In the worst case, this requires one complete scan of the array with a set insertion on each element, followed by a linear scan of all except one.
With that algorithm, I’d be inclined to use a radix tree rather than a hash table. The depth of the tree is still constant and so each lookup or insertion is constant time. This also gives you an efficient way of storing ranges. You can collapse subtrees where all integers in the range have been found, which saves some space if the numbers cluster. You need one bit at the leaf nodes to store whether a value is present. This should scale nicely to 64-bit integers.
If you are using unsigned integers, you can also avoid reserving any space for storing numbers that are larger than the target, which is likely to significantly reduce the space requirements for arbitrary uint64_t arrays.
I think “keying” the number and its complement together is nifty. As the author pointed out, using max produces a branch, but that could be avoided by vectorizing the loop.
Thanks for posting, the author’s blog is filled with interesting stuff.
That’s an interesting insight, the “indexing” use case is essentially the front half of an index map / Hettinger’s naturally ordered hashmap.
It’s possibly worth noting that you can still be O(n) with two traversals through the array. This means that your hash table just has to tell you, for any number, whether it exists or not and so is a set, rather than a map. You do one scan through the array where you see if the target minus the current value is in the set. If it is, stop and record the current number and index. If not, add the current number to the set. If you reach the end of the array, the problem is not solvable. If you exit from the first loop, do another linear scan to find the number that you now know that you’re looking for. In the worst case, this requires one complete scan of the array with a set insertion on each element, followed by a linear scan of all except one.
With that algorithm, I’d be inclined to use a radix tree rather than a hash table. The depth of the tree is still constant and so each lookup or insertion is constant time. This also gives you an efficient way of storing ranges. You can collapse subtrees where all integers in the range have been found, which saves some space if the numbers cluster. You need one bit at the leaf nodes to store whether a value is present. This should scale nicely to 64-bit integers.
If you are using unsigned integers, you can also avoid reserving any space for storing numbers that are larger than the target, which is likely to significantly reduce the space requirements for arbitrary uint64_t arrays.