I’m not an algorithms guy, nor do I claim to be. This paper(? or whatever you want to call it) makes some interesting claims

Sorting in O(n) time sequentially, O(log(n)) in parallel. That’s fast!

“O(m) space where m is the size of: { min, min + 1, min + 2, … , max - 1, max }”

Directly quoted from the article - I’m not entirely sure what it means for the size to be a set of numbers

“…both the parallel and sequential versions sort the array A with a collision probability of 0.00001%.”

This may seem small, but it’s 1/100000. That’s pretty significant if you’re sorting, say, a million items. Correct me if I’m wrong, but I think that means there’s actually a 10 to 1 chance that there isn’t a collision with a million item sort.

Also, doing a ctrl-F for “stable” does result in anything - so there’s no indication if it’s stable or not. I’d lean on the assumption that it’s unstable, because it’s being done in parallel? But again, I’m not an algorithms guy.

Well… it depends on what you mean by stable. All the numbers in the result will be, by definition, in order… but you may get new numbers that weren’t in the original array. The rate of those new numbers is 1/100000.

From what I can tell, this is kinda how the algorithm works:

lookup = bloomfilter(numbers)
for i in range(min(numbers), max(numbers)):
if i in lookup:
yield i

It’s important to note that because we’re using bloom filters, this scheme does not handle arrays with duplicate entries. That’s why they mention “unique elements” in the first statement.

It’s the i in lookup that has the 0.00001% error rate. The O(m) comes from the fact that we do range(min(numbers), max(numbers))…. the author has a really strange way of stating that with that set notation.

One last thing to note is that the rate of O(log(n)) is theoretical assuming an infinite number of available cores.

As this is a probabilistic sorting algorithm, you will have a small chance that the given array is not sorted after executing the algorithm. Therefore this algorithm shouldn’t be used critical operations.

This is particularly useful for approximate percentiles. If I have a dataset with 239487283572934 numbers, it’d be hard to store in memory and constantly keep updated and sorted. With this algorithm, all you need to do is maintain: the bloom filter (which is small), the min/max numbers and the total number of records that have been added. This scheme would have O(1) insertion and O(M) query where M is the size of the interval the data is on (ie: how spread apart the min and max are).

The thing that makes this kinda useless though is that it only supports sorting unique items.

I’m not an algorithms guy, nor do I claim to be. This paper(? or whatever you want to call it) makes some interesting claims

isn’ta collision with a million item sort.Also, doing a ctrl-F for “stable” does result in anything - so there’s no indication if it’s stable or not. I’d lean on the assumption that it’s unstable, because it’s being done in parallel? But again, I’m not an algorithms guy.

Well… it depends on what you mean by stable. All the numbers in the result will be, by definition, in order… but you may get new numbers that weren’t in the original array. The rate of those new numbers is 1/100000.

From what I can tell, this is kinda how the algorithm works:

It’s important to note that because we’re using bloom filters, this scheme does not handle arrays with duplicate entries. That’s why they mention “unique elements” in the first statement.

It’s the

`i in lookup`

that has the 0.00001% error rate. The`O(m)`

comes from the fact that we do`range(min(numbers), max(numbers))`

…. the author has a really strange way of stating that with that set notation.One last thing to note is that the rate of

`O(log(n))`

is theoretical assuming an infinite number of available cores.Thanks for this. I skimmed through the whole thing several times without figuring out what the

actualalgorithm was.I’m going to guess it isn’t stable :P

Hah, I definitely missed that. Out of curiosity, do you (or anyone else) know some possible applications of a “mostly sorting” algorithm?

In data science for instance, you rarely need any operation to be perfect, since everything you calculate is statistical anyway.

This is particularly useful for approximate percentiles. If I have a dataset with 239487283572934 numbers, it’d be hard to store in memory and constantly keep updated and sorted. With this algorithm, all you need to do is maintain: the bloom filter (which is small), the min/max numbers and the total number of records that have been added. This scheme would have

`O(1)`

insertion and`O(M)`

query where M is the size of the interval the data is on (ie: how spread apart the min and max are).The thing that makes this kinda useless though is that it only supports sorting unique items.

Perhaps as a first pass, ahead of a proper sorting algorithm?