Here are some things I really like about this article:
You make it clear what your problem domain is, why it matters, and why there’s no easy drop-in tool. It gives enough background to connect everything you say back to the big picture and pull all of the pros and cons in context.
You say why the original complex solution failed, and why the bash script was both simpler and faster. It’s a great example for “simple is better than complex.”
Followed by a great example of “sometimes simple just isn’t enough”: details about why performance issues and business considerations meant the bash solution couldn’t scale. It’s neat to see how the problem domain shifted and why it broke your solution.
Pros and cons of each possible new solution, as well as you actually built each of them. A lot of comparisons are written in a vacuum, so seeing that you’re speaking from experience adds a lot of weight many articles don’t have.
Explanations of the limitations and weaknesses of your final system, plus how you would have done it differently. Again, really cements this as a balanced discussion where everything was carefully considered.
All that and a great writing style and interesting tangents that inform but don’t distract from the main flow of your piece.
This looks like one of those blogs where I’ll be reading every post even if it’s not about something I’m interested in, because the explanations will be good enough to make it interesting. Thanks for sharing!
I appreciate the way you’ve written this as an exploration of the problem space rather than a polemic (“10 reasons you’re a dumb-dumb if you don’t count with HyperLogLog”). It feels far more authentic and useful.
Having said that, I find a disturbing story here: how much time was spent standing up Cassandra, operationalizing it, validating the new process, adjusting the existing billing processes, (…) when you’re sitting on an embarrassingly parallel problem? Do you really want to trust a complex datastore when wc is the perfect solution here?
wc has to count the unique lines. Like many companies we faced the problem were certain customers received an extreme amount of data compared to their peers. This meant that we had to count billions of items.
I’m curious how the problem is embarrassingly parallel… I was never happy with the Cassandra solution, so if there’s something obvious here I missed, I’d definitely be interested…
Cassandra is quite simple, though expensive, if you’ve already payed down the operational cost. (which we had) It solves the high availability problem.
Careful re-reading indicates the problem isn’t quite embarrassingly parallel; the roll-up has to occur monthly, and I misread the gzipped log date format as monthly logs; oops, my bad :)
There are still a handful of problems still present in that pipeline, and you’d still see some very strong gains if you spend a day parallelizing it:
you eliminate duplicates at merge time (-u), which cuts out n+1 full-file iterations and just does it at the k-way merge time. It’s also critical to set LANG to C or else you’ll get eaten alive with multibyte comparisons.
If those two would be too slow - and seriously, don’t underestimate the boost you could get there, the next likely step would be to fan out the download, gzcat and sort onto multiple servers. parallel can help do that with very little administration:
aws s3 ls | parallel --sshloginfile ssh-keys-go-here --return {.}.sort --cleanup "aws s3 cp {} . && gzcat {} | sort > {.}.sort" && find *.sort | sort -mu | wc -l (this one is from memory, so the syntax may need a tweak!)
Even better, drop the --return option, mount the output directories from your counter nodes to your invoking node, and be sure your network connection is fast enough and you’ve got a very easy to parallelize counter.
Since we wanted better than linear performance (ie a billion tweets shouldn’t take a billion operations to count), we explored the indexed option.
Can someone explain what this means? I don’t quite understand how you could have sublinear time for counting items. Maybe this is referring to the uniqueness/deduplication part?
So, to go full circle, if I were to do it again, I’d probably spend most of my time building something clever with a HyperLogLog, only to eventually cave-in and resort to something inefficient, bland and boring.
Why not HyperLogLog when it’s the most efficient solution?
Here are some things I really like about this article:
This looks like one of those blogs where I’ll be reading every post even if it’s not about something I’m interested in, because the explanations will be good enough to make it interesting. Thanks for sharing!
I appreciate the way you’ve written this as an exploration of the problem space rather than a polemic (“10 reasons you’re a dumb-dumb if you don’t count with HyperLogLog”). It feels far more authentic and useful.
Having said that, I find a disturbing story here: how much time was spent standing up Cassandra, operationalizing it, validating the new process, adjusting the existing billing processes, (…) when you’re sitting on an embarrassingly parallel problem? Do you really want to trust a complex datastore when
wc
is the perfect solution here?wc
has to count the unique lines. Like many companies we faced the problem were certain customers received an extreme amount of data compared to their peers. This meant that we had to count billions of items.I’m curious how the problem is embarrassingly parallel… I was never happy with the Cassandra solution, so if there’s something obvious here I missed, I’d definitely be interested…
Cassandra is quite simple, though expensive, if you’ve already payed down the operational cost. (which we had) It solves the high availability problem.
Careful re-reading indicates the problem isn’t quite embarrassingly parallel; the roll-up has to occur monthly, and I misread the gzipped log date format as monthly logs; oops, my bad :)
There are still a handful of problems still present in that pipeline, and you’d still see some very strong gains if you spend a day parallelizing it:
find *.gz | xargs -I{} echo "<(gzcat {} | sort | uniq)" | xargs echo sort -m | bash | uniq | wc -l
Right now you’re sorting the log file output and eliminating unique elements, merging them together and re-stripping uniques (gah!)
Instead, if you did
find *.gz | xargs -I{} echo "<(gzcat {} | LANG=C sort )" | xargs echo LANG=C sort -mu | bash | wc -l
you eliminate duplicates at merge time (
-u
), which cuts outn+1
full-file iterations and just does it at the k-way merge time. It’s also critical to set LANG to C or else you’ll get eaten alive with multibyte comparisons.If those two would be too slow - and seriously, don’t underestimate the boost you could get there, the next likely step would be to fan out the download, gzcat and sort onto multiple servers. parallel can help do that with very little administration:
aws s3 ls | parallel --sshloginfile ssh-keys-go-here --return {.}.sort --cleanup "aws s3 cp {} . && gzcat {} | sort > {.}.sort" && find *.sort | sort -mu | wc -l
(this one is from memory, so the syntax may need a tweak!)Even better, drop the
--return
option, mount the output directories from your counter nodes to your invoking node, and be sure your network connection is fast enough and you’ve got a very easy to parallelize counter.Can someone explain what this means? I don’t quite understand how you could have sublinear time for counting items. Maybe this is referring to the uniqueness/deduplication part?
Sorry for the confusion. It’s sub-linear at query time. You still have to pay the cost of indexing, but you can do that over the whole month.
Why not HyperLogLog when it’s the most efficient solution?
Its not 100% accurate and in the end I don’t think that’d fly with a billing system.