A system administrator walks into the bar. She looks to her left and considers what it will take to back up 33GB files atomically and reliably, especially as the users inevitably ask for longer-term data. She looks to her right and considers all the special snowflake app servers and their distributed partial failures.
She orders a stiff drink.
The sisadmin in question, midway through her bourbon, texts the security buddy she met at Defcon for the monthly list she knows his company gathers. As the second bourbon arrives, her phone dings with a hyperlink and an API key.
By the end of the second bourbon, she’s typed a little shell script to curl the link, untar it, awk the results, and copy everything into the dev’s results folder. She then sets this up as a cronjob, and returns her battered blackberry to her purse.
It will be six months later, during her exit interview, that the devs in question realize the fix that has been in place and running quietly and without issue.
When the old time systems dev decided to use Go, but use a packed binary representation for small data, that’s about where the plausibility problems began to set in.
So what I took away from this article was: people underestimate the tricks employed by the systems they use. Or perhaps the point is that the “modern” developer just doesn’t need to care anymore?
For example, the raw JSON for each host might be 400 bytes, but the data that is actually “indexed” may well be close to the “old timer” solution.
Under the covers Elasticsearch and Lucene use some of the exact tricks that are mentioned in the article. Terms are tracked in inverted indices and referenced by ordinal, so low-cardinality fields (["up", "down"]) are highly compressed. Postings are sorted and compressed using frame-of-reference encoding. Search is done heuristically by leap-frogging the sparsest iterator. Doc values use offset/delta/table encoding. Filters are encoded via Roaring Bitmaps and evaluated with standard bitwise logic. Etc etc.
I’m sure this applies to all “modern” systems including relational DBs, other NoSQL, etc. These systems use “old timer” methods so that you don’t have to.
And if/when your data ever grows past a single host’s memory, I imagine the “old timer” methods start to look suspiciously like reinventing the flavor-of-the-month distributed system :)
Heh… a side point that the article didn’t touch on, but I once heard an argument that there’s no need to care about floating-point imprecisions, because everyone has already done a lot of work to make sure it works out. Specifically, this was an attempt to justify storing currencies as floating-point.
I mean, I suppose it wouldn’t have gone that badly. I eventually determined that the programmer in question believed the IEEE formats were decimal.
(Never have an argument like this with a coworker unless you really want their ire. Live and learn.)
Oh dear, that could have ended very poorly indeed. :) You can escape from knowing how data structures work, but you can never escape from floating points and numerical instability!
I see the bottleneck being running nmap on all the 4B ips. The rest looks simple for a postgresql server with a ssd and 32GB or ram. Encode the ip in the int32 ID. Scanning a /24 that has a firewall took more than a minute when I just tried it. So with a single host scanning that’s a few years to scan the whole internet…
Very surprised neither suggested speeding the whole process up with a bunch of AWS instances.