1. 36
  1.  

  2. 12

    I worked for a while in a comp sci lab that specialized in solving real-world problems with statistical analysis and (generally non-neural-network) machine learning. The PI was quite good at making contacts with interesting people who had interesting data sets. People would come to him with “big data” problems and he would say “this fits on a single hard disk, this isn’t big data. We can work with your data far more easily than these other people with Hadoop pipelines and whatever, you should work with us.”

    We had a compute server with 1 TB of memory, in 2016, and it was not particularly new. Turns out that if you’re counting things that exist in the real world, there’s not that many actual things that require a terabyte of RAM to keep track of. It probably cost low six figures to buy, or rather, about the same as one full-time engineer for a year. (Or two grad students, including tuition.)

    I didn’t do particularly well at that job, but I did learn that 90% of the work of any data-oriented project is getting your data cleaned up and in the right shape/format to be ingested by the code that does the actual analysis. xsv, numpy and similar tools can make the difference between spending a day on it and spending a week on it. That was far more fun for me than the actual analysis was.

    1. 11

      By contrast, at a previous gig I watched an entire data science group basically fuck around with Amazon and Google Bigtable and data lakes and lambdas and Kafka to handle ingests of a whopping…three events a day? maybe?

      Their primary deliverable every month seemed to be five-figure AWS bills occasionally punctuated with presentations with very impressive diagrams of pipelines that didn’t do anything…my junior I was working with, by contrast, was whipping together SQL reports and making our product managers happy–a practice we had to hide from the org because it was Through The Wrong Channels.

      And because reasons, it was politically infeasible to call bullshit on it.

      1. 1

        Sounds like your last gig was wasting their time and money really.

        1. 1

          I hope you didn’t spend too long there. It sounds hellish

        2. 4

          Really understanding what problem you’re actually trying to solve is often overlooked in the desire to jump on the latest buzzwordy technologies.

          Every time BitCoin power consumption comes up, I go and look at the transactions per second that the entire BitCoin network has done over the last week. I’ve never seen it average more than 7/second. If you want to use a cryptocurrency as a currency (rather than the exciting smart-contract things that things Etherium allow, which may require a bit of compute on each transaction), each one is simply atomically subtracting a number from one entry in a key-value store and adding it to another.

          A Raspberry Pi, with a 7 W PSU, could quite happily handle a few orders of magnitude more transactions per second with Postgres or whatever. Three of them in different locations could manage this with a high degree of reliability. You could probably implement a centralised system that outperformed Bitcoin with a power budget of under 50W. BitCoin currently is consuming around 6 GW. That’s a roughly 8 orders of magnitude more power consumption in exchange for the decentralisation.

          1. 2

            Really understanding what problem you’re actually trying to solve is often overlooked in the desire to jump on the latest buzzwordy technologies.

            Perversely, the problem that’s often being solved is “keeping the engineers from getting bored”, “padding your resume to make it easier to jump”, or “making your company more sound more important to attract VC dollars”.

            1. 1

              Now that I think about it, “If we use technology $x we’ve got a better chance of nabbing VC money” can often be a sound business decision.

            2. 2

              Every time BitCoin power consumption comes up, I go and look at the transactions per second that the entire BitCoin network has done over the last week. I’ve never seen it average more than 7/second. If you want to use a cryptocurrency as a currency (rather than the exciting smart-contract things that things Etherium allow, which may require a bit of compute on each transaction), each one is simply atomically subtracting a number from one entry in a key-value store and adding it to another.

              This is part of the design. Every 2000 blocks or so the protocol adjusts the difficulty of mining to keep the average mining rate at roughly 1 per 10 minutes.

            3. 3

              It probably cost low six figures

              IIRC, my infrastructure team told me the cost to replace our 1U 40-core IBM server with 1TB RAM was going to be around $50k

              1. 3

                Just checked. 48 core AMD EPYC cpu, 1TB RAM, no disks past boot, 1U: just over $18k. Call it 20k with a 40G ethernic and a couple TB of NVMe.

                1. 1

                  That’s a pretty affordable kit! The 1TB of RAM (8x128GB) alone from SuperMicro or NewEgg would cost $10k-$15k.

                  1. 2

                    That’s listed price from a SuperMicro VAR, checked this morning.

                    1. 1

                      I was talking with my team again today and this came up, that $50k price tag was actually for a pair of servers.

              2. 1

                Turns out that if you’re counting things that exist in the real world, there’s not that many actual things that require a terabyte of RAM to keep track of.

                I’ve been saying this for years. I haven’t used numpy too much—I’m not a real data analyst—but I’ve gotten the job done with SQLite, and/or split; xargs -P | mawk.

                1. 5

                  Definitely laughing. They don’t pull any punches:

                  Understanding what these systems did right and how to improve them is more important than re-hashing existing ideas in new domains compared against only the poorest of prior work.

                  1. 3

                    I remember when this paper first came out and I was definitely laughing then! I was just wrapping up my MSc thesis and several members of my research group (distributed systems, natch) were not super impressed when people started asking them questions about how their research would fit in with this framework :D

                  2. 1

                    What is the significance of the image you linked?

                    1. 4

                      It’s a reference to the movie Joker (2019). Robert De Niro (pictured) is speaking to The Joker. The original line is “Two policemen are in critical condition and you’re laughing, you’re laughing!”