1. 35
  1.  

  2. 24

    Data tech is a massive and intertwined ecosystem with a lot of money riding on it. It’s not just about compute or APIs, that’s a fairly small part.

    • What file formats does it support?
    • Does it run against S3/Azure/etc.?
    • How do I onboard my existing data lake?
    • How does it handle real-time vs batch?
    • Does it have some form of transactions?
    • Do I have to operate it myself or is there a Databricks-like option?
    • How do I integrate with data visualization systems like Tableau? (SQL via ODBC is the normal answer to this, which is why it’s so critical)
    • What statistical tools are at my disposal? (Give me an R or Python interface)
    • Can I do image processing? Video? Audio? Tensors?
    • What about machine learning? Does the compute system aid me in distributed model training?

    I could keep going. Giving it a JavaScript interface isn’t even leaning in to the right community. It’s a neat idea, for sure, but there’s mountains of other things a data tech needs to provide just to be even remotely viable.

    1. 6

      Yeah this is kinda what I was going to write… I worked with “big data” from ~2009 to 2016. The storage systems, storage formats, computation frameworks, and the cluster manager / cloud itself are all tightly coupled.

      You can’t buy into a new computation technology without it affecting a whole lot of things elsewhere in the stack.

      It is probably important to mention my experience was at Google, which is a somewhat unique environment, but I think the “lock in” / ecosystem / framework problems are similar elsewhere. Also, I would bet that even at medium or small companies, an individual engineer can’t just “start using” something like differential dataflow. It’s a decision that would seem to involve an entire team.

      Ironically that is part of the reason I am working on https://www.oilshell.org/ – often the least common denominator between incompatible job schedulers or data formats is a shell script!

      Similarly, I suspect Rust would be a barrier in some places. Google uses C++ and the JVM for big data, and it seems like most companies use the JVM ecosystem (Spark and Hadoop).

      Data tech also can’t be done without operators / SREs, and they (rightly) tend to be more conservative about new tech than engineers. It’s not like downloading something and trying it out on your laptop.

      Another problem is probably a lack of understanding of how inefficient big data systems can be. I frequently refer to McSherry’s COST paper, but I don’t think most people/organizations care… Somehow they don’t get the difference between 4 hours and 4 minutes, or 100 machines and 10 machines. If people are imagining that real data systems are “optimized” in any sense, they’re in for a rude awakening :)

      1. 3

        Believe that andy is referring to this paper if anyone else is curious.

        (And if you weren’t let me know and I’ll read that one instead. :] )

        1. 3

          Yup that’s it. The key phrases are “parallelizing your overhead”, and the quote “You can have a second computer once you’ve shown you know how to use the first one.” :)

          https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf

          The details of the paper are about graph processing frameworks, which most people probably won’t relate to. But it applies to big data in general. It’s similar to experiences like this:

          https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html

          I’ve had similar experiences… 32 or 64 cores is a lot, and one good way to use them all is with a shell script. You run into fewer “parallelizing your overhead” problems. The usual suspects are (1) copying code to many machines (containers or huge statically linked binaries), (2) scheduler delay, and (3) getting data to many machines. You can do A LOT of work on one machine in the time it takes a typical cluster to say “hello” on 1000 machines…

        2. 1

          That’s a compelling explanation. If differential dataflow is an improvement on only one component, perhaps that means that we’ll see those ideas in production once the next generation of big systems replaces the old?

          1. 2

            I think if the ideas are good, we’ll see them in production at some point or another… But sometimes it takes a few decades, like algebraic data types or garbage collection… I do think this kind of big data framework (a computation model) is a little bit more like a programming language than it is a “product” like AWS S3 or Lambda.

            That is, it’s hard to sell programming languages, and it’s hard to teach people how to use them!

            I feel like the post is missing a bunch of information: like what kinds of companies or people would you expect to use differential dataflow but are not? I am interested in new computation models, and I’ve heard of it, but I filed it in the category of “things I don’t need because I don’t work on big data anymore” or “things I can’t use unless the company I work for uses it” …

        3. 2

          The above is a great response, so to elaborate on one bit:

          What statistical tools are at my disposal? (Give me an R or Python interface)

          It’s important for engineers to be aware of how many non-engineers produce important constituent parts of the data ecosystem. When a new paper comes out with code, that code is likely to be in Python or R (and occasionally Julia, or so I’m hearing).

          One of the challenges behind using other great data science languages (e.g. Scala) is that there may be an ongoing and semi-permanent translation overhead for those things.

          1. 1

            all of the above + does it support tight security and data governance?

          2. 9

            I’m particularly excited to see Materialize take off. Co-founded by Frank McSherry (Differential Privacy, Naiad, Timely Dataflow) their current system allows you to create materialized views using existing tools (it speaks the postgres protocol) and they are doing well with investors. They have the brains and the means, and I believe these are the folks who may, in fact, make differential dataflow more popular. It’s the early days still.

            Remember, Kafka began its life at LinkedIn as a system for facilitating cross-datacenter replication of databases. Jay Kreps spent a lot of time trying to convince people to think about The Log as “not all that different from a file or a table” and early in kafka’s growth curve other folks were spending time emphasizing how you could get away with using logs in ways that are similar to existing tools, but serve as effective work-arounds when scalability causes them to break down on their own. Folks like Martin Kleppman gave this view a nice nod that many data engineers at the time paid attention to in his 2014 talk introducing Samza where he makes the point that you can think of log-based systems in the same way that you thought of materialized views with traditional databases.

            Materialize is not beginning its life as a generic log for shipping database replication streams across data centers over the hostile open internet, and Frank has shown that he has an eye for removing incidental complexity and achieving significantly more efficiency in the process. I’m not going to be surprised it his team is the one that has success in popularizing a system that incidentally brings differential dataflow to popularity, not because it’s a cool algorithm, but because it can actually solve some hard problems at scale in useful ways.

            1. 1

              It sounds interesting, but what are people using it for?

              1. 8

                All of the differential dataflow papers really suffer from the academic presentation which forces the author to focus on absolute novelty, rather than being allowed to talk about quality engineering. I think https://github.com/frankmcsherry/blog/blob/master/posts/2017-09-05.md motivates the problem much better.

                1. 1

                  FWIW I looked over some of the recent blog posts, and they were talking about the SQL interface.

                  That is a lot more interesting to me! It looks like I might be able to stream my web logs into a tabular format and then run ANSI SQL over them that is continuously updated? That would be pretty interesting.

                  The post made it sound like the main way to use it was via a Rust library. I’m not really interested in writing Rust for analytics. Go replaced Sawzall several years ago and I wasn’t really a fan of that either. Analytics code can be very tedious so you want a high level language.

                  Although I do have some “log spam” use cases and I wonder if SQL can handle that …

                  1. 2

                    It looks like I might be able to stream my web logs into a tabular format and then run ANSI SQL over them that is continuously updated?

                    Yep, that’s exactly how it works. The sql interface is a BSL-licensed thing being built on top of DD by a company founded by Frank McSherry and Arjun Narayan about 2 years ago - https://materialize.com/. I think it has a bright future, but I’m also curious about other niches where DD itself seems like it should be useful but currently doesn’t have any uptake.

                    I updated the post with the feedback I got, fwiw. Notably people who wanted to just crunch data were more into the kind of all-in-one-product that materialize is offering, but a smaller group wanted to use DD as a backend for more specialized tools but bounced off the api/docs.

                2. 2

                  This sounds like incremental computing, also called self-adjusting computing. Which is also a bit more general than the incremental computation over streams idea presented here. And it has applications for UI programming as well.

                3. 1

                  There is a popular use case in world of databases for this type of approach (persisting data deltas).

                  This is in the area in streamlining testing/developments/debugging environments for large volume systems. Where ( a ) you do not want to replicate the whole environment ( b ) some data must be masked out for compliance reasons.

                  The software that virtualized Oracle databases, is called Delphix (if I am not mistaken). There are probably other solutions.

                  These systems would present developers with working database connections that allow read/insert/update/etc just a real database. But underneath it would store deltas rather than updating full database.

                  But I can see how something like that is very useful in the world of other data/document storage solutions and data flows.