I spent the last two weeks reading about incremental computing, and how to apply that to incremental data pipelines and somehow I totally missed Differential Dataflows and DDLog. The graph example is spot-on, and this article breaks down every bit in a very clear way. Looking forward to the next one!
I’ve been reading about incremental computing in the past a week or two as well! One thing I cannot quite put my head to it is how “incremental” is done. The “tracking of which part changed and recompute” is pretty straightforward. But cannot find good material to explain “incremental” part to me.
For example, in this article, it talks about scores.average(), so in “incremental” fashion, if a new student score added, it somehow can derive that I just need to compute previous_average_scores * previous_count / new_count + new_added_store / new_count or (previous_total_scores + new_added_score) / new_count. How system decide which intermediate value to keep track of and do the compute?
previous_average_scores * previous_count / new_count + new_added_store / new_count
(previous_total_scores + new_added_score) / new_count
In more complicated cases, for example, if we do scores.top(10).average(), how a incremental computing system can derive that the minimal work should be previous_average_scores - lowest_score_so_far / 10 +new_added_score / 10 and keep track of lowest_score_so_far in the system?
previous_average_scores - lowest_score_so_far / 10 +new_added_score / 10
Somehow I believe these issues are solved problems in incremental computing paradigm, but the material on the net is sparse on the details.
I think there’s a variety of degrees to which systems are incremental. Some, like Adapton, are based on having an explicit dataflow and reusing previous computation results, thus avoiding to redo computations unaffected by the change – that’s essentially the same concept as build systems.
A truly incremental system, however, would consume a delta (a change) on the input (starting with ∅) and produce a delta on the output for maximum efficiency. In other words, an incremental system would ideally be Δinput→ Δoutput instead of input→output, with the key operations of the pipeline all consuming and producing deltas. That seems to be the idea behind differential dataflow.
But as you point out, you sometimes (actually often) need to reuse the previous state to calculate the output (whether it’s a delta or not), and in some cases you need the entire input anyway (like doing a SHA1 sum on the inputs). I haven’t seen this spectrum of being incremental articulated clearly in the literature so far, but the idea of differential is certainly that the computations should operate on deltas (differences) as opposed to complete values.
The How to recalculate a spreadsheet article has a bunch of links that you might find interesting on the topic. The Build systems à la carte paper might be an interesting read for you as well.
Thanks! Equipped with what you said, I re-read the Differential Dataflow paper, and now it is much clearer. Their difference operators are on deltas and the so-called generic conversion from normal operators to difference operators just simply do \delta output = f(input_t + \delta input) - output_t. To make it work efficiently in LINQ settings, a few difference operators (especially for aggregation, such as sum and count) implemented manually because these can work with \delta only.
\delta output = f(input_t + \delta input) - output_t
It is still very interesting, but much less generic / magic than I initially thought it could be.
The biggest thing for incrementality is tracing data interdependence. If relation B draws in data from A and C, changes to those relations trigger a chain reaction, propagating said changes. For aggregates (the .group_by() clause of the post) things are pretty coarse-grained, any change to the input relation(s) will trigger a recomputation of the aggregate, so changes to Test or Student will cause scores to be recomputed. While in a reasonably trivial example like the scores one this seems stupid and/or wasteful, for more complex rules things get exponentially more difficult since worst-case joins are commonplace within ddlog code
My compliments on your writing! I can’t judge it from a newcomer’s perspective, alas: I’ve been obsessively working through Ivan Bratko’s “Prolog Programming for Artificial Intelligence”, so I’ve had a lot of recent Prolog exposure, so I suspect my learning zone is perfectly situated to receive the post you wrote. But even if receptiveness is one thing, teaching is quite another, and I can say: (a) that PPfAI is a book that practically teaches itself, and (b) that your blog post gave me the same delicious “the author is installing new knowledge in my brain” feeling. Which I presumably owe to damn hard writing on your part. Enjoyable, and learnsome, and I look forward to the rest of it.