I’m a huge fan of both Rust and Delta Lake, but my eyebrows shot off my face when I saw “exactly once delivery”
I asked about this years ago and @aphyr and @mjb gave me memorable answers about why exactly once delivery isn’t possible. I highly recommend reading it. More recently, we’ve discovered that there are some cases where exactly once delivery is possible, but the semantics are very difficult to grok, to the point that it’s probably best to only claim “exactly once” when you’re in the presence of extremely knowledgeable people. Any “exactly once” guarantees require a strict protocol with the client, so the semantics don’t bubble up to larger systems.
For example, someone might extrapolate from this title that when I stream the Delta Lake table that I’ll receive the message exactly once. That’s not true, Delta Lake doesn’t give those guarantees, neither does Kafka. Only the connector from Kafka into Delta Lake gives the guarantees.
It’s still useful, to be sure. But be careful about the semantics.
Thanks @kellogh for the links. I fully agree with what you said and what was said in the discussion you linked. Like you said, it all comes down to semantics.The kafka-delta-ingest project is a Kafka to Delta connector. What I meant in the title is we delivery a message from Kafka to Delta Table exactly once. Notice I used the phrase “from Kafka to Delta Lake” in the title, not “From Kafka to Delta Lake to your client” ;) It certainly doesn’t make sense to talk about exactly once delivery to an opaque client in a physical sense. In real world distributed systems, messages get redelivered all the time. The consumer of a Delta Table or Kafka topic will need to have its own atomic progress tracking capability in order to process the message exactly once logically.
I don’t see any mention of delivery sematics in the linked repo. @houqp, perhaps you can expand on this? Right now, the linked repo seems like a kafka connector, but there’s not much in there from what I can see.
Yes, it’s a native kafka delta lake connector. In short, the exactly once message delivery is accomplished by batching the message and kafka offset into a single Delta Table commit so they are written to the table atomically. If a message has been written to a Delta Table, trying to write the same message again will result in transaction conflicts because kafka-delta-ingest only allows the offset to go forward.
Is Delta Lake only for big data, or could it be useful sometimes for smaller projects?
It used to be only for big data because you can only read/write it from Spark. With the new rust implementation, this is not the case anymore: https://github.com/delta-io/delta-rs. You can use it for smaller project as well.