I must admit that the example I’ve chosen is not the clearest because it really suggests the solution with table partitioning. I’ve updated text a little bit to make it more clear that the deletion condition could be ad-hoc, to focus on replication.
I had this problem and solved it with partitioning as well. I think time-series data is prone to this problem, which is why there are time series databases and plugins for ordinary relational databases. Outside this circumstance, I just don’t seem to run into bulk data removal that often.
It might be good to specify which database engines this is talking about. Some of the fine points about tables and indexes getting locked and not allowing concurrent inserts are pretty engine-dependent, but the article implies that they’re universally true.
Replication lag is certainly a real concern for write operations that touch large numbers of rows. When I ran into a similar situation (with updates rather than deletes in my case, but same concept), I ended up solving it by chunking the work like the article describes, but rather than using fixed-duration pauses between chunks, I monitored replication lag in the application code and only proceeded with the next chunk if lag was below a threshold. That had the added benefit of being resilient against spikes in write activity from other sources while the job was running.
I tried to add a lot of weasel-wording, but maybe not enough. I think what I’m trying to say is not that “they’re universally true”, but that “there is a chance that you will sooner or later deal with that problem so you should be prepared”.
In the end, what if you’re just lucky, and you just YOLO DELETE whatever you want, and your servers just manage to do that without anyone noticing or caring, and your replication just manages to deal with the traffic. Then you will continue living happily, not aware that you just dodged the bullet. But it won’t save you some other time and you will have to find that out.
So, I guess I try to replicate what happens during the technical interviews: keep piling up complications to initial simple solution to see how the candidate would deal with those.
rather than using fixed-duration pauses between chunks, I monitored replication lag in the application code and only proceeded with the next chunk if lag was below a threshold.
Some people just want to reveal all the spoilers, don’t they?
Some of the fine points about tables and indexes getting locked and not allowing concurrent inserts are pretty engine-dependent
So, suppose that we create a 10M-rows table on a not-too-fast hardware to make it easier to reproduce locks. We run a number of fast clients that do many random single-row insert/update/deletes. Then, in parallel, we run “DELETE FROM foo WHERE id % 2 = 0” (every other row), basically forcing it to work on entire index.
Are you saying that those fast clients at no point will have to wait while the huge DELETE is running? Which db engine is that? Where is the catch?
I worked on a carrier-scale “convergent” (i.e., telephony and internet) billing system in the late ‘90s and I can vouch for the fact that getting data out of the RDBMS was more difficult than getting it in, at least back then. In the physical world I’ve heard that getting heat out of data centers is more difficult than getting electricity in, too.
adding a “deleted” column is also an expensive operation, and anyway you have to physically delete data to save space, or even because of data retention policies.
also, “just do an update on the million rows” will often kill your replication just the same if you’re not careful.
yep. The main point of the text is about the idea of propagation delay, which could later, in the follow-up posts, be applied to updating and inserting data, for example; or about dealing with the lagging replicas; or about reducing the unintended atomicity; or about backpressure in distributed database systems. And of course, column-oriented storages (which are extremely interesting topic) still won’t help with the actual deletion of data in replicated environment.
Yeah, I think that some people don’t realize that data retention works both ways: not only must you retain some data long enough, but you also must not retain some other data after some time. Maybe someday we’ll discuss some regulatory things too.
What about partitioning?
DELETE
I must admit that the example I’ve chosen is not the clearest because it really suggests the solution with table partitioning. I’ve updated text a little bit to make it more clear that the deletion condition could be ad-hoc, to focus on replication.
Thank you!
I had this problem and solved it with partitioning as well. I think time-series data is prone to this problem, which is why there are time series databases and plugins for ordinary relational databases. Outside this circumstance, I just don’t seem to run into bulk data removal that often.
I honestly am looking forward to part 2…
thank you! next Tuesday, Lord willing.
It might be good to specify which database engines this is talking about. Some of the fine points about tables and indexes getting locked and not allowing concurrent inserts are pretty engine-dependent, but the article implies that they’re universally true.
Replication lag is certainly a real concern for write operations that touch large numbers of rows. When I ran into a similar situation (with updates rather than deletes in my case, but same concept), I ended up solving it by chunking the work like the article describes, but rather than using fixed-duration pauses between chunks, I monitored replication lag in the application code and only proceeded with the next chunk if lag was below a threshold. That had the added benefit of being resilient against spikes in write activity from other sources while the job was running.
I tried to add a lot of weasel-wording, but maybe not enough. I think what I’m trying to say is not that “they’re universally true”, but that “there is a chance that you will sooner or later deal with that problem so you should be prepared”.
In the end, what if you’re just lucky, and you just YOLO DELETE whatever you want, and your servers just manage to do that without anyone noticing or caring, and your replication just manages to deal with the traffic. Then you will continue living happily, not aware that you just dodged the bullet. But it won’t save you some other time and you will have to find that out.
So, I guess I try to replicate what happens during the technical interviews: keep piling up complications to initial simple solution to see how the candidate would deal with those.
Some people just want to reveal all the spoilers, don’t they?
So, suppose that we create a 10M-rows table on a not-too-fast hardware to make it easier to reproduce locks. We run a number of fast clients that do many random single-row insert/update/deletes. Then, in parallel, we run “DELETE FROM foo WHERE id % 2 = 0” (every other row), basically forcing it to work on entire index.
Are you saying that those fast clients at no point will have to wait while the huge DELETE is running? Which db engine is that? Where is the catch?
I worked on a carrier-scale “convergent” (i.e., telephony and internet) billing system in the late ‘90s and I can vouch for the fact that getting data out of the RDBMS was more difficult than getting it in, at least back then. In the physical world I’ve heard that getting heat out of data centers is more difficult than getting electricity in, too.
Why not cart around a “deleted” column and just do an update on the million rows?
adding a “deleted” column is also an expensive operation, and anyway you have to physically delete data to save space, or even because of data retention policies.
also, “just do an update on the million rows” will often kill your replication just the same if you’re not careful.
What if you have a column-oriented store? That should allow locking “deleted” pretty easily.
yep. The main point of the text is about the idea of propagation delay, which could later, in the follow-up posts, be applied to updating and inserting data, for example; or about dealing with the lagging replicas; or about reducing the unintended atomicity; or about backpressure in distributed database systems. And of course, column-oriented storages (which are extremely interesting topic) still won’t help with the actual deletion of data in replicated environment.
Thanks,
Ok this time substack gets my email, I wanna read part 2.
Suggestion: You could simply take the European privacy laws as a reason to actually really delete data after some time.
haha thank you.
Yeah, I think that some people don’t realize that data retention works both ways: not only must you retain some data long enough, but you also must not retain some other data after some time. Maybe someday we’ll discuss some regulatory things too.