I read through the DynamoDb problem and was surprised that they had to add special logging to figure out and disable hot keys like ‘user_id’. If they had setup structured logging, a quick python script over the log would have shown this issue.
(I debugged this issue) This is roughly what we did. I guess it didn’t come across in the blog that way?
From the blog
So we dreamt up a simple hack to give us the data we needed: anytime we were throttled by DynamoDB, we logged the key
Is there a reason you don’t want to log every single key?
Isn’t that unnecessary? You really only care about when there’s a problem, or else your logs will be noisy. If you log every time, it’ll be kind of hard to differentiate which keys are okay and which keys are a problem.
a) it’s sensitive data belonging to third parties, b) at the scale they are dealing with the logs would impose significant size and time penalties