This answers some of the questions I have when companies talk about fully encrypting their customer data. But at the same time, it raises a lot of questions about practicality. I’d like to see some sort of survey of how the big (TM) companies actually store their data. How many of them use any of the techniques described here, for how much of their data (and what parts, e.g. email, home address). And how many just give up, and rely on taping down disksdisk encryption and network isolation, with the hope that you never break into the actual box.
Thanks for giving away so much information for free.
I always shudder in horror when asked to “implement encryption” in a database, then try to argue why it’s a nice idea but riddled with problems. Good to see that there are at least some improvements being made to existing software, but it all has a long way to go before being practically useful, especially if you want to keep using the RDBMS product you’re already using.
This mentions the full disk encryption use case very briefly, though I think it’s actually a common one. Say you have an application that runs on laptops or computers which are portable enough to get stolen. You want the benefits of full disk encryption, i.e. that someone who steals a device probably won’t be able to read the data for your application running on it. You aren’t confident relying on the assumption that all the end user devices have FDE turned on. If you naively encrypt the files containing the tables with a cipher designed for disk encryption, you can get the same mitigation for this very limited threat model as FDE would’ve given you.
Another use case I’ve seen is that you ship an application which stores data on end user devices (esp mobile apps) and you don’t want the owner of that device being able to read all of the data just by grepping the filesystem for strings: you want them to have to at least take the effort to reverse engineer the application first. For this purpose, basically anything will do, so encrypting the table files in the same way as above also suffices. This is an extremely crappy threat model but people do want it.
That’s the ransomware encrypting your devices right, then you have a bunch of disks filled with garbage.
But I think for most people a disk encryption goes already a long way. Why bother breaking into the database if you can just siphon the keys and decrypted data at the application server.
The problem with these designs is that they have a significant enough leakage that it no longer provides semantic security.
I think the thing is that this might be ok. Like, unless you can reliably discover the cleartext “close enough” might not really matter. Let’s say I have your social security number in my database but it’s ECB encrypted - how hard would it be to reveal information about your actual SSN? Serious question - I don’t know the answer. Is this ok sometimes?
Or if someone is storing images, like let’s say some naked pictures of themselves, there’s a huge difference between “very fuzzy image with no details, could basically be anyone” vs “the cleartext image”.
The main issue I’ve had with encryption in a database is that it’s going to wreak havoc on compression/ any performance wins that rely on the semantics of the data. If I have tons of JSON documents being stored I can compress them very effectively because their keys are all the same. If I encrypt each document in a way that there’s no leakage I’m fucked, obviously that destroys compression. I imagine that, while ECB leaks information, that may actually make it more amenable in this way?
Encrypting values has never been viable for me for this reason. Some of the approaches mentioned in the blog seem like they may help here though, I’ll probably have to read this a few times. I think I might be willing to do an “encryption that leaks some information but has very little/ no performance cost” if it were available, depending on what the data is and what can be leaked. I’ve tried to think about this before because I reallllly want to have more guarantees about how data is stored but I’ve worked on systems where storage on disk was a critical issue and losing compression altogether would be a no-go.
Thanks for writing such detailed posts, they’ve always been very helpful as someone who’s not an expert.
Let’s say I have your social security number in my database but it’s ECB encrypted - how hard would it be to reveal information about your actual SSN? Serious question - I don’t know the answer. Is this ok sometimes?
You can use a chosen-plaintext attack of all possible SSNs (and probably narrow it down faster than that if you know the target’s place of birth).
Or if someone is storing images, like let’s say some naked pictures of themselves, there’s a huge difference between “very fuzzy image with no details, could basically be anyone” vs “the cleartext image”.
Yes! The commentary on the ECB art is relevant here.
If I have tons of JSON documents being stored I can compress them very effectively because their keys are all the same. If I encrypt each document in a way that there’s no leakage I’m fucked, obviously that destroys compression.
Note: you don’t have to encrypt the JSON keys. The values are what really need to get encrypted.
You can still get some compression (and indexing, etc.) benefits with the NoSQL approaches + some of the searchable encryption features.
I’m not sure how much this buys you.
Thanks for writing such detailed posts, they’ve always been very helpful as someone who’s not an expert.
You can use a chosen-plaintext attack of all possible SSNs (and probably narrow it down faster than that if you know the target’s place of birth).
True! That’s a good point. It’s very data dependent.
Note: you don’t have to encrypt the JSON keys. The values are what really need to get encrypted.
That’s a good point, although compression of values is really helpful for a lot of data T_T in my case I have a lot information like process names and file path and distinct values can still compress extremely well.
I’m not sure how much this buys you.
I think that’s the question, really. How much compression am I will to spend to buy how much security. ECB is sort of interesting there since it can make some data useless, some data harder to obtain, and does nothing for other bits of data - but presumably it doesn’t have nearly the same impact on compression.
This answers some of the questions I have when companies talk about fully encrypting their customer data. But at the same time, it raises a lot of questions about practicality. I’d like to see some sort of survey of how the big (TM) companies actually store their data. How many of them use any of the techniques described here, for how much of their data (and what parts, e.g. email, home address). And how many just give up, and rely on
taping down disksdisk encryption and network isolation, with the hope that you never break into the actual box.Thanks for giving away so much information for free.
You and me both. I imagine the answer will vary wildly, even in FAANG.
For example, AWS released a thing recently called C3R which does cryptographic computation in their Clean Rooms.
Who knows what the other big companies have in the works, though. I’d be delighted to hear some of these motions.
I always shudder in horror when asked to “implement encryption” in a database, then try to argue why it’s a nice idea but riddled with problems. Good to see that there are at least some improvements being made to existing software, but it all has a long way to go before being practically useful, especially if you want to keep using the RDBMS product you’re already using.
Very nice.
This mentions the full disk encryption use case very briefly, though I think it’s actually a common one. Say you have an application that runs on laptops or computers which are portable enough to get stolen. You want the benefits of full disk encryption, i.e. that someone who steals a device probably won’t be able to read the data for your application running on it. You aren’t confident relying on the assumption that all the end user devices have FDE turned on. If you naively encrypt the files containing the tables with a cipher designed for disk encryption, you can get the same mitigation for this very limited threat model as FDE would’ve given you.
Another use case I’ve seen is that you ship an application which stores data on end user devices (esp mobile apps) and you don’t want the owner of that device being able to read all of the data just by grepping the filesystem for strings: you want them to have to at least take the effort to reverse engineer the application first. For this purpose, basically anything will do, so encrypting the table files in the same way as above also suffices. This is an extremely crappy threat model but people do want it.
That’s the ransomware encrypting your devices right, then you have a bunch of disks filled with garbage.
But I think for most people a disk encryption goes already a long way. Why bother breaking into the database if you can just siphon the keys and decrypted data at the application server.
Thanks for pointing out the typo.
I hope it didn’t get across as snarky, I rather found it funny.
It did a little, so thank you for clarifying. :)
The poem about ECB is worth the price of admission alone. Solid article @soatok.
When you said a poem about ECB I wondered if it was going to be something like:
I think the thing is that this might be ok. Like, unless you can reliably discover the cleartext “close enough” might not really matter. Let’s say I have your social security number in my database but it’s ECB encrypted - how hard would it be to reveal information about your actual SSN? Serious question - I don’t know the answer. Is this ok sometimes?
Or if someone is storing images, like let’s say some naked pictures of themselves, there’s a huge difference between “very fuzzy image with no details, could basically be anyone” vs “the cleartext image”.
The main issue I’ve had with encryption in a database is that it’s going to wreak havoc on compression/ any performance wins that rely on the semantics of the data. If I have tons of JSON documents being stored I can compress them very effectively because their keys are all the same. If I encrypt each document in a way that there’s no leakage I’m fucked, obviously that destroys compression. I imagine that, while ECB leaks information, that may actually make it more amenable in this way?
Encrypting values has never been viable for me for this reason. Some of the approaches mentioned in the blog seem like they may help here though, I’ll probably have to read this a few times. I think I might be willing to do an “encryption that leaks some information but has very little/ no performance cost” if it were available, depending on what the data is and what can be leaked. I’ve tried to think about this before because I reallllly want to have more guarantees about how data is stored but I’ve worked on systems where storage on disk was a critical issue and losing compression altogether would be a no-go.
Thanks for writing such detailed posts, they’ve always been very helpful as someone who’s not an expert.
You can use a chosen-plaintext attack of all possible SSNs (and probably narrow it down faster than that if you know the target’s place of birth).
Yes! The commentary on the ECB art is relevant here.
Note: you don’t have to encrypt the JSON keys. The values are what really need to get encrypted.
You can still get some compression (and indexing, etc.) benefits with the NoSQL approaches + some of the searchable encryption features.
I’m not sure how much this buys you.
Happy to help!
True! That’s a good point. It’s very data dependent.
That’s a good point, although compression of values is really helpful for a lot of data T_T in my case I have a lot information like process names and file path and distinct values can still compress extremely well.
I think that’s the question, really. How much compression am I will to spend to buy how much security. ECB is sort of interesting there since it can make some data useless, some data harder to obtain, and does nothing for other bits of data - but presumably it doesn’t have nearly the same impact on compression.
Does encrypting using the hash of the data as a nonce count as using a static nonce?
Yes. If you want to go this route, you should look at AES-SIV or AES-GCM-SIV rather than rolling your own.