I’ve read the article (once) and I’m trying to understand exactly why in practical terms is this a real security problem?
The article mentions:
The extract phase is inappropriately being used for domain separation, when the security goal is only to create an IND KDK.
Then the expand phase creates a new unique key using… the year.
Uhm, a year is not unique!!! Literally, for this use case, the info parameter must be unique.
So, the issue seems more like an improper theoretical usage of the HKDF construct (by misplacing the arguments). However, by dissecting how HKDF is constructed, even with this misplacement of arguments the resulting keys should be good enough. (Minus the canonicalization problem discussed.)
So, I’ve tried to expand the HKDF construct into it’s basic elements (assuming we don’t need also length extension):
HMAC(k, m) = H( B1(k) || H( B2(k) || m ) )
// NOTE: H is the hash function, say SHA256.
// NOTE: B1 and B2 are plain XOR's with a constant.
HKDF(salt, ikm, info) = HKDF_expand( HKDF_extract(salt, ikm), info )
// NOTE: The HMAC-key is the HKDF-salt, and the HMAC-data is the HKDF-key.
HKDF_extract(salt, ikm) = HMAC(salt, ikm)
// NOTE: Let's assume we are OK with the output size as is, without actually extending it.
HKDF_expand(prk, info) = HMAC(prk, info)
HKDF(salt, ikm, info) = HMAC( HMAC(salt, ikm), info )
HKDF(salt, ikm, info) = HMAC( H( B1(salt) || H( B2(salt) || ikm ), info )
HKDF(salt, ikm, info) =
H(
B1( H( B1(salt) || H( B1(salt) || ikm) )
|| H( B2( H( B1(salt) || H( B2(salt) || ikm) ) || info )
)
Setting the mangling aside, to make it easier to read, it’s more something like:
HKDF_no_mangling(salt, ikm, info) =
H(
H( salt || H( salt || ikm ) )
|| H ( H( salt || H( salt || ikm ) ) || info )
)
So, in the end with HKDF, assuming H is actually one way, in practical terms (if one doesn’t care about optimizations, and provided canonicalization is used), does it matter if the domain separation label is put in the salt or info (or even appended to the key)?
(If one puts the label, or part of the label in the salt, one can actually cache the intermediate values and only apply twice the later stages of H by just appending the info, so this way the “time” axis of the domain separation could be left inside the info and the constant (canonicalized) table and column left inside salt.)
Going even further, HMAC is defined as such a complicated construct (with two hashes and mangling the key) so it can work even with somewhat broken hash functions, and especially with length-extension vulnerabilities of hashes such as SHA2.
But if one goes with something like SHA3, the HMAC could be just written as SHA3 (key || m) (assuming the key is fixed length or canonicalized). Thus the HKDF would be equivalent to SHA3( SHA3( salt || key ) || m ) which (provided the salt is fixed length or canonicalized) could be written as just SHA3( salt || key || m ). And given that, one can safely move around (provided canonicalization) the three parameters.
In fact, if I read correctly, Blake3 supports exactly this use case with the derive_key(label, key) which (glossing over a lot of details) seems to be similar to keyed_hash(hash(label), key). This also lends to optimizing repeated generations of the different keys for the same domain.
I’ve read the article (once) and I’m trying to understand exactly why in practical terms is this a real security problem?
The real-world security issue in AnonCo’s design isn’t mentioned until the end: Canonicalization.
Given:
Table: customers
Column: last_order_id
Receive:
customers_last_order_id
Given:
Table: customers_last_order
Column: id
Receive:
customers_last_order_id, again
This means that two different columns on two different tables can resolve to the same derived key (since the IKM, salt, and info will be identical).
If you’re calculating your wear-out limits for your derived keys in your threat models, your math might be wrong, and you might trigger a nonce reuse condition simply because you’re under-counting how much data you encrypt with the same derived key (the output of HKDF).
Additionally, you can take ciphertexts from one table/column and replay them in another table/column and they’ll likely decrypt successfully, because there is no stated protection against Confused Deputies.
This means that two different columns on two different tables can resolve to the same derived key (since the IKM, salt, and info will be identical).
I agree. But this isn’t a problem with putting the domain separation label in the salt or the info. It’s actually a problem with canonicalization. Meanwhile the article emphasizes a lot the salt vs info issue, and only mentions the canonicalization as a side problem in the Last Tweaks section.
All in all, thanks for the article! I’ve learned a few more things today.
And in the end I fully agree with one of your observations
I shouldn’t need any additional context for that. And that means I need to know: do I use AES-GCM with this key? Do I use as AES-CTR-HMAC or something with that key? And, this is a fairly simple concept in some aspects, like I just put everything into the key and then I get like a very straightforward API where I just have a function called encrypt that takes a plaintext and some associated data and then just encrypts that. Because the key includes everything else that you need to know.
Today’s cryptographic API’s are perhaps a bit too low-level: key, salt, info? Nop. Just give me a function like the following, and let that function handle the canonicalization (and encoding / decoding) issue:
SimpleEncrypt(
secret = _config.global_secret,
context = [
// NOTE: making sure we don't reuse keys in other parts of the code;
"app-x / field-encryption-use-case / 2020",
// NOTE: "production", "staging-1", etc.;
_config.environment,
// NOTE: making sure we can't copy-paste between columns;
[_database_name, _table_name, _column_name],
// NOTE: making sure we can't copy-paste between rows;
_row.id,
// NOTE: making sure we can't rollback in time the value;
_row.updated_at,
],
plaintext = _value
)
Or something like CipherSweet, where you call a method on an object that you configure with other objects (which in turn manages the configuration and key derivation).
I’ve read the article (once) and I’m trying to understand exactly why in practical terms is this a real security problem?
The article mentions:
So, the issue seems more like an improper theoretical usage of the HKDF construct (by misplacing the arguments). However, by dissecting how HKDF is constructed, even with this misplacement of arguments the resulting keys should be good enough. (Minus the canonicalization problem discussed.)
So, I’ve tried to expand the HKDF construct into it’s basic elements (assuming we don’t need also length extension):
Setting the mangling aside, to make it easier to read, it’s more something like:
So, in the end with HKDF, assuming
H
is actually one way, in practical terms (if one doesn’t care about optimizations, and provided canonicalization is used), does it matter if the domain separation label is put in the salt or info (or even appended to the key)?(If one puts the label, or part of the label in the salt, one can actually cache the intermediate values and only apply twice the later stages of
H
by just appending theinfo
, so this way the “time” axis of the domain separation could be left inside theinfo
and the constant (canonicalized) table and column left insidesalt
.)Going even further,
HMAC
is defined as such a complicated construct (with two hashes and mangling the key) so it can work even with somewhat broken hash functions, and especially with length-extension vulnerabilities of hashes such as SHA2.But if one goes with something like SHA3, the HMAC could be just written as
SHA3 (key || m)
(assuming the key is fixed length or canonicalized). Thus the HKDF would be equivalent toSHA3( SHA3( salt || key ) || m )
which (provided the salt is fixed length or canonicalized) could be written as justSHA3( salt || key || m )
. And given that, one can safely move around (provided canonicalization) the three parameters.In fact, if I read correctly, Blake3 supports exactly this use case with the
derive_key(label, key)
which (glossing over a lot of details) seems to be similar tokeyed_hash(hash(label), key)
. This also lends to optimizing repeated generations of the different keys for the same domain.The real-world security issue in AnonCo’s design isn’t mentioned until the end: Canonicalization.
customers
last_order_id
customers_last_order_id
customers_last_order
id
customers_last_order_id
, againThis means that two different columns on two different tables can resolve to the same derived key (since the IKM, salt, and info will be identical).
If you’re calculating your wear-out limits for your derived keys in your threat models, your math might be wrong, and you might trigger a nonce reuse condition simply because you’re under-counting how much data you encrypt with the same derived key (the output of HKDF).
Additionally, you can take ciphertexts from one table/column and replay them in another table/column and they’ll likely decrypt successfully, because there is no stated protection against Confused Deputies.
I agree. But this isn’t a problem with putting the domain separation label in the salt or the info. It’s actually a problem with canonicalization. Meanwhile the article emphasizes a lot the salt vs info issue, and only mentions the canonicalization as a side problem in the
Last Tweaks
section.All in all, thanks for the article! I’ve learned a few more things today.
And in the end I fully agree with one of your observations
Today’s cryptographic API’s are perhaps a bit too low-level: key, salt, info? Nop. Just give me a function like the following, and let that function handle the canonicalization (and encoding / decoding) issue:
Or something like CipherSweet, where you call a method on an object that you configure with other objects (which in turn manages the configuration and key derivation).