Require query parameters to be sorted alphabetically
I understand the benefit here but this is pretty user unfriendly, especially because the default dictionary in many languages does not allow sorting. Why not sort these under the hood, in a client library or similar?
If we provided client libraries we would probably do that. But there’s only two of us who’ve worked on this api and we have client teams using Perl, Java, Scala, JavaScript & Objective-C. We don’t have capacity to provide & support good native clients for all of those. And as I pointed out, it might look like a hassle but it helps ensure that clients gets as much benefit from caching as they can, so they are not complaining. (At least not to me ;-)
+1 for rejecting unknown params. People don’t like; I love it. :)
I do param normalization in the app though. Sometimes similar but slightly different queries have the same result. Memcache is a good fit. E.g. I cache the result of the query (rows 1,2,4,8,11) and their final output.
I think “query is well formed” is a good rule. “Query is well formed and like so” is a bit much.
Have rearranged but identical queries been a problem in practice? I’d like to hear more about the circumstances under which you could say “we received X reqs but only X/10 uniques”.
Yes, rearranged but identical queries was (and still is) a problem. We run an e-commerce site, and the main product listing would just append filter options to the end of the URL. Thus two identical queries would have different-looking URLs depending on the order people selected the filters. I don’t have numbers of how much it hurt us, but that wasn’t really the point. (Given the amount of filter options we have, you can do trillions of distinct queries.) The point was to avoid clients accidentally getting bad cache performance when they can easily avoid it.
On OpenResty one could do some preprocessing on inbound requests to normalize them before caching. Relying on sorted parameters when they are specifically allowed to be unsorted in the standard means you are actually breaking the standard.
Sure, using a caching proxy that can do normalisation would be an
alternative. (And thanks for the pointer; I wasn’t aware of
OpenResty.) But you won’t get maximal benefit from local
client or CDN caches that way.
RFC 3986 does
indeed mention that you can get unnecessary aliases for the same
resource, and has this to say about avoiding the problem:
False negatives are caused by the production and use of URI
aliases. Unnecessary aliases can be reduced, regardless of the
comparison method, by consistently providing URI references in
an already- normalized form (i.e., a form identical to what
would be produced after normalization is applied, as described
below).
Protocols and data formats often limit some URI comparisons to
simple string comparison, based on the theory that people and
implementations will, in their own best interest, be consistent
in providing URI references, or at least consistent enough to
negate any efficiency that might be obtained from further
normalization.
I haven’t found anything in the RFC saying I can’t enforce this
normalisation, however. Did I miss something?
Of course you are free to do what you want. I understand you have less control over intermediaries, but all caching should be done on a normalized URL. My pedantic point is that you could be altering semantics in a bad way. I have seen plenty of web apps that required the parameters in the URL in a specific order, this is bad.
Those intermediaries that you don’t have control over, if they are doing caching, I would consider them broken if they weren’t doing normalization first.
Your serverside should assume random order and normalize.
I understand the benefit here but this is pretty user unfriendly, especially because the default dictionary in many languages does not allow sorting. Why not sort these under the hood, in a client library or similar?
If we provided client libraries we would probably do that. But there’s only two of us who’ve worked on this api and we have client teams using Perl, Java, Scala, JavaScript & Objective-C. We don’t have capacity to provide & support good native clients for all of those. And as I pointed out, it might look like a hassle but it helps ensure that clients gets as much benefit from caching as they can, so they are not complaining. (At least not to me ;-)
+1 for rejecting unknown params. People don’t like; I love it. :)
I do param normalization in the app though. Sometimes similar but slightly different queries have the same result. Memcache is a good fit. E.g. I cache the result of the query (rows 1,2,4,8,11) and their final output.
I think “query is well formed” is a good rule. “Query is well formed and like so” is a bit much.
Have rearranged but identical queries been a problem in practice? I’d like to hear more about the circumstances under which you could say “we received X reqs but only X/10 uniques”.
Yes, rearranged but identical queries was (and still is) a problem. We run an e-commerce site, and the main product listing would just append filter options to the end of the URL. Thus two identical queries would have different-looking URLs depending on the order people selected the filters. I don’t have numbers of how much it hurt us, but that wasn’t really the point. (Given the amount of filter options we have, you can do trillions of distinct queries.) The point was to avoid clients accidentally getting bad cache performance when they can easily avoid it.
On OpenResty one could do some preprocessing on inbound requests to normalize them before caching. Relying on sorted parameters when they are specifically allowed to be unsorted in the standard means you are actually breaking the standard.
Normalizing the keys is the correct solution.
Sure, using a caching proxy that can do normalisation would be an alternative. (And thanks for the pointer; I wasn’t aware of OpenResty.) But you won’t get maximal benefit from local client or CDN caches that way.
RFC 3986 does indeed mention that you can get unnecessary aliases for the same resource, and has this to say about avoiding the problem:
I haven’t found anything in the RFC saying I can’t enforce this normalisation, however. Did I miss something?
Of course you are free to do what you want. I understand you have less control over intermediaries, but all caching should be done on a normalized URL. My pedantic point is that you could be altering semantics in a bad way. I have seen plenty of web apps that required the parameters in the URL in a specific order, this is bad.
Those intermediaries that you don’t have control over, if they are doing caching, I would consider them broken if they weren’t doing normalization first.
Your serverside should assume random order and normalize.