This post is as bad as the Twitter one from them a month or so ago. The analysis of this problem is really poor and missing fundamental questions. Specifically: what happens during collisions? The post discusses using a random string or using an incremental id, and then either using those values directly or using a hash function to get the tiny URL. But clearly the random string and hash functions have the very real possibility of collisions (the ID’s in the URL will only be 7 characters). The post also proposes that an incremental ID would be more efficient, but does not discuss any possibly memory/lock contention around all requests incrementing the same ID.
I don’t think the author of this post has actually implemented anything like this, this analysis is more or less what we throw out in the first 5 minutes of actually giving this question in interviews.
It’s said that there are ~644 million URLs at the time of this writing.
That’s quite far from the reality. The article links to a 2012 article saying there are 644M websites. Not URLs.
I can’t say for 2012, but currently we’re closer to 45B URLs. Which is quite another story than 644M.
Posted this there, waiting for moderation:
There are a few points in your analysis that are incorrect. An interviewer familiar with this test will beat on those:
Neither hashes nor random strings of 7 characters would work in practice. You calculated how many URLs you can store, but that assumes you can efficiently use all that space. Once you have a few urls stored in your system, you will have clashes. Avoiding those clashes is non trivial in a globally distributed system. Even worse, random strings will will hit that limit much earlier if the system is popular since every requests generates a new key. Conversely, with actual hashes, the likelihood of clashes depends on the amount of urls you store. Since you have chosen a fairly long identifier, you ’d be “more or less” safe until you store about 100.000.000 URLs, but that goes against the URL being “tiny”. You could store the 644000000 urls using 5 character identifiers, so your identifiers are currently too big.
When you suggest using a “better function” pointing out to actual hashes. Both functions that you suggest are bad examples. CRC is designed for cheap error detection, it’s distribution is less than ideal and is easy to attack (an attacker could very easily generate many different urls that hash to the same crc). SHA-1 is better, but has also know attacks.
I doubt there is a cost penalty for the database depending on which type of ID you use. The cost of random IDs is that you need to check for clashes which requires a transaction, which is not really possible in a globally distributed system.
I got confused when you introduced the idea of using GUIDs. GUIDs won’t work, they are simply too large (to guarantee uniqueness). But then you move to incremental ids, I assume that you refer to a counter that increments per time you write an entry. While that could work, there is much more to that. Generating a unique sequence of ids in a distributed system is not trivial either, you’ll need to find your ways around that. However, once you do that, why are you adding an extra hash to the ID? It is already unique.
You don’t only need distribution due to scalability issues (which I doubt would be much of a problem here) but due to both availability and latency. If the system is global, you don’t want people in Australia connecting to some datacenter in Ireland, plus you need to cope with network and server failures. Distributing the solution is the meat of this problem, if you could solve it with only one computer is pretty trivial.
I found this StackOverflow question, especially the first two answers, very helpful in order to understand how such a service might work.
It’s interesting the second reply, which indicates of using a base36 or base62 encoding of the key in the database….
Each day learning something new….
Is the answer “write a 5-line CGI script that catches POSTs and appends RedirectTemp directives to an .htaccess file in the webroot” a hire or no-hire? :)
Shouldn’t 1 byte per char be enough?