Well I feel dumb now.. I reverse engineered a compression scheme some time ago (by staring at hex dumps until it clicked), and turns out it’s just a rather simple variant of LZ.
Apart from the wasted effort if you just assumed it was some variant of LZ I think that is still pretty impressive and I’m sure you learned a lot by this. So no reason to feel dumb.
It is nice to see a data compression example that is both easy to follow (with the blog post in hand) and not trivial.
Also, it is great that he links to the blog posts of ryg and cbloom which contain lots of practical knowledge.
The code is set up to test using enwik8. It is easy to end up optimizing for a specific type of data which might inversely affect the results on other types, so I would suggest testing with some non-text data sets as well if he isn’t already (perhaps the Silesia corpus, which contains various types of data).
Compressing enwik8 took 22 seconds, a quick hack to try and compress silesia.tar which is roughly twice as big, took almost 2 minutes to compress and segfaults on decompression. A good indication it needs some testing on other types of data.