1. 11

This is something a few of my colleagues worked on. Short summary: word + subword embedding tables are very large. For example, if you have 2 million word embeddings and 2 million subword embeddings of 300 dimensions, the embedding table is about 4.4GiB, which is prohibitively large for many applications.

This post describes a simple, but effective trick to compress embedding tables inspired by Bloom filters. Rather than storing and querying (sub)words explicitly, a word and its subwords are hashed. The hashes are used to index into the embedding table and the embedding is the average of these ‘hash embeddings’. Even though this leads to collisions, this multiple-hashing scheme makes it possible to use much smaller embedding tables.

  1. 2

    Interesting read, but I would appreciate some more detailed analysis of the choices. For example, how does using two hashings compare to a single hash with more bits? As written, it’s not clear if these choices were arbitrary or if there is a sound mathematical logic behind them.