1. 17
    1. 8

      Please excuse the attention grabby title. The writing itself is actually quite good.

      This is an article about a paper Hyena Hierarchy: Towards Larger Convolutional Language Models https://arxiv.org/abs/2302.10866 which appears very interesting, as it may optimize and improve upon the current state of the art attention based LLMs.

      1. 5

        There are some really cool results in here:

        We measure 5x speedups over dense self-attention at length 8192 – 2x over highly optimized FlashAttention2 (Dao et al., 2022b) – and 100x speedup over FlashAttention at sequence lengths of 64k, where standard attention implementation in PyTorch runs out of memory.

        They’re matching GPT’s perplexity while being capable of a much larger sequence length. They also compared this to the recent RWKV which is another Transformer alternative, and they’re able to smoke everyone when the sequence length is > 30k (Table 4.2, page 10).