1. 35
    1. 4

      Really nice overview of a lot of the practical aspects of this stuff. There’s just so much woo woo hand waving around it these days that it’s refreshing to read something concrete.

      1. 3

        Wow, this is amazingly informative! With open source tools and pointers to models / data sets to boot

        Kinda wish I had more time to play with this stuff, since I have several big pieces of text accumulated over the years …

        The blog tech is very impressive too – glad you are building your own, rather than relying other people’s services

        My reaction to the first part was “well N-grams work pretty well for document similarity”, N-grams being a pretty intuitive way to explain high dimensional spaces (each dimension corresponds to say 3 words)

        And then I opened up the 2013 word2vec paper, and the first paragraph motivates neural networks over N-grams

        https://arxiv.org/pdf/1301.3781.pdf

        It also mentions DistBelief at the end, distributed neural network training, which I remember as the project that kinda kicked off the last decade of progress. I’m pretty sure that’s where the GPUs + cluster training architecture started – before that it was just single GPUs. Although I guess GPUs only started to be used a couple years before that!


        I just read Genius Makers, which goes through the last decade of AI history, and it’s quite good. I had a decent awareness of all this stuff, but the book talks about the huge “technology transfer” from Toronto, Montreal, and NYU to Google, DeepMind, Facebook, Microsoft, Baidu, etc. Basically how all the AI labs came together

        https://www.amazon.com/Genius-Makers-Mavericks-Brought-Facebook/dp/1524742678

        It’s encouraging to see all the open source equivalents of this tech!

        1. 2

          Great writeup. Love the “vibes-based search” joke. I am a technical writer (TW) by trade. I have been telling my fellow TWs that for us docs people embeddings might possibly be the most important tool to come out of all this LLM hullabaloo because they essentially make it much more feasible for us to create major semantic search systems integrated across all the usual knowledge sources: official docs, unofficial docs, source code, forums, etc.

          I will add this to my list of solid embeddings explainers. Cohere’s has been my go-to up until now: https://txt.cohere.com/text-embeddings/

          1. 1

            oh, good. this is a frequent topic of confusion when ML people need to communicate with non-ML engineers and it’s really nice to have a resource to point people at.