Really nice overview of a lot of the practical aspects of this stuff. There’s just so much woo woo hand waving around it these days that it’s refreshing to read something concrete.
Wow, this is amazingly informative! With open source tools and pointers to models / data sets to boot
Kinda wish I had more time to play with this stuff, since I have several big pieces of text accumulated over the years …
The blog tech is very impressive too – glad you are building your own, rather than relying other people’s services
My reaction to the first part was “well N-grams work pretty well for document similarity”, N-grams being a pretty intuitive way to explain high dimensional spaces (each dimension corresponds to say 3 words)
And then I opened up the 2013 word2vec paper, and the first paragraph motivates neural networks over N-grams
It also mentions DistBelief at the end, distributed neural network training, which I remember as the project that kinda kicked off the last decade of progress. I’m pretty sure that’s where the GPUs + cluster training architecture started – before that it was just single GPUs. Although I guess GPUs only started to be used a couple years before that!
I just read Genius Makers, which goes through the last decade of AI history, and it’s quite good. I had a decent awareness of all this stuff, but the book talks about the huge “technology transfer” from Toronto, Montreal, and NYU to Google, DeepMind, Facebook, Microsoft, Baidu, etc. Basically how all the AI labs came together
Great writeup. Love the “vibes-based search” joke. I am a technical writer (TW) by trade. I have been telling my fellow TWs that for us docs people embeddings might possibly be the most important tool to come out of all this LLM hullabaloo because they essentially make it much more feasible for us to create major semantic search systems integrated across all the usual knowledge sources: official docs, unofficial docs, source code, forums, etc.
oh, good. this is a frequent topic of confusion when ML people need to communicate with non-ML engineers and it’s really nice to have a resource to point people at.
Really nice overview of a lot of the practical aspects of this stuff. There’s just so much woo woo hand waving around it these days that it’s refreshing to read something concrete.
Wow, this is amazingly informative! With open source tools and pointers to models / data sets to boot
Kinda wish I had more time to play with this stuff, since I have several big pieces of text accumulated over the years …
The blog tech is very impressive too – glad you are building your own, rather than relying other people’s services
My reaction to the first part was “well N-grams work pretty well for document similarity”, N-grams being a pretty intuitive way to explain high dimensional spaces (each dimension corresponds to say 3 words)
And then I opened up the 2013 word2vec paper, and the first paragraph motivates neural networks over N-grams
https://arxiv.org/pdf/1301.3781.pdf
It also mentions DistBelief at the end, distributed neural network training, which I remember as the project that kinda kicked off the last decade of progress. I’m pretty sure that’s where the GPUs + cluster training architecture started – before that it was just single GPUs. Although I guess GPUs only started to be used a couple years before that!
I just read Genius Makers, which goes through the last decade of AI history, and it’s quite good. I had a decent awareness of all this stuff, but the book talks about the huge “technology transfer” from Toronto, Montreal, and NYU to Google, DeepMind, Facebook, Microsoft, Baidu, etc. Basically how all the AI labs came together
https://www.amazon.com/Genius-Makers-Mavericks-Brought-Facebook/dp/1524742678
It’s encouraging to see all the open source equivalents of this tech!
Great writeup. Love the “vibes-based search” joke. I am a technical writer (TW) by trade. I have been telling my fellow TWs that for us docs people embeddings might possibly be the most important tool to come out of all this LLM hullabaloo because they essentially make it much more feasible for us to create major semantic search systems integrated across all the usual knowledge sources: official docs, unofficial docs, source code, forums, etc.
I will add this to my list of solid embeddings explainers. Cohere’s has been my go-to up until now: https://txt.cohere.com/text-embeddings/
oh, good. this is a frequent topic of confusion when ML people need to communicate with non-ML engineers and it’s really nice to have a resource to point people at.