This article seems a bit uninformed.
and definitely don’t need deep learning – to find them
word2vec does not use deep learning, just two matrices (word/context matrices), matrix multiplication, and softmax. Since there is a large number of classes (the vocabulary), hierarchical softmax or softmax with negative sampling is applied. There are no non-linearities, let alone multiple non-linearities, hence word2vec is not deep learning.
It’s a hell of a lot more intuitive & easier to count skipgrams, divide by the word counts to get how ‘associated’ two words are and SVD the result than it is to understand what even a simple neural network is doing.
I don’t know where to start:
The approach outlined here isn’t exactly equivalent, but it performs about the same as word2vec skipgram negative-sampling SGNS.
word2vec is O(n) where n is the corpus lenght. SVD is O(mn^2) for an m x n matrix. So, ‘it performs about the same’ is only true for particular corpus sizes/vocabularies.
So if you’re using word vectors and aren’t gunning for state of the art or a paper publication then stop using word2vec.
In the end, the author does not really give much rationale for this. I would argue that PMI-SVD is not much simpler than word2vec, but even if it was, there are good off-the-shell implementations of word2vec (Mikolov’s, gensim, etc.) that one can use. We don’t switch to Minix in production because it’s simpler to understand than Linux or OpenBSD ;). Also, the training time of word2vec is not really a problem in practice - usually a couple of hours (depending on the corpus size) and you typically only have to do the training once.