and definitely don’t need deep learning – to find them

word2vec does not use deep learning, just two matrices (word/context matrices), matrix multiplication, and softmax. Since there is a large number of classes (the vocabulary), hierarchical softmax or softmax with negative sampling is applied. There are no non-linearities, let alone multiple non-linearities, hence word2vec is not deep learning.

It’s a hell of a lot more intuitive & easier to count skipgrams, divide by the word counts to get how ‘associated’ two words are and SVD the result than it is to understand what even a simple neural network is doing.

I don’t know where to start:

First of all, this is nothing new. SVD on word-word co-occurrence matrices was proposed by Schütze in 1992 ;). There have been many works since then exploring various co-occurrence measures (including PMI) in combination with SVD.

What word2vec is doing is pretty simple to understand: the skip-gram model is a simple linear classifier that predicts the context given a word. In this classifier every word and every context word is represented as a weight vector. The word embeddings are just the trained weight vectors (and/or context vectors) of every word.

People use word2vec over PMI+SVD because word2vec vectors tend to be better at analogy tasks (see e.g. Levy & Goldberg, 2014).

Levy & Goldberg, 2014 have shown that word2vec (skip-gram) performs a matrix factorization of a shifted PMI matrix.

There are newer co-occurrence based methods, such as GloVe that are more well-founded than PMI-SVD. Moreover, GloVe’s training times are typically shorter than word2vec’s.

The approach outlined here isn’t exactly equivalent, but it performs about the same as word2vec skipgram negative-sampling SGNS.

word2vec is O(n) where n is the corpus lenght. SVD is O(mn^2) for an m x n matrix. So, ‘it performs about the same’ is only true for particular corpus sizes/vocabularies.

So if you’re using word vectors and aren’t gunning for state of the art or a paper publication then stop using word2vec.

In the end, the author does not really give much rationale for this. I would argue that PMI-SVD is not much simpler than word2vec, but even if it was, there are good off-the-shell implementations of word2vec (Mikolov’s, gensim, etc.) that one can use. We don’t switch to Minix in production because it’s simpler to understand than Linux or OpenBSD ;). Also, the training time of word2vec is not really a problem in practice - usually a couple of hours (depending on the corpus size) and you typically only have to do the training once.

This article seems a bit uninformed.

and definitely don’t need deep learning – to find themword2vec does not use deep learning, just two matrices (word/context matrices), matrix multiplication, and softmax. Since there is a large number of classes (the vocabulary), hierarchical softmax or softmax with negative sampling is applied. There are no non-linearities, let alone multiple non-linearities, hence word2vec is not deep learning.

It’s a hell of a lot more intuitive & easier to count skipgrams, divide by the word counts to get how ‘associated’ two words are and SVD the result than it is to understand what even a simple neural network is doing.I don’t know where to start:

The approach outlined here isn’t exactly equivalent, but it performs about the same as word2vec skipgram negative-sampling SGNS.word2vec is

O(n)wherenis the corpus lenght. SVD isO(mn^2)for anm x nmatrix. So, ‘it performs about the same’ is only true for particular corpus sizes/vocabularies.So if you’re using word vectors and aren’t gunning for state of the art or a paper publication then stop using word2vec.In the end, the author does not really give much rationale for this. I would argue that PMI-SVD is not much simpler than word2vec, but even if it was, there are good off-the-shell implementations of word2vec (Mikolov’s, gensim, etc.) that one can use. We don’t switch to Minix in production because it’s simpler to understand than Linux or OpenBSD ;). Also, the training time of word2vec is not really a problem in practice - usually a couple of hours (depending on the corpus size) and you typically only have to do the training once.