“Cosine Similarity tends to determine how similar two words or sentence are, It can be used for Sentiment Analysis, Text Comparison
and being used by lot of popular packages out there like word2vec.”

Wouldn’t any distance metric do? As long as you choose the right vector space?

I was under the impression cosine similarity makes it easy to batch computation with highly tuned matrix multiplications

Cosine similarity is not actually a metric and I think that is why people use it. Showing it is not a metric is easy because for metric spaces the only points that are zero distance away from another point are the points themselves. Cosine similarity in that sense fails to be a metric because for any given vector there are infinitely many vectors orthogonal to it and hence “zero” distance away. (But I just realized it’s even simpler than that because cosine similarity also gives negative values so it fails the positivity test as well.)

The relation to matrices that you mentioned are about positive definite bilinear forms and those give rise to dot products that are expressible as matrix multiplications and in vector spaces there is a way to define a metric based on dot products by defining the metric to be the dot product of the difference between two vectors with itself. Following through the logic the positive definite condition ends up being what is required to make this construction a metric.

This is not really the problem. People convert cosine similarly into a pseudo distance by taking 1 - cos(u,v). This solves the problems that you mention.

The true problem is that the cosine is a non-linear function of the angle between two vectors, this violates the triangle inequality . Consider the vectors a and c with an angle of 90 degrees. Their cosine pseudo-distance is 1. Now add a vector b that has an angle of 45 degrees of both a and c. The cosine pseudo-distances between a and b, b and c are 0.29 rounded. So, the distance from a to c via b is shorter than the distance from a to c.

Even with the pseudo-distance you still have the same problem with zero distances, as 1 - cos(u,v) is zero whenever the two points are tau radians apart.

In most vector space models, these would be considered to be vectors with the same direction, so 0 would be the correct distance. Put differently, the maximum angle between two vectors is 180 degrees.

The Wikipedia article has a good explanation I think. When working with tf-idf weights the vector coefficients are all positive so cosine similarity ends up being a good enough approximation of what you’d want out of an honest metric that is also easy to compute because it only involves dot products. But I’m no expert so take this with a grain of salt.

So I take back what I said about using it because it’s not a metric. I was thinking it has something to do with clustering by “direction” and there is some of that but it’s also pretty close to being a metric so it seems like a good compromise when trying to do matching and clustering types of work where the points can be embedded into some vector space.

I was under the impression cosine similarity makes it easy to batch computation with highly tuned matrix multiplications

The same applies to Euclidean distance when computed with the law of cosines. Since the dot product is the cosine of the angle between two vectors, scaled by the vector magnitudes, the squared Euclidean distance between two points is:

|u|^2 + |v|^2 - 2u·v

(|u|^2 is the squared L2 norm of u, equations are a bit hard to do here)

Computing the euclidean distance in this way especially pays off when you have to compute the euclidean distance between a vector and a matrix or two matrices. Then the third term is a matrix multiplication UV, which as you say, is very well-optimized in BLAS libraries. The first two terms are negligible (lower order of complexity).

One of the reasons that cosine similarity is popular in information retrieval is that it is not sensitive to vector magnitudes. Vector magnitudes can vary quite a bit with most common metrics (term frequency, TF-IDF) because of varying document lengths.

Consider e.g. two documents A and B, where document B is document A repeated ten times. We would consider these documents to have exactly the same topic. With a term frequency or TF-IDF vector, the vectors of A and B would have the same direction, but different lengths. Since cosine similarity measures just the angle between the vectors, it correctly tell us that the documents have the same topic, whereas a metric such as Euclidean distance would indicate a large difference due to the different vector magnitudes. Of course, Euclidean distance could be made to work by normalizing document (and query) vectors to unit vectors.

I don’t want to bash this article too much, but it is not a very good or accurate description of the vector model of information retrieval. Interested readers are better served by e.g. the vector model chapter from Manning et al.:

“Cosine Similarity tends to determine how similar two words or sentence are, It can be used for Sentiment Analysis, Text Comparison and being used by lot of popular packages out there like word2vec.”

Wouldn’t any distance metric do? As long as you choose the right vector space?

I was under the impression cosine similarity makes it easy to batch computation with highly tuned matrix multiplications

Cosine similarity is not actually a metric and I think that is why people use it. Showing it is not a metric is easy because for metric spaces the only points that are zero distance away from another point are the points themselves. Cosine similarity in that sense fails to be a metric because for any given vector there are infinitely many vectors orthogonal to it and hence “zero” distance away. (

But I just realized it’s even simpler than that because cosine similarity also gives negative values so it fails the positivity test as well.)The relation to matrices that you mentioned are about positive definite bilinear forms and those give rise to dot products that are expressible as matrix multiplications and in vector spaces there is a way to define a metric based on dot products by defining the metric to be the dot product of the difference between two vectors with itself. Following through the logic the positive definite condition ends up being what is required to make this construction a metric.

This is not really the problem. People convert cosine similarly into a pseudo distance by taking 1 - cos(u,v). This solves the problems that you mention.

The true problem is that the cosine is a non-linear function of the angle between two vectors, this violates the triangle inequality . Consider the vectors a and c with an angle of 90 degrees. Their cosine pseudo-distance is 1. Now add a vector b that has an angle of 45 degrees of both a and c. The cosine pseudo-distances between a and b, b and c are 0.29 rounded. So, the distance from a to c via b is shorter than the distance from a to c.

Even with the pseudo-distance you still have the same problem with zero distances, as 1 - cos(u,v) is zero whenever the two points are tau radians apart.

In most vector space models, these would be considered to be vectors with the same direction, so 0 would be the correct distance. Put differently, the maximum angle between two vectors is 180 degrees.

Good point. I didn’t think about the triangle inequality failure.

Thanks! Why would people want that (not actually being a metric)?

The Wikipedia article has a good explanation I think. When working with tf-idf weights the vector coefficients are all positive so cosine similarity ends up being a good enough approximation of what you’d want out of an honest metric that is also easy to compute because it only involves dot products. But I’m no expert so take this with a grain of salt.

So I take back what I said about using it because it’s not a metric. I was thinking it has something to do with clustering by “direction” and there is some of that but it’s also pretty close to being a metric so it seems like a good compromise when trying to do matching and clustering types of work where the points can be embedded into some vector space.

I was under the impression cosine similarity makes it easy to batch computation with highly tuned matrix multiplicationsThe same applies to Euclidean distance when computed with the law of cosines. Since the dot product is the cosine of the angle between two vectors, scaled by the vector magnitudes, the squared Euclidean distance between two points is:

|u|^2 + |v|^2 - 2u·v(

|u|^2is the squared L2 norm ofu, equations are a bit hard to do here)Computing the euclidean distance in this way especially pays off when you have to compute the euclidean distance between a vector and a matrix or two matrices. Then the third term is a matrix multiplication

UV, which as you say, is very well-optimized in BLAS libraries. The first two terms are negligible (lower order of complexity).One of the reasons that cosine similarity is popular in information retrieval is that it is not sensitive to vector magnitudes. Vector magnitudes can vary quite a bit with most common metrics (term frequency, TF-IDF) because of varying document lengths.

Consider e.g. two documents A and B, where document B is document A repeated ten times. We would consider these documents to have exactly the same topic. With a term frequency or TF-IDF vector, the vectors of A and B would have the same direction, but different lengths. Since cosine similarity measures just the angle between the vectors, it correctly tell us that the documents have the same topic, whereas a metric such as Euclidean distance would indicate a large difference due to the different vector magnitudes. Of course, Euclidean distance could be made to work by normalizing document (and query) vectors to unit vectors.

I don’t want to bash this article too much, but it is not a very good or accurate description of the vector model of information retrieval. Interested readers are better served by e.g. the vector model chapter from Manning et al.:

https://nlp.stanford.edu/IR-book/pdf/06vect.pdf

I must have been confusing the two