I was listening to an interesting lecture on Natural Language Processing (NLP) recently. The talk mentioned the cosine similarity of two sentences. I hadn’t used cosine similarity in a long time so I thought I’d refresh my memory.
In general, the cosine similarity of two vectors is a number between -1.0 and +1.0 where a value of 1.0 means the two vectors are exactly the same. A value of -1.0 means the two vectors are exactly opposite of each other.
For example, if vector v0 = (3, 5) and vector v1 = (4, 2) and vector v2 = (-3, -5), then:
CosSim(v0, v2) = -1.0 CosSim(v0, v0) = +1.0 CosSim(v0, v1) = 0.84
Suppose v0 = (x, y) and v1 = (s, t). The cosine similarity is:
((x*s) + (y*t)) / ( sqrt(x^2 + y^2) * sqrt(s^2 + t^2) )
Here I’ve used vectors with two values each for simplicity but the cosine similarity can be applied to two vectors with any number of values.
The cosine similarity can be applied to two sentences where the vector values represent word counts. For example suppose:
s0 = "This is sentence example sentence one" s1 = "This example sentence has six words"
The distinct words in both sentences are (this, is, sentence, example, one, has, six, words). The two vectors are the counts of each distinct word per sentence:
v0 = (1, 1, 2, 1, 1, 0, 0, 0) v1 = (1, 0, 1, 1, 0, 1, 1, 1)
Kind of weird. When used this way, because all the numeric values are counts, there aren’t any negative values so the cosine similarity will be between 0.0 and +1.0.