How Close are Two Strings? Part II

Let’s continue looking at the problem of determining how close two strings are. If the two strings have equal lengths, then the Hamming distance is a good answer. But this technique does not work for the general case. One answer is to compute the Levenshtein distance. The Levenshtein distance between two strings is the minimum number of operations necessary to convert one string into the other, where an operation is an insertion, deletion, or substitution of a single character. For example, if:

 

A = cat

B = cots

 

then the Levenshtein distance is 2. Starting with "cat", you must change the ‘a’ to ‘o’, and then add an ‘s’. There is a very cool visual algorithm to compute the Levenshtein distance. I’ll walk you through it. First construct this matrix:


 

         c   a   t

    0   1   2   3

c  1

o  2

t  3

s  4

 

The short string goes on top, the longer sting goes vertical and each letter gets a 1-based index value.

 

Now, working by columns, from top to bottom, for each cell in the matrix, first assign a 0 if the corresponding characters match, and a 1 if the characters do not match. Then adjust this temp value by assigning the minimum of these three values:

 

1. the value in the cell above plus 1.

2. the value in the cell to the left plus 1.

3. the value in the cell to the upper left plus the temp value.

 

So the first cell in the example above gets a temp value of 0 (because the ‘c’ characters match) then gets modified to 0 (because condition #3 holds). The first column ends up like this:

 

         c   a   t

    0   1   2   3

c  1   0

o  2   1

t  3   2

s  4   3

 

The first cell in the second column gets a temp value of 1 because ‘a’ does not equal ‘c’. The temp value gets modified to 1 because condition #2 holds. The second column ends up like this:

 

         c   a   t

    0   1   2   3

c  1   0   1

o  2   1   1

t  3   2   2

s  4   3   3

 

In the same way, the last column becomes:

 

         c   a   t

    0   1   2   3

c  1   0   1   2

o  2   1   1   2

t  3   2   2   1

s  4   3   3   2

 

Now, the Levenshtein distance is the value in the lower right corner, 2. This is an interesting algorithm and one which can be easily implemented.

This entry was posted in Software Test Automation. Bookmark the permalink.