Monday, May 10, 2010

Longest common subsequence problem

Longest common sub sequence problem is to find the longest
common sub sequence(LCS) common to all sub sequences given. Here
we will discuss how to find the LCS of two sequences (we
will use simple string for our convenience). This can be easily extended to more than
two sequences.

The basic principle is simple.
Suppose we have sequences S1 and S2 and,
1. If their last two characters are the same, then we are
100% sure that the last two characters will be always part
of the LCS. So, the LCS will be = (LCS of the sub sequences
excluding the last two characters) + (last characters which
is same for both the sequence).

2. If the last two characters are not matching, then both of
them cannot be in the LCS of the two sequences. But it is
possible that one of them may be in the LCS. Whichever
character results in maximum length for the common sequence
is taken as then only we will get the LCS. Suppose we are
examining ith and jth characters of the two sequences, then
if LCS of S1[1..j] and S2[1..i - 1] is longer than that of
LCS of S1[1..j-1] and S2[1..i], then S1[j] is added to the
sub sequence.

Let LCS(i,j) denotes the value of LCS length at position i
and j in S1 and S2 respectively. Dynamic programming comes
handy here. We initialize a 2d array of size n1 x n2 to
zero all. At each point i and j in S1 and S2 respectively,
LCS(i,j) = (LCS(i-1,j-1)  added with S1[i])  if S1[i] == S2[i],
else,
LCS(i,j) = max(LCS(i-1,j), LCS(i,j-1))
the array element lcs[i][j] will hold the value for LCS(i,j);
lcs[n1-1, n2 -1] will ultimately hold maximum LCS length.

below is the implementation of a C routine to compute LCS
for two strings str1 and str2 of lengths length1 and length2
respectively.

int compute_lcs(
  char *str1, unsigned int length1,
  char *str2, unsigned int length2
  )
{
  /// No checking for validation of input params, hope if
  // you use this code, you will add them 

  // allocate space for a matrix of size length1 x length2
  // and intialize them to zero
  int i, j, max_lcs;
  int **lcs = (int **) malloc(sizeof(int *) * length1 );
  
  for (i = 0; i < length1; i++) {
    lcs[i] = calloc(1, sizeof(int) * length2);
  }
  for (i = 0; i < length1; i++) {
    for (j = 0; j < length2; j++) {
      if (str1[i] == str2[j]) {
        if (i == 0 || j == 0)
          lcs[i][j] = 1;
        else
          lcs[i][j] = lcs[i-1][j-1] + 1; 
      } else {
        if (i) {
          if (j) {
            lcs[i][j] = lcs[i-1][j];
            if (lcs[i][j-1] >  lcs[i][j]) {
              lcs[i][j] = lcs[i][j-1];
            }
          } else {
            lcs[i][j] = lcs[i-1][j];
          }
        }
        if (j) {
          if (i) {
            lcs[i][j] = lcs[i-1][j];
            if (lcs[i][j-1] >  lcs[i][j]) {
              lcs[i][j] = lcs[i][j-1];
            }
          } else {
            lcs[i][j] = lcs[i][j-1];
          }
        }
      }
    }
  }

  max_lcs = lcs[length1 -1][length2 - 1];
  for (i = 0; i < length1; i++) {
    free(lcs[i]);
  }
  free(lcs);

  return max_lcs;
}

Understanding Levenshtein distance computing algorithm

Levenshtein distance is the amount of differences between two sequences. It is also called edit distance.Levenshtein distance is the minimum number of edits required to convert one string to another string. For example,the edit distance between strings "a" and "b" is 1. If you want to convert string "a" to "ab", the edit distance is 1.
To convert string "ab" to "ax" is also, the edit distance is 1.

The edit operation will include
1. insert of chars
2. delete of chars
3. substitution of chars

The algorithm is well known Wagner-Fisher algorithm.

While converting "a" to "ax" , we do insert of "x" at the end of the source string. So, edit distance is 1.
While converting "ab" to "ax", we substitute "b" with "x" and hence edit distance here is 1.
While converting "aqbc" to "abcx", we 1. delete char "q" and 2. add char "x" to the end. Hence edit distance here is 2.

Now how do we calculate the edit distance?
Suppose we are considering only insertion,deletion and substitution of characters (no transposition).We use dynamic programming to calculate the edit distance.

Let us say, S is the array containing the source string and T is the array where we will put the target string.
Let us say, ns is numbers of characters in S and nt is the number of characters in T.

Let us number the positions of the array from 1 to ns and 1 to nt respectively.
Suppose we are examining T[i] and S[j] chars. We already "know" edit distance for substrings S[1..j] and T[1..i-1], and
for substrings S[1..j-1] and T[1..i-1], and for S[1..j-1] and T[1..i].

1. Suppose, S[j] same as T[i], then edit distance involving S[1..j] and T[1..i] will be same sas with S[1..j-1] and
T[1..i-1].
2. Suppose S[j] is not equal to T[i], then
e1 = edit distance of substrings involving S[1..j-1] and T[1..i]
e2 = edit distance of substrings involving S[1..j] and T[1..i-1]
e3 = edit distance of substrings involving S[1..j-1] and T[1..i-1]

Among e1,e2,e3:
if e1 is the smallest, then edit distance of substrings S[1..j] and T[1..i] = e1 + 1. Here it is equivalent of deletion
of char S[j] in the source string.

if e2 is the smallest, then edit distance of substrings S[1..j] and T[1..i] = e2 + 1. Here it is equivalent of
insertion of char T[i] in the source string to form the target string.

if e3 is the smallest, then edit distance of substrings S[1..j] and T[1..i] = e3 + 1. Here it is equivalent of
substituting S[j] with T[i] to form the target string.


Below, I have given the c code implementation of the above.



int calculate_Levenshteindistance_distance(
char *source_string,
int source_length,
char *target_string,
int target_length
)
{
// No checking, hope for your tests you will
// pass proper
// values
int i, j, e1,e2,e3;
int **matrix_values;

matrix_values = (int **)
calloc(1, sizeof(int *) * (target_length + 1));

for (i = 0; i <= target_length ; i++) {
matrix_values[i] = (int *)
calloc(1, sizeof(int) * (source_length + 1));
}

// Here is the key ... initialize the column 0 of
// the matrix with the maximum possible edit distances
// for target with length
// 0, 1,2,3,......target_length when the source string
// is of length 0.
// Similarly initialize the 0th row of the matrix.
for (i = 0; i <= target_length ; i++)
matrix_values[i][0] = i;
for (j = 0; j <= source_length ; j++)
matrix_values[0][j] = j;

for (i = 1; i <= target_length ; i++) {
for (j=1; j <= source_length; j++) {
if (source_string[j - 1] == target_string[i - 1]) {
matrix_values[i][j] = matrix_values[i-1][j-1];
continue;
}
e1 = matrix_values[i][j-1];
e2 = matrix_values[i-1][j];
e3 = matrix_values[i-1][j-1];
if (e1 > e2 )
e1 = e2;
if (e1 > e3 )
e1 = e3;
matrix_values[i][j] = e1 + 1;
}
}

return matrix_values[target_length][source_length];
}