Edit Distance

Patrick Leaton
Problem Description

Given two strings word1 and word2, return the minimum number of operations required to convert word1 to word2.

You have the following three operations permitted on a word:

  • Insert a character
  • Delete a character
  • Replace a character

 

Example 1:

Input: word1 = "horse", word2 = "ros"
Output: 3
Explanation: 
horse -> rorse (replace 'h' with 'r')
rorse -> rose (remove 'r')
rose -> ros (remove 'e')

Example 2:

Input: word1 = "intention", word2 = "execution"
Output: 5
Explanation: 
intention -> inention (remove 't')
inention -> enention (replace 'i' with 'e')
enention -> exention (replace 'n' with 'x')
exention -> exection (replace 'n' with 'c')
exection -> execution (insert 'u')

 

Constraints:

  • 0 <= word1.length, word2.length <= 500
  • word1 and word2 consist of lowercase English letters.

 

The description was taken from https://leetcode.com/problems/edit-distance/.

Problem Solution

#O(M*N) Time, O(M*N) Space
class Solution:
    def minDistance(self, word1: str, word2: str) -> int:
        if len(word1) * len(word2) == 0:
            return len(word1) or len(word2)
 
        dp = [[0 for col in range(len(word1)+1)] for row in range(len(word2)+1)]
 
        for i in range (0, len(word1) +1):
            dp[0][i] = i
        for i in range(0, len(word2) +1):
            dp[i][0] = i
 
        for i in range(1, len(word2) + 1):
            for j in range(1, len(word1) + 1):
                if word2[i-1] == word1[j-1]:
                    dp[i][j] = dp[i-1][j-1]
                else:
                    dp[i][j] = 1 + min(dp[i-1][j-1], dp[i-1][j], dp[i][j-1])
 
        return dp[-1][-1]

Problem Explanation


If we are given two inputs and we are asked to find if those two inputs can be correlated, what the minimum amount of iterations it takes for one to be the other, or what amount of one exists in the other, things like that, typically that is an indication that Dynamic Programming can be applied.

What we can do is come up with a bottom-up approach by building a dp lookup table from a 2D array, and with that we will tabulate the distance based on the three choices given in the input.

Note: this is a famous algorithm, the Levenshtein distance.


First off, if one string is empty then the distance would be between the empty string and the nonempty string, that distance would be the length of the nonempty string.

If both strings are empty, the length would be zero.

        if len(word1) * len(word2) == 0:
            return len(word1) or len(word2) or 0

 

If one of the strings is empty, the code above will return the nonempty string, if both strings are empty then it will return zero.

Let's create our dp table.

        dp = [[0 for col in range(len(word1)+1)] for row in range(len(word2)+1)]

 

The code above translates to "we want zeros in each column for as many characters are in the first word, and we will want as many rows as there are characters in the second word".

Next, we will need to satisfy our base case.

Our base case would be the distance from an empty string.

  "_ l   u   n   a   r  
"_"  0                 
           
u            
n            
a            

 

If we have an empty string, then the distance from the empty string to the nonempty string would be the length of the nonempty string.  We will go ahead and fill in those values now.

Let's fill the first row.

  "_ l  
"_"  0   1   2   3   4   5  
           
u            
n            
a            
 
        for i in range (0, len(word1) +1):
            dp[0][i] = i

 

Let's fill the first column.

  "_ l  
"_"  0   1   2   3   4   5  
1             
u 2             
n 3             
a 4             
 
        for i in range(0, len(word2) +1):
            dp[i][0] = i

 

Now here is where we start getting into the subproblems.

The term "subproblems" gets thrown around a lot, but many don't elaborate on what they mean.  If we take, for example, the strings "tuna" and "lunar", if we wanted to find the edit distance between them then we can solve the following subproblems in this order:

  1. "t" vs "lunar",
  2. "tu" vs "lunar",
  3. "tun" vs "lunar",
  4. "tuna" vs "lunar".

 

To achieve this, we will make one loop to traverse the first word with an iterator, i, and an inner loop to traverse the second word with an iterator, j.  This will allow us to visit each cell in the dp table shown in the previous step.

        for i in range(1, len(word2) + 1):
            for j in range(1, len(word1) + 1):

 

Just like other DP problems, within each iteration, we are going to use subproblem answers from previous iterations to justify our current subproblem answer before marking it into our cell and continuing.

If the two characters we are comparing are the same, then we will take the value from the top-left cell because that would be the edit distance of the current substring without the current character that was already calculated during the iteration of the top-left index.  In other words, the edit distance would be zero for these two characters.

                if word2[i-1] == word1[j-1]:
                    dp[i][j] = dp[i-1][j-1]

 

If the two characters we are comparing are not the same, we will be faced with a subproblem that requires a few choices.

The question posed within each subproblem, each substring, is should we either:

  • Insert a character
  • Delete a character
  • Replace a character

The choices will be within the top, top-left, and left indices during each iteration as follows:

Replace    Insert   
  Delete  Current 

 

We will take the best bet from the three neighbors and add one to it to include the current cell.

Let's begin filling in the matrix and we'll point out when each solution would be the best pick, besides when they have the minimum value of course.

Right off the bat, refer to the table below.  Notice that the best option is to replace a character in the top-left index, dp[i-1][j-1].  This makes sense because if we wanted to edit one of the strings to become the other, we would need to swap a "t" for an "l".

  "_ l  
"_" 0    1   2   3   4   5  
1             
u 2             
n 3             
a 4             

 

 

Let's start filling in some more of these boxes. 

We would continue to choose the replace option for the rest of the row since there is no "t" in "lunar".  

  "_ l  
"_" 0    1   2   3   4   5  
1    1   2   3   4   5  
u 2          
n 3             
a 4             

 

 

Afterward, we would move to the "u" row.

  "_ l  
"_" 0    1   2   3   4   5  
1    1 2   3   4 5
u 2    2 1        
n 3             
a 4             
 

In the iteration where we are comparing "lun" to "tu", we would find that the delete option, dp[i][j-1] is the best.  This makes sense because if we deleted the letter "n", then the substrings would be closer to each other, "lu" and "tu".

We would continue to choose the delete option for the rest of the row because from our current position, to get "lunar" closer to "tu", we would need to delete "nar".  

 

 

Eventually, we will get to this position where are comparing "lu" to "tun".  In this position, we would see that the best option would be to delete,  dp[i-1][j].  This makes sense because we would have to add an "n" to "lu" to get it closer to "tun".

  "_ l  
"_" 0    1   2   3   4   5  
1    1 2   3   4 5
u 2    2   1   2 3 4
n 3    3          
a 4             

 

 

We would continue to either take the top-left value if we get two equal characters or adding one to the minimum value of the three left neighboring cells until we get our final answer written in the last index.

To make "lunar" into "tuna", we would have to drop the "r" and replace the "l", so the edit distance would be two.

  "_ l  
"_" 0    1   2   3   4   5  
1    1 2   3   4 5
u 2    2   1   2 3 4
n 3    3   2 1 2 3
a 4    4 3 2 1 2  

 

                else:
                    dp[i][j] = 1 + min(dp[i-1][j-1], dp[i-1][j], dp[i][j-1])
        return dp[-1][-1]