Longest Common Subsequence

Patrick Leaton

Published: Nov. 27, 2021, 6:57 p.m.

Problem Description

Given two strings text1 and text2, return the length of their longest common subsequence. If there is no common subsequence, return 0.

A subsequence of a string is a new string generated from the original string with some characters (can be none) deleted without changing the relative order of the remaining characters.

For example, "ace" is a subsequence of "abcde".

A common subsequence of two strings is a subsequence that is common to both strings.

Example 1:

Input: text1 = "abcde", text2 = "ace" 
Output: 3  
Explanation: The longest common subsequence is "ace" and its length is 3.

Example 2:

Input: text1 = "abc", text2 = "abc"
Output: 3
Explanation: The longest common subsequence is "abc" and its length is 3.

Example 3:

Input: text1 = "abc", text2 = "def"
Output: 0
Explanation: There is no such common subsequence, so the result is 0.

Constraints:

1 <= text1.length, text2.length <= 1000
text1 and text2 consist of only lowercase English characters.

The description was taken from https://leetcode.com/problems/longest-common-subsequence/.

Problem Solution

#O(M*N) Time, O(M*N) Space

class Solution:

    def longestCommonSubsequence(self, text1: str, text2: str) -> int:

        m, n = len(text1), len(text2)

        if not m or not n:

            return 0

        

        dp = [[0 for col in range(n+1)] for row in range(m+1)]

        

        for row in range(1, m+1):

            for col in range(1, n+1):

                if text1[row-1] == text2[col-1]:

                    dp[row][col] = dp[row-1][col-1] + 1

                else:

                    dp[row][col] = max(dp[row][col-1], dp[row-1][col-1], dp[row-1][col])

        

        return dp[-1][-1]

Problem Explanation

This problem is very similar to Edit Distance and Maximum Length of Repeated Subarray.

Similar to those classic Dynamic Programming problems, if we are given two strings or lists then asked to find the count of a specific relationship within these two inputs, that is typically an indication that DP can be applied.

Another requirement is overlapping subproblems, which this problem has.

In order to find the longest common subsequence between optimally, for example, "abcde" and "abc", we would first need to find the longest common subsequence between:

" " and "abc",

"a" and "abc",

"ab" and "abc",

"abc" and "abc",

"abcd" and "abc",

"abcde" and "abc".

Each subproblem answer from the previous longest common subsequence would be used within the current subproblem to avoid repeating the same calculation.

So how do we solve each subproblem?

If we are given two string or list inputs as mentioned previously, typically a DP cache is built in the form of a grid so that each element can be cross-referenced to distinguish some kind of relationship.

	" "	a	b	c
" "
a
b
c
d
e

The relationship we will identify is if the current character at the current row and current column is the same as we are traversing the cache, then we will be able to place the current subproblem's substring into the LCS carried over from a previous subproblem, incrementing the longest common subsequence by one.

Otherwise, we will carry over the greatest LCS we had calculated up until the current cell.

	" "	a	b	c
" "	0	0	0	0
a	0	1	1	1
b	0	1	2	2
c	0	2	2	3
d	0	2	2	3
e	0	2	2	3

Note: Here we have a contiguous subsequence for illustration purposes but "abcde" and "fazbvc" would yield the same result.

Let's start by saving the length of our two words which will end up being the length of our non-base-case rows and columns for our cache.

m, n = len(text1), len(text2)

If one of the lengths is zero, then there couldn't possibly be a common subsequence between these two strings so we will return zero.

if not m or not n: return 0

Otherwise, we will create our dp cache as shown at the beginning of the explanation.

We'll need an additional zero column and row for the base case subproblem that either one of these strings is null.

dp = [[0 for col in range(n+1)] for row in range(m+1)]

We will then have two pointers, one for each iteration through both arrays, the first will expand the partition of solved sub-problems, the second will solve the subproblem of the current partition.

Each additional row is a new subproblem with the inclusion of an additional character.

for row in range(1, m+1): for col in range(1, n+1):

Here is where we will solve each subproblem.

If the additional character we are adding to the previous subproblem is equal to the current character within the comparison string, we will take the top-left cell's subproblem answer and add one to it.

What is this doing?

	" "	a	b	c
" "	0	0	0	0
a	0	1	1	1
b	0	1	2	2
c	0	2	2	3
d	0	2	2	3
e	0	2	2	3

This is excluding the current subproblem to build on a previous one. If we have two of the same characters, then the answer is going to be what we calculated without these current characters within the previous subproblem, incremented by one.

If we have "abc" and "abc", then the answer is going to be what the answer was for "ab" and "ab", plus the current "c".

If we find two of the same characters, we can increment the previously solved LCS.

if text1[row-1] == text2[col-1]: dp[row][col] = dp[row-1][col-1] + 1

Otherwise, we will take the maximum between the current cells' three previous neighbors.

This part is similar to Edit Distance because, in order for us to transform one subsequence into another, we would either have to delete, insert, or replace the current character if it is mismatched. We would pick these choices from choosing the best option from the three surrounding cells.

else: dp[row][col] = max(dp[row][col-1], dp[row-1][col-1], dp[row-1][col])

Once we have finished filling our cache, our overall answer will be the value that was either carried to the final index or was incremented there if the two suffices were the same.

return dp[-1][-1]