Longest Common Subsequence

Posted by Richard-Burd on November 1, 2020

Until very recently my illustrations have focused on stack architecture and reference guides.  My latest endeavor involves the creation of visual aids that work towards solving the toughest algorithmic problems, that is, I am not looking here to illustrate large code bases involving APIs & multiple stacks, but rather, to illustrate a single algorithm that must solve some sort of difficult problem…and to date, I haven’t really done anything like this.  I am still in the process of creating such a system so I have not yet arrived at a final destination, stay tuned for the final product in a future post. 

Okay, so l only go big or go home, therefore we’re gonna start with a tough coding problem called the longest common subsequence, or ‘LCS’ for short.  Apparently Google uses LCS in hiring interviews so it is an ideal place to start trying to figure out how to graphically visualize complex algorithms.  The basic idea it to take two inputs, what we’ll call string1 and string2, and find the longest sequence of characters that are common to both of these values.  In this example, I will be working in JavaScript and the input values will look like this:

const string1 = "rqvjtweyrztuyio";
const string2 = "qzxwevrtbyrw";

There is a live REPL here and a Git repo here for reference, and both use the aforementioned input values. 

My goal is to create an algorithm that is similar to this Dynamic Programming solution.  I am still not sure exactly how this algorithm works, and I am not looking to replicate its logic, just its output.  Let’s putstring1 and string2 onto a spreadsheet below in figure 1 with string1 along an ‘x’ axis and string2 along a ‘y’ axis: 

fig.1 LCS diagram

Here we can see all of the matches between the two strings (shown in pink above) so now in order to find the longest common subsequence (LCS) we would need to find a series of characters that follow one of the blue arrow paths below in figure 2: 

fig.2 LCS diagram

If you’ve got a good eye, you’ll see that the longest common subsequence (LCS) is qwerty!  Let’s take a look at the LCS highlighted in yellow below in Figure 3: 

fig.3 LCS diagram You can see that each character (or lower-case letter in our example) occurs in both string1 and string2 and each of these characters also occur in the same chronological order.  These are the essential characteristics of the LCS. 

I am not looking for the most efficient solution, just one that is reliable and may need refactoring later on. At this point I am going to create some JavaScript Object Notation (or JSON) code for each of these matches, let’s see what that looks like in Figure 4 below: 

fig.4 LCS diagram So here we are calling each of these matches (shown in pink) a location - and assigning each one a locationOrder in the JSON.  The ordering starts from the top left & ends in the bottom right of the matrix.  If we take a look at the JSON code we also see each location is aware of its own coordinates in the matrix with x and y values.  The location is also cognizant of its descendants - these are the other locations located to the right and bottom of the current location.  So let’s have a look at the t qwerty’ in order to better understand this.  Go ahead and zoom into the t in the lower right-hand corner where you’ll find the following JSON code:

{
  locationOrder: 13,
  x: 11,
  y: 8,
  descendants: [ 14 ],
  progenyCenses: 1,
  character: 't'
}

This is telling us that our t:

  • is the 13th location in our matrix

  • is located in the 11th column on the matrix (x-axis)

  • is located in the 8th row on the matrix (y-axis)

  • has one location to its right and bottom, so that location is a descendant, and that descendant’s locationOrder is 14 

  • has a progenyCenses value of 1 just like every other location.  This is the core engine of my algorithm that figures out how to actually find the LCS which I will discuss later. 

  • has a character value of t

When we speak of a descendant we mean to say that if the LCS were ‘qwerty’, then the ‘r’, ‘t’, and ‘y’ are all descendants of ‘e’ because they all come after ‘e’ in the order of sequence. Likewise, ‘t’ and ‘y’ are descendants of ‘r’ for the same reason, and it goes without saying that everybody in ‘qwerty’ is a descendant of the ‘q’ except the ‘q’ itself which is an ancestor to all of the other characters.  This is a good place to point out that my solution here will not track child-parent relationships, and is only aware of descendant-ancestor relationships. This means there are fewer relationships to track which leads to better efficiency. 

After creating JSON objects for each location my solution calls for a second iteration over the matrix, this time going in reverse order from the iteration we did in Figure 4 above.  As we can see from the zig-zagging fat arrow below in Figure 5, we need to start the iteration from the lower right-hand corner and work towards the upper left-hand corner.  when we arrive at our first location we see it has a locationOrder value of 14 and a character of y.  Now we’re going to get to the mysterious progenyCensus value that I identified earlier as the core engine of this solution.  At this lower right-most location (location 14) we have a progenyCensus value of 1 and we are going to keep it that way.  The reason is that this location has no descendants to call its own, and the progenyCensus is a way to add value to locations who have descendants which in turn also have descendants, but in this case, our location 14 only gets a progenyCensus value of 1 to represent itself. 

fig.5 LCS diagram Once we get to the location 13 above (the one highlighted in green) we see it gets a progenyCensus value of 2 - this is because it gets one point for itself (just like location 14 did) but it also gets another point for location 14 (highlighted in orange) because location 14 is a descendant and it also has a progenyCensus value of 1 so we add those two 1’s together and assign location 13 a final progenyCensus value of 2

Below in Figure 6 we see this pattern repeating once the iteration arrives at location 10 (the one highlighted in green) - there we see a big pink 4 that represents the commensurate progenyCensus value. We get this 4 by looking at the progenyCensus values of all of the descendants (highlighted in orange) - there we have a 1 (at location 14) and a 2 (at location 13) and when we give a 1 for the current location (10) we get (1 + 2 + 1) which equals 4.

fig.6 LCS diagram

We now repeat the pattern until we reach location 8 as shown below in Figure 7; here we have a progenyCensus value of 11 because we took the values of all descendants and added them up so that 4 + 2 + 2 + 1 + 1 = 10 plus, the value of 1 for representation of location 8 itself which yields a total progenyCensus value of 11 at location 8 which is highlighted in green. 

fig.7 LCS diagram

Once we keep going we get the progenyCensus values for all of the locations, even those not part of the LCS as shown below in Figure 8:

fig.8 LCS diagram

Keep in mind that at this point, our algorithm still has no idea where the longest common subsequence (LCS) is or what locations are included in it.  In example, location 1 below (which has a locationOrder value of 1 if you still don’t see the pattern) has a progenyCensus value of 14 and is highlighted in a blue border with a big blue 14 underneath it.  To be sure, this location has a lot of descendants and as far as the algorithm is concerned at this point, this location might be in the LCS. 

fig.9 LCS diagram

Alright, here in Figure 10 we are going to find the first character in the LCS; it is the q in the upper left-hand corner because that location has the highest progenyCensus value (73) in the entire matrix, so the algorithm will at this point decide that this in in fact the first character of the LCS.  Next question: of all the descendants of q, which one has the highest progenyCensus value?  It turns out the answer is the w with the progenyCensus value of 22.  Remember that descendants have to be to the right and below the current location, below they are highlighted with dark-red borders:

fig.10 LCS diagram

In figure 11 below we just repeat the process; we’ve already determined that the q with a progenyCensus value of 73 must be the first character in the LCS and its descendent, the w with a progenyCensus value of 22 must be the second character. Now we’re going to look at the descendants of that w (that are highlighted with dark-red borders) and ask which of them has the highest progenyCensus value?  The answer is of course the e with a progenyCensus value of 11

fig.11 LCS diagram

And then we repeat the process… fig.12 LCS diagram

…until we get to the end! fig.13 LCS diagram

I discovered this algorithm by accident while starring at a diagram I made that looks very similar to Figure 8 above.  I noticed that the LCS had to have some sort of census value that weighed not only descendants, but the descendants of descendants in such a way that gave more credit to descendant branches with a greater number of generations.  Now to be sure, this solution is not as fast as the Dynamic Programming example I was trying to replicate, and it also breaks when the string1 and string2 input values get too large.  The reason is that it is possible to have two progenyCensus values that are exactly the same, with one location having fewer generations than the other. 

To prevent the breaking, I can simply add a multiplier to the progenyCensus calculation such that each number in that algorithm will be multiplied by x wherein x represents a value that must increase as string1 and string2 increase in size.  This multiplier can be found on line 82 of this REPL where it is currently set to a value of 8.  This will produce progenyCensus values that are much larger then the ones shown in the figures above, and thus, the chances of any two locations having the same progenyCensus value will be greatly reduced. 

In the end, the value this algorithm brings to the table is the fact that it is object oriented and can therefore be expanded upon to include extended functionality.  For example, we could add a z axis to this thing and find the LCS between three strings instead of just 2.  We could also extend the functionality to exclude certain characters from the LCS, so for example we could say we want the LCS that does not include a letter ‘q’ for example.  This could have all sorts of use cases for finding things like DNA sequences or searching for signal patterns while defining certain inputs as noise up front that should be excluded from the return value(s). 

Finally, to close up I have one last diagram that gives a comprehensive overview of everything discussed in this blog post.  The code being illustrated here is the exact same code as what is in the REPL Complete LCS diagram