Longest Common Subsequence

The longest common subsequence (LCS) is the most popular dynamic programming problem. It’s the problem of finding the longest subsequence common to all sequences in a set of sequences. It differs from the longest common substring problem. LCS for input Sequences “ABCDGH” and “AEDFHR” is “ADH” of length 3.
LCS for input Sequences “AGGTAB” and “GXTXAYB” is “GTAB” of length 4. The LCS is the basis the usual diff tool. The naïve solution is exponential : the complexity is with length of the string. The running time of the dynamic programming approach is with and the lengths of the two sequences. The LCS has an optimal substructure: the LCS(z) contains also the LCS of the prefix of Z. Since the algorithm builds a matrix to store the partial results we can called this algorithm a ‘2D dynamic program’. We can define the LCS algorithm:

  • LCS of 2 empty sequences is the empty sequence.
  • LCS of “{prefix1}A” and “{prefix2}A” is LCS({prefix1}, {prefix2}) + A
  • LCS of “{prefix1}A” and “{prefix2}B” is the longest of LCS({prefix1}A, {prefix2}) and LCS({prefix1}, {prefix2}B)

So it’s the solution iteratively starting from the simple base cases. Note that the algorithm can work like this because the problem has so-called “optimal” structure, meaning that it can be built by reusing previous memoized steps. The algorithm is ‘divided’ in two parts: the first one is focused on find the length of the longest common subsequence.

\begin{array}{|c|c|c|c|c|c|c|} \hline & \boldsymbol{\varnothing} & \mathbf{A} & \mathbf{G} & \mathbf{C} & \mathbf{A} & \mathbf{T} \\ \hline \boldsymbol{\varnothing} & 0 & 0 & 0 & 0 & 0 & 0 \\ \hline \mathbf{G} & 0 & \leftarrow \uparrow0 & \nwarrow 1 & \leftarrow 1 & \leftarrow 1 & \leftarrow 1 \\ \hline \mathbf{A} & 0 & \nwarrow 1 & \leftarrow \uparrow 1 & \leftarrow \uparrow 1 & \nwarrow 2 & \leftarrow 2 \\ \hline \mathbf{C} & 0 & \uparrow 1 & \leftarrow \uparrow 1 & \nwarrow 2 & \leftarrow \uparrow 2 & \leftarrow \uparrow 2 \\ \hline \end{array}$$ This matrix will be used later to reconstruct LCS by tracing backwards:

\begin{array}{|c|c|c|c|c|c|c|} \hline & \varnothing & \mathbf{A} & \mathbf{G} & \mathbf{C} & \mathbf{A} & \mathbf{T} \ \hline \boldsymbol{\varnothing} & \varnothing & \varnothing & \varnothing & \varnothing & \varnothing & \varnothing \ \hline \mathbf{G} & \varnothing & \leftarrow \varnothing & \nwarrow(\mathrm{G}) & \leftarrow(\mathrm{G}) & \leftarrow(\mathrm{G}) & \leftarrow(\mathrm{G}) \ \hline \mathbf{A} & \varnothing & \nwarrow(\mathrm{A}) & \leftarrow\uparrow(\mathrm{A}) &(\mathrm{G}) & \leftarrow\uparrow(\mathrm{A}) &(\mathrm{G}) & \nwarrow(\mathrm{GA}) & \leftarrow(\mathrm{GA}) \ \hline \mathbf{C} & \varnothing & \uparrow(\mathrm{A}) & \leftarrow\uparrow(\mathrm{A}) &(\mathrm{G}) & \nwarrow(\mathrm{AC}) &(\mathrm{GC}) & \leftarrow\uparrow(\mathrm{AC}) &(\mathrm{GC}) &(\mathrm{GA}) & \leftarrow \uparrow(\mathrm{AC}) &(\mathrm{GC}) &(\mathrm{GA}) \ \hline \end{array}$$

Super useful video: Longest common subsequence algorithm