Longest Common Subsequence,最长相同子序列(LCS)

Last Updated: 2024-03-25 14:15:01 Monday

-- TOC --

Subsequence is not Substring.

问题分析

子序列,subsequence,是指元素按照原序列内的顺序出现但不一定连续的序列。

sequence:    ABCDEFG
subsequence: ACD, BEFG, ABCDE

LCS问题,就是寻找两个sequence的最长的common subsequence。

S1 =  ACCGGTCGAGTGCGCGGAAGCCGGCCGAA
S2 =  GTCGTTCGGAATGCCGTTGCTCTGTAAA
LCS = GTCGTCGGAAGCCGGCCGAA

此问题具备optimal substructure和overlapping subproblem两个特性。子问题就是对合适位置的substring计算subsequence问题,取maximum值。

简单递归

def lcs(a, b):
    if not a or not b:
        return 0
    blen = len(b)
    clen = 0
    for i in range(len(a)):
        for j in range(blen):
            if a[i] == b[j]:
                clen = max(1+lcs(a[i+1:],b[j+1:]), clen)
                break
    return clen


a = [1,2,3,2,1,2,3,4,2,3,4,3,2,3,1,3]
b = [4,4,3,2,1,2,3,4,3,2,1,2,3,4,3,2,1,2]
print(lcs(a,b))  # 12
print(lcs(b,a))  # 12

Top-Down

class LCS:

    def prep(self, a, b):
        self.a = a
        self.b = b
        self.alen = len(a)
        self.blen = len(b)
        self.rec = {}

    def lcs(self, i, j):
        if (i,j) in self.rec:
            return self.rec[i,j]
        if i==self.alen or j==self.blen:
            return 0
        clen = 0
        for ii in range(i,self.alen):
            for jj in range(j,self.blen):
                if self.a[ii] == self.b[jj]:
                    clen = max(self.lcs(ii+1,jj+1)+1, clen)
                    break
        self.rec[i,j] = clen
        return clen

Bottom-Up

前面递归算法采用的suffix substring视角,而这里的bottom-up采用的是prefix substring视角。《CLRS》中一直采用prefix视角,但没有给出递归代码的伪码。

def lcs_bu(a, b):
    alen = len(a)
    blen = len(b)
    rec = [[0]*(blen+1) for _ in range(alen+1)]
    for i in range(alen):
        for j in range(blen):
            if a[i] == b[j]:
                rec[i+1][j+1] = rec[i][j] + 1
            else:
                rec[i+1][j+1] = max(rec[i+1][j], rec[i][j+1])
    return rec[i+1][j+1]


print(lcs_bu([1,2,4],[1,2,3,4]))  # 3
print(lcs_bu(a,b))  # 12
print(lcs_bu(b,a))  # 12

Bottom-Up版本似乎总有可以节省内存的办法,这种特性来自于子问题是一层层从底向上计算的:

def lcs_bu(a, b):
    alen = len(a)
    blen = len(b)
    rec = [0] * (blen+1)
    for i in range(alen):
        c = rec[:]
        # rec = [0] * (blen+1)
        for j in range(blen):
            if a[i] == b[j]:
                rec[j+1] = c[j] + 1
            else:
                rec[j+1] = max(rec[j], c[j+1])
    return rec[blen]

还可以进一步降低内存,rec的长度采用min(alen,blen)。代码略。

找出LCS

在迭代的过程中,做一些标记,最后通过这些标记,把LCS字符串找出来:

def get_lcs_bu(a, b):
    alen = len(a)
    blen = len(b)
    pos = [['^']*blen for _ in range(alen)]
    rec = [0] * (blen+1)
    for i in range(alen):
        c = rec[:]
        for j in range(blen):
            if a[i] == b[j]:
                rec[j+1] = c[j] + 1
                pos[i][j] = '\\'
            else:
                if rec[j] >= c[j+1]:
                    rec[j+1] = rec[j]
                else:
                    rec[j+1] = c[j+1]
                    pos[i][j] = '<'
    #
    r = rec[j+1]
    lcs = ''
    i = alen - 1
    j = blen - 1
    while not (i<0 or j<0):
        if pos[i][j] == '\\':
            lcs += str(a[i])
            i -= 1
            j -= 1
        elif pos[i][j] == '^':
            j -= 1
        else:  # '<'
            i -= 1
    return r, lcs[::-1]


print(get_lcs_bu([1,3,5],[1,2,3,4,5,6]))
print(get_lcs_bu([1,2,3,4,5,6],[1,3,5]))
print('#', a)
print('#', b)
print(get_lcs_bu(a,b))
print(get_lcs_bu(b,a))
a.pop(0)
print('#', a)
print('#', b)
print(get_lcs_bu(a,b))
print(get_lcs_bu(b,a))
a.pop(0)
print('#', a)
print('#', b)
print(get_lcs_bu(a,b))
print(get_lcs_bu(b,a))
a.pop(0)
print('#', a)
print('#', b)
print(get_lcs_bu(a,b))
print(get_lcs_bu(b,a))
如下图示例:
   1   3   5
1  \   <   <
2  ^   ^   ^
3  ^   \   <
4  ^   ^   ^
5  ^   ^   \

0:表示匹配,要取这个位置的值,同时与上移和左移
1:左移
2:上移

输出如下:

(3, '135')
(3, '135')
# [1, 2, 3, 2, 1, 2, 3, 4, 2, 3, 4, 3, 2, 3, 1, 3]
# [4, 4, 3, 2, 1, 2, 3, 4, 3, 2, 1, 2, 3, 4, 3, 2, 1, 2]
(12, '321234234321')
(12, '321234234321')
# [2, 3, 2, 1, 2, 3, 4, 2, 3, 4, 3, 2, 3, 1, 3]
# [4, 4, 3, 2, 1, 2, 3, 4, 3, 2, 1, 2, 3, 4, 3, 2, 1, 2]
(12, '321234234321')
(12, '321234234321')
# [3, 2, 1, 2, 3, 4, 2, 3, 4, 3, 2, 3, 1, 3]
# [4, 4, 3, 2, 1, 2, 3, 4, 3, 2, 1, 2, 3, 4, 3, 2, 1, 2]
(12, '321234234321')
(12, '321234234321')
# [2, 1, 2, 3, 4, 2, 3, 4, 3, 2, 3, 1, 3]
# [4, 4, 3, 2, 1, 2, 3, 4, 3, 2, 1, 2, 3, 4, 3, 2, 1, 2]
(11, '21234234321')
(11, '21234234321')

不同的问题,通过DP算法,得到结果的方法均不相同,要具体分析,而且还不是很简单。

LCS可能不止一个吗?

可能,如下:

1 2 3 4 5 6 9
1 3 9 5

LCS:135 or 139

如何一口气将所有的LCS都找出来?

本文链接:https://cs.pynote.net/ag/dp/202403147/

-- EOF --

-- MORE --