Last Updated: 2024-03-25 14:15:01 Monday
-- TOC --
Subsequence is not Substring.
子序列,subsequence,是指元素按照原序列内的顺序出现但不一定连续的序列。
sequence: ABCDEFG
subsequence: ACD, BEFG, ABCDE
LCS问题,就是寻找两个sequence的最长的common subsequence。
S1 = ACCGGTCGAGTGCGCGGAAGCCGGCCGAA
S2 = GTCGTTCGGAATGCCGTTGCTCTGTAAA
LCS = GTCGTCGGAAGCCGGCCGAA
此问题具备optimal substructure和overlapping subproblem两个特性。子问题就是对合适位置的substring计算subsequence问题,取maximum值。
def lcs(a, b):
if not a or not b:
return 0
blen = len(b)
clen = 0
for i in range(len(a)):
for j in range(blen):
if a[i] == b[j]:
clen = max(1+lcs(a[i+1:],b[j+1:]), clen)
break
return clen
a = [1,2,3,2,1,2,3,4,2,3,4,3,2,3,1,3]
b = [4,4,3,2,1,2,3,4,3,2,1,2,3,4,3,2,1,2]
print(lcs(a,b)) # 12
print(lcs(b,a)) # 12
class LCS:
def prep(self, a, b):
self.a = a
self.b = b
self.alen = len(a)
self.blen = len(b)
self.rec = {}
def lcs(self, i, j):
if (i,j) in self.rec:
return self.rec[i,j]
if i==self.alen or j==self.blen:
return 0
clen = 0
for ii in range(i,self.alen):
for jj in range(j,self.blen):
if self.a[ii] == self.b[jj]:
clen = max(self.lcs(ii+1,jj+1)+1, clen)
break
self.rec[i,j] = clen
return clen
前面递归算法采用的suffix substring视角,而这里的bottom-up采用的是prefix substring视角。《CLRS》中一直采用prefix视角,但没有给出递归代码的伪码。
def lcs_bu(a, b):
alen = len(a)
blen = len(b)
rec = [[0]*(blen+1) for _ in range(alen+1)]
for i in range(alen):
for j in range(blen):
if a[i] == b[j]:
rec[i+1][j+1] = rec[i][j] + 1
else:
rec[i+1][j+1] = max(rec[i+1][j], rec[i][j+1])
return rec[i+1][j+1]
print(lcs_bu([1,2,4],[1,2,3,4])) # 3
print(lcs_bu(a,b)) # 12
print(lcs_bu(b,a)) # 12
Bottom-Up版本似乎总有可以节省内存的办法,这种特性来自于子问题是一层层从底向上计算的:
def lcs_bu(a, b):
alen = len(a)
blen = len(b)
rec = [0] * (blen+1)
for i in range(alen):
c = rec[:]
# rec = [0] * (blen+1)
for j in range(blen):
if a[i] == b[j]:
rec[j+1] = c[j] + 1
else:
rec[j+1] = max(rec[j], c[j+1])
return rec[blen]
还可以进一步降低内存,rec的长度采用min(alen,blen)。代码略。
在迭代的过程中,做一些标记,最后通过这些标记,把LCS字符串找出来:
def get_lcs_bu(a, b):
alen = len(a)
blen = len(b)
pos = [['^']*blen for _ in range(alen)]
rec = [0] * (blen+1)
for i in range(alen):
c = rec[:]
for j in range(blen):
if a[i] == b[j]:
rec[j+1] = c[j] + 1
pos[i][j] = '\\'
else:
if rec[j] >= c[j+1]:
rec[j+1] = rec[j]
else:
rec[j+1] = c[j+1]
pos[i][j] = '<'
#
r = rec[j+1]
lcs = ''
i = alen - 1
j = blen - 1
while not (i<0 or j<0):
if pos[i][j] == '\\':
lcs += str(a[i])
i -= 1
j -= 1
elif pos[i][j] == '^':
j -= 1
else: # '<'
i -= 1
return r, lcs[::-1]
print(get_lcs_bu([1,3,5],[1,2,3,4,5,6]))
print(get_lcs_bu([1,2,3,4,5,6],[1,3,5]))
print('#', a)
print('#', b)
print(get_lcs_bu(a,b))
print(get_lcs_bu(b,a))
a.pop(0)
print('#', a)
print('#', b)
print(get_lcs_bu(a,b))
print(get_lcs_bu(b,a))
a.pop(0)
print('#', a)
print('#', b)
print(get_lcs_bu(a,b))
print(get_lcs_bu(b,a))
a.pop(0)
print('#', a)
print('#', b)
print(get_lcs_bu(a,b))
print(get_lcs_bu(b,a))
如下图示例:
1 3 5
1 \ < <
2 ^ ^ ^
3 ^ \ <
4 ^ ^ ^
5 ^ ^ \
0:表示匹配,要取这个位置的值,同时与上移和左移
1:左移
2:上移
输出如下:
(3, '135')
(3, '135')
# [1, 2, 3, 2, 1, 2, 3, 4, 2, 3, 4, 3, 2, 3, 1, 3]
# [4, 4, 3, 2, 1, 2, 3, 4, 3, 2, 1, 2, 3, 4, 3, 2, 1, 2]
(12, '321234234321')
(12, '321234234321')
# [2, 3, 2, 1, 2, 3, 4, 2, 3, 4, 3, 2, 3, 1, 3]
# [4, 4, 3, 2, 1, 2, 3, 4, 3, 2, 1, 2, 3, 4, 3, 2, 1, 2]
(12, '321234234321')
(12, '321234234321')
# [3, 2, 1, 2, 3, 4, 2, 3, 4, 3, 2, 3, 1, 3]
# [4, 4, 3, 2, 1, 2, 3, 4, 3, 2, 1, 2, 3, 4, 3, 2, 1, 2]
(12, '321234234321')
(12, '321234234321')
# [2, 1, 2, 3, 4, 2, 3, 4, 3, 2, 3, 1, 3]
# [4, 4, 3, 2, 1, 2, 3, 4, 3, 2, 1, 2, 3, 4, 3, 2, 1, 2]
(11, '21234234321')
(11, '21234234321')
不同的问题,通过DP算法,得到结果的方法均不相同,要具体分析,而且还不是很简单。
LCS可能不止一个吗?
可能,如下:
1 2 3 4 5 6 9
1 3 9 5
LCS:135 or 139
如何一口气将所有的LCS都找出来?
本文链接:https://cs.pynote.net/ag/dp/202403147/
-- EOF --
-- MORE --