通过Python学习正则表达式

Last Updated: 2023-05-17 09:40:49 Wednesday

-- TOC --

正则表达式语法
贪婪（greedy） or 懒惰（reluctant）
分支
分组（group）
(?:...) 非捕获组
possessive （from Python 3.11）
(?>...) atomic group (from Python 3.11)
re.M（re.MULTILINE）
re.S（re.DOTALL）
(?#...) comments
(?=...) lookahead assertion
(?!...) negative lookahead assertion
(?<=...) lookbehind assertion
(?<!...) negative lookbehind assertion
正则表达式的性能与优化
re.compile
re.finditer
re.findall

正则表达式，RE，regular expression，是一个mini language，用来匹配满足规则的字符串，然后再做进一步处理。

正则表达式语法

RE code	comments
.	任何单个字符，除了'\n'，除非使用re.DOTALL
^	开始位置
\A	同上，但不受re.M影响
$	结束位置
\Z	同上，但不受re.M影响
[]	set，匹配set内的某个符号，可用于匹配特殊符号，set内可用`-`来表示范围
[^x]	匹配除了x以外的任意字符，`^` has no special meaning if it’s not the first character in the set.
\d	单个数字
\D	the opposite of `\d`
\s	单个whitespace符号，`[ \t\n\r\f\v]`
\S	the opposite of `\s`
\w	单个字母，数字或下划线，`[_a-zA-Z0-9]`
\W	`\w`的反，[^\w]
*	重复0到多次
+	重复1到多次
？	重复0或1次，或用于设置non-greedy
{n}	重复n次
{n,}	重复n次或更多次
{n,m}	重n到m次，included
[.*+?$]	匹配`.*+?$`这几个特殊符号
`\`	转义，比如`\.`匹配`.`符号，`\\`匹配`\\`
`\^`	匹配^符号（建议）
\b	Matches the empty string, but only at the beginning or end of a word（`\w+`），配置word开始或结束的位置
\B	the opposite of `\b`
`()`	group or back reference
`\(`	`[(]`
`\)`	`[)]`
`A\|B`	分支条件匹配，A or B
`[\|]`

贪婪（greedy） or 懒惰（reluctant）

所谓贪婪，greedy，即匹配尽可能多的字符，这是*+?的默认动作。但有的时候，我们需要懒惰，reluctant，此时只需要在这几个符号后面，再加上一个?即可：

>>> import re
>>> re.search(r'\d+','123')
<re.Match object; span=(0, 3), match='123'>
>>> re.search(r'\d+?','123')
<re.Match object; span=(0, 1), match='1'>

{n,m}?，表示在n到m之间，匹配尽可能少的字符，reluctant，那就是只匹配n次了。

分支

多个正则表达（子）式，相互之间是或的关系：

>>> re.search(r'\d+|[a-z]+','12abcdef')
<re.Match object; span=(0, 2), match='12'>
>>> re.search(r'\d+|[a-z]+','abc123456789')
<re.Match object; span=(0, 3), match='abc'>

匹配分枝条件时，将会从左到右地测试每个条件，如果满足了某个分枝的话，就不会去再管其它的条件了。因此，出现可能性高的分支要写在左边。（正如在一个if内写多个或关系的条件，可能性高的要写在前面，short circiut，短路操作，短的路...）

This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy.

分组（group）

分组可以实现的功能有：

复杂表达式的重复；
提取匹配内容；
后向引用；
在分组中使用|，实现局部分支匹配；

>>> a = re.search(r'(\da)+', '1a2a3a4d5a')
>>> a.group()
'1a2a3a'
>>> a.groups()  # only one group
('3a',)

后向引用的group编号，从\1开始：

>>> a = re.search(r'(\d+)([a-z]+)\1\2','123abc123abc')
>>> a
<re.Match object; span=(0, 12), match='123abc123abc'>
>>> a.group()
'123abc123abc'
>>> a.groups()
('123', 'abc')

命名分组：

>>> a = re.search(r'(?P<k1>\d+)([a-z]+)\1\2','123abc123abc')
>>> a.group()
'123abc123abc'
>>> a.group('k1')
'123'
>>> a.groups()
('123', 'abc')

提取可能发生在re.sub接口内，此时不能使用\1，但可以使用\g<1>：

>>> re.sub(r'(?P<k>\d+)', 'abc\g<k>kk', '123bc')
'abc123kkbc'
>>> re.sub(r'(?P<k>\d+)', 'abc\g<1>kk', '123bc')
'abc123kkbc'

不喜欢用分组的名称来做后向引用，虽然可以：

>>> re.search('(?P<num>\d+)==(?P=num)', '123==123')
<re.Match object; span=(0, 8), match='123==123'>

分组嵌套：

>>> a = re.search(r'((\d+)[ ]+)a(bc)', '123 abc')
>>> a.groups()
('123 ', '123', 'bc')
>>> a.group(1)
'123 '
>>> a.group(2)
'123'
>>> a.group(3)
'bc'

`(?:...)` 非捕获组

0号分组是整体，其它分组的编号从1开始，非捕获组不占用编号。

>>> a = re.search(r'a(\d)b(?:\d)c(\d)d', 'a1b2c3d4')
>>> a.group()
'a1b2c3d'
>>> a.group(0)
'a1b2c3d'
>>> a.group(1)
'1'
>>> a.group(2)
'3'
>>> a.groups()
('1', '3')

possessive （from Python 3.11）

有人将其翻译成“独占”模式。

*+, ++, ?+，{m,n}+

Like the '*', '+', and '?' quantifiers, those where '+' is appended also match as many times as possible. However, unlike the true greedy quantifiers, these do not allow back-tracking when the expression following it fails to match. These are known as possessive quantifiers.

For example, aa will match 'aaaa' because the a will match all 4 'a's, but, when the final 'a' is encountered, the expression is backtracked so that in the end the a ends up matching 3 'a's total, and the fourth 'a' is matched by the final 'a'. However, when a+a is used to match 'aaaa', the a*+ will match all 4 'a', but when the final 'a' fails to find any more characters to match, the expression cannot be backtracked and will thus fail to match.

`(?>...)` atomic group (from Python 3.11)

re.M（re.MULTILINE）

re.MULTILINE的缩写，re.M。

re.M

re.MULTILINE

When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string. Corresponds to the inline flag (?m).

当re.M被使用时，^和$会在每一行上起作用，默认没有re.M的行为是，整个字符串只有一个最开始和一个最末尾。

>>> a = """\
... abcde12345
... 12345abcde
... xinlin
... 20110407
... """
>>> a
'abcde12345\n12345abcde\nxinlin\n20110407\n'
>>> import re
>>> re.findall('\d+', a)
['12345', '12345', '20110407']
>>> re.findall('^\d+', a, re.M)
['12345', '20110407']
>>> re.findall('[a-z]+$', a)
[]
>>> re.findall('[a-z]+$', a, re.M)
['abcde', 'xinlin']

\A和\Z不受re.M的影响。
Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.

re.S（re.DOTALL）

Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline. Corresponds to the inline flag (?s).

>>> re.search(r'.', '\n')
>>> re.search(r'.', '\n', re.S)
<re.Match object; span=(0, 1), match='\n'>

`(?#...)` comments

A comment; the contents of the parentheses are simply ignored. RE表达式中，也可以存在注释，mini-language嘛...

`(?=...)` lookahead assertion

Matches if ... matches next, but doesn’t consume any of the string.

>>> for tm in re.finditer(r'a(?=3a)', 'a3a3a3a4a5a4a3b2a'):
...   print(tm, tm.start())
...
<re.Match object; span=(0, 1), match='a'> 0
<re.Match object; span=(2, 3), match='a'> 2
<re.Match object; span=(4, 5), match='a'> 4

有网友将正则的这个特性翻译为零宽前向断言，所谓零宽，就是不消耗字符串，不会被包含在group中。

`(?!...)` negative lookahead assertion

Matches if ... doesn’t match next.

`(?<=...)` lookbehind assertion

Matches if the current position in the string is preceded by a match for ... that ends at the current position.

lookbehind中的re表达式，必须是fixed length。而且，使用match是不合适的，因为匹配发生的位置一定不是字符串的开始位置：

>>> re.search(r'(?<=abc)123', 'abc123')
<re.Match object; span=(3, 6), match='123'>
>>> re.match(r'(?<=abc)123', 'abc123')
>>>

`(?<!...)` negative lookbehind assertion

Matches if the current position in the string is not preceded by a match for ....

正则表达式的性能与优化

greedy匹配的时候，容易出现back-tracing，这很影响性能，greedy能不用就不要用；
写分支的时候，让可能性高的分支在左边；
不同分支中，相同的匹配内容要提出来，可以加快速度；
不需要捕获的时候，就用非捕获组，尽力少用嵌套分组；

re.compile

先compile正则表达式，然后再在迭代中去匹配，去掉多余的重复计算。

p = re.compile('ab*', re.IGNORECASE)

REs are handled as strings because regular expressions aren’t part of the core Python language, and no special syntax was created for expressing them. (There are applications that don’t need REs at all, so there’s no need to bloat the language specification by including them.) Instead, the re module is simply a C extension module included with Python, just like the socket or zlib modules.

re.finditer

返回一个iterator，每次遍历返回一个match object，这些match是non-overlapping的，对字符串的扫描从左到右。

>>> a = 'Do you know? I like cats.'
>>> word_re = re.compile(r'\w+')
>>> for m in re.finditer(word_re,a):
...   print(m)
...
<re.Match object; span=(0, 2), match='Do'>
<re.Match object; span=(3, 6), match='you'>
<re.Match object; span=(7, 11), match='know'>
<re.Match object; span=(13, 14), match='I'>
<re.Match object; span=(15, 19), match='like'>
<re.Match object; span=(20, 24), match='cats'>
>>>
>>> for m in word_re.finditer(a):
...   print(m)
...
<re.Match object; span=(0, 2), match='Do'>
<re.Match object; span=(3, 6), match='you'>
<re.Match object; span=(7, 11), match='know'>
<re.Match object; span=(13, 14), match='I'>
<re.Match object; span=(15, 19), match='like'>
<re.Match object; span=(20, 24), match='cats'>

re.findall

Return a group list，left to right, non-overlapping match。

>>> re.findall(r'\bf[a-z]*', 'which foot or hand fell fastest')
['foot', 'fell', 'fastest']
>>> re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10')
[('width', '20'), ('height', '10')]

本文链接：https://cs.pynote.net/sf/python/202212131/

-- EOF --

-- MORE --