Last Updated: 2023-05-17 09:40:49 Wednesday
-- TOC --
正则表达式,RE,regular expression,是一个mini language,用来匹配满足规则的字符串,然后再做进一步处理。
RE code | comments |
---|---|
. | 任何单个字符,除了'\n',除非使用re.DOTALL |
^ | 开始位置 |
\A | 同上,但不受re.M影响 |
$ | 结束位置 |
\Z | 同上,但不受re.M影响 |
[] | set,匹配set内的某个符号,可用于匹配特殊符号,set内可用- 来表示范围 |
[^x] | 匹配除了x以外的任意字符,^ has no special meaning if it’s not the first character in the set. |
\d | 单个数字 |
\D | the opposite of \d |
\s | 单个whitespace符号,[ \t\n\r\f\v] |
\S | the opposite of \s |
\w | 单个字母,数字或下划线,[_a-zA-Z0-9] |
\W | \w 的反,[^\w] |
* | 重复0到多次 |
+ | 重复1到多次 |
? | 重复0或1次,或用于设置non-greedy |
{n} | 重复n次 |
{n,} | 重复n次或更多次 |
{n,m} | 重n到m次,included |
[.*+?$] | 匹配.*+?$ 这几个特殊符号 |
\ |
转义,比如\. 匹配. 符号,\\ 匹配\\ |
\^ |
匹配^符号(建议) |
\b | Matches the empty string, but only at the beginning or end of a word(\w+ ),配置word开始或结束的位置 |
\B | the opposite of \b |
() |
group or back reference |
\( |
[(] |
\) |
[)] |
A|B |
分支条件匹配,A or B |
[|] |
所谓贪婪,greedy,即匹配尽可能多的字符,这是*+?
的默认动作。但有的时候,我们需要懒惰,reluctant,此时只需要在这几个符号后面,再加上一个?
即可:
>>> import re
>>> re.search(r'\d+','123')
<re.Match object; span=(0, 3), match='123'>
>>> re.search(r'\d+?','123')
<re.Match object; span=(0, 1), match='1'>
{n,m}?
,表示在n到m之间,匹配尽可能少的字符,reluctant,那就是只匹配n次了。
多个正则表达(子)式,相互之间是或
的关系:
>>> re.search(r'\d+|[a-z]+','12abcdef')
<re.Match object; span=(0, 2), match='12'>
>>> re.search(r'\d+|[a-z]+','abc123456789')
<re.Match object; span=(0, 3), match='abc'>
匹配分枝条件时,将会从左到右地测试每个条件,如果满足了某个分枝的话,就不会去再管其它的条件了。 因此,出现可能性高的分支要写在左边。(正如在一个if内写多个或关系的条件,可能性高的要写在前面,short circiut,短路操作,短的路...)
This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy.
分组可以实现的功能有:
|
,实现局部分支匹配;>>> a = re.search(r'(\da)+', '1a2a3a4d5a')
>>> a.group()
'1a2a3a'
>>> a.groups() # only one group
('3a',)
后向引用的group编号,从\1
开始:
>>> a = re.search(r'(\d+)([a-z]+)\1\2','123abc123abc')
>>> a
<re.Match object; span=(0, 12), match='123abc123abc'>
>>> a.group()
'123abc123abc'
>>> a.groups()
('123', 'abc')
命名分组:
>>> a = re.search(r'(?P<k1>\d+)([a-z]+)\1\2','123abc123abc')
>>> a.group()
'123abc123abc'
>>> a.group('k1')
'123'
>>> a.groups()
('123', 'abc')
提取可能发生在re.sub接口内,此时不能使用\1
,但可以使用\g<1>
:
>>> re.sub(r'(?P<k>\d+)', 'abc\g<k>kk', '123bc')
'abc123kkbc'
>>> re.sub(r'(?P<k>\d+)', 'abc\g<1>kk', '123bc')
'abc123kkbc'
不喜欢用分组的名称来做后向引用,虽然可以:
>>> re.search('(?P<num>\d+)==(?P=num)', '123==123')
<re.Match object; span=(0, 8), match='123==123'>
分组嵌套:
>>> a = re.search(r'((\d+)[ ]+)a(bc)', '123 abc')
>>> a.groups()
('123 ', '123', 'bc')
>>> a.group(1)
'123 '
>>> a.group(2)
'123'
>>> a.group(3)
'bc'
(?:...)
非捕获组0号分组是整体,其它分组的编号从1开始,非捕获组不占用编号。
>>> a = re.search(r'a(\d)b(?:\d)c(\d)d', 'a1b2c3d4')
>>> a.group()
'a1b2c3d'
>>> a.group(0)
'a1b2c3d'
>>> a.group(1)
'1'
>>> a.group(2)
'3'
>>> a.groups()
('1', '3')
有人将其翻译成“独占”模式。
*+
,++
,?+
,{m,n}+
Like the '*', '+', and '?' quantifiers, those where '+' is appended also match as many times as possible. However, unlike the true greedy quantifiers, these do
not allow back-tracking when the expression following it fails to match
. These are known aspossessive quantifiers
.For example, aa will match 'aaaa' because the a will match all 4 'a's, but, when the final 'a' is encountered, the expression is backtracked so that in the end the a ends up matching 3 'a's total, and the fourth 'a' is matched by the final 'a'. However, when a+a is used to match 'aaaa', the a*+ will match all 4 'a', but when the final 'a' fails to find any more characters to match, the expression cannot be backtracked and will thus fail to match.
(?>...)
atomic group (from Python 3.11)re.MULTILINE的缩写,re.M。
re.M
re.MULTILINE
When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string. Corresponds to the inline flag (?m).
当re.M被使用时,^
和$
会在每一行上起作用,默认没有re.M的行为是,整个字符串只有一个最开始和一个最末尾。
>>> a = """\
... abcde12345
... 12345abcde
... xinlin
... 20110407
... """
>>> a
'abcde12345\n12345abcde\nxinlin\n20110407\n'
>>> import re
>>> re.findall('\d+', a)
['12345', '12345', '20110407']
>>> re.findall('^\d+', a, re.M)
['12345', '20110407']
>>> re.findall('[a-z]+$', a)
[]
>>> re.findall('[a-z]+$', a, re.M)
['abcde', 'xinlin']
\A
和\Z
不受re.M的影响。re.match()
will only match at the beginning of the string and not at the beginning of each line.Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline. Corresponds to the inline flag (?s).
>>> re.search(r'.', '\n')
>>> re.search(r'.', '\n', re.S)
<re.Match object; span=(0, 1), match='\n'>
(?#...)
commentsA comment; the contents of the parentheses are simply ignored. RE表达式中,也可以存在注释,mini-language嘛...
(?=...)
lookahead assertionMatches if ...
matches next, but doesn’t consume any of the string.
>>> for tm in re.finditer(r'a(?=3a)', 'a3a3a3a4a5a4a3b2a'):
... print(tm, tm.start())
...
<re.Match object; span=(0, 1), match='a'> 0
<re.Match object; span=(2, 3), match='a'> 2
<re.Match object; span=(4, 5), match='a'> 4
有网友将正则的这个特性翻译为零宽前向断言,所谓零宽,就是不消耗字符串,不会被包含在group中。
(?!...)
negative lookahead assertionMatches if ...
doesn’t match next.
(?<=...)
lookbehind assertionMatches if the current position in the string is preceded by a match for ...
that ends at the current position.
lookbehind中的re表达式,必须是fixed length。而且,使用match是不合适的,因为匹配发生的位置一定不是字符串的开始位置:
>>> re.search(r'(?<=abc)123', 'abc123')
<re.Match object; span=(3, 6), match='123'>
>>> re.match(r'(?<=abc)123', 'abc123')
>>>
(?<!...)
negative lookbehind assertionMatches if the current position in the string is not preceded by a match for ...
.
先compile正则表达式,然后再在迭代中去匹配,去掉多余的重复计算。
p = re.compile('ab*', re.IGNORECASE)
REs are handled as strings because regular expressions aren’t part of the core Python language, and no special syntax was created for expressing them. (There are applications that don’t need REs at all, so there’s no need to bloat the language specification by including them.) Instead, the re module is simply a C extension module included with Python, just like the socket or zlib modules.
返回一个iterator,每次遍历返回一个match object,这些match是non-overlapping的,对字符串的扫描从左到右。
>>> a = 'Do you know? I like cats.'
>>> word_re = re.compile(r'\w+')
>>> for m in re.finditer(word_re,a):
... print(m)
...
<re.Match object; span=(0, 2), match='Do'>
<re.Match object; span=(3, 6), match='you'>
<re.Match object; span=(7, 11), match='know'>
<re.Match object; span=(13, 14), match='I'>
<re.Match object; span=(15, 19), match='like'>
<re.Match object; span=(20, 24), match='cats'>
>>>
>>> for m in word_re.finditer(a):
... print(m)
...
<re.Match object; span=(0, 2), match='Do'>
<re.Match object; span=(3, 6), match='you'>
<re.Match object; span=(7, 11), match='know'>
<re.Match object; span=(13, 14), match='I'>
<re.Match object; span=(15, 19), match='like'>
<re.Match object; span=(20, 24), match='cats'>
Return a group list,left to right, non-overlapping match。
>>> re.findall(r'\bf[a-z]*', 'which foot or hand fell fastest')
['foot', 'fell', 'fastest']
>>> re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10')
[('width', '20'), ('height', '10')]
本文链接:https://cs.pynote.net/sf/python/202212131/
-- EOF --
-- MORE --