详解str对象及接口

>>> str.capitalize('abcd')
'Abcd'
>>> str.capitalize('1234abcd')
'1234abcd'
>>> str.capitalize('hello world')
'Hello world'

isascii

判断字符串中是否全部为ASCII字符。这个函数在处理用户输入数据的时候，也许能省去了我们不少自己编写检查代码的麻烦。

>>> str.isascii('123')
True
>>> str.isascii('abc')
True
>>> str.isascii('!@#$%')
True
>>> str.isascii('麦新杰')
False
>>> str.isascii('麦新杰abc')
False

title

>>> str.title('hello world')
'Hello World'
>>> str.title('hello 123 world 123')
'Hello 123 World 123'
>>> str.title('a b c d e')
'A B C D E'
>>> str.title('a,b,c,d,e')
'A,B,C,D,E'

upper, lower

>>> str.upper('abc')
'ABC'
>>> str.lower('ABC')
'abc'

startswith，endswith

>>> '123abcde'.startswith('123')
True
>>> '123abcde'.startswith('@@')
False
>>> '123abcde'.endswith('cde')
True
>>> '123abcde'.endswith('3a')
False

find, index

这两个函数接口都是用来从左到右查找第一个出现的子串的位置（index），所不同的是，如果找不到，find返回-1，而index会raise ValueError。

>>> '12345abcde12345'.find('23')
1
>>> '12345abcde12345'.find('67')
-1
>>> '12345abcde12345'.index('23')
1
>>> '12345abcde12345'.index('67')
Traceback (most recent call last):
  File "", line 1, in 
ValueError: substring not found
>>> '12345abcde12345'.index('23', 5)
11
>>> '12345abcde12345'.find('23', 5)
11

center，ljust，rjust

>>> '123'.center(10)
'   123    '
>>> '123'.center(11)
'    123    '
>>> '123'.center(11,'*')
'****123****'
>>> '123'.center(1,'-')
'123'
>>> '123'.rjust(10,'-')
'-------123'
>>> '123'.ljust(10,'-')
'123-------'

strip，lstrip，rstrip

>>> '   123   '.strip()
'123'
>>> '   123   '.lstrip()
'123   '
>>> '   123   '.rstrip()
'   123'

isdigit，isalpha，isalnum

>>> '123'.isdigit()
True
>>> '123abc'.isdigit()
False
>>> 'abc'.isalpha()
True
>>> 'abc123'.isalpha()
False
>>> 'abc123'.isalnum()
True
>>> 'abc123@'.isalnum()
False

count

返回子串出现的次数。（Python没有字符的概念，带个字符就是长度为1的子串）

>>> '12345abc12345'.count('12')
2
>>> '12345abc12345'.count('12a')
0
>>> '12345abc12345'.count('a')
1
>>> '12345abc12345'.count('abc')
1
>>> '12345abc12345'.count('234')
2

encode

>>> '123'.encode()
b'123'
>>> '123abc'.encode()
b'123abc'

join

>>> '.'.join(('https://cs','pynote','net'))
'https://cs.pynote.net

>>> from timeit import repeat
>>> import time
>>> stmt01 = """\
... 'a'+'b'+'c'+'d'+'e'
... """
>>> sum(repeat(stmt01, timer=time.process_time, number=10000000, repeat=10))/10
0.10396588629999996
>>> stmt02 = """\
... ''.join(('a','b','c','d','e'))
... """
>>> sum(repeat(stmt02, timer=time.process_time, number=10000000, repeat=10))/10
1.2185048218000003

>>> stmt03 = """\
... a = [str(i) for i in range(1000000)]
... b = ''
... for it in a:
...   b += it
... """
>>> sum(repeat(stmt03, timer=time.process_time, number=10, repeat=3))/3
3.234182735666669
>>> stmt04 = """\
... a = [str(i) for i in range(1000000)]
... b = ''.join(a)
... """
>>> sum(repeat(stmt04, timer=time.process_time, number=10, repeat=3))/3
2.2477204476666706

这种测试场景，如果不用join，会出现循环，一般情况下，有循环就会比较慢，但上面的测试，并没有看出数量级上的差异。不用循环的代码，看起来也更爽！

Python语言在很多细节方面，与C一样，比如现在要介绍的相邻字符串literal的自动拼接。

>>> 'a' 'b' 'c'
'abc'
>>> print("abc"
...       "123")
abc123
>>> stmt = """\
... 'a' 'b' 'c' 'd' 'e'
... """
>>> sum(repeat(stmt, timer=time.process_time, number=10000000, repeat=10))/10
0.10400003420000001

>>> b''.join((b'123',b'abc'))
b'123abc'

format

个人感觉format接口用的不多，可能还不如f-string。下面是一些示例代码：

>>> '{}-{}-{}'.format(1,2,3)
'1-2-3'
>>> '{}-{}-{}'.format('a','b','c')
'a-b-c'
>>> '{0}-{1}-{0}'.format(1,2)
'1-2-1'
>>> '{1}-{0}-{1}'.format(1,2)
'2-1-2'
>>> '{1}-{0}-{1}'.format(1,2,3)
'2-1-2'
>>> a
[1, 2, 3]
>>> b
('a', 'b', 'c')
>>> '{} {} {}'.format(*a)
'1 2 3'
>>> '{} {} {}'.format(*b)
'a b c'
>>> '{}'.format(a)
'[1, 2, 3]'
>>> '{0}--{0}--{0}'.format(a)
'[1, 2, 3]--[1, 2, 3]--[1, 2, 3]'
>>> '{0} {0} {0}'.format(*a)
'1 1 1'
>>> '{0[0]} {0[1]} {0[2]}'.format(a)
'1 2 3'
>>> '{0[0]} {1[1]} {1[2]}'.format(a,b)
'1 b c'
>>> d
{'a': 1, 'b': 3, 'c': 3}
>>> '{a},{b},{c}'.format(**d)
'1,3,3'
>>> '{a},{b},{c}'.format(a=1,b=2,c=3)
'1,2,3'

另外，format接口在格式化字符串方面，与f-string基本一致，这部分请参考f-string的总结。

zfill

>>> '123'.zfill(10)
'0000000123'
>>> 'abc'.zfill(10)
'0000000abc'
>>> 'abc'.rjust(10,'0')
'0000000abc'

replace

用一个新的子串来代替原来的子串。第3个参数表示替换次数：

>>> a
'a b c 1 2 3 '
>>> a.replace(' ','')
'abc123'
>>> a.replace(' ','',2)
'abc 1 2 3 '
>>> a.replace(' ','',3)
'abc1 2 3 '

>>> a = r'abc\n123\nhjk'
>>> print(a)
abc\n123\nhjk
>>> print(a.replace('\\n','\n'))
abc
123
hjk

split

str.split函数参数的默认值是None，不是空格！此时的None表示whitespace（6种），并且会丢弃所有空字符串。一般情况默认参数能够很好的工作。

>>> b = ' a b   c d e     '
>>> b.split()
['a', 'b', 'c', 'd', 'e']  # good
>>> b.split(' ')
['a', 'b', 'c', 'd', 'e', '', '', '', '', '']  # bad
>>> c = 'a   b   c   d'
>>> c.split()
['a', 'b', 'c', 'd']  # good
>>> c.split(' ')
['a', '', '', 'b', '', '', 'c', '', '', 'd']  # bad

str.split接口其实有两个参数，第2个参数用来控制split的次数（默认unlimited），从左到右：

>>> a = 'a b c d e'
>>> a.split(None, 1)
['a', 'b c d e']
>>> a.split(None, 2)
['a', 'b', 'c d e']
>>> a.split(None, 3)
['a', 'b', 'c', 'd e']
>>> a.split(None, 4)
['a', 'b', 'c', 'd', 'e']
>>> a.split()
['a', 'b', 'c', 'd', 'e']

shlex是Python标准库中的模块，在Linux平台下的脚本中，常常能见到。shlex也有一个split接口，主要是针对shell命令字符串的场景。

>>> import shlex
>>> cmd = 'command -t a -b 1 -g "1 2 3" -h'
>>> shlex.split(cmd)
['command', '-t', 'a', '-b', '1', '-g', '1 2 3', '-h']
>>> cmd = 'command -t a -b 1 -g "1 2 3" -h  # a little comments'
>>> shlex.split(cmd, comments=True)
['command', '-t', 'a', '-b', '1', '-g', '1 2 3', '-h']
>>> cmd.split()  # not suitable
['command', '-t', 'a', '-b', '1', '-g', '"1', '2', '3"', '-h', '#', 'a', 'little', 'comments']

shell命令行有一些参数可能会使用引号括起来，可能带有comments，这些都是shlex.split擅长的场景，此时用str.split反而不太合适和麻烦。

swapcase

>>> a = 'aBcD78'
>>> a.swapcase()
'AbCd78'
>>> a.swapcase().swapcase()
'aBcD78'