python进阶：第三章（字符串处理）

问题一：如何拆分含有多种分隔符的字符串？

问题内容：
我们要把某个字符串依据分隔符号拆分不同的字段，该字段包含多种不同的分隔符，例如：
s = 'ab;cd|efg|hi,jkl|mn\topq;rst,uvw\txyz'
其中的， | ; \t 都是分隔符号，如何处理？

对于单一的分隔符：

In [4]: x = !ps  aux 

In [5]: s = x[-1] 

In [6]: s 
Out[6]: 'root     32487  0.0  0.0      0     0 ?        S    09:44   0:01 [kworker/u24:0]'

In [7]: s.split?
Docstring:
S.split(sep=None, maxsplit=-1) -> list of strings

Return a list of the words in S, using sep as the
delimiter string.  If maxsplit is given, at most maxsplit
splits are done. If sep is not specified or is None, any
whitespace string is a separator and empty strings are
removed from the result.
Type:      builtin_function_or_method

In [8]: s.split() 
Out[8]: 
['root',
 '32487',
 '0.0',
 '0.0',
 '0',
 '0',
 '?',
 'S',
 '09:44',
 '0:01',
 '[kworker/u24:0]']

解决方案：
方法一：连续使用str.split()方法，每次处理一个分隔符号。
方法二：使用正则表达式的re.split()方法，一次性拆分字符串。

方法一：

In [9]: s = 'ab;cd|efg|hi,jkl|mn\topq;rst,uvw\txyz'

In [10]: s.split(';') 
Out[10]: ['ab', 'cd|efg|hi,jkl|mn\topq', 'rst,uvw\txyz']

In [11]: res = s.split(';') 

In [12]: map(lambda x: x.split('|'),res) 
Out[12]: <map at 0x7f2faa12bc50>

In [13]: list(map(lambda x: x.split('|'),res)) 
Out[13]: [['ab'], ['cd', 'efg', 'hi,jkl', 'mn\topq'], ['rst,uvw\txyz']]

In [14]: t = [] 

In [38]: list(map(lambda x: t.extend(x.split('|')),res))
Out[38]: [None, None, None]

In [39]: t
Out[39]: ['ab', 'cd', 'efg', 'hi,jkl', 'mn\topq', 'rst,uvw\txyz']

In [40]: res = t 

In [41]: t = [] 

In [42]: list(map(lambda x: t.extend(x.split(',')),res))
Out[42]: [None, None, None, None, None, None]

In [43]: t
Out[43]: ['ab', 'cd', 'efg', 'hi', 'jkl', 'mn\topq', 'rst', 'uvw\txyz']

根据上面的逻辑，我们可以写个函数

In [44]: def mySplit(s,ds):
    ...:     res = [s] 
    ...:     for d in ds:
    ...:         t = [] 
    ...:         list(map(lambda x: t.extend(x.split(d)),res))
    ...:         res = t 
    ...:     return res 
    ...: 

In [45]: s = 'ab;cd|efg|hi,jkl|mn\topq;rst,uvw\txyz' 

In [46]: mySplit(s,';,|\t') 
Out[46]: ['ab', 'cd', 'efg', 'hi', 'jkl', 'mn', 'opq', 'rst', 'uvw', 'xyz']

但是有两个分隔符连续的时候就会生成空的字符串

In [47]: s = 'ab;;cd|efg|hi,jkl|mn\topq;rst,uvw\txyz' 

In [48]: mySplit(s,';,|\t') 
Out[48]: ['ab', '', 'cd', 'efg', 'hi', 'jkl', 'mn', 'opq', 'rst', 'uvw', 'xyz']

我们修改函数，当字符串不为空的时候才返回

In [49]: def mySplit(s,ds):
    ...:     res = [s] 
    ...:     for d in ds:
    ...:         t = [] 
    ...:         list(map(lambda x: t.extend(x.split(d)),res))
    ...:         res = t 
    ...:     return [x  for x in res if x ] 
    ...:      
    ...: 

In [50]: s = 'ab;;cd|efg|hi,jkl|mn\topq;rst,uvw\txyz' 

In [51]: mySplit(s,';,|\t') 
Out[51]: ['ab', 'cd', 'efg', 'hi', 'jkl', 'mn', 'opq', 'rst', 'uvw', 'xyz']

方法二：

In [56]: import re 

In [57]: re.split?
Signature: re.split(pattern, string, maxsplit=0, flags=0)
Docstring:
Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings.  If
capturing parentheses are used in pattern, then the text of all
groups in the pattern are also returned as part of the resulting
list.  If maxsplit is nonzero, at most maxsplit splits occur,
and the remainder of the string is returned as the final element
of the list.
File:      /usr/local/lib/python3.5/re.py
Type:      function

In [58]: re.split(r'[;,|\t]+',s)
Out[58]: ['ab', 'cd', 'efg', 'hi', 'jkl', 'mn', 'opq', 'rst', 'uvw', 'xyz']

问题二：如何判断字符串a是否以字符串b开头或结尾？

问题内容：
某文件系统目录下有一些列文件：
qudsn.rc
cadbh.py
ajcn.java
njasd.sh
wneacm.cpp
......
编写程序给其中所有 .sh文件和 .py文件加上用户可执行权限。

解决方案：
使用字符串的str.startswitch()和str.endswith()方法
注意：多个匹配时参数使用元组

In [1]: ls
a.py  b.sh  c.java  d.h  e.cpp  f.c

In [2]: import  os,stat 

In [3]: os.listdir('.') 
Out[3]: ['a.py', 'b.sh', 'c.java', 'd.h', 'e.cpp', 'f.c']

In [4]: s = 'g.sh' 

In [5]: s.endswith('.sh') 
Out[5]: True

In [6]: s.endswith('.py') 
Out[6]: False

当传入为元组的时候，只要有一个成立，就会返回True
In [7]: s.endswith(('.sh','.py')) 
Out[7]: True
还有就是，传入的参数只能为元组，不能为列表
In [8]: s.endswith(['.sh','.py']) 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-2b9d4c4a32cd> in <module>()
----> 1 s.endswith(['.sh','.py'])

TypeError: endswith first arg must be str or a tuple of str, not list

下面我们给指定文件授予权限：

In [9]: [ name for name in os.listdir('.') if  name.endswith(('.sh','.py'))]
Out[9]: ['a.py', 'b.sh']

查看当前文件状态，其中的st_mode为状态码
In [11]: os.stat('a.py')
Out[11]: os.stat_result(st_mode=33188, st_ino=3678001, st_dev=2051, st_nlink=1, st_uid=0, st_gid=0, st_size=0, st_atime=1491533824, st_mtime=1491533824, st_ctime=1491533824)

In [12]: os.stat('a.py').st_mode
Out[12]: 33188

将状态码转为八进制，后三位就是我们平时看的777.
In [13]: oct(os.stat('a.py').st_mode)
Out[13]: '0o100644'

我们将状态码和stat.S_IXUSR 进行并集操作，X 代表执行，USR代表用户
In [17]: os.chmod('a.py',os.stat('a.py').st_mode | stat.S_IXUSR) 

In [18]: os.stat('a.py').st_mode
Out[18]: 33252
文件被赋予执行权限
In [19]: ls -l 
total 0
-rwxr--r-- 1 root root 0 Apr  7 10:57 a.py*
-rw-r--r-- 1 root root 0 Apr  7 10:57 b.sh
-rw-r--r-- 1 root root 0 Apr  7 10:57 c.java
-rw-r--r-- 1 root root 0 Apr  7 10:57 d.h
-rw-r--r-- 1 root root 0 Apr  7 10:57 e.cpp
-rw-r--r-- 1 root root 0 Apr  7 10:57 f.c

问题三：如何调整字符串中文本的格式？

问题内容：
某log文件，其中的日期格式为
......
t=2017-04-07T15:47:00+0800 lvl=eror msg="Metrics: GraphitePublisher: Failed to connect to [dial tcp [::1]:2003: getsockopt: connection refused]!"
t=2017-04-07T15:47:10+0800 lvl=eror msg="Metrics: GraphitePublisher: Failed to connect to [dial tcp [::1]:2003: getsockopt: connection refused]!"
t=2017-04-07T15:47:20+0800 lvl=eror msg="Metrics: GraphitePublisher: Failed to connect to [dial tcp [::1]:2003: getsockopt: connection refused]!"
t=2017-04-07T15:47:30+0800 lvl=eror msg="Metrics: GraphitePublisher: Failed to connect to [dial tcp [::1]:2003: getsockopt: connection refused]!"
......

我们想把其中的日期格式改为美国日期格式'mm/dd/yyyy'，'2017-04-07'=>'04/07/2017'，该如何处理？

解决方案：
使用正则表达式re.sub()方法做字符串替换，利用正则表达式的捕获组，捕获每个部分内容，在替换字符串中调整各个捕获组的顺序。

In [8]: log = open('gdash.log').read()

In [9]: import re 

In [10]: re.sub?
Signature: re.sub(pattern, repl, string, count=0, flags=0)
Docstring:
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl.  repl can be either a string or a callable;
if a string, backslash escapes in it are processed.  If it is
a callable, it's passed the match object and must return
a replacement string to be used.
File:      /usr/local/lib/python3.5/re.py
Type:      function

In [12]: re.sub('(\d{4})-(\d{2})-(\d{2})',r'\2/\3/\1',log)

其中的 \1 代表第一个组，我们还可以给组进行设置组名

In [14]: re.sub('(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})',r'\g<month>/\g<day>/\g<year>',log)

问题五：如何对字符串进行左，右，居中对齐？

某个字典存储了一系列属性值：

{
  "lodDISt":1000.0,
  "ejfmkd":0.01
  "ajdmea":30.0
  "jrfenim":322
}

在程序中，我们想以下面的工整的格式将其输出，如何处理？

lodDISt :1000.0,
ejfmkd  :0.01
ajdmea  :30.0
jrfenim :322

方法一：使用字符串的str.ljust()，str.rjust()，str.center()进行左，右，居中对齐。
方法二：使用format()方法，传递类似'<20'，'>20'，'^20'参数完成同样任务。

In [1]: s = 'abc' 
查看方法的详解
In [2]: s.ljust?
Docstring:
S.ljust(width[, fillchar]) -> str

Return S left-justified in a Unicode string of length width. Padding is
done using the specified fill character (default is a space).
Type:      builtin_function_or_method

In [3]: s.ljust(20)
Out[3]: 'abc                 '

传入的第二个参数为填充值
In [4]: s.ljust(20,'=')
Out[4]: 'abc================='

In [5]: s.rjust(20)
Out[5]: '                 abc'

In [6]: len(s.rjust(20))
Out[6]: 20

In [7]: s.center(20)
Out[7]: '        abc         '

使用format()方法

In [8]: format?
Signature: format(value, format_spec='', /)
Docstring:
Return value.__format__(format_spec)

format_spec defaults to the empty string
Type:      builtin_function_or_method
左对齐：
In [9]: format(s,'<20')
Out[9]: 'abc                 '
右对齐：
In [10]: format(s,'>20')
Out[10]: '                 abc'
居中：
In [11]: format(s,'^20') 
Out[11]: '        abc         '

我们看下问题内容：

In [14]: d
Out[14]: {'ajdmea': 30.0, 'ejfmkd': 0.01, 'jrfenim': 322, 'lodDISt': 1000.0}

In [15]: d.keys()
Out[15]: dict_keys(['lodDISt', 'jrfenim', 'ajdmea', 'ejfmkd'])
获得键长的列表
In [16]: map(len,d.keys())
Out[16]: <map at 0x7fe9eb59ea20>

In [17]: list(map(len,d.keys()))
Out[17]: [7, 7, 6, 6]

In [18]: max(map(len,d.keys()))
Out[18]: 7

In [19]: w = max(map(len,d.keys())) 

In [20]: for k in d:
    ...:     print(k.ljust(w),':',d[k])
    ...:     
lodDISt : 1000.0
jrfenim : 322
ajdmea  : 30.0
ejfmkd  : 0.01

问题六：如何去掉字符串中不需要的字符

问题内容：
1，过滤掉用户输入中的前后多余的空白字符
2，过滤某windows下编辑文本中的'\r'
3，去掉文本中的unicode组合符号（音调）

解决方案：
方法一：字符串strip()，lstrip()，rstrip()方法去掉字符串两端字符
方法二：删除单个固定位置的字符，可以使用切片 + 拼接的方式
方法三：字符串的replace()方法或正则表达式re.sub()删除任一位置字符
方法四：字符串translate()方法，可以同事删除多种不同字符

去除前后的空白

In [21]: s = '   abc   123   '

In [22]: s.strip?
Docstring:
S.strip([chars]) -> str

Return a copy of the string S with leading and trailing
whitespace removed.
If chars is given and not None, remove characters in chars instead.
Type:      builtin_function_or_method
当为指定字符时，默认去除空白字符
In [23]: s.strip()
Out[23]: 'abc   123'

In [24]: s.lstrip()
Out[24]: 'abc   123   '

In [25]: s.rstrip()
Out[25]: '   abc   123'

In [26]: s = '---abc+++'
当指定去除字符的时候
In [27]: s.strip('-+')
Out[27]: 'abc'

In [28]: s.lstrip('-')
Out[28]: 'abc+++'

In [29]: s.rstrip('+')
Out[29]: '---abc'

删除固定位置的字符：

In [30]: s = 'abc:123' 
去除字符串中的 : 
In [31]: s[:3] + s[4:] 
Out[31]: 'abc123'

使用替换方法：

In [32]: s = '\tabc\t123\t' 

In [33]: s.replace('\t','')
Out[33]: 'abc123'

但是replace()方法只能替换单一的字符，当有多种字符的时候，可以使用正则表达式的sub()方法。

In [34]: s = '\tabc\t123\txyz\ropq\r' 

In [35]: import  re

In [36]: re.sub('[\t\r]','',s) 
Out[36]: 'abc123xyzopq'

使用tanslate()方法

我们看下translate方法的选项，第一个是映射表，第二个是要删除的字符
In [37]: str.translate?
Docstring:
S.translate(table) -> str

Return a copy of the string S in which each character has been mapped
through the given translation table. The table must implement
lookup/indexing via __getitem__, for instance a dictionary or list,
mapping Unicode ordinals to Unicode ordinals, strings, or None. If
this operation raises LookupError, the character is left untouched.
Characters mapped to None are deleted.
Type:      method_descriptor

In [38]: s = 'abc123456xyz' 

In [42]: tal = str.maketrans('abcxyz','xyzabc')

In [43]: s.translate(tal)
Out[43]: 'xyz123456abc'

注意：在python3中，这两种方法被包含到str类的方法中，就是说不用import string就可以直接用两种方法。

使用translate()方法删除指定字符

In [54]: s = 'abc\refg\n234\t' 
首先设置映射表
In [55]: map = str.maketrans('','','\r\t\n')

In [56]: s.translate(map)
Out[56]: 'abcefg234'

使用translate()方法，看截图

Paste_Image.png

最后编辑于：2017.12.06 09:59:21

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 194,761评论 5赞 460
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 81,953评论 2赞 371
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 141,998评论 0赞 320
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 52,248评论 1赞 263
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 61,130评论 4赞 356
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 46,145评论 1赞 272
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 36,550评论 3赞 381
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 35,236评论 0赞 253
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 39,510评论 1赞 291
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 34,601评论 2赞 310
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 36,376评论 1赞 326
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 32,247评论 3赞 313
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 37,613评论 3赞 299
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 28,911评论 0赞 17
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 30,191评论 1赞 250
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 41,532评论 2赞 342
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 40,739评论 2赞 335