python读取一个utf-8编码保存的文件,第一行为空,然后我用line.strip() == ‘’来判断是否是空行,发现判断不对。
line.strip()后, 我发现显示的值是‘’, 但为什么与‘’不相等呢?len(line.strip())居然等于3!!太奇怪了,显然不是空值呀,然后我用repr()这个函数对结果进行转义,发现有值\xef\xbb\xbf, 那这个值是什么意思呢?
EF BB BF是被称为 Byte order mark (BOM)的文件标记,用来指出这个文件是UTF-8编码。
处理方式见 Reading Unicode file data with BOM chars in Python 的第一个回答,附下:
There is no reason to check if a BOM exists or not, utf-8-sig manages that for you and behaves exactly as utf-8 if the BOM does not exist:
1. # Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
'hello'
>>> b'hello'.decode('utf-8-sig')
'hello'
2. # BOM encoded UTF-8
>>> b'\xef\xbb\xbfhello'.decode('utf-8')
'\ufeffhello'
>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')
'hello'
In the example above, you can see utf-8-sig correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use utf-8-sig and not worry about it
所以我在读取文件时,采用utf-8-sig的方式,在python 2.7中,代码如下:
import codecs
with codecs.open(file_path, 'r', 'utf-8-sig') as fh: