被一个中文编码的 error 卡了好久,师兄给我的从mac拿过来的含有中文的txt文档。文件首部含有 BOM 字符!记录以自省…
- 最早的版本是这样,
with open('input/out_classes.txt', 'r', encoding='utf-8') as cf:
for line in cf:
label = line.strip('\r\n')
label_list.append((label))
print(label)
- 报错是这样:
UnicodeEncodeError: 'gbk' codec can't encode character '\ufeff' in position 0
- stack overflow 的说法是这样:
The Unicode character U+FEFF is the byte order mark, or BOM, and is used to tell the difference between big- and little-endian UTF-16 encoding. If you decode the web page using the right codec, Python will remove it for you.
#!python2
#coding: utf8
u = u'ABC'
e8 = u.encode('utf-8') # encode without BOM
e8s = u.encode('utf-8-sig') # encode with BOM
e16 = u.encode('utf-16') # encode with BOM
e16le = u.encode('utf-16le') # encode without BOM
e16be = u.encode('utf-16be') # encode without BOM
print 'utf-8 %r' % e8
print 'utf-8-sig %r' % e8s
print 'utf-16 %r' % e16
print 'utf-16le %r' % e16le
print 'utf-16be %r' % e16be
print
print 'utf-8 w/ BOM decoded with utf-8 %r' % e8s.decode('utf-8')
print 'utf-8 w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')
print 'utf-16 w/ BOM decoded with utf-16 %r' % e16.decode('utf-16')
print 'utf-16 w/ BOM decoded with utf-16le %r' % e16.decode('utf-16le')
- 最后改成这样:
with open('input/out_classes.txt', 'r', encoding='utf-8-sig') as cf:
for line in cf:
label = line.strip('\r\n')
label_list.append((label))
print(label)