NLP中的预处理

最近在做nlp相关的项目，对新手而言真的是从零开始造轮子...
　　刚开始mentor了个相关性分析的任务，我开始了激情澎湃词向量训练之路，几天之后...发现我还停留在数据预处理阶段...就很绝望
　　nlp预处理虽然麻烦，但总有一套流程的，大致包括：本文提取、文本过滤/清洗、分词这几个流程，本文所用环境：MacOS(本机)、RedHat(服务器)。

1.文本提取

1.1 json

JSON指JavaScript对象表示法（JavaScript Object Notation），是一种轻量级的数据交换格式，易于人阅读和编写，同时也易于机器解析和生成，文件以.json为后缀。
注：JSON格式必须使用双引号，以键值对的形式存在

待处理文本：

json文本

less查看文件的时候，我当成html格式的文本去提取了...结果很尴尬，找到两种方法提取content正文文本，并过滤掉文本中的html标签信息。
① js代码

data.content = JSON.parse(JSON.stringify(data.content).replace(/<\/?.+?\/?>/g,""));

② python代码

>>> with open(datapath, 'r') as fp:
...   json_data = json.load(fp)
... 
>>> `ValueError: Extra data: line 2 column 1 - line 12244872 column 1 (char 840 - 54306796812)`
# 报错原因：json.load()、json.loads()无法解码多个json对象，采用jsonlines遍历读取。

# 完整代码：用于文件中包含多个json对象的处理
import jsonlines
import re
def deal_json():
  t_file = open('res.txt', 'w')
  with open('example_json', 'r') as fp:
    for item in jsonlines.Reader(fp):
      if 'content' in item:
        tmp = item['content']   # 提取所需要的正文文本，不一定是content
        tmp = re.sub(r'<[^>]+>', '', tmp)   # 去除html标签
        t_file.write(tmp+'\n')
  t_file.close()

1.2 html

其实这个还没实操过，贴几个参考文档：
任意网页正文内容主题词提取

# 测试代码
from bs4 import BeautifulSoup
file = open('exaplme.txt', 'rb')
html = file.read()
bs = BeautifulSoup(html,"html.parser")
print(bs.title)

1.3 提取某列

normally，公司爬到的文本或者生成query不一定是单列文本，可能包含多列比如日期、部门标记（20200502 \t department \t data）等等。

cat wiki.txt | awk -F '\\t' '{print$5}' > wiwi_col5

2.文本过滤（表情标签、火星文、繁简转换、空格空行、重复行、全半角）

2.1 表情标签

情感分析中的表情是一种比较重要的特征，但大多数任务，可以直接忽略。

表情语义化
pip install emoji -i https://mirror.baidu.com/pypi/simple 安装emoji包，语义化表情：

>>> import emoji
>>> example = "nlp好简单喔😁nlp好难哦😊我爱nlp😬"
>>> res = emoji.demojize(example)
>>> res
'nlp好简单喔:beaming_face_with_smiling_eyes:nlp好难哦:smiling_face_with_smiling_eyes:我爱nlp:grimacing_face:'

简单粗暴的方法
直接根据常用汉字的编码范围，只保留中英文和数字，中英文及数字的Unicode编码范围

import re
def keep_zh(data):
    file = open(data, 'r')
    t_file = open('data_jtzh_a', 'w')
    line = file.readline()
    while line:
        line_sub = re.sub(u"([^\u4e00-\u9fa5\u0030-\u0039\u0041-\u005a\u0061-\u007a])","",line)
        if line_sub:
            line_sub += '\n'
            t_file.writelines(line_sub)
        line = file.readline()
    t_file.close()

>>> example = "^^nlp好简单喔*0*😁nlp好难哦😊我爱nlp😬>,<치앙+▄┻┳═"
>>> re.sub(u"([^\u4e00-\u9fa5\u0030-\u0039\u0041-\u005a\u0061-\u007a])","",example)
'nlp好简单喔0nlp好难哦我爱nlp'

根据表情的编码进行过滤
代码可以参考这位老哥写的：表情异常符号

2.2 火星文

看了很多用户名，都有火星文，这都0202年了..
有人会处理的话..麻烦评论一下教教我..

2.3 繁简转换

不管是哪个平台（微博、豆瓣、快手、抖音等等）都会有繁体字的出现，由于计算机编码的存在，繁体视为和简体不同的字，虽然占比较小，但不容忽视。

Open Chinese Convert（OpenCC）是一个开源的中文简繁转换项目，致力于制作高质量的基于统计预料的简繁转换词库。还提供函数库(libopencc)、命令行简繁转换工具、人工校对工具、词典生成程序、在线转换服务及图形用户界面。

这里借助OpenCC工具可对文本进行繁简转换，安装方式：
① 解压安装
　　下载地址：OpenCC下载，选择合适的版本，解压即可，进入安装目录执行命令。

opencc -i zhft.txt -o zhjt.txt -c t2s.json   # 繁转简
opencc -i zhjt.txt -o zhft.txt -c s2t.json   # 繁转简

② python安装opencc模块
　　不过opencc-python坑比较多，需要解决依赖问题，可采用pip install opencc-python-reimplemented解决，u1s1处理有点慢，据说snownlp模块会快一点，下次有机会试试。

def opencc_deal(data):      # path:zhft.txt
    file = open(data, 'r')
    t_file = open('zhjt.txt', 'w')
    cc = opencc.OpenCC('t2s')
    line = file.readline()
    while line:
        t = cc.convert(line)
        t_file.writelines(u'%s' % t)
        line = file.readline()

③ 系统全局安装 brew install Opencc
命令同①

2.4 空格空行

shell直接处理，速度挺快: （sed和awk是个好东西，建议了解~

空格

sed 's/ //g' example.txt > example_noblank.txt   # g表示全局匹配，替换一行中所有空格

空行

sed -i '/^$/d' example.txt   # 仅去空行

去除一切由空格、空行、制表符组成的空行

sed '/^\s*$/d' example.txt

2.5 重复行

pandas
对pandas比较熟悉，用了这种曲折的方法。
我的文本没有逗号，而pd.read_csv(seq = ',') 默认以逗号为分隔符，得到的pdfile只有一列，比较简洁。

def delet_repeat(filepath):
    pdfile = pd.read_csv(filepath)
    pdfile.head()
    res = pdfile.drop_duplicates()
    res.to_csv('./res.txt', index = 0)

shell：uniq
uniq只对相邻行做比较，一般结合sort排序，会打乱文本顺序。

sort -n aa.txt | uniq > bb.txt

2.6 全半角

知识盲区，我还没处理过这个..
参考文档：
Python实现全角与半角相互转换

3.分词

关于分词技术、实现方式就不细说了，我也不懂...安心做调包侠，分词主要用的工具是jieba分词，不过速度有点慢...巨慢...公司有提供另一种分词方法，这里就不便透露啦~~~
分词需要考虑两点：细粒度和停用词。

细粒度

停用词

停用词，词典中解释为“电脑检索中的虚字、非检索用字”，SEO中，为节省存储空间和提高搜索效率，搜索引擎在索引页面或处理搜索请求时会自动忽略某些字或词，这些字或词即被称为Stop Words(停用词)，停用词主要包括：
① 出现频率很高，但意义不大的词，主要包括了语气助词、副词、介词、连词等，“是”、“的”、“在”、“和”。
② 出现频率过高的词，比如：“我”、“你”等

NLKT不支持中文，国内整理了几套中文停用词表：下载地址，一般用cn_stopwords.txt，下载之后，创建停用词list：

def stopwordslist(filepath):
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]
    stopwords = set(stopwords)
    return stopwords

分词的同时去除stopwords：

def seg(data):
    file = open(data, 'rb')
    t_file = open(res.txt, 'w')
    line = file.readline()
    while line:
        tmp = jieba.cut(line)
        line_seg = " ".join(tmp)
        # t_file.writelines(line_seg)
        # 去除停用词部分
        out = ''
        for word in line_seg.split():
            word = word.strip()
            if len(word)>1 and word not in stopwords and word!='\t':
                out += word
                out += ' '
        out = out.strip()+'\n'
        t_file.writelines(out)
        line = file.readline()
    file.close()
    t_file.close()

停用词不一定要去掉，尤其是在具体生态特色的公司，实践出真知，多试试就知道啦~