项目位置:node2:/home/disk1/xukaituo/expriments/ngram-2016-11/
Step 1. 转换编码
iconv -f gbk//IGNORE -t utf-8//IGNORE filename > new_format_file
Step 2. 将非汉字去掉
#!/usr/bin/env python
# coding: utf-8
import codecs
import re
import sys
def remove_non_Chinese_word(input_file, output_file):
re_non_chinese = ur"[^\u4e00-\u9fa5]+"
with codecs.open(input_file, 'r', 'utf-8') as inputf:
with codecs.open(output_file, 'w', 'utf-8') as outputf:
for line in inputf:
re_result = re.sub(re_non_chinese, u"", line)
# new_line = " ".join(re_result)
new_line = re_result
outputf.write(new_line + '\n')
if __name__ == '__main__':
if len(sys.argv) < 3:
print "Usage: python 0-filter_non_chinese.py input-file output-file"
sys.exit()
remove_non_Chinese_word(sys.argv[1], sys.argv[2])
Step 3. 删除空白行
sed -i '/^$/d' filename
Step 4. 分词
使用ltp分词工具
[1]github https://github.com/HIT-SCIR/ltp
[2]文档 http://ltp.readthedocs.io/zh_CN/latest/api.html#id2
[3]模型 https://pan.baidu.com/share/link?shareid=1988562907&uk=2738088569
部分bash脚本:
cd /home/disk1/xukaituo/expriments/ngram-2016-11/utils
CWSTOOL=/home/disk1/xukaituo/projects/Chinese-word-segmentation
1-Chinese-word-segmentor/cws ${CWSTOOL}/ltp_data/cws.model $2 $3
调用ltp接口的分词程序:
// cws.cc
// Copyright 2016 ASLP(Author: Kaituo Xu)
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include "segment_dll.h"
int main(int argc, char *argv[])
{
try {
if (argc < 4) {
std::cerr << "cws [model path] [input file path] [output file path]" << std::endl;
return 1;
}
void *engine = segmentor_create_segmentor(argv[1]);
std::ifstream input(argv[2]);
std::ofstream output(argv[3], std::ofstream::app);
if (!engine || !input || !output) {
return -1;
}
std::string line;
while (getline(input, line)) {
std::vector<std::string> words;
int len = segmentor_segment(engine, line, words);
for (int i = 0; i < len; ++i) {
output << words[i] << " ";
}
output << std::endl;
}
segmentor_release_segmentor(engine);
return 0;
} catch(const std::exception &e) {
std::cerr << e.what();
return -1;
}
}
Step 5. 将暂时不用的数据进行压缩,节省磁盘空间
# 使用`gzip`对文件进行压缩
gzip <filename>
# 解压缩
gzip -d <filename>.gz
压缩后原文件消失,默认在<filename>
后加.gz
;解压缩后,.gz
文件会消失。