1. 文本预处理(Text Preprocessing)
the task of converting a raw text file, essentially a sequence of digital bits, into a well-defined sequence of linguistically meaningful units.
文本预处理是NLP中的基本步骤,在这一步骤中,主要完成字符、单词、句子的识别任务。文本预处理又可以分成两个阶段,document triage 和 text segmentation。
Document Triage 将文件转化成定义明确的文本。它包含以下三个步骤:
Step 1: 字符编码识别(character encoding identification)
Step 2: 语言识别(language identification)
Step 3: 文本解剖(text sectioning):识别文本的有用主体部分,去除无用元素,如图表、 链接、HTML标签等。
Text Segmentation 将文本转化为单词和句子。它包含以下几个部分。
1) word segmentation 也叫tokenization,即分词。
2) text normalization 文本规范化,比如将“Mr.”, “Mr”, "mister", "Mister"规范化成一种形式。
3) Sentence segmentation 即句子划分。
2. 词法分析(Lexical Analysis)
A basic task of lexical analysis is to relate morphological variants to their lemma that lies in a lemma dictionary bundled up with its invariant semantic and syntactic information.
词法分析的一个基本任务是基于词元词典(lemma dictionary)进行词形还原,例如{delivers, deliver, delivering, delivered}.
词性标注(part-of-speech tagging) 也是词法分析的一个重要应用,常将词性标注的结果作为后续句法分析的输入。
3. 句法分析(Syntactic Parsing)
A basic techniques for grammar-driven natural language parsing, that is, analyzing a string of words (typically a sentence) to determine its structural description according to a formal grammar.
句法分析,一种语法驱动的句子解析,包含两个任务,phrase structure parsing 和 dependency parsing。
phrase structure parsing 旨在划分句子的结构化单元。
dependency parsing 旨在挖掘单词之间的语法依存关系。比如,主语、谓语等。
下图展示了两种任务之间的区别。
shallow syntactic parsing分析句子成分,主谓宾等。
chunker 是一种基于依存句法分析的句子划分方法。
e.g. Santa Claus delivers toy to Child.可以对此句做出如下的划分。
Action: delivers toy to Child
Initiating Actor: Santa Claus
Business Entity: toy
Responding Actor: Child
4. 语义分析(Semantic Analysis)
Poesio于 2000年在《 Handbook of Natural Language Processing》第一版中曾对语义分析给出了如下定义:The ultimate goal, for humans as well as natural language-processing (NLP) systems, is to understand the utterance—which, depending on the circumstances, may mean incorporating information provided by the utterance into one’s own knowledge base or, more in general performing some action in response to it. ‘Understanding’ an utterance is a complex process, that depends on the results of parsing, as well as on lexical information, context, and commonsense reasoning. . .
to be continued.........