NLP 从入门到入土
What is NLP
Making wheels:造轮子
In the wheel-making movement of the big front-end era, each company has its own wheels, and a lot of repeated coding.
The ML field is better but not much better. There are a large number of databases and method libraries for us to use. Never write deep networks manually with tensoflow anymore. Even the birth of autokeras makes optimizing the network foolish.
The knowledge you need:
Simple python knowledge reserve
A preliminary understanding of the structure of the Neural network, such as gradient descent.
Recommend Courses :
CS224n: Natural Language Processing with Deep Learning Stanford University's very famous NLP course, because of the Covid-19, online cause also changed the professor's keynote style that never changes😄. A very detailed and systematic course, from ML basics to math formulas, with very detailed notes. However, mathematical formulas and algorithms are too ‘mathematical’, and are obscure for students who have no foundation.
Deep Learning for Human Language Processing (2020, Spring) The famous stand-up comic lecturer at National Taiwan University HUNG-YI LEE (Li Hong Yi), because of its humorous and easy-to-understand lectures, the course has a high number of broadcasts on Youtube.
Machine Learning (2021, Spring) The same course is taught by Hung-Yi Lee. ML has an introduction. The two courses have overlapping knowledge blocks that can be skipped as appropriate.
Representing words
Representing Image
For images, know that the grayscale image is one of our matrices.
The RBG image is a three-channel matrix.
各种Image processing 就是在这个矩阵上叠Buff,卷积啊,滤镜啊还有高斯傅立叶。。。。。。那么人类语言的词汇,该如何让机器去理解呢。
Various Image processing is to stack Buff, convolution, filter and Gaussian Fourier on this matrix. . . . . . So how can the vocabulary of human language be understood by the machine?
How do we have usable meaning in a computer?
举个简单例子,判断一个词的词性,是动词还是名词。
用机器学习的思路,我们有一系列样本(x,y),这里 x 是词语,y 是它们的词性,我们要构建 f(x)->y 的映射,但这里的数学模型 f(比如神经网络、SVM)只接受数值型输入。
而 NLP 里的词语,是人类的抽象总结,是符号形式的(比如中文、英文、拉丁文等等),所以需要把他们转换成数值形式,或者说——嵌入到一个数学空间里,这种嵌入方式,就叫词嵌入word embedding,而 Word2vec,就是词嵌入 word embedding 的一种
- WordNet A thesaurus containing lists of synonym sets and hypernyms (“is a” relationships).
<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n27" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-size: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-top-left-radius: 3px; border-top-right-radius: 3px; border-bottom-right-radius: 3px; border-bottom-left-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; caret-color: rgb(51, 51, 51); color: rgb(51, 51, 51); font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-indent: 0px; text-transform: none; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none; background-position: inherit; background-repeat: inherit;">包含同义词集和上位词列表的同义词库</pre>
WordNet的开发有两个目的:
支持自动的文本分析以及人工智能应用。
Representing words as discrete symbols One-hot vectors:
<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n36" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-size: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-top-left-radius: 3px; border-top-right-radius: 3px; border-bottom-right-radius: 3px; border-bottom-left-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; caret-color: rgb(51, 51, 51); color: rgb(51, 51, 51); font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-indent: 0px; text-transform: none; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none; background-position: inherit; background-repeat: inherit;">motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] </pre>
Vector dimension = number of words in vocabulary (e.g., 500,000) 太庞大的数据了
- Word vector word vectors are also called word embeddings or (neural) word representations They are a distributed representation
如何创建呢?
通过统计一个事先指定大小的窗口内的word共现次数,以word周边的共现词的次数做为当前word的vector。具体来说,我们通过从大量的语料文本中构建一个共现矩阵来定义word representation。
For example
I like deep learning. I like NLP. I enjoy flying.
NLP Task
One Sequence | Multiple Sequences | |
---|---|---|
One Class | Sentiment Classification, Stance Detection, Veracity Prediction, Intent Classification, Dialogue Policy | NLI Search Engine Relation Extraction |
Class for each Token | POS tagging Word segmentation Extraction Summarization Slotting Filling NER | |
Copy from Input | Extractive QA | |
General Sequence | Abstractive Summarization, Translation, Grammar Correction ,NLG | General QA Task Oriented Dialogue Chatbot |
Other? | Parsing, Coreference Resolution |
Part-of-Speech (POS) Tagging
Word Segmentation
就是如何断句。尤其在英文的从句,中文的多重定语。
It's how to break sentences. Especially in Chinese clauses, Chinese multiple attributives.
酿酒缸缸好造醋坛坛酸
养猪大如山老鼠只只死
酿酒缸缸好,造醋坛坛酸
养猪大如山,老鼠只只死
酿酒缸缸好造醋,坛坛酸
养猪大如山老鼠,只只死
新闻:佟大为妻子产下一女
评论:这个佟大是谁?真了不起,太厉害了!!
Parsing
Summarization
-
Extractive summarization
最简单的resolution: It is binary classfication problem. To decide which sentence will be add in summary. 类似我们小时候写摘要,老师让总结课文,我们只是摘抄两句。
但是这样往往不会得到最好的结果,如果有两个句子意思相近,我们只一句句input,是不够的。
如果能用上DL,我们要把全文考虑进去,全部一起输入,用一个binary LSTM or Transformer然后输出每一句是否放在summary里
-
Abstractive summarization
The machine needs to write the summary in its own words, not directly in the original text. resolution:Seq2Seq problem. Long seq -> short Seq
会有一个问题,就是本来这个文章里的一些专业术语。或者金句,人家写的好好的。我们非要给人家parephas一下。然后滤除不对马嘴。
我们总结的时候还是希望有一下原文的,所以我能鼓励这这个网络是由Copy的能力的额。不要全部用机器自己的白话。
-
Machine Translation
7000种语言,每种语言上万词。胡翻译至少需要7000的平方。
Unsupervised learning!
-
Grammar Error Correction Seq2seq. 我们可以直接给他数据,硬train
[图片上传失败...(image-18f7d6-1643274066425)]
进阶Input: Token -> Token calculate the different. For example: 3 options C for Copy, R for replace, A for append
Sentiment Classification 情感判断。广告推广啊,去判断电影的口碑啊。股票利空利多消息啊,币圈源于周热度啊之类的。
-
Stance Detection 立场侦测。新型民调,挖掘不愿意表态选民画像,选举广告推送。B站阿瓦隆系统
Source:川普是个好总统 Reply: 他只是个资本家 这位网民的立场?->Denied Many systems use the support, denying, querying and commenting (SDQC 4classes) for classifying replies. 支持、否认、质疑和评论
-
Natural Language Inference (NLI)
自然语言推理
Contradiction:矛盾
Entailment:蕴含
-
Neutral:中性
Premise : 一个绿色的三角形 ->? hypothesis: 两边之和大于第三边 Premise ->? Hypothesis input: premise+ hypothesis -> output:蕴含
-
Search engine Bert 可以简化为
[图片上传失败...(image-8871d6-1643274066425)]
2 inputs: 搜索词+ 文章内容 -> model -> relevant
-
Question AnswerQA system
传统方法是一个非常庞大的价格,包含一些svm的简单模型等等。 Input: Question & knowledge source -> QA model -> answer Reading comprehension Extractive QA 目前还太难实现了,目前的网络只是阅读理解和的程度,输出原文的答案的field。eg. (1_7-11,第一段7-11词) 如果实现,就是先知的诞生。
Chatting 尬聊 就。。。尬聊
Task-oriented
[图片上传失败...(image-de07f-1643274066425)]
Natural Language Generation (NLG)
Policy & State Tracker
-
Natural Language Understanding (NLU)
Network
BERT:
是芝麻街里的一个人物, 大家都在用网络方法的首字母凑芝麻街里的人物。Bert和RNN 将会在下一次Note里更新
LSTM: Will be introduced in next sharing session
Final
Quote from "Statistical approach to speech" by Prof. Keiichi Tokuda in Interspeech 2019
Every time I fire a linguist, the performance of the speech recongnizer goes up