NLP from Zero to Picture

NLP 从入门到入土

What is NLP

Making wheels：造轮子

In the wheel-making movement of the big front-end era, each company has its own wheels, and a lot of repeated coding.

The ML field is better but not much better. There are a large number of databases and method libraries for us to use. Never write deep networks manually with tensoflow anymore. Even the birth of autokeras makes optimizing the network foolish.

The knowledge you need:

Simple python knowledge reserve
A preliminary understanding of the structure of the Neural network, such as gradient descent.

Recommend Courses ：

CS224n: Natural Language Processing with Deep Learning Stanford University's very famous NLP course, because of the Covid-19, online cause also changed the professor's keynote style that never changes😄. A very detailed and systematic course, from ML basics to math formulas, with very detailed notes. However, mathematical formulas and algorithms are too ‘mathematical’, and are obscure for students who have no foundation.
Deep Learning for Human Language Processing (2020, Spring) The famous stand-up comic lecturer at National Taiwan University HUNG-YI LEE (Li Hong Yi), because of its humorous and easy-to-understand lectures, the course has a high number of broadcasts on Youtube.
Machine Learning (2021, Spring) The same course is taught by Hung-Yi Lee. ML has an introduction. The two courses have overlapping knowledge blocks that can be skipped as appropriate.

Representing words

Representing Image

For images, know that the grayscale image is one of our matrices.

The RBG image is a three-channel matrix.

details

img

各种Image processing 就是在这个矩阵上叠Buff，卷积啊，滤镜啊还有高斯傅立叶。。。。。。那么人类语言的词汇，该如何让机器去理解呢。
Various Image processing is to stack Buff, convolution, filter and Gaussian Fourier on this matrix. . . . . . So how can the vocabulary of human language be understood by the machine?

How do we have usable meaning in a computer?

举个简单例子，判断一个词的词性，是动词还是名词。

用机器学习的思路，我们有一系列样本(x,y)，这里 x 是词语，y 是它们的词性，我们要构建 f(x)->y 的映射，但这里的数学模型 f（比如神经网络、SVM）只接受数值型输入。

而 NLP 里的词语，是人类的抽象总结，是符号形式的（比如中文、英文、拉丁文等等），所以需要把他们转换成数值形式，或者说——嵌入到一个数学空间里，这种嵌入方式，就叫词嵌入word embedding，而 Word2vec，就是词嵌入 word embedding 的一种

WordNet A thesaurus containing lists of synonym sets and hypernyms (“is a” relationships).

<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n27" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-size: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-top-left-radius: 3px; border-top-right-radius: 3px; border-bottom-right-radius: 3px; border-bottom-left-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; caret-color: rgb(51, 51, 51); color: rgb(51, 51, 51); font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-indent: 0px; text-transform: none; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none; background-position: inherit; background-repeat: inherit;">包含同义词集和上位词列表的同义词库</pre>

WordNet的开发有两个目的：

它既是一个字典，又是一个辞典，它比单纯的辞典或词典都更加易于使用。
支持自动的文本分析以及人工智能应用。
Representing words as discrete symbols One-hot vectors:

<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n36" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-size: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-top-left-radius: 3px; border-top-right-radius: 3px; border-bottom-right-radius: 3px; border-bottom-left-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; caret-color: rgb(51, 51, 51); color: rgb(51, 51, 51); font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-indent: 0px; text-transform: none; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none; background-position: inherit; background-repeat: inherit;">motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] </pre>

Vector dimension = number of words in vocabulary (e.g., 500,000) 太庞大的数据了

Word vector word vectors are also called word embeddings or (neural) word representations They are a distributed representation

如何创建呢？

通过统计一个事先指定大小的窗口内的word共现次数，以word周边的共现词的次数做为当前word的vector。具体来说，我们通过从大量的语料文本中构建一个共现矩阵来定义word representation。

For example

I like deep learning. I like NLP. I enjoy flying.

截屏2021-12-17 14.13.53

NLP Task

截屏2021-12-17 13.31.29

	One Sequence	Multiple Sequences
One Class	Sentiment Classification, Stance Detection, Veracity Prediction, Intent Classification, Dialogue Policy	NLI Search Engine Relation Extraction
Class for each Token	POS tagging Word segmentation Extraction Summarization Slotting Filling NER
Copy from Input		Extractive QA
General Sequence	Abstractive Summarization, Translation, Grammar Correction ,NLG	General QA Task Oriented Dialogue Chatbot
Other?	Parsing, Coreference Resolution

Part-of-Speech (POS) Tagging

截屏2021-12-17 13.48.36

Word Segmentation

就是如何断句。尤其在英文的从句，中文的多重定语。

It's how to break sentences. Especially in Chinese clauses, Chinese multiple attributives.

酿酒缸缸好造醋坛坛酸

养猪大如山老鼠只只死

酿酒缸缸好，造醋坛坛酸

养猪大如山，老鼠只只死

酿酒缸缸好造醋，坛坛酸

养猪大如山老鼠，只只死

新闻：佟大为妻子产下一女

评论：这个佟大是谁？真了不起，太厉害了！！

Parsing

截屏2021-12-17 13.53.33

Summarization

Extractive summarization

截屏2021-12-17 13.54.29

最简单的resolution： It is binary classfication problem. To decide which sentence will be add in summary. 类似我们小时候写摘要，老师让总结课文，我们只是摘抄两句。

但是这样往往不会得到最好的结果，如果有两个句子意思相近，我们只一句句input，是不够的。

如果能用上DL，我们要把全文考虑进去，全部一起输入，用一个binary LSTM or Transformer然后输出每一句是否放在summary里
Abstractive summarization

截屏2021-12-17 14.18.06

The machine needs to write the summary in its own words, not directly in the original text. resolution：Seq2Seq problem. Long seq -> short Seq

会有一个问题，就是本来这个文章里的一些专业术语。或者金句，人家写的好好的。我们非要给人家parephas一下。然后滤除不对马嘴。

我们总结的时候还是希望有一下原文的，所以我能鼓励这这个网络是由Copy的能力的额。不要全部用机器自己的白话。
Machine Translation

截屏2021-12-17 14.19.46

7000种语言，每种语言上万词。胡翻译至少需要7000的平方。

Unsupervised learning！
Grammar Error Correction Seq2seq. 我们可以直接给他数据，硬train

[图片上传失败...(image-18f7d6-1643274066425)]

进阶Input: Token -> Token calculate the different. For example: 3 options C for Copy, R for replace, A for append
Sentiment Classification 情感判断。广告推广啊，去判断电影的口碑啊。股票利空利多消息啊，币圈源于周热度啊之类的。
Stance Detection 立场侦测。新型民调，挖掘不愿意表态选民画像，选举广告推送。B站阿瓦隆系统

截屏2021-12-17 13.56.29

Source：川普是个好总统 Reply：他只是个资本家这位网民的立场？->Denied Many systems use the support, denying, querying and commenting （SDQC 4classes) for classifying replies. 支持、否认、质疑和评论
Natural Language Inference (NLI)

自然语言推理
- Contradiction：矛盾
- Entailment：蕴含
- Neutral：中性
  
  Premise : 一个绿色的三角形 ->？ hypothesis：两边之和大于第三边 Premise ->? Hypothesis input： premise+ hypothesis -> output：蕴含
Search engine Bert 可以简化为

[图片上传失败...(image-8871d6-1643274066425)]

2 inputs: 搜索词+ 文章内容 -> model -> relevant
Question AnswerQA system

截屏2021-12-17 14.27.16

传统方法是一个非常庞大的价格，包含一些svm的简单模型等等。 Input: Question & knowledge source -> QA model -> answer Reading comprehension Extractive QA 目前还太难实现了，目前的网络只是阅读理解和的程度，输出原文的答案的field。eg. (1_7-11,第一段7-11词) 如果实现，就是先知的诞生。
Chatting 尬聊就。。。尬聊
Task-oriented

[图片上传失败...(image-de07f-1643274066425)]

Natural Language Generation (NLG)
Policy & State Tracker
Natural Language Understanding (NLU)

截屏2021-12-17 14.31.16

Network

BERT：

是芝麻街里的一个人物, 大家都在用网络方法的首字母凑芝麻街里的人物。Bert和RNN 将会在下一次Note里更新

截屏2021-12-17 13.42.20

LSTM: Will be introduced in next sharing session

Final

Quote from "Statistical approach to speech" by Prof. Keiichi Tokuda in Interspeech 2019

Every time I fire a linguist, the performance of the speech recongnizer goes up

截屏2021-12-17 14.38.36

最后编辑于：2024.10.16 13:32:20

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 205,132评论 6赞 478
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 87,802评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 151,566评论 0赞 338
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,858评论 1赞 277
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,867评论 5赞 368
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,695评论 1赞 282
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,064评论 3赞 399
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,705评论 0赞 258
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 42,915评论 1赞 300
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,677评论 2赞 323
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,796评论 1赞 333
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,432评论 4赞 322
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,041评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,992评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,223评论 1赞 260
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 45,185评论 2赞 352
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,535评论 2赞 343