最近做的一个项目是短文本关键词提取(twitter, linkedin post),这里主要用到了两个算法,一个是TextRank, 一个是RAKE,总的来说,这两个算法思路上差别很大,但对于短文本的关键词提取来说,RAKE算法效果更为明显。
TextRank 介绍
TextRank 算法是一种用于文本的基于图的排序算法。其基本思想来源于谷歌的 PageRank算法, 通过把文本分割成若干组成单元(单词、句子)并建立图模型, 利用投票机制对文本中的重要成分进行排序, 仅利用单篇文档本身的信息即可实现关键词提取、文摘。和 LDA、HMM 等模型不同, TextRank不需要事先对多篇文档进行学习训练, 因其简洁有效而得到广泛应用。
TextRank 一般模型可以表示为一个有向有权图 G =(V, E), 由点集合 V和边集合 E 组成, E 是V ×V的子集。图中任两点 Vi , Vj 之间边的权重为 wji , 对于一个给定的点 Vi, In(Vi) 为 指 向 该 点 的 点 集 合 , Out(Vi) 为点 Vi 指向的点集合。点 Vi 的得分定义如下:
其中, d 为阻尼系数, 取值范围为 0 到 1, 代表从图中某一特定点指向其他任意点的概率, 一般取值为 0.85。使用TextRank 算法计算图中各点的得分时, 需要给图中的点指定任意的初值, 并递归计算直到收敛, 即图中任意一点的误差率小于给定的极限值时就可以达到收敛, 一般该极限值取 0.0001。
基于TextRank的关键词提取
关键词抽取的任务就是从一段给定的文本中自动抽取出若干有意义的词语或词组。TextRank算法是利用局部词汇之间关系(共现窗口)对后续关键词进行排序,直接从文本本身抽取。其主要步骤如下:
把给定的文本T按照完整句子进行分割,即[图片上传失败...(image-52938a-1545389188888)]
对于每个句子,进行分词和词性标注处理,并过滤掉停用词,只保留指定词性的单词,如名词、动词、形容词,即[图片上传失败...(image-cb02e4-1545389188888)],其中[图片上传失败...(image-87e204-1545389188888)]是保留后的候选关键词。
构建候选关键词图G = (V,E),其中V为节点集,由(2)生成的候选关键词组成,然后采用共现关系(co-occurrence)构造任两点之间的边,两个节点之间存在边仅当它们对应的词汇在长度为K的窗口中共现,K表示窗口大小,即最多共现K个单词。
根据上面公式,迭代传播各节点的权重,直至收敛。
对节点权重进行倒序排序,从而得到最重要的T个单词,作为候选关键词。
-
由(5)得到最重要的T个单词,在原始文本中进行标记,若形成相邻词组,则组合成多词关键词。例如,文本中有句子“Matlab code for plotting ambiguity function”,如果“Matlab”和“code”均属于候选关键词,则组合成“Matlab code”加入关键词序列。
另外,TextRank算法还可以做文章的自动生成摘要,这里没有涉及到,我就不做详细介绍了。
TextRank算法github地址
RAKE(Rapid Automatic keyword extraction) 介绍
RAKE算法思想
RAKE算法用来做关键词(keyword)的提取,实际上提取的是关键的短语(phrase),并且倾向于较长的短语,在英文中,关键词通常包括多个单词,但很少包含标点符号和停用词,例如and,the,of等,以及其他不包含语义信息的单词。
RAKE算法首先使用标点符号(如半角的句号、问号、感叹号、逗号等)将一篇文档分成若干分句,然后对于每一个分句,使用停用词作为分隔符将分句分为若干短语,这些短语作为最终提取出的关键词的候选词。
我们注意到,每个短语可以再通过空格分为若干个单词,可以通过给每个单词赋予一个得分,通过累加得到每个短语的得分。一个关键点在于将这个短语中每个单词的共现关系考虑进去。最终定义的公式是:
- wordScore = wordDegree(w) / wordFrequency(w)
即单词w的得分是该单词的度(是一个网络中的概念,每与一个单词共现在一个短语中,度就加1,考虑该单词本身)除以该单词的词频(该单词在该文档中出现的总次数)。
然后对于每个候选的关键短语,将其中每个单词的得分累加,并进行排序,RAKE将候选短语总数的前三分之一的认为是抽取出的关键词。
另外,值得说明的是,关于分数计算这部分,wordDegree(w)实际上是等于word和每一个phrase里面的词共现的次数加上word的frequency。具体算法请看附件论文,《Automatic Keyword Extraction from IndividualDocumen》
RAKE算法github地址
短文本关键词提取实验
RAKE实验
测试文本1:
"Great interview by Gerry Dick with Ball State University's new president, Geoffrey Mearns, who recognizes the need to offer curriculum that meets students' needs. Aidex would welcome the opportunity to introduce our latest learning technologies, including Desktop Metal, metal 3D printing; SynDaver Labs and its lifelike human cadavers; and FANUC America Corporation robotics and CNC technology. These technologies elevate the educational experience and prepare students for fantastic careers. We hope to visit Muncie soon to present these and other STEM technologies."
结果:
[('fanuc america corporation robotics', 16.0), ('ball state university', 9.0), ('lifelike human cadavers', 9.0), ('including desktop metal', 9.0), ('metal 3d printing', 9.0), ('latest learning technologies', 8.333333333333334), ('stem technologies', 4.333333333333334), ('technologies elevate', 4.333333333333334), ('educational experience', 4.0), ('geoffrey mearns', 4.0), ('syndaver labs', 4.0), ('great interview', 4.0), ('prepare students', 4.0), ('visit muncie', 4.0), ('meets students', 4.0), ('cnc technology', 4.0), ('offer curriculum', 4.0), ('gerry dick', 4.0), ('fantastic careers', 4.0), ('aidex', 1.0), ('recognizes', 1.0), ('introduce', 1.0), ('president', 1.0), ('opportunity', 1.0), ('present', 1.0), ('hope', 1.0)]
这个是结果是按照分数排列的。
测试文本2:
"Yesterday, Desktop Metal CEO Ric Fulop joined Bloomberg Radio to discuss the future of metal 3D printing. Listen to the interview here"
结果:
[('desktop metal ceo ric fulop joined bloomberg radio', 61.5), ('metal 3d printing', 11.5), ('yesterday', 1.0), ('interview', 1.0), ('future', 1.0), ('discuss', 1.0), ('listen', 1.0)]
测试文本3:
"3D printing metal on a desktop FDM printer, exclusive interview with The Virtual Foundry founder : Is 2017 going to be the year for 3D printing metal? Recently 3D Printing Industry reported announcements from Markforged about their forthcoming Metal X 3D "
结果:
[('recently 3d printing industry reported announcements', 31.25), ('3d printing metal', 9.916666666666666), ('virtual foundry founder', 9.0), ('desktop fdm printer', 9.0), ('forthcoming metal', 4.666666666666666), ('exclusive interview', 4.0), ('3d', 3.25), ('year', 1.0), ('markforged', 1.0), ('2017', 0)]
两种算法对比实验
测试文本:
Desktop Metal is proud to welcome Morris Group, Inc.. as an authorized reseller of its metal 3D printing systems in 30 states. With the addition of Desktop Metal’s Studio System™ to its existing lineup of CNC machine tools, Morris Group’s extensive distributor network provides an end-to-end suite of advanced solutions to manufacturers of precision metal parts.
In the latest episode of podcast, The Digital Factory, Desktop Metal CEO Ric Fulop shares his thoughts on the state of the metal 3D printing industry.
We're excited to announce our Series D Funding with support from our strategic partners NEA, GV, GE Ventures, among others.
Register now for 's Metal 3D Printing webinar featuring Desktop Metal and the Studio System, the world's first office-friendly metal 3D printing system.
The latest issue of examines how recent advances make 3D printing a powerful competitor to conventional mass production. Read the full article here, including commentary from Desktop Metal CEO Ric Fulop.
We're honored to be recognized as one of 's 50 Smartest Companies of 2017.
Desktop Metal is honored to join the prestigious roster of recipients of the World Economic Forum Technology Pioneers program. For the press release, please visit: .
See the full list of Technology Pioneers 2017 here: .
At RAPID+TCT last month, Desktop Metal CTO Jonah Myerberg spoke with about leveraging metal 3D printing for the full product life cycle, from prototyping to mass production.
At RAPID+TCT, Desktop Metal CTO Jonah Myerberg talked to TechCrunch about our metal 3D printing solutions. Check out the video here:
This past weekend, Desktop Metal was honored to be recognized as Startup of the Year by the 3D Printing Industry awards. Thank you to all who voted!
Yesterday, Desktop Metal CEO Ric Fulop joined Bloomberg Radio to discuss the future of metal 3D printing. Listen to the interview here:
Today in the Wall Street Journal: 3D printing is transforming manufacturing, from prototyping to mass production.
Desktop Metal CEO Ric Flop joined CNBC's Squawk Box to discuss the latest in metal 3D printing--from prototyping to mass production.
利用RAKE提取关键词的结果是:
str | score |
---|---|
desktop metal ceo ric fulop joined bloomberg radio | 52.1515151515 |
desktop metal ceo ric flop joined cnbc | 43.8181818182 |
metal 3d printing webinar featuring desktop metal | 36.1090909091 |
desktop metal ceo ric fulop shares | 34.6515151515 |
desktop metal cto jonah myerberg spoke | 33.3181818182 |
desktop metal cto jonah myerberg talked | 33.3181818182 |
world economic forum technology pioneers program | 29.5 |
desktop metal ceo ric fulop | 28.6515151515 |
recent advances make 3d printing | 23.2909090909 |
office-friendly metal 3d printing system | 20.7909090909 |
metal 3d printing industry | 16.7909090909 |
metal 3d printing systems | 16.7909090909 |
leveraging metal 3d printing | 16.7909090909 |
3d printing industry awards | 16.2909090909 |
metal 3d printing solutions | 15.7909090909 |
full product life cycle | 14.6666666667 |
metal 3d printing | 12.7909090909 |
metal 3d printing-- | 11.5909090909 |
precision metal parts | 10.5 |
desktop metal | 9.31818181818 |
desktop metal’ | 9.31818181818 |
strategic partners nea | 9.0 |
extensive distributor network | 9.0 |
wall street journal | 9.0 |
cnc machine tools | 9.0 |
3d printing | 8.29090909091 |
technology pioneers 2017 | 8.0 |
conventional mass production | 7.5 |
studio system | 5.0 |
advanced solutions | 5.0 |
studio system™ | 5.0 |
full list | 4.66666666667 |
full article | 4.66666666667 |
mass production | 4.5 |
50 smartest companies | 4.0 |
transforming manufacturing | 4.0 |
including commentary | 4.0 |
authorized reseller | 4.0 |
ge ventures | 4.0 |
squawk box | 4.0 |
end-to-end suite | 4.0 |
prestigious roster | 4.0 |
digital factory | 4.0 |
morris group | 4.0 |
past weekend | 4.0 |
press release | 4.0 |
existing lineup | 4.0 |
morris group’ | 4.0 |
powerful competitor | 4.0 |
latest episode | 3.66666666667 |
latest issue | 3.66666666667 |
world | 3.5 |
latest | 1.66666666667 |
gv | 1.0 |
month | 1.0 |
voted | 1.0 |
announce | 1.0 |
techcrunch | 1.0 |
recipients | 1.0 |
read | 1.0 |
discuss | 1.0 |
honored | 1.0 |
series | 1.0 |
startup | 1.0 |
prototyping | 1.0 |
year | 1.0 |
funding | 1.0 |
state | 1.0 |
rapid+tct | 1.0 |
recognized | 1.0 |
visit | 1.0 |
addition | 1.0 |
support | 1.0 |
today | 1.0 |
listen | 1.0 |
manufacturers | 1.0 |
30 states | 1.0 |
podcast | 1.0 |
join | 1.0 |
excited | 1.0 |
future | 1.0 |
video | 1.0 |
proud | 1.0 |
examines | 1.0 |
check | 1.0 |
interview | 1.0 |
yesterday | 1.0 |
thoughts | 1.0 |
register | 1.0 |
2017 | 0 |
利用TextRank算法结果
str | str | str | str |
---|---|---|---|
desktop | metal | d printing | production |
product | join | joined | morris |
myerberg | latest | advances | advanced solutions |
fulop | distributor network | ric | machine |
partners | ge | pioneers | economic |
可以看到的是,其实两种算法的结果都不太好,但是总体上来说,RAKE算法的结果会更好一些,所以针对这个问题,我把RAKE算法进行了改进,结果成为了
str | mean score |
---|---|
desktop metal desktop metal | 49.5083333333 |
desktop metal | 49.5083333333 |
morris | 42.1666666667 |
metal 3d printing tool | 38.9375 |
morris group | 37.0 |
metal 3d printing solutions | 36.8333333333 |
make metal 3d printing | 35.1583333333 |
represent desktop metal | 34.1166666667 |
desktop metal offers | 34.1166666667 |
desktop metal products | 34.1166666667 |
metal additive manufacturing | 33.3388888889 |
office-friendly metal 3d printing system | 32.7966666667 |
end-to-end metal 3d printing solutions | 30.8066666667 |
innovative metal 3d printing systems | 29.9566666667 |
metal cutting manufacturers | 29.6722222222 |
precision metal parts | 29.0611111111 |
morris group distributor | 28.3055555556 |
morris company | 27.25 |
bound metal deposition | 26.8388888889 |
3d printing process | 23.9 |
morris group distribution network | 23.125 |
local morris group distributor | 22.2916666667 |
groundbreaking 3d printing technology | 21.5083333333 |
studio system | 21.3 |
(以上为部分结果)
改进后具体代码见我的github.