原理
所谓自动摘要,就是从文章中自动抽取关键句。何谓关键句?人类的理解是能够概括文章中心的句子,机器的理解只能模拟人类的理解,即拟定一个权重的评分标准,给每个句子打分,之后给出排名靠前的几个句子。
TextRank的打分思想依然是从PageRank的迭代思想衍生过来的,如下公式所示:
等式左边表示一个句子的权重(WS是weight_sum的缩写),右侧的求和表示每个相邻句子对本句子的贡献程度。与提取关键字的时候不同,一般认为全部句子都是相邻的,不再提取窗口。
求和的分子wji表示两个句子的相似程度,相似程度wji的计算,推荐使用BM25算法。分母又是一个weight_sum,而WS(Vj)代表上次迭代j的权重。整个公式是一个迭代的过程。
代码实现
text = '''
自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。
它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。
自然语言处理是一门融语言学、计算机科学、数学于一体的科学。
因此,这一领域的研究将涉及自然语言,即人们日常使用的语言,
所以它与语言学的研究有着密切的联系,但又有重要的区别。
自然语言处理并不是一般地研究自然语言,
而在于研制能有效地实现自然语言通信的计算机系统,
特别是其中的软件系统。因而它是计算机科学的一部分。
'''
import jieba
from utils import utils
from snownlp import seg
class TextRank(object):
def __init__(self, docs):
self.docs = docs
self.bm25 = BM25(docs)
self.D = len(docs)
self.d = 0.85
self.weight = []
self.weight_sum = []
self.vertex = []
self.max_iter = 200
self.min_diff = 0.001
self.top = []
def text_rank(self):
for cnt, doc in enumerate(self.docs):
scores = self.bm25.simall(doc)
self.weight.append(scores)
self.weight_sum.append(sum(scores)-scores[cnt])
self.vertex.append(1.0)
for _ in range(self.max_iter):
m = []
max_diff = 0
for i in range(self.D):
m.append(1-self.d)
for j in range(self.D):
if j == i or self.weight_sum[j] == 0:
continue
# TextRank的公式
m[-1] += (self.d*self.weight[j][i]
/ self.weight_sum[j]*self.vertex[j])
if abs(m[-1] - self.vertex[i]) > max_diff:
max_diff = abs(m[-1] - self.vertex[i])
self.vertex = m
if max_diff <= self.min_diff:
break
self.top = list(enumerate(self.vertex))
self.top = sorted(self.top, key=lambda x: x[1], reverse=True)
def top_index(self, limit):
return list(map(lambda x: x[0], self.top))[:limit]
def top(self, limit):
return list(map(lambda x: self.docs[x[0]], self.top))
if __name__ == '__main__':
sents = utils.get_sentences(text)
doc = []
for sent in sents:
words = seg.seg(sent)
# words = list(jieba.cut(sent))
words = utils.filter_stop(words)
doc.append(words)
print(doc)
rank = TextRank(doc)
rank.text_rank()
for index in rank.top_index(3):
print(sents[index])
self.weight
每一句与其它句的相似度
[[10.011936342719583, 0.0, 0.9413276860939246, 0, 2.5208967587765487, -0.42128772462816594, 0, 0, -0.41117681923708993, 0.0, 0, 1.7776032315706807], [0.0, 7.362470286473312, -0.1203528298812426, 0, 0.3208550531092889, -0.42128772462816594, 0.8638723295025715, 0, 0.16249889568919007, 1.1941879837414735, 0, 0.42128772462816594], [1.071635908557153, -0.2234656626288532, 7.174478670010185, 0, -0.6108314377461014, -0.8425754492563319, 0, 0, -0.8223536384741799, -0.258091618961588, 0, 3.1339187385131955], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1.256853150903099, 0.23476226867251077, -0.32977059288034255, 0, 3.7397693866902904, -0.42128772462816594, 0.8638723295025715, 0, 0.16249889568919007, -0.258091618961588, 0, 0], [-0.197031608341979, -0.2234656626288532, -0.32977059288034255, 0, -0.3054157188730507, 2.3454371414406934, 0, 0, -0.41117681923708993, -0.258091618961588, 0, 0], [0, 0.45822793130136397, 0, 0, 0.6262707719823396, 0, 3.630597195571431, 0, 0.57367571492628, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 3.1672694847490694, 0, 0, 0, 0], [-0.394063216683958, 0.011296606043657564, -0.6595411857606851, 0, 0.015439334236238222, -0.8425754492563319, 0.8638723295025715, 0, 1.5886339237482168, -0.516183237923176, 0, 0], [0.0, 1.0339739436867168, -0.1203528298812426, 0, -0.3054157188730507, -0.42128772462816594, 0, 0, -0.41117681923708993, 5.778308586740918, 1.730455352739724, 0.42128772462816594], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1.1941879837414735, 6.642686196030381, 0], [0.8313653667915449, 0.2234656626288532, 1.271098278974267, 0, 0, 0, 0, 0, 0, 0.258091618961588, 0, 1.7776032315706807]]
self.weight_sum
每一句与其它句相似度和
[4.407363132575895, 2.421061432161282, 1.4482368400032932, 0, 1.5088367082972747, -1.7249520209229035, 1.658174418209983, 0.0, -1.5217548198416835, 1.9274839284350573, 1.1941879837414735, 2.584020927356253]
(j, m)
第一次迭代中前两句的权重变化
1 [0.15000000000000002]
2 [0.7789651644764877]
4 [1.4870107550912874]
5 [1.584101494462158]
6 [1.584101494462158]
8 [1.8042116789767775]
9 [1.8042116789767775]
10 [1.8042116789767775]
11 [2.0776849137681874]
0 [2.0776849137681874, 0.15000000000000002]
2 [2.0776849137681874, 0.018843404622897686]
4 [2.0776849137681874, 0.15109623706944147]
5 [2.0776849137681874, 0.261212814765845]
6 [2.0776849137681874, 0.4961059221065203]
8 [2.0776849137681874, 0.4897960257868833]
9 [2.0776849137681874, 0.9457675849621023]
10 [2.0776849137681874, 0.9457675849621023]
11 [2.0776849137681874, 1.0192754312893615]
每次迭代过程中m值(即每一句的权重)的变化
迭代200次,前五次和最后五次m值的变化,可以看到最后m值已收敛
[2.0776849137681874, 1.0192754312893615, 0.999457826141497, 0.15000000000000002, 0.7185396874241888, -0.5261633807600671, 0.4574244460937142, 0.15000000000000002, 0.05200189320790127, 1.6227868709805937, 0.9131124846903355, 2.66587982716429]
[1.9767903479098448, 1.0990295797831187, 1.3128224919934568, 0.15000000000000002, 0.7652761963157931, -1.111371191008174, 0.7837318722726239, 0.15000000000000002, -0.6395683714253901, 1.2720237049753234, 1.3883689212368555, 3.1528964479465524]
[2.131123478696624, 0.9565423086380485, 1.1548328753945554, 0.15000000000000002, 0.6827525917271398, -1.5413388479058974, 1.1643685871601586, 0.15000000000000002, -0.7329978403690465, 1.4226914336015335, 1.1206971700887252, 3.6413282429681653]
[2.0445854067537668, 1.1136053809668183, 1.2961406802982383, 0.15000000000000002, 0.8363765234805878, -1.507050641192699, 1.12607464161547, 0.15000000000000002, -0.6871676552565422, 1.1312077269369323, 1.2356735948433215, 3.4105543415541146]
[2.192542113515565, 0.9600086901987991, 1.1866885268412732, 0.15000000000000002, 0.7930661765454192, -1.553868352553225, 1.2263591164249343, 0.15000000000000002, -0.6769452402058755, 1.2490292917375383, 1.0132387392037487, 3.6098809382918335]
...
[3.0660780434944765, 0.4978862574608699, 1.8170234457076675, 0.15000000000000002, 1.2100915598373658, -1.7210905907446725, 1.273125426875697, 0.15000000000000002, -0.7943466908131793, -0.05874055540868156, 0.10506476396899506, 4.604908339621487]
[3.0670178684613028, 0.49797632644810413, 1.8169978091525545, 0.15000000000000002, 1.2097806282899073, -1.7214471894836905, 1.2732051380734033, 0.15000000000000002, -0.7946293748838906, -0.06010081416007357, 0.10517434880999071, 4.606025259292418]
[3.0669901938072566, 0.4973816702146632, 1.81759950114689, 0.15000000000000002, 1.2104144540568738, -1.721330726750902, 1.2732175406130832, 0.15000000000000002, -0.7945170165565797, -0.05995284501299358, 0.10413631837439419, 4.60606091010734]
[3.0678632028596278, 0.4974716987107603, 1.817569225028541, 0.15000000000000002, 1.2101189112709647, -1.721663112850048, 1.2732914272902789, 0.15000000000000002, -0.7947807217322551, -0.06121755331865397, 0.1042492354778799, 4.607097687262931]
[3.067828117464969, 0.49691854169441674, 1.8181282756563857, 0.15000000000000002, 1.2107106526021962, -1.7215513928316113, 1.2733021487789122, 0.15000000000000002, -0.7946735518644014, -0.06106655593120924, 0.1032841207803389, 4.607119643650029]
摘要(前三句)
用jieba分词的结果
因而它是计算机科学的一部分
自然语言处理是计算机科学领域与人工智能领域中的一个重要方向
自然语言处理是一门融语言学、计算机科学、数学于一体的科学
详细代码
https://github.com/jllan/jannlp/blob/master/summary/textrank.py