第一章：NLTK频率分布类中定义的函数

1、语言计算：文本和词汇

1.1安装和入门

首先应该安装NLTK，可以从http://www.nltk.org上免费下载；
一旦安装完成，启动python，而后使用下面的语句安装本系列所需要的数据；

>>>import nltk
>>>nltk.download()

下载时间有点长~

#从NLTK的book模块中加载所有的条目：
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
#无论什么时候想要找到这些书本，只要输入他们的名字即可；
>>> text1
<Text: Moby Dick by Herman Melville 1851>
>>> text2
<Text: Sense and Sensibility by Jane Austen 1811>

出了简单的阅读文本以外，还有很多方法可以用来查看文本内容。

#查找书中某个词
>>> text1.concordance('monstrous')
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
...

关键词索引让我们可以看到上下文中的词，还有哪些词出现在相似的上下文中，可以通过在被查询的文本名后添加函数名SIMILAR(),然后在括号中插入相关词的方法来查找；

>>> text1.similar('monstrous')
true contemptible christian abundant few part mean careful puzzled
mystifying passing curious loving wise doleful gamesome singular
delightfully perilous fearless

我们可以使用函数common_contexts研究共用两个或者两个以上词汇的上下文，如：monstrous和very；使用方括号和圆括号将这些词扣起来，中间用逗号隔开；

>>> text2.common_contexts(['monstrous','very'])
a_pretty am_glad a_lucky is_pretty be_glad

自动检索出现在文本中的特定的词，并显示同一上下文中出现的其他词，我们也可以判断在文本中的位置：从文本开头算起有多少词出现，这个位置信息可以用离散图表示；

>>> text4.dispersion_plot(['citizens','democracy','freedom','duties','America'])

Figure_1.png

1.2计算词汇

#获取词汇的条目
>>> sorted(set(text3))
['!', "'", '(', ')', ',', ',)', '.', '.)', ':', ';', ';)', '?', '?)', 'A', 'Abel', 'Abelmizraim', 'Abidah', 'Abide', 'Abimael', 'Abimelech', 'Abr', 'Abrah', 'Abraham', 'Abram', 'Accad', 'Achbor', 'Adah', 'Adam', 'Adbeel', 'Admah'……]
#总词汇
>>> len(text3)
44764
#去重词汇总数
>>> len(set(text3))
2789
#某个词出现的次数
>>> text3.count('smote')
5
#对词汇文本的丰富度进行计算
>>> from __future__ import division
>>> len(text3) / len(set(text3))
16.050197203298673

1.3频数分布

我们在分析一篇文章的时候，想要看出现最频繁的50个词，称为频数分布，我们可以使用nltk提供的内置函数FreqDist()来完成；

>>> fdist1 = FreqDist(text1)
>>> fdist1
FreqDist({',': 18713, 'the': 13721, '.': 6862, 'of': 6536, 'and': 602
4, 'a': 4569, 'to': 4542, ';': 4072, 'in': 3916, 'that': 2982, ...})
>>> fdist1['whale']
906

词库中虽有长度唱过7个字符并且出现次数超过7词的词；

>>> fdist5 = FreqDist(text5)
>>> sorted([w for w in fdist5 if len(w) > 7 and fdist5[w] > 7])
['#14-19teens', '#talkcity_adults', '((((((((((', '........', 'Question', 'actually', 'anything', 'computer', 'cute.-ass', 'everyone', 'football', 'innocent', 'listening', 'remember', 'seriously', 'something', 'together', 'tomorrow', 'watching']

NLTK频率分布中定义的函数

例子	描述
fdist = FreqDist(sample)	创建包含给定样本的频率分布
fdist.inc(sample)	增加样本
fdist['monstrous']	计数给定样本出现的次数
fdist.freq('monstrous')	给定样本的频数
fdist.N()	样本总数
fdist.keys()	以频率递减的顺序排序的样本链表
for sample in fdist:	以频数递减的顺序遍历链表
fdist.max()	数值最大的样本
fdist.tabulate()	绘制频率分布表
fdist.plot()	绘制频率分布图
fdist.plot(cumulative = True)	绘制累积频率分布图
fdist1 < fdist 2	绘制样本在fdist1 中出现的频率是否小于fdist2

词汇比较运算符

函数	含义
s.startswith(t)	测试s是否以t开头
s.endswith(t)	测试s是否以t结尾
t in s	测试s是否包含t
s.islower()	测试s中所有字符是否都是小写字母
s.isupper()	测试s中所有字符是否都是大写字母
s.isalpha()	测试s所有字符是否都是字母
s.isalnum()	测试s中所有字符是否都是字母或数字
s.isdigit()	测试s中所有字符都是数字
s.istitle()	测试s是否首字母大写

待续……

最后编辑于：2018.03.13 11:31:38

第一章：NLTK频率分布类中定义的函数

1、语言计算：文本和词汇

1.1安装和入门

1.2计算词汇

1.3频数分布

推荐阅读更多精彩内容