python nltk 笔记(持续更新)

1 基础对象与方法

1.1 nltk.text.Text

>>> from import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
>>> text1
<Text: Moby Dick by Herman Melville 1851>
>>> type(text1)
<class 'nltk.text.Text'>
>>> dir(text1)
['_CONTEXT_RE', '_COPY_TOKENS', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__len__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_context', 'collocations', 'common_contexts', 'concordance', 'count', 'dispersion_plot', 'findall', 'generate', 'index', 'name', 'plot', 'readability', 'similar', 'tokens', 'unicode_repr', 'vocab']


>>> text1.concordance("affection")
Displaying 3 of 3 matches:
oyously assented ; for besides the affection I now felt for Queequeg , he was a
e enough , yet he had a particular affection for his own harpoon , because it w
ing cobbling jobs . Lord ! what an affection all old women have for tinkers . I


>>> text1.similar("monstrous")
imperial subtly impalpable pitiable curious abundant perilous
trustworthy untoward singular lamentable few determined maddens
horrible tyrannical lazy mystifying christian exasperate

Text.common_contexts([word1, word2, ...])
搜索参数中所有word相同的上下文,即word1、word2 ...相同的上下文

>>> text2.common_contexts(["monstrous", "very"])
a_pretty is_pretty a_lucky am_glad be_glad

Text.dispersion_plot([word1, word2, ...])

>>> text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

1.2 nltk.probability.FreqDist


方法 描述
fdist=FreqDist(samples) 创建包含给定样本的频率分布(samples可以是nltk.text.Text、空格分割的字符串、列表或者其他) 增加样本
fdist[word] word在样本中出现的次数
fdist.freq(word) word在样本中出现的频率
fdist.N() 样本总数
fdist.keys() 样本list
for sample in fdist: 以频率递减顺序遍历样本
fdist.max() 数值最大样本
fdist.plot() 绘制频率分布图
fdist.plot(cumulative=True) 绘制累积频率分布图
>>> fdist = FreqDist(text1)
>>> fdist.plot(50, cumulative=True)

1.3 nltk.util.bigrams


>>> list(bigrams(["more", "is", "said", "than", "done"]))
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]

除了一些含生僻词的情况,英语文本中的词语搭配基本上是频繁出现的双连词。nltk.text.Text中提供了collocations(self, num=20, window_size=2)方法可以直接从Text文本中提取常出现的词语搭配,如下

>>> text4.collocations()
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties

2 语料库及其使用


2.1 nltk.corpus.reader


方法 描述
fileids() 语料库中的文件
fileids([categories]) 分类对应的语料库中的文件(可能没有)
categories() 语料库中的分类(可能没有)
categories([filedids]) 文件对应的语料库中的分类
raw() 语料库的原始内容
raw(fileids=[f1, f2, f3]) 指定文件的原始内容
raw(categories=[c1, c2]) 指定分类的原始内容
words() 整个语料库中的词汇
words(fileids=[f1, f2, f3]) 指定文件中的词汇
words(categories=[c1, c2]) 指定分类中的词汇
sents() 整个语料库中的句子
sents(fileids=[f1, f2, f3]) 指定文件中的句子
sents(categories=[c1, c2]) 指定分类中的句子
abspath(fileid) 指定文件在磁盘上的位置
encoding(fileid) 文件的编码格式
open(fileid) 打开指定语料库文件的文件流
root() 到本地安装的语料库跟目录的路径
readme() 语料库的README文件的内容

2.2 几个语料库简介

>>> from nltk.corpus import gutenberg        #古腾堡语料库
>>> from nltk.corpus import webtext          #网络语料库
>>> from nltk.corpus import nps_chat         #聊天文本
>>> from nltk.corpus import brown            #布朗语料库
>>> from nltk.corpus import reuters          #路透社语料库
>>> from nltk.corpus import inaugural        #就职演说语料库


>>> gutenberg.fileids()                      #获取语料库中的文件
[u'austen-emma.txt', u'austen-persuasion.txt', u'austen-sense.txt', u'bible-kjv.txt', u'blake-poems.txt', u'bryant-stories.txt', u'burgess-busterbrown.txt', u'carroll-alice.txt', u'chesterton-ball.txt', u'chesterton-brown.txt', u'chesterton-thursday.txt', u'edgeworth-parents.txt', u'melville-moby_dick.txt', u'milton-paradise.txt', u'shakespeare-caesar.txt', u'shakespeare-hamlet.txt', u'shakespeare-macbeth.txt', u'whitman-leaves.txt']


>>> brown.categories()
[u'adventure', u'belles_lettres', u'editorial', u'fiction', u'government', u'hobbies', u'humor', u'learned', u'lore', u'mystery', u'news', u'religion', u'reviews', u'romance', u'science_fiction']


>>> nps_chat.words()                        #获取语料库中的所有词汇列表
[u'now', u'im', u'left', u'with', u'this', u'gay', ...]
>>> gutenberg.words(["austen-emma.txt"])    #获取文件中的词汇列表
[u'[', u'Emma', u'by', u'Jane', u'Austen', u'1816', ...]
>>> brown.words(categories="news")          #获取news分类中的词汇列表
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]


>>> inaugural.sents()      #获取语料库中的所有句子列表
[[u'Fellow', u'-', u'Citizens', u'of', u'the', u'Senate', u'and', u'of', u'the', u'House', u'of', u'Representatives', u':'], [u'Among', u'the', u'vicissitudes', u'incident', u'to', u'life', u'no', u'event', u'could', u'have', u'filled', u'me', u'with', u'greater', u'anxieties', u'than', u'that', u'of', u'which', u'the', u'notification', u'was', u'transmitted', u'by', u'your', u'order', u',', u'and', u'received', u'on', u'the', u'14th', u'day', u'of', u'the', u'present', u'month', u'.'], ...]
>>> gutenberg.sents(["shakespeare-macbeth.txt"])
[[u'[', u'The', u'Tragedie', u'of', u'Macbeth', u'by', u'William', u'Shakespeare', u'1603', u']'], [u'Actus', u'Primus', u'.'], ...]
>>> brown.sents(categories=["mystery"])
[[u'There', u'were', u'thirty-eight', u'patients', u'on', u'the', u'bus', u'the', u'morning', u'I', u'left', u'for', u'Hanover', u',', u'most', u'of', u'them', u'disturbed', u'and', u'hallucinating', u'.'], [u'An', u'interne', u',', u'a', u'nurse', u'and', u'two', u'attendants', u'were', u'in', u'charge', u'of', u'us', u'.'], ...]


>>> for fileid in webtext.fileids():
...     print fileid, webtext.raw(fileid)[:65], "..."    #获取文件的原始内容
firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se ...
grail.txt SCENE 1: [wind] [clop clop clop]
KING ARTHUR: Whoa there!  [clop ...
overheard.txt White guy: So, do you have any plans for this evening?
Asian girl ...
pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr ...
singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun ...
wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb ...

3 条件频率分布


方法 描述
cfd=ConditionalFreqDist(pairs) 创建条件频率分布
cfd.conditions() 按字母排序的分类
cfd[condition] 指定条件下的频次分布(是一个FreqDist)
cfd[codition][sample] 指定条件以及样本的频次
cfd.tabulate() 为条件频率分布制表
cfd.tabulate(samples, conditions) 指定条件和样本下制表
cfd.plot() 绘制条件频率分布图
cfd.plot(samples, conditions) 指定条件以及样本下绘图


>>> from nltk.corpus import brown
>>> cfd = nltk.ConditionalFreqDist(
...             (genre, word)
...             for genre in brown.categories()
...             for word in brown.words(categories=genre))


>>> genre_word = [(genre, word) for genre in ["news", "romance"] for word in brown.words(categories=genre)]
>>> len(genre_word)
>>> genre_word[:5]
[('news', u'The'), ('news', u'Fulton'), ('news', u'County'), ('news', u'Grand'), ('news', u'Jury')]
>>> genre_word[-5:]
[('romance', u"I'm"), ('romance', u'afraid'), ('romance', u'not'), ('romance', u"''"), ('romance', u'.')]


>>> cfd = nltk.ConditionalFreqDist(genre_word)
>>> cfd
<ConditionalFreqDist with 2 conditions>
>>> cfd.conditions()
['romance', 'news']

cfd[condition] & cfd[condition][sample]

>>> cfd["news"]
FreqDist({u'the': 5580, u',': 5188, u'.': 4030, u'of': 2849, u'and': 2146, u'to': 2116, u'a': 1993, u'in': 1893, u'for': 943, u'The': 806, ...})
>>> cfd["romance"]
FreqDist({u',': 3899, u'.': 3736, u'the': 2758, u'and': 1776, u'to': 1502, u'a': 1335, u'of': 1186, u'``': 1045, u"''": 1044, u'was': 993, ...})
>>> cfd["romance"]["love"]


>>> from nltk.corpus import udhr
>>> languages = ["Chickasaw", "English", "German_Deutsch"]
>>> cfd = nltk.ConditionalFreqDist(
...     (lang, len(word))
...     for lang in languages
...     for word in udhr.words(lang+"-Latin1"))
>>> cfd.tabulate()
                 1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  23
     Chickasaw 411  99  41  68  91  89  77  70  49  33  16  28  45  10   6   4   5   3   2   1   1   1
       English 185 340 358 114 169 117 157 118  80  63  50  12  11   6   1   0   0   0   0   0   0   0
German_Deutsch 171  92 351 103 177 119  97 103  62  58  53  32  27  29  15  14   3   7   5   2   1   0
>>> cfd.tabulate(conditions=["English", "German_Deutsch"], samples=range(10), cumulative=True)
                  0    1    2    3    4    5    6    7    8    9
       English    0  185  525  883  997 1166 1283 1440 1558 1638
German_Deutsch    0  171  263  614  717  894 1013 1110 1213 1275

4 词汇列表语料库

>>> from nltk.corpus import words
>>> from nltk.corpus import stopwords
>>> from nltk.corpus import names
>>> from nltk.corpus import swadesh


>>> words = words.words()
>>> len(words)
>>> words[:20]
[u'A', u'a', u'aa', u'aal', u'aalii', u'aam', u'Aani', u'aardvark', u'aardwolf', u'Aaron', u'Aaronic', u'Aaronical', u'Aaronite', u'Aaronitic', u'Aaru', u'Ab', u'aba', u'Ababdeh', u'Ababua', u'abac']


>>> stopwords.words("english")
[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u'not', u'only', u'own', u'same', u'so', u'than', u'too', u'very', u's', u't', u'can', u'will', u'just', u'don', u'should', u'now', u'd', u'll', u'm', u'o', u're', u've', u'y', u'ain', u'aren', u'couldn', u'didn', u'doesn', u'hadn', u'hasn', u'haven', u'isn', u'ma', u'mightn', u'mustn', u'needn', u'shan', u'shouldn', u'wasn', u'weren', u'won', u'wouldn']


>>> names.words("female.txt")[:10]
[u'Abagael', u'Abagail', u'Abbe', u'Abbey', u'Abbi', u'Abbie', u'Abby', u'Abigael', u'Abigail', u'Abigale']
>>> names.words("male.txt")[:10]
[u'Aamir', u'Aaron', u'Abbey', u'Abbie', u'Abbot', u'Abbott', u'Abby', u'Abdel', u'Abdul', u'Abdulkarim']


>>> swadesh.fileids()
[u'be', u'bg', u'bs', u'ca', u'cs', u'cu', u'de', u'en', u'es', u'fr', u'hr', u'it', u'la', u'mk', u'nl', u'pl', u'pt', u'ro', u'ru', u'sk', u'sl', u'sr', u'sw', u'uk']
>>> swadesh.words("en")[:20]
[u'I', u'you (singular), thou', u'he', u'we', u'you (plural)', u'they', u'this', u'that', u'here', u'there', u'who', u'what', u'where', u'when', u'how', u'not', u'all', u'many', u'some', u'few']


[(u'je', u'I'), (u'tu, vous', u'you (singular), thou'), (u'il', u'he'), (u'nous', u'we'), (u'vous', u'you (plural)'), (u'ils, elles', u'they'), (u'ceci', u'this'), (u'cela', u'that'), (u'ici', u'here'), (u'l\xe0', u'there')]

5 WordNet


6 处理原始文本

6.1 分词并转化为Text对象


>>> from urllib import urlopen
>>> url = ""
>>> raw = urlopen(url).read().decode("utf-8")
>>> raw[:100]
u'\ufeffThe Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n\r\nThis eBook is for the '
>>> tokens = word_tokenize(raw)
[u'\ufeffThe', u'Project', u'Gutenberg', u'EBook', u'of', u'Crime', u'and', u'Punishment', u',', u'by', u'Fyodor', u'Dostoevsky', u'This', u'eBook', u'is', u'for', u'the', u'use', u'of', u'anyone']
>>> text = nltk.Text(tokens)      #转化为nltk.Text对象,而后就可以使用Text中的方法
>>> text
<Text: \ufeffThe Project Gutenberg EBook of Crime and Punishment...>
>>> text[:20]
[u'\ufeffThe', u'Project', u'Gutenberg', u'EBook', u'of', u'Crime', u'and', u'Punishment', u',', u'by', u'Fyodor', u'Dostoevsky', u'This', u'eBook', u'is', u'for', u'the', u'use', u'of', u'anyone']

6.2 处理HTML


>>> from urllib import urlopen
>>> from bs4 import BeautifulSoup
>>> url = ""
>>> html = urlopen(url).read()
>>> html[:60]
'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'
>>> soup = BeautifulSoup(html, "lxml")
>>> raw = soup.text
>>> raw[:60]
u"\n\nBBC NEWS | Health | Blondes 'to die out in 200 years'\n\n\n\n\n"
>>> tokens = nltk.word_tokenize(raw)
>>> text = nltk.Text(tokens)
>>> text.concordance("gene")
Displaying 7 of 7 matches:
hey say too few people now carry the gene for blondes to last beyond the next
blonde hair is caused by a recessive gene . In order for a child to have blond
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance . They do n't disappear
des would disappear is if having the gene was a disadvantage and I do not thin
er's Polio campaign launched in Iraq Gene defect explains high blood pressure
er's Polio campaign launched in Iraq Gene defect explains high blood pressure

6.3 使用正则表达式


6.3.1 正则表达式简单示例

>>> wsj = sorted(set(nltk.corpus.treebank.words()))
>>> [w for w in wsj if'^[0-9]+\.[0-9]+$', w)]    #匹配小数
[u'0.0085', u'0.05', u'0.1', u'0.16', u'0.2', u'0.25', u'0.28', u'0.3', u'0.4', u'0.5', u'0.50', u'0.54', u'0.56', u'0.60', u'0.7', u'0.82', u'0.84', u'0.9', u'0.95', u'0.99', ...]
>>> [w for w in wsj if"^[0-9]{4}$", w)]          #匹配四位整数
[u'1614', u'1637', u'1787', u'1901', u'1903', u'1917', u'1925', u'1929', u'1933', u'1934', u'1948', u'1953', u'1955', u'1956', u'1961', u'1965', u'1966', u'1967', u'1968', u'1969', u'1970', u'1971', u'1972', u'1973', u'1975', u'1976', u'1977', u'1979', u'1980', u'1981', u'1982', u'1983', u'1984', u'1985', u'1986', u'1987', u'1988', u'1989', u'1990', u'1991', u'1992', u'1993', u'1994', u'1995', u'1996', u'1997', u'1998', u'1999', u'2000', u'2005', u'2009', u'2017', u'2019', u'2029', u'3057', u'8300']
>>> [w for w in wsj if"^[0-9]+-[a-z]{3,5}$", w)] #匹配数字-单词(长度3-5)
[u'10-day', u'10-lap', u'10-year', u'100-share', u'12-point', u'12-year', u'14-hour', u'15-day', u'150-point', u'190-point', u'20-point', u'20-stock', u'21-month', u'237-seat', u'240-page', u'27-year', u'30-day', u'30-point', u'30-share', u'30-year', u'300-day', u'36-day', u'36-store', u'42-year', u'50-state', u'500-stock', u'52-week', u'69-point', u'84-month', u'87-store', u'90-day']
>>> [w for w in wsj if"[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$", w)]
[u'black-and-white', u'bread-and-butter', u'father-in-law', u'machine-gun-toting', u'savings-and-loan']
>>> [w for w in wsj if"(ed|ing)$", w)][:20]      #匹配ed或者ing结尾的词
[u'62%-owned', u'Absorbed', u'According', u'Adopting', u'Advanced', u'Advancing', u'Alfred', u'Allied', u'Annualized', u'Anything', u'Arbitrage-related', u'Arbitraging', u'Asked', u'Assuming', u'Atlanta-based', u'Baking', u'Banking', u'Beginning', u'Beijing', u'Being', ...]

6.3.2 使用re.findall()提取字符块

  1. 提取两个或者两个以上的元音序列
>>> wsj = sorted(set(nltk.corpus.treebank.words()))
>>> fd = nltk.FreqDist(vs for word in wsj
...                     for vs in re.findall(r"[aeiou]{2,}", word))
>>> fd.items()[:20]
[(u'aa', 3), (u'eo', 39), (u'ei', 86), (u'ee', 217), (u'ea', 476), (u'oui', 6), (u'ao', 6), (u'eu', 18), (u'au', 106), (u'io', 549), (u'ia', 253), (u'ae', 11), (u'ie', 331), (u'iao', 1), (u'iai', 1), (u'uou', 5), (u'ieu', 3), (u'ai', 261), (u'aii', 1), (u'uee', 4)]
  1. 去掉英文词内部的元音


>>> ptn = re.compile(r"^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]")
>>> def compress(word):
...     pieces = ptn.findall(word)
...     return "".join(pieces)
>>> english_udhr = nltk.corpus.udhr.words("English-Latin1")
>>> nltk.tokenwrap(compress(w) for w in english_udhr[:100])
u'Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and\nof the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn\nof frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn\nrghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,\nand the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and\nblf and frdm frm fr and wnt hs bn prclmd as the hghst asprtn of the\ncmmn pple , Whrs it is essntl , if'
  1. 提取辅音-元音序列


>>> rotokas_words = nltk.corpus.toolbox.words("rotokas.dic")
>>> cvs = [cv for w in rotokas_words for cv in re.findall(r"[ptksvr][aeiou]", w)]
>>> cfg = nltk.ConditionalFreqDist(cvs)
>>> cfg.tabulate()
    a   e   i   o   u
k 418 148  94 420 173
p  83  31 105  34  51
r 187  63  84  89  79
s   0   0 100   2   1
t  47   8   0 148  37
v  93  27 105  48  49
  1. 查找包含某个辅音-元音对应的单词列表
>>> cv_word_pairs = [(cv, w) for w in rotokas_words
...                     for cv in re.findall(r"[ptksvr][aeiou]", w)]
>>> cv_index = nltk.Index(cv_word_pairs)
>>> cv_index["su"]
>>> cv_index["po"]
[u'kaapo', u'kaapopato', u'kaipori', u'kaiporipie', u'kaiporivira', u'kapo', u'kapoa', u'kapokao', u'kapokapo', u'kapokapo', u'kapokapoa', u'kapokapoa', u'kapokapora', u'kapokapora', u'kapokaporo', u'kapokaporo', u'kapokari', u'kapokarito', u'kapokoa', u'kapoo', u'kapooto', u'kapoovira', u'kapopaa', u'kaporo', u'kaporo', u'kaporopa', u'kaporoto', u'kapoto', u'karokaropo', u'karopo', u'kepo', u'kepoi', u'keposi', u'kepoto']

6.4 规范化文本


6.4.1 词干提取

在NLTK包中提供集中常用的词干提取接口:Porter stemmer、Lancaster stemmer 和Snowball stemmer。使用示例如下:


>>> from nltk import PorterStemmer
>>> porter_stemmer = PorterStemmer()
>>> porter_stemmer.stem('maximum')
>>> porter_stemmer.stem('presumably')
>>> porter_stemmer.stem('multiply')
>>> porter_stemmer.stem('provision')
>>> porter_stemmer.stem('owed')


>>> from nltk import LancasterStemmer
>>> lancaster_stemmer = LancasterStemmer()
>>> lancaster_stemmer.stem('maximum')
>>> lancaster_stemmer.stem('presumably')
>>> lancaster_stemmer.stem('multiply')
>>> lancaster_stemmer.stem('provision')
>>> lancaster_stemmer.stem('owed')


>>> from nltk import SnowballStemmer
>>> snowball_stemmer = SnowballStemmer("english")
>>> snowball_stemmer.stem("maximum")
>>> snowball_stemmer.stem("presumably")
>>> snowball_stemmer.stem("provision")
>>> snowball_stemmer.stem("owed")


6.4.2 词形还原


>>> from nltk import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize("cars")
>>> lmtzr.lemmatize("feet")
>>> lmtzr.lemmatize("people")
>>> lmtzr.lemmatize("fantasized", pos="v")
