4.2.3. Text feature extraction
4.2.3.1. The Bag of Words representation
Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.
文本分析是机器学习算法的主要应用领域。 然而,原始数据,符号文字序列不能直接传递给算法,因为它们大多数要求具有固定长度的数字矩阵特征向量,而不是具有可变长度的原始文本文档。
In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely:
为解决这个问题,scikit-learn提供了从文本内容中提取数字特征的最常见方法,即:
- tokenizing strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.
- counting the occurrences of tokens in each document.
-
normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents.
令牌化(tokenizing) 对每个可能的词令牌分成字符串并赋予整数形的id,例如通过使用空格和标点符号作为令牌分隔符。
统计(counting) 每个词令牌在文档中的出现次数。
标准化(normalizing) 对出现在在大多数文档 / 样本中的词令牌,减少其重要程度。
In this scheme, features and samples are defined as follows:
在该方案中,特征和样本定义如下:
- each individual token occurrence frequency (normalized or not) is treated as a feature.
每个单独的令牌发生频率(归一化或不归零)被视为一个特征。 - the vector of all the token frequencies for a given document is considered a multivariate sample.
给定文档中所有的令牌频率向量被看做一个多元样本。
A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.
因此,文本的集合可被表示为矩阵形式,每行对应一条文本,每列对应每个文本中出现的词令牌(如单个词)。
We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.
我们称向量化是将文本文档集合转换为数字集合特征向量的普通方法。 这种特殊思想(令牌化,计数和归一化)被称为 Bag of Words 或 “Bag of n-grams” 模型。 文档由单词出现来描述,同时完全忽略文档中单词的相对位置信息。
4.2.3.2. Sparsity
As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).
由于大多数文本文档通常只使用文本词向量全集中的一个小子集,所以得到的矩阵将具有许多特征值为零(通常大于99%)。
For instance a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.
例如,10,000 个短文本文档(如电子邮件)的集合将使用总共100,000个独特词的大小的词汇,而每个文档将单独使用100到1000个独特的单词。
In order to be able to store such a matrix in memory but also to speed up algebraic operations matrix / vector, implementations will typically use a sparse representation such as the implementations available in the scipy.sparse
package.
为了能够将这样的矩阵存储在存储器中,并且还可以加速代数的矩阵/向量运算,实现通常将使用诸如 scipy.sparse 包中的稀疏实现。
4.2.3.3. Common Vectorizer usage
CountVectorizer
implements both tokenization and occurrence counting in a single class:
类 CountVectorizer
在单个类中实现了 tokenization (词语切分)和 occurrence counting (出现频数统计):
from sklearn.feature_extraction.text import CountVectorizer
This model has many parameters, however the default values are quite reasonable (please see the reference documentation for the details):
这个模型有很多参数,但参数的默认初始值是相当合理的(请参阅 参考文档 了解详细信息):
>>> vectorizer = CountVectorizer()
>>> vectorizer
CountVectorizer(analyzer=...'word', binary=False, decode_error=...'strict',
dtype=<... 'numpy.int64'>, encoding=...'utf-8', input=...'content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern=...'(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)
Let’s use it to tokenize and count the word occurrences of a minimalistic corpus of text documents:
我们用它来对简约的文本语料库进行 tokenize(分词)和统计单词出现频数:
>>> corpus = [
... 'This is the first document.',
... 'This is the second second document.',
... 'And the third one.',
... 'Is this the first document?',
... ]
>>> X = vectorizer.fit_transform(corpus)
>>> X
<4x9 sparse matrix of type '<... 'numpy.int64'>'
with 19 stored elements in Compressed Sparse ... format>
The default configuration tokenizes the string by extracting words of at least 2 letters. The specific function that does this step can be requested explicitly:
默认配置通过提取至少 2 个字母的单词来对 string 进行分词。做这一步的函数可以显式地被调用:
>>> analyze = vectorizer.build_analyzer()
>>> analyze("This is a text document to analyze.") == (
... ['this', 'is', 'text', 'document', 'to', 'analyze'])
True
Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix. This interpretation of the columns can be retrieved as follows:
analyzer 在拟合过程中找到的每个 term(项)都会被分配一个唯一的整数索引,对应于 resulting matrix(结果矩阵)中的一列。此列的一些说明可以被检索如下:
>>> vectorizer.get_feature_names() == (
... ['and', 'document', 'first', 'is', 'one',
... 'second', 'the', 'third', 'this'])
True
>>> X.toarray()
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
[0, 1, 0, 1, 0, 2, 1, 0, 1],
[1, 0, 0, 0, 1, 0, 1, 1, 0],
[0, 1, 1, 1, 0, 0, 1, 0, 1]]...)
The converse mapping from feature name to column index is stored in the vocabulary_
attribute of the vectorizer:
从 feature 名称到 column index(列索引) 的逆映射存储在 vocabulary_ 属性中:
>>> vectorizer.vocabulary_.get('document')
1
Hence words that were not seen in the training corpus will be completely ignored in future calls to the transform method:
因此,在未来对 transform 方法的调用中,在 training corpus (训练语料库)中没有看到的单词将被完全忽略:
>>> vectorizer.transform(['Something completely new.']).toarray()
...
array([[0, 0, 0, 0, 0, 0, 0, 0, 0]]...)
Note that in the previous corpus, the first and the last documents have exactly the same words hence are encoded in equal vectors. In particular we lose the information that the last document is an interrogative form. To preserve some of the local ordering information we can extract 2-grams of words in addition to the 1-grams (individual words):
请注意,在前面的 corpus(语料库)中,第一个和最后一个文档具有完全相同的词,因为被编码成相同的向量。 特别是我们丢失了最后一个文件是一个疑问的形式的信息。为了防止词组顺序颠倒,除了提取一元模型 1-grams(个别词)之外,我们还可以提取 2-grams 的单词:
>>> bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),
... token_pattern=r'\b\w+\b', min_df=1)
>>> analyze = bigram_vectorizer.build_analyzer()
>>> analyze('Bi-grams are cool!') == (
... ['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool'])
True
The vocabulary extracted by this vectorizer is hence much bigger and can now resolve ambiguities encoded in local positioning patterns:
由 vectorizer(向量化器)提取的 vocabulary(词汇)因此会变得更大,同时可以在定位模式时消除歧义:
>>> X_2 = bigram_vectorizer.fit_transform(corpus).toarray()
>>> X_2
...
array([[0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],
[0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0],
[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0],
[0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1]]...)
In particular the interrogative form “Is this” is only present in the last document:
特别是 “Is this” 的疑问形式只出现在最后一个文档中:
>>> feature_index = bigram_vectorizer.vocabulary_.get('is this')
>>> X_2[:, feature_index]
array([0, 0, 0, 1]...)
4.2.3.4. Tf–idf term weighting
In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.
在一个大的文本语料库中,一些单词将出现很多次(例如 “the”, “a”, “is” 是英文),因此对文档的实际内容没有什么有意义的信息。 如果我们将直接计数数据直接提供给分类器,那么这些频繁词组会掩盖住那些我们关注但很少出现的词。
In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.
为了为了重新计算特征权重,并将其转化为适合分类器使用的浮点值,因此使用 tf-idf 变换是非常常见的。
Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency:
Using the TfidfTransformer
’s default settings,TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
the term frequency, the number of times a term occurs in a given document, is multiplied with idf component, which is computed as
,
where is the total number of documents, and
is the number of documents that contain term . The resulting tf-idf vectors are then normalized by the Euclidean norm:
.
Tf表示词频,而 tf-idf 表示术语频率乘以逆文档频率:
使用 TfidfTransformer
的默认设置,TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
词频即一个词在给定文档中出现的次数,乘以 idf 即通过计算,
其中是文档的总数,是包含词的文档数。 然后,所得到的tf-idf向量通过欧几里得范数归一化:
.
This was originally a term weighting scheme developed for information retrieval (as a ranking function for search engines results) that has also found good use in document classification and clustering.
The following sections contain further explanations and examples that illustrate how the tf-idfs are computed exactly and how the tf-idfs computed in scikit-learn’s TfidfTransformer
and TfidfVectorizer
differ slightly from the standard textbook notation that defines the idf as
In the TfidfTransformer
and TfidfVectorizer
with smooth_idf=False
, the “1” count is added to the idf instead of the idf’s denominator:
它源于一个词权重的信息检索方式(作为搜索引擎结果的评级函数),同时也在文档分类和聚类中表现良好。
以下部分包含进一步说明和示例,说明如何精确计算 tf-idfs 以及如何在 scikit-learn 中计算 tf-idfs, TfidfTransformer
并 TfidfVectorizer
与定义 idf 的标准教科书符号略有不同
在 TfidfTransformer
和 TfidfVectorizer
中 smooth_idf=False
,将 “1” 计数添加到 idf 而不是 idf 的分母:
This normalization is implemented by the TfidfTransformer
class:
该归一化由类 TfidfTransformer
实现:
>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> transformer = TfidfTransformer(smooth_idf=False)
>>> transformer
TfidfTransformer(norm=...'l2', smooth_idf=False, sublinear_tf=False,
use_idf=True)
Again please see the reference documentation for the details on all the parameters.
有关所有参数的详细信息,请参阅 参考文档。
Let’s take an example with the following counts. The first term is present 100% of the time hence not very interesting. The two other features only in less than 50% of the time hence probably more representative of the content of the documents:
让我们以下方的词频为例。第一个次在任何时间都是100%出现,因此不是很有重要。另外两个特征只占不到50%的比例,因此可能更具有代表性:
>>> counts = [[3, 0, 1],
... [2, 0, 0],
... [3, 0, 0],
... [4, 0, 0],
... [3, 2, 0],
... [3, 0, 2]]
...
>>> tfidf = transformer.fit_transform(counts)
>>> tfidf
<6x3 sparse matrix of type '<... 'numpy.float64'>'
with 9 stored elements in Compressed Sparse ... format>
>>> tfidf.toarray()
array([[ 0.81940995, 0. , 0.57320793],
[ 1. , 0. , 0. ],
[ 1. , 0. , 0. ],
[ 1. , 0. , 0. ],
[ 0.47330339, 0.88089948, 0. ],
[ 0.58149261, 0. , 0.81355169]])
Each row is normalized to have unit Euclidean norm:
For example, we can compute the tf-idf of the first term in the first document in the <cite style="font-style: normal;">counts</cite> array as follows:
Now, if we repeat this computation for the remaining 2 terms in the document, we get
and the vector of raw tf-idfs:
Then, applying the Euclidean (L2) norm, we obtain the following tf-idfs for document 1:
Furthermore, the default parameter smooth_idf=True
adds “1” to the numerator and denominator as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions:
Using this modification, the tf-idf of the third term in document 1 changes to 1.8473:
And the L2-normalized tf-idf changes to
:
每行都被正则化,使其适应欧几里得标准:
例如,我们可以计算计数
数组中第一个文档中第一个项的 tf-idf ,如下所示:
现在,如果我们对文档中剩下的2个术语重复这个计算,我们得到:
和原始 tf-idfs 的向量:
然后,应用欧几里德(L2)规范,我们获得文档1的以下 tf-idfs:
此外,默认参数 smooth_idf=True
将 “1” 添加到分子和分母,就好像一个额外的文档被看到一样包含集合中的每个术语,这样可以避免零分割:
使用此修改,文档1中第三项的 tf-idf 更改为 1.8473:
而 L2 标准化的 tf-idf 变为
:
>>> transformer = TfidfTransformer()
>>> transformer.fit_transform(counts).toarray()
array([[ 0.85151335, 0. , 0.52433293],
[ 1. , 0. , 0. ],
[ 1. , 0. , 0. ],
[ 1. , 0. , 0. ],
[ 0.55422893, 0.83236428, 0. ],
[ 0.63035731, 0. , 0.77630514]])
The weights of each feature computed by the fit
method call are stored in a model attribute:
通过 fit 方法调用计算出的每个特征的权重存储在模型属性中:
>>> transformer.idf_
array([ 1. ..., 2.25..., 1.84...])
As tf–idf is very often used for text features, there is also another class called TfidfVectorizer
that combines all the options of CountVectorizer
and TfidfTransformer
in a single model:
由于 tf-idf 经常用于文本特征,所以还有一个类 TfidfVectorizer
,它将 CountVectorizer
和 TfidfTransformer
的所有选项组合在一个单例模型中:
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> vectorizer = TfidfVectorizer()
>>> vectorizer.fit_transform(corpus)
...
<4x9 sparse matrix of type '<... 'numpy.float64'>'
with 19 stored elements in Compressed Sparse ... format>
While the tf–idf normalization is often very useful, there might be cases where the binary occurrence markers might offer better features. This can be achieved by using the binary
parameter of CountVectorizer
. In particular, some estimators such as Bernoulli Naive Bayes explicitly model discrete boolean random variables. Also, very short texts are likely to have noisy tf–idf values while the binary occurrence info is more stable.
虽然tf-idf标准化通常非常有用,但是可能有一种情况是二元变量显示会提供更好的特征。 这可以使用类 CountVectorizer
的 二进制
参数来实现。 特别地,一些估计器,诸如 伯努利朴素贝叶斯 显式的使用离散的布尔随机变量。 而且,非常短的文本很可能影响 tf-idf 值,而二进制出现信息更稳定。
As usual the best way to adjust the feature extraction parameters is to use a cross-validated grid search, for instance by pipelining the feature extractor with a classifier:
通常情况下,调整特征提取参数的最佳方法是使用基于网格搜索的交叉验证,例如通过将特征提取器与分类器进行流水线化:
- 用于文本特征提取和评估的样本管道 Sample pipeline for text feature extraction and evaluation
4.2.3.5. Decoding text files
Text is made of characters, but files are made of bytes. These bytes represent characters according to some encoding. To work with text files in Python, their bytes must be decoded to a character set called Unicode. Common encodings are ASCII, Latin-1 (Western Europe), KOI8-R (Russian) and the universal encodings UTF-8 and UTF-16. Many others exist.
Note
An encoding can also be called a ‘character set’, but this term is less accurate: several encodings can exist for a single character set.
The text feature extractors in scikit-learn know how to decode text files, but only if you tell them what encoding the files are in. The CountVectorizer
takes an encoding
parameter for this purpose. For modern text files, the correct encoding is probably UTF-8, which is therefore the default (encoding="utf-8"
).
If the text you are loading is not actually encoded with UTF-8, however, you will get a UnicodeDecodeError
. The vectorizers can be told to be silent about decoding errors by setting the decode_error
parameter to either "ignore"
or "replace"
. See the documentation for the Python function bytes.decode
for more details (type help(bytes.decode)
at the Python prompt).
If you are having trouble decoding text, here are some things to try:
- Find out what the actual encoding of the text is. The file might come with a header or README that tells you the encoding, or there might be some standard encoding you can assume based on where the text comes from.
- You may be able to find out what kind of encoding it is in general using the UNIX command
file
. The Pythonchardet
module comes with a script calledchardetect.py
that will guess the specific encoding, though you cannot rely on its guess being correct. - You could try UTF-8 and disregard the errors. You can decode byte strings with
bytes.decode(errors='replace')
to replace all decoding errors with a meaningless character, or setdecode_error='replace'
in the vectorizer. This may damage the usefulness of your features. - Real text may come from a variety of sources that may have used different encodings, or even be sloppily decoded in a different encoding than the one it was encoded with. This is common in text retrieved from the Web. The Python package ftfy can automatically sort out some classes of decoding errors, so you could try decoding the unknown text as
latin-1
and then usingftfy
to fix errors. - If the text is in a mish-mash of encodings that is simply too hard to sort out (which is the case for the 20 Newsgroups dataset), you can fall back on a simple single-byte encoding such as
latin-1
. Some text may display incorrectly, but at least the same sequence of bytes will always represent the same feature.
For example, the following snippet uses chardet
(not shipped with scikit-learn, must be installed separately) to figure out the encoding of three texts. It then vectorizes the texts and prints the learned vocabulary. The output is not shown here.
<pre style="padding: 5px 10px; font-family: Monaco, Menlo, Consolas, "Courier New", monospace; font-size: 13px; color: rgb(34, 34, 34); border-radius: 4px; display: block; margin: 0.1em 0px 0.5em; line-height: 1.2em; word-break: break-all; word-wrap: break-word; white-space: pre-wrap; background-color: rgb(248, 248, 248); border: 1px solid rgb(221, 221, 221); overflow: auto hidden;">>>> import chardet
text1 = b"Sei mir gegr\xc3\xbc\xc3\x9ft mein Sauerkraut"
text2 = b"holdselig sind deine Ger\xfcche"
text3 = b"\xff\xfeA\x00u\x00f\x00 \x00F\x00l\x00\xfc\x00g\x00e\x00l\x00n\x00 \x00d\x00e\x00s\x00 \x00G\x00e\x00s\x00a\x00n\x00g\x00e\x00s\x00,\x00 \x00H\x00e\x00r\x00z\x00l\x00i\x00e\x00b\x00c\x00h\x00e\x00n\x00,\x00 \x00t\x00r\x00a\x00g\x00 \x00i\x00c\x00h\x00 \x00d\x00i\x00c\x00h\x00 \x00f\x00o\x00r\x00t\x00"
decoded = [x.decode(chardet.detect(x)['encoding'])
... for x in (text1, text2, text3)]
v = CountVectorizer().fit(decoded).vocabulary_
for term in v: print(v)
</pre>
(Depending on the version of chardet
, it might get the first one wrong.)
For an introduction to Unicode and character encodings in general, see Joel Spolsky’s Absolute Minimum Every Software Developer Must Know About Unicode.
4.2.3.6. Applications and examples
The bag of words representation is quite simplistic but surprisingly useful in practice.
In particular in a supervised setting it can be successfully combined with fast and scalable linear models to train document classifiers, for instance:
In an unsupervised setting it can be used to group similar documents together by applying clustering algorithms such as K-means:
Finally it is possible to discover the main topics of a corpus by relaxing the hard assignment constraint of clustering, for instance by using Non-negative matrix factorization (NMF or NNMF):
4.2.3.7. Limitations of the Bag of Words representation
A collection of unigrams (what bag of words is) cannot capture phrases and multi-word expressions, effectively disregarding any word order dependence. Additionally, the bag of words model doesn’t account for potential misspellings or word derivations.
N-grams to the rescue! Instead of building a simple collection of unigrams (n=1), one might prefer a collection of bigrams (n=2), where occurrences of pairs of consecutive words are counted.
One might alternatively consider a collection of character n-grams, a representation resilient against misspellings and derivations.
For example, let’s say we’re dealing with a corpus of two documents: ['words', 'wprds']
. The second document contains a misspelling of the word ‘words’. A simple bag of words representation would consider these two as very distinct documents, differing in both of the two possible features. A character 2-gram representation, however, would find the documents matching in 4 out of 8 features, which may help the preferred classifier decide better:
<pre style="padding: 5px 10px; font-family: Monaco, Menlo, Consolas, "Courier New", monospace; font-size: 13px; color: rgb(34, 34, 34); border-radius: 4px; display: block; margin: 0.1em 0px 0.5em; line-height: 1.2em; word-break: break-all; word-wrap: break-word; white-space: pre-wrap; background-color: rgb(248, 248, 248); border: 1px solid rgb(221, 221, 221); overflow: auto hidden;">>>> ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2))
counts = ngram_vectorizer.fit_transform(['words', 'wprds'])
ngram_vectorizer.get_feature_names() == (
... [' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp'])
True
counts.toarray().astype(int)
array([[1, 1, 1, 0, 1, 1, 1, 0],
[1, 1, 0, 1, 1, 1, 0, 1]])
</pre>
In the above example, 'char_wb
analyzer is used, which creates n-grams only from characters inside word boundaries (padded with space on each side). The 'char'
analyzer, alternatively, creates n-grams that span across words:
<pre style="padding: 5px 10px; font-family: Monaco, Menlo, Consolas, "Courier New", monospace; font-size: 13px; color: rgb(34, 34, 34); border-radius: 4px; display: block; margin: 0.1em 0px 0.5em; line-height: 1.2em; word-break: break-all; word-wrap: break-word; white-space: pre-wrap; background-color: rgb(248, 248, 248); border: 1px solid rgb(221, 221, 221); overflow: auto hidden;">>>> ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(5, 5))
ngram_vectorizer.fit_transform(['jumpy fox'])
...
<1x4 sparse matrix of type '<... 'numpy.int64'>'
with 4 stored elements in Compressed Sparse ... format>
ngram_vectorizer.get_feature_names() == (
... [' fox ', ' jump', 'jumpy', 'umpy '])
True
ngram_vectorizer = CountVectorizer(analyzer='char', ngram_range=(5, 5))
ngram_vectorizer.fit_transform(['jumpy fox'])ngram_vectorizer.get_feature_names() == (
... [' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp'])
True
>>> counts.toarray().astype(int)
array([[1, 1, 1, 0, 1, 1, 1, 0],
[1, 1, 0, 1, 1, 1, 0, 1]])
</pre>
In the above example,'char_wb
analyzer is used, which creates n-grams only from characters inside word boundaries (padded with space on each side). The'char'
analyzer, alternatively, creates n-grams that span across words:
>>>
<pre style="padding: 5px 10px; font-family: Monaco, Menlo, Consolas, "Courier New", monospace; font-size: 13px; color: rgb(34, 34, 34); border-radius: 4px; display: block; margin: 0.1em 0px 0.5em; line-height: 1.2em; word-break: break-all; word-wrap: break-word; white-space: pre-wrap; background-color: rgb(248, 248, 248); border: 1px solid rgb(221, 221, 221); overflow: auto hidden;">>>> ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(5, 5))
>>> ngram_vectorizer.fit_transform(['jumpy fox'])
...
<1x4 sparse matrix of type '<... 'numpy.int64'>'
with 4 stored elements in Compressed Sparse ... format>
>>> ngram_vectorizer.get_feature_names() == (
... [' fox ', ' jump', 'jumpy', 'umpy '])
True
>>> ngram_vectorizer = CountVectorizer(analyzer='char', ngram_range=(5, 5))
>>> ngram_vectorizer.fit_transform(['jumpy fox'])
...
<1x5 sparse matrix of type '<... 'numpy.int64'>'
with 5 stored elements in Compressed Sparse ... format>
ngram_vectorizer.get_feature_names() == (
... ['jumpy', 'mpy f', 'py fo', 'umpy ', 'y fox'])
True
</pre>
The word boundaries-aware variant char_wb
is especially interesting for languages that use white-spaces for word separation as it generates significantly less noisy features than the raw char
variant in that case. For such languages it can increase both the predictive accuracy and convergence speed of classifiers trained using such features while retaining the robustness with regards to misspellings and word derivations.
While some local positioning information can be preserved by extracting n-grams instead of individual words, bag of words and bag of n-grams destroy most of the inner structure of the document and hence most of the meaning carried by that internal structure.
In order to address the wider task of Natural Language Understanding, the local structure of sentences and paragraphs should thus be taken into account. Many such models will thus be casted as “Structured output” problems which are currently outside of the scope of scikit-learn.
4.2.3.8. Vectorizing a large text corpus with the hashing trick
The above vectorization scheme is simple but the fact that it holds an in- memory mapping from the string tokens to the integer feature indices (the vocabulary_
attribute) causes several problems when dealing with large datasets:
- the larger the corpus, the larger the vocabulary will grow and hence the memory use too,
- fitting requires the allocation of intermediate data structures of size proportional to that of the original dataset.
- building the word-mapping requires a full pass over the dataset hence it is not possible to fit text classifiers in a strictly online manner.
- pickling and un-pickling vectorizers with a large
vocabulary_
can be very slow (typically much slower than pickling / un-pickling flat data structures such as a NumPy array of the same size), - it is not easily possible to split the vectorization work into concurrent sub tasks as the
vocabulary_
attribute would have to be a shared state with a fine grained synchronization barrier: the mapping from token string to feature index is dependent on ordering of the first occurrence of each token hence would have to be shared, potentially harming the concurrent workers’ performance to the point of making them slower than the sequential variant.
It is possible to overcome those limitations by combining the “hashing trick” (Feature hashing) implemented by thesklearn.feature_extraction.FeatureHasher
class and the text preprocessing and tokenization features of the CountVectorizer
.
This combination is implementing in HashingVectorizer
, a transformer class that is mostly API compatible with CountVectorizer
. HashingVectorizer
is stateless, meaning that you don’t have to call fit
on it:
<pre style="padding: 5px 10px; font-family: Monaco, Menlo, Consolas, "Courier New", monospace; font-size: 13px; color: rgb(34, 34, 34); border-radius: 4px; display: block; margin: 0.1em 0px 0.5em; line-height: 1.2em; word-break: break-all; word-wrap: break-word; white-space: pre-wrap; background-color: rgb(248, 248, 248); border: 1px solid rgb(221, 221, 221); overflow: auto hidden;">>>> from sklearn.feature_extraction.text import HashingVectorizer
hv = HashingVectorizer(n_features=10)
hv.transform(corpus)
...
<4x10 sparse matrix of type '<... 'numpy.float64'>'
with 16 stored elements in Compressed Sparse ... format>
</pre>
You can see that 16 non-zero feature tokens were extracted in the vector output: this is less than the 19 non-zeros extracted previously by the CountVectorizer
on the same toy corpus. The discrepancy comes from hash function collisions because of the low value of the n_features
parameter.
In a real world setting, the n_features
parameter can be left to its default value of 2 ** 20
(roughly one million possible features). If memory or downstream models size is an issue selecting a lower value such as 2 ** 18
might help without introducing too many additional collisions on typical text classification tasks.
Note that the dimensionality does not affect the CPU training time of algorithms which operate on CSR matrices (LinearSVC(dual=True)
, Perceptron
, SGDClassifier
, PassiveAggressive
) but it does for algorithms that work with CSC matrices (LinearSVC(dual=False)
, Lasso()
, etc).
Let’s try again with the default setting:
<pre style="padding: 5px 10px; font-family: Monaco, Menlo, Consolas, "Courier New", monospace; font-size: 13px; color: rgb(34, 34, 34); border-radius: 4px; display: block; margin: 0.1em 0px 0.5em; line-height: 1.2em; word-break: break-all; word-wrap: break-word; white-space: pre-wrap; background-color: rgb(248, 248, 248); border: 1px solid rgb(221, 221, 221); overflow: auto hidden;">>>> hv = HashingVectorizer()
hv.transform(corpus)
...
<4x1048576 sparse matrix of type '<... 'numpy.float64'>'
with 19 stored elements in Compressed Sparse ... format>
</pre>
We no longer get the collisions, but this comes at the expense of a much larger dimensionality of the output space. Of course, other terms than the 19 used here might still collide with each other.
The HashingVectorizer
also comes with the following limitations:
- it is not possible to invert the model (no
inverse_transform
method), nor to access the original string representation of the features, because of the one-way nature of the hash function that performs the mapping. - it does not provide IDF weighting as that would introduce statefulness in the model. A
TfidfTransformer
can be appended to it in a pipeline if required.
4.2.3.9. Performing out-of-core scaling with HashingVectorizer
An interesting development of using a HashingVectorizer
is the ability to perform out-of-core scaling. This means that we can learn from data that does not fit into the computer’s main memory.
A strategy to implement out-of-core scaling is to stream data to the estimator in mini-batches. Each mini-batch is vectorized using HashingVectorizer
so as to guarantee that the input space of the estimator has always the same dimensionality. The amount of memory used at any time is thus bounded by the size of a mini-batch. Although there is no limit to the amount of data that can be ingested using such an approach, from a practical point of view the learning time is often limited by the CPU time one wants to spend on the task.
For a full-fledged example of out-of-core scaling in a text classification task see Out-of-core classification of text documents.
4.2.3.10. Customizing the vectorizer classes
It is possible to customize the behavior by passing a callable to the vectorizer constructor:
<pre style="padding: 5px 10px; font-family: Monaco, Menlo, Consolas, "Courier New", monospace; font-size: 13px; color: rgb(34, 34, 34); border-radius: 4px; display: block; margin: 0.1em 0px 0.5em; line-height: 1.2em; word-break: break-all; word-wrap: break-word; white-space: pre-wrap; background-color: rgb(248, 248, 248); border: 1px solid rgb(221, 221, 221); overflow: auto hidden;">>>> def my_tokenizer(s):
... return s.split()
...
vectorizer = CountVectorizer(tokenizer=my_tokenizer)
vectorizer.build_analyzer()(u"Some... punctuation!") == (
... ['some...', 'punctuation!'])
True
</pre>
In particular we name:
preprocessor
: a callable that takes an entire document as input (as a single string), and returns a possibly transformed version of the document, still as an entire string. This can be used to remove HTML tags, lowercase the entire document, etc.tokenizer
: a callable that takes the output from the preprocessor and splits it into tokens, then returns a list of these.analyzer
: a callable that replaces the preprocessor and tokenizer. The default analyzers all call the preprocessor and tokenizer, but custom analyzers will skip this. N-gram extraction and stop word filtering take place at the analyzer level, so a custom analyzer may have to reproduce these steps.
(Lucene users might recognize these names, but be aware that scikit-learn concepts may not map one-to-one onto Lucene concepts.)
To make the preprocessor, tokenizer and analyzers aware of the model parameters it is possible to derive from the class and override the build_preprocessor
, build_tokenizer`` and
build_analyzer` factory methods instead of passing custom functions.
Some tips and tricks:
- If documents are pre-tokenized by an external package, then store them in files (or strings) with the tokens separated by whitespace and pass
analyzer=str.split
- Fancy token-level analysis such as stemming, lemmatizing, compound splitting, filtering based on part-of-speech, etc. are not included in the scikit-learn codebase, but can be added by customizing either the tokenizer or the analyzer. Here’s a
CountVectorizer
with a tokenizer and lemmatizer using NLTK:>>> <pre style="padding: 5px 10px; font-family: Monaco, Menlo, Consolas, "Courier New", monospace; font-size: 13px; color: rgb(34, 34, 34); border-radius: 4px; display: block; margin: 0.1em 0px 0.5em; line-height: 1.2em; word-break: break-all; word-wrap: break-word; white-space: pre-wrap; background-color: rgb(248, 248, 248); border: 1px solid rgb(221, 221, 221); overflow: auto hidden;">>>> from nltk import word_tokenize >>> from nltk.stem import WordNetLemmatizer >>> class LemmaTokenizer(object): ... def __init__(self): ... self.wnl = WordNetLemmatizer() ... def __call__(self, doc): ... return [self.wnl.lemmatize(t) for t in word_tokenize(doc)] ... >>> vect = CountVectorizer(tokenizer=LemmaTokenizer()) </pre> (Note that this will not filter out punctuation.) The following example will, for instance, transform some British spelling to American spelling: >>> <pre style="padding: 5px 10px; font-family: Monaco, Menlo, Consolas, "Courier New", monospace; font-size: 13px; color: rgb(34, 34, 34); border-radius: 4px; display: block; margin: 0.1em 0px 0.5em; line-height: 1.2em; word-break: break-all; word-wrap: break-word; white-space: pre-wrap; background-color: rgb(248, 248, 248); border: 1px solid rgb(221, 221, 221); overflow: auto hidden;">>>> import re >>> def to_british(tokens): ... for t in tokens: ... t = re.sub(r"(...)our$", r"\1or", t) ... t = re.sub(r"([bt])re$", r"\1er", t) ... t = re.sub(r"([iy])s(e$|ing|ation)", r"\1z\2", t) ... t = re.sub(r"ogue$", "og", t) ... yield t ... >>> class CustomVectorizer(CountVectorizer): ... def build_tokenizer(self): ... tokenize = super(CustomVectorizer, self).build_tokenizer() ... return lambda doc: list(to_british(tokenize(doc))) ... >>> print(CustomVectorizer().build_analyzer()(u"color colour")) [...'color', ...'color'] </pre> for other styles of preprocessing; examples include stemming, lemmatization, or normalizing numerical tokens, with the latter illustrated in: > * [Biclustering documents with the Spectral Co-clustering algorithm](http://scikit-learn.org/stable/auto_examples/bicluster/plot_bicluster_newsgroups.html#sphx-glr-auto-examples-bicluster-plot-bicluster-newsgroups-py)
Customizing the vectorizer can also be useful when handling Asian languages that do not use an explicit word separator such as whitespace.