0.document delineation and character sequence decoding
obtain the character sequence in a document
- determine the character encoding
- determine the document file
- assumption: text is a linear sequence of characters.
choose a document unit
- indexing granularity
trade-off between precision(small - sentence) and recall(large - book).
1.determine the vocabulary of terms
tokenization
terminology
- token
an instance of a sequence of characters - type
the class of all tokens containing the same character sequence. - term
A term is a (perhaps normalized) type that is included in the IR system’s dictionary.
to sleep perchance to dream,
5 tokens,4 types (because there are two instances of to), 3 terms ( if to is omitted from the index as a stop word)
tricky cases
- do the exact same tokenization for both the dictionary and query
Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing.
language specific
- unusual specific tokens:email address, URL, IP
Items such as the date of an email, which have a clear semantic
type, are often indexed separately as document metadata - hyphenations, splitting on white space
encourage users to enter hyphens wherever they may be
possible(depends on user' training) - new language, new issue
Chinese:word segmentation
dropping common terms: stop words
- collect frequency, hand-filtered build stop list
- modern IR system abandoned, for its harm to phrase query
normalization
token normalization
the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens
equivalence classing
- using mapping rules that remove characters like hyphens
- maintain relations between unnormalized tokens.
expansion lists
- query expansion lists
- perform the expansion during index construction
Although more space for postings seemed to be less efficient, for modern IR system, the increased flexibility is more appealing.
common forms of normalization
- Accents and diacritics.
Spanish
Non-ASCII text is hardly used on many computer systems, it might be best to equate all words to a form without diacritics. - Capitalization/case-folding
Do case-folding by reducing all letters to lower case.(companies, government organization, personal names)
Making every token lowercase is to just make some tokens lowercase.(if your users usually use lowercase regardless of the correct case of words?) - other issues in English:
different version of English. - other languages:
Japanese: multiple alphabets
stemming and lemmatization
- For grammatical reasons, documents are going to use different forms of a
word, such as organize, organizes, and organizing. - stemming
a crude heuristic process that chops off the ends of words - lemmatization
doing things properly with the use of a vocabulary and morphological analysis of words.
2.faster posting list intersection via skip points
skip list: adding skip points
for a postings list of length P, use √P evenly spaced skip pointers.
the optimal encoding for an inverted index
- Traditionally, CPUs were slow, and so highly compressed techniques were not optimal.
- Now CPUs are fast and disk is slow, so reducing disk postings list size dominates.
3.positional postings and phrase queries
Many as 10% of web queries are phrase queries, and many more are implicit phrase queries (such as person names), entered without use of double quotes.
Biword indexes
- longer phrase: break them down
- NX*N: a extended biword indexes
i.e.the abolition of slavery - increase the space cost
Positional indexes
<doc1ID,frequency: <�position1, position2, . . . �>;
doc2ID,frequency: <�position1, position2, . . . �>;
...>
Combination schemes
A combination strategy uses a phrase index, or just a biword index, for certain common queries(like Micheal Jackson) and uses a positional index for other phrase queries.