ML8 - Text Mining Intro

1. Introduction

The main idea behind text mining is “turn text into numbers”.

1.1 Areas

Search and information retrieval (IR): Storage and retrieval of text documents, including search engines and keyword search.
Document clustering: Grouping and categorizing terms, snippets, paragraphs, or documents, using data mining clustering methods.
Document classification: Grouping and categorizing snippets, paragraphs, or documents, using data mining classification methods, based on models trained on labeled examples.
Web mining: Data and text mining on the Internet, with a specific focus on the scale and interconnectedness of the web.
Information extraction (IE): Identification and extraction of relevant facts and relationships from unstructured text; the process of making structured data from unstructured and semi-structured text.
Natural language processing (NLP): Low-level language processing and understanding tasks (e.g., tagging part of speech); often used synonymously with computational linguistics.
Concept extraction: Grouping of words and phrases into semantically similar groups.

1.2 Processes

Planning

Sets the foundation for the analysis

Preparing (data assemble) and data pre-processing

Gathering the relevant documents.
Information extraction (unearthing information of interest from these documents) including data reduction and preparation involving text pre-processing, and termdocument representation.

Data exploration (discovering new associations among the extracted pieces of information)

Text analysis (building models) using supervised analysis methods (such as classification analysis and sentiment analysis) and unsupervised analysis methods (such as latent semantic analysis, cluster analysis and topic models).

Reporting

Interpretation of the findings and their significance.
Two key elements: storytelling and visualization.

2. Concepts

Syntax: Specific grammar rules and language conventions govern how language is used, leading to statistical patterns appearing frequently in large amounts of text.
Semantics: Refers to the meaning of the individual words within the surrounding context.
The Generalized Vector-Space Model: The most popular structured representation of text is the vector-space model, which represents text as a vector where the elements of the vector indicate the occurrence of words within the text.
Bag-of-words
Homographs: Words that are spelt the same but have different
meanings. Homographs do not typically have a large effect on the
results of text mining algorithms.

3. Preprocessing

Choose the scope of the text to be processed (documents, paragraphs, etc.).
Tokenize: Break text into discrete words called tokens.
Remove stopwords (“stopping”): Remove common words such as 'the'.
Stem: Remove prefixes and suffixes to normalize words – e.g. run, running, and runs would all be stemmed to run.

Typically, the stemming process includes the identification and removal of prefixes, suffixes, and inappropriate pluralizations.
E.g., normalize walking, walks, walked, walker, and so on into walk.
Popular methods: Snowball Stemmer, Lemmatization

Normalize spelling: Unify misspellings and other spelling variations into a single token.
Detect sentence boundaries: Mark the ends of sentences.
Normalize case: Convert the text to either all lower or all upper case.

4. Creating vectors

After text pre-processing has been completed, the individual word tokens must be transformed into a vector representation suitable for input into text mining algorithms.

This vector representation can take one of three different forms:

a binary representation,
an integer count, or
a float-valued weighted vector.
TF-IDF stands for “term frequencyinverse document frequency”.
The assumption behind TF-IDF is that words with high term frequency should receive high weight unless they also have high document frequency.

5. Applications

1. Extracting “meaning” from unstructured text.
  – This application involves the understanding of core themes
  and relevant messages in a corpus of text, without actually
  reading the documents.
  Common use cases:
  – Sentiment analysis
  – Trending themes in a stream of text
  – Summarizing text
1. Automatic text categorization.
  – Automatically classifying text is an efficient way to organize
  text for downstream processing.
1. Improving predictive accuracy in predictive modeling or unsupervised learning.
  – Combining unstructured text with structured numeric information in predictive modeling or unsupervised learning (clustering) is a powerful method to achieve better accuracy.
1. Identifying specific or similar/relevant documents.
  – Efficiently extracting from a large corpus of text those documents that are relevant to a particular topic of interest or are similar to a target document (or documents) is a vitally necessary operation in information retrieval.
1. Extracting specific information from the text (“entity extraction”).
  – Automatically extracting specific information from the text (such as names, geographical locations, and dates) is an efficient method for presenting highly focused information for downstream analytical processing or for direct use by decision makers.