Topic Modeling

Topic Modeling¶

While more and more texts are available online, we simply do not have the human power to read and study them to provide the kind of browsing experience described above. To this end, machine learning researchers have developed probabilistic topic modeling, a suite of algorithms that aim to discover and annotate large archives of documents with thematic information.

"Topic modeling algorithms are statistical methods that analyze the words of the original texts to discover the themes that run through them, how those themes are connected to each other, and how they change over time".

"LDA is a statistical model of document collections that tries to capture this intuition of assigning words to their topics"

Key Insights

Topic models are algorithms for discovering the main themes that pervade a large and otherwise unstructured collection of documents. Topic models can organize the collection according to the discovered themes.

Topic modeling algorithms can be applied to massive collections of documents. Recent advances in this field allow us to analyze streaming collections, like you might find from a Web API.

Topic modeling algorithms can be adapted to many kinds of data. Among other applications, they have been used to find patterns in genetic data, images, and social networks.

Objective: We are looking to find groups (similar to clustering)

Model the topics that exist in our data
Find the underlying themes (ie topics)

Model types

Probabilistic Model: Refers to a broad category of models that incorporate probabilities to represent uncertainty. This term doesn't inherently imply mixed membership, as a model could be probabilistic without allowing for mixed membership.
Mixed Membership Model: Refers to a specific type of probabilistic model that explicitly allows for the simultaneous presence of multiple categories in each data point. In the context of topic modeling, mixed membership models acknowledge that documents may be associated with more than one topic.

LDA¶

LDA | Automatically discover topics from sentences

Strengths v weaknesses

Strengh: Good for finding core thematic structure in large archives of text. Adaptable and extendable.
Weakness: Limited by assumptions. Evaluation should be based on interpretability.

Assumes

There is a set distribution of topic for our documents
Order of documents doesn't matter
Bag of words - order doesn't matter

Parameters

topic
per-doc topic distribution
per-doc per-word topic assignment

Notes

Latent - the "groups" that we don't know
K = number of topics
D = number of documents
N = number of unique words total

Distributions

Each document is a distribution of topics
Each column is a topic (multinomial distributions)
Each row is a word
Each of these topics have distributions of words
Looking on a per topic basis - weight words accordingly

How it works

Take documents
Take words in each document
Decide how many topics you would like
Per topic - how likely is each word to belong to us?
Now that we know this, what topics do the words fall in for 1 specific document?
Doc 1 = 50% sports, 40% news

Understand what the Dirichlet and Multinomial Distributions represent

Diri = topic distribution
How are they used in LDA? How could you use LDA in contexts that are not textual?
Genetic data, images, social networks