topic models #42

chengjun · 2022-03-23T09:13:56Z

chengjun
Mar 23, 2022
Maintainer

Topic_model_scheme.webm.480p.vp9.mp4

Animation of the topic detection process in a document-word matrix. Every column corresponds to a document, every row to a word. A cell stores the frequency of a word in a document, dark cells indicate high word frequencies. Topic models group both documents, which use similar words, as well as words which occur in a similar set of documents. The resulting patterns are called "topics".[6]

https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

To actually infer the topics in a corpus, we imagine a generative process whereby the documents are created, so that we may infer, or reverse engineer, it. We imagine the generative process as follows. Documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over all the words. LDA assumes the following generative process for a corpus $D$ consisting of $M$ documents each of length $N_i$:

Choose $ \theta_i \sim \operatorname{Dir}(\alpha) $, where $ i \in { 1,\dots,M } $ and
$ \mathrm{Dir}(\alpha) $ is a [[Dirichlet distribution]] with a symmetric parameter $\alpha$ which typically is sparse ($\alpha < 1$)
Choose $ \varphi_k \sim \operatorname{Dir}(\beta) $, where $ k \in { 1,\dots,K } $ and $\beta$ typically is sparse
For each of the word positions $i, j$, where $ i \in { 1,\dots,M } $, and $ j \in { 1,\dots,N_i } $

: (a) Choose a topic $z_{i,j} \sim\operatorname{Multinomial}(\theta_i). $

: (b) Choose a word $w_{i,j} \sim\operatorname{Multinomial}( \varphi_{z_{i,j}}). $

(Note that ''multinomial distribution'' here refers to the [[multinomial distribution|multinomial]] with only one trial, which is also known as the [[categorical distribution]].)

The lengths $N_i$ are treated as independent of all the other data generating variables ($w$ and $z$). The subscript is often dropped, as in the plate diagrams shown here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

topic models #42

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

topic models #42

chengjun Mar 23, 2022 Maintainer

Replies: 0 comments

chengjun
Mar 23, 2022
Maintainer