Topic identification, topic modeling

IA161 NLP in Practice Course, Course Guarantee: Aleš Horák

Prepared by: Zuzana Nevěřilová, Adam Rambousek, Jirka Materna

State of the Art

Topic modeling is a statistical approach for discovering abstract topics hidden in text documents. A document usually consists of multiple topics with different weights. Each topic can be described by typical words belonging to the topic. The most frequently used topic modeling methods are Latent Semantic Analysis and Latent Dirichlet Allocation.

References

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993 – 1022, 2003.
Weijie Xu, Xiaoyu Jiang, Srinivasan Sengamedu Hanumantha Rao, Francis Iannacci, and Jinjin Zhao. 2023. vONTSS: vMF based semi-supervised neural topic modeling with optimal transport. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4433–4457, Toronto, Canada. Association for Computational Linguistics. https://aclanthology.org/2023.findings-acl.271/
Röder, M., Both, A., and Hinneburg, A. (2015). Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining, pages 399–408. https://dl.acm.org/doi/abs/10.1145/2684822.2685324

Practical Session

In this session, we will use Gensim to model latent topics of Wikipedia documents. We will focus on Latent Semantic Analysis and Latent Dirichlet Allocation models.

Create <YOUR_FILE>, a text file named ia161-UCO-07.txt where UCO is your university ID.
Train LSA and LDA models of the corpus for various topics using Gensim. Use Google Colab.
Follow instructions in the Colab.
Save it into a <YOUR_FILE> and submit it to the homework vault (Odevzdavarna).