Version 4 (modified by 8 years ago) (diff) | ,
---|
Topic identification, topic modelling
IA161 Advanced NLP Course?, Course Guarantee: Aleš Horák
Prepared by: Jirka Materna
State of the Art
Topic modeling is a statistical approach for discovering abstract topics hidden in text documents. A document usually consists of multiple topics with different weights. Each topic can be described by typical words belonging to the topic. The most frequently used methods of topic modeling are Latent Semantic Analysis and Latent Dirichlet Allocation.
References
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993 – 1022, 2003.
- Yee W. Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. Hierarchical Dirichlet processes . Journal of the American Statistical Association, 101:1566 – 1581, 2006.
- S. T. Dumais, G. W. Furnas, T. K. Landauer, S. Deerwester, and R. Harshman. Using Latent Semantic Analysis to Improve Access to Textual Information. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’88, pages 281–285, New York, NY, USA, 1988. ACM. ISBN 0-201-14237-6.
Practical Session
In this session we will use Gensim to model latent topics of Wikipedia documents. We will focus on Latent Semantic Analysis and Latent Dirichlet Allocation models.
Students will also be required to generate some results of their work and hand them in to prove completing the tasks.
Attachments (2)
- models.py (916 bytes) - added by 11 months ago.
- Topic_Modeling_with_gensim.ipynb (6.5 KB) - added by 2 days ago.
Download all attachments as: .zip