wiki:private/NlpInPracticeCourse/TopicModelling

Version 21 (modified by Zuzana Nevěřilová, 6 months ago) (diff)

--

Topic identification, topic modeling

IA161 NLP in Practice Course, Course Guarantee: Aleš Horák

Prepared by: Zuzana Nevěřilová, Adam Rambousek, Jirka Materna

State of the Art

Topic modeling is a statistical approach for discovering abstract topics hidden in text documents. A document usually consists of multiple topics with different weights. Each topic can be described by typical words belonging to the topic. The most frequently used topic modeling methods are Latent Semantic Analysis and Latent Dirichlet Allocation.

References

  1. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993 – 1022, 2003.
  2. Curiskis, S. A., Drake, B., Osborn, T. R., and Kennedy, P. J. (2020). An evaluation of document clustering and topic modelling in two online social networks: Twitter and reddit. Information Processing & Management, 57(2):102034.
  3. Röder, M., Both, A., and Hinneburg, A. (2015). Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining, pages 399–408.

Practical Session

In this session, we will use Gensim to model latent topics of Wikipedia documents. We will focus on Latent Semantic Analysis and Latent Dirichlet Allocation models.

  1. Create <YOUR_FILE>, a text file named ia161-UCO-07.txt where UCO is your university ID.
  2. Gensim is installed on epimetheus1.fi.muni.cz and offers faster model processing, but you can easily use your own installation.
  3. Download and extract the corpus of Wikipedia documents: English wiki corpus.
  4. Train LSA and LDA models of the corpus for various topics using Gensim. You can use this template: models.py or Google Colab.
  5. Check the coherence for various parameters.
  6. Select the best model for both LSA and LDA (by looking at the data or by coherence).
  7. For each model, select the two most significant topics that make sense to you and compare them with the coherence score. Give them a name, save it into a <YOUR_FILE>, and submit it to the homework vault (Odevzdavarna).

You can save the files in your home directory on NLP computers, which will be accessible on the server.

Attachments (2)

Download all attachments as: .zip