Changes between Initial Version and Version 1 of en/AdvancedNlpCourse2017/TopicModelling


Ignore:
Timestamp:
Sep 14, 2018, 11:54:49 AM (6 years ago)
Author:
Ales Horak
Comment:

copied from private/AdvancedNlpCourse/TopicModelling

Legend:

Unmodified
Added
Removed
Modified
  • en/AdvancedNlpCourse2017/TopicModelling

    v1 v1  
     1= Topic identification, topic modelling =
     2
     3[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák
     4
     5Prepared by: Jirka Materna
     6
     7== State of the Art ==
     8Topic modeling is a statistical approach for discovering abstract topics hidden in text documents. A document usually consists of multiple topics with different weights. Each topic can be described by typical words belonging to the topic. The most frequently used methods of topic modeling are Latent Semantic Analysis and Latent Dirichlet Allocation.
     9=== References ===
     10
     11 1. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993 – 1022, 2003.
     12 1. Yee W. Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. Hierarchical Dirichlet processes . Journal of the American Statistical Association, 101:1566 – 1581, 2006.
     13 1. S. T. Dumais, G. W. Furnas, T. K. Landauer, S. Deerwester, and R. Harshman. Using Latent Semantic Analysis to Improve Access to Textual Information. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’88, pages 281–285, New York, NY, USA, 1988. ACM. ISBN 0-201-14237-6.
     14
     15== Practical Session ==
     16
     17In this session we will use [[http://radimrehurek.com/gensim/|Gensim]] to model latent topics of Wikipedia documents. We will focus on Latent Semantic Analysis and Latent Dirichlet Allocation models.
     18
     19 1. Download and extract the corpus of Czech Wikipedia documents:  [[htdocs:bigdata/wiki.tar.bz2|wiki corpus]].
     20 1. Train LSA and LDA models of the corpus for various numbers of topics using Gensim. You can use this template: [raw-attachment:models.py models.py].
     21 1. For both LSA and LDA select the best model (by looking at the data or by computing perplexity of a test set for LDA).
     22 1. Select 5 most important topics with 10 most important words, give them a name, save it into a text file and upload it into odevzdavarna.