Context Navigation

TopicModelling

-                      v6
+                      v7
 In this session we will use [[http://radimrehurek.com/gensim/|Gensim]] to model latent topics of Wikipedia documents. We will focus on Latent Semantic Analysis and Latent Dirichlet Allocation models.
+. Download and extract the corpus of Czech Wikipedia documents:  [[htdocs:bigdata/wiki.tar.bz2|wiki corpus]].
+. Train LSA and LDA models of the corpus for various numbers of topics using Gensim. You can use this template:
+. For both the LSA and LDA select the best best models
+Students will also be required to generate some results of their work and hand them in to prove completing the tasks.
+. Download and extract the corpus of Czech Wikipedia documents:  [[htdocs:bigdata/wiki.tar.bz2|wiki corpus]].
+. Train LSA and LDA models of the corpus for various numbers of topics using Gensim. You can use this template: [raw-attachment:models.py models.py].
+. For both LSA and LDA select the best model (by looking at the data or by computing perplexity of a test set for LDA).
+. Select 5 most important topics with 10 most important words, give them a name, save it into a text file and upload it into odevzdavarna.