Changes between Version 18 and Version 19 of private/NlpInPracticeCourse/TopicModelling


Ignore:
Timestamp:
Sep 25, 2023, 2:21:48 PM (7 months ago)
Author:
Zuzana Nevěřilová
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • private/NlpInPracticeCourse/TopicModelling

    v18 v19  
    1 = Topic identification, topic modelling =
     1= Topic identification, topic modeling =
    22
    33[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/NlpInPracticeCourse|NLP in Practice Course]], Course Guarantee: Aleš Horák
    44
    5 Prepared by: Adam Rambousek, Jirka Materna
     5Prepared by: Zuzana Nevěřilová, Adam Rambousek, Jirka Materna
    66
    77== State of the Art ==
    8 Topic modeling is a statistical approach for discovering abstract topics hidden in text documents. A document usually consists of multiple topics with different weights. Each topic can be described by typical words belonging to the topic. The most frequently used methods of topic modeling are Latent Semantic Analysis and Latent Dirichlet Allocation.
     8Topic modeling is a statistical approach for discovering abstract topics hidden in text documents. A document usually consists of multiple topics with different weights. Each topic can be described by typical words belonging to the topic. The most frequently used topic modeling methods are Latent Semantic Analysis and Latent Dirichlet Allocation.
    99=== References ===
    1010
    1111 1. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993 – 1022, 2003.
    1212 1. Curiskis, S. A., Drake, B., Osborn, T. R., and Kennedy, P. J. (2020). An evaluation of document clustering and topic modelling in two online social networks: Twitter and reddit. Information Processing & Management, 57(2):102034.
    13  1. Yee W. Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. Hierarchical Dirichlet processes . Journal of the American Statistical Association, 101:1566 – 1581, 2006.
     13 1. Yee W. Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101:1566 – 1581, 2006.
    1414 1. Castellanos, A., Juan Cigarrán, and Ana García-Serrano. "Formal concept analysis for topic detection: a clustering quality experimental analysis." Information Systems 66 (2017): 24-42.
    1515 1. Xie, Pengtao, and Eric P. Xing. "Integrating document clustering and topic modeling." arXiv preprint arXiv:1309.6874 (2013).
     
    1919== Practical Session ==
    2020
    21 In this session we will use [[http://radimrehurek.com/gensim/|Gensim]] to model latent topics of Wikipedia documents. We will focus on Latent Semantic Analysis and Latent Dirichlet Allocation models.
     21In this session, we will use [[http://radimrehurek.com/gensim/|Gensim]] to model latent topics of Wikipedia documents. We will focus on Latent Semantic Analysis and Latent Dirichlet Allocation models.
    2222
    23  1. Gensim is already installed on epimetheus1.fi.muni.cz and it also offers faster model processing.
    24  1. Download and extract the corpus of Wikipedia documents:  [[htdocs:bigdata/wiki.tar.bz2|Czech wiki corpus]], [[htdocs:bigdata/wiki_en.tar.bz2|English wiki corpus]].
    25  1. Train LSA and LDA models of the corpus for various numbers of topics using Gensim. You can use this template: [raw-attachment:models.py models.py].
     23 1. Create `<YOUR_FILE>`, a text file named ia161-UCO-07.txt where UCO is your university ID.
     24 1. Gensim is installed on `epimetheus1.fi.muni.cz` and offers faster model processing, but you can easily use your own installation.
     25 1. Download and extract the corpus of Wikipedia documents:  [[htdocs:bigdata/wiki_en.tar.bz2|English wiki corpus]].
     26 1. Train LSA and LDA models of the corpus for various topics using Gensim. You can use this template: [raw-attachment:models.py models.py] or  [[https://colab.research.google.com/drive/1nTJaNkwclqBSI6Kk6X_uUHLtViWgbe_S?usp=sharing|Google Colab]].
    2627 1. Check the coherence for various parameters.
    27  1. For both LSA and LDA select the best model (by looking at the data or by coherence).
    28  1. For each model, select 2 most significant topics which makes sense to you, compare with coherence score. Give them a name, save it into a text file and upload it into odevzdavarna.
     28 1. Select the best model for both LSA and LDA (by looking at the data or by coherence).
     29 1. For each model, select the two most significant topics that make sense to you and compare them with the coherence score. Give them a name, save it into a `<YOUR_FILE.txt>`, and submit it to the homework vault (Odevzdavarna).
    2930
    30 You can save the files in your home directory on NLP computers and they will be accessible on the server.
     31You can save the files in your home directory on NLP computers, which will be accessible on the server.