Context Navigation

← Previous Change
Wiki History
Next Change →

LanguageModelling

Timestamp:: Aug 30, 2022, 10:39:35 AM (3 years ago)
Author:: Ales Horak
Comment:: copied from private/NlpInPracticeCourse/LanguageModelling

Legend:

: Unmodified
: Added
: Removed
: Modified

en/NlpInPracticeCourse/2021/LanguageModelling

                       v1
+= Language modelling =
+[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/NlpInPracticeCourse|NLP in Practice Course]], Course Guarantee: Aleš Horák
+Prepared by: Pavel Rychlý
+== State of the Art ==
+The goal of a language model is to assign a score to any possible input sentence. In the past, this was achieved mainly by n-gram models known since WWII. But recently, the buzzword deep learning penetrated also into language modelling and it turned out to be substantially better than Markov's n-gram models.
+The current state of the art models are build on neural networks using transformers.
+=== References ===
+. Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". [[https://arxiv.org/abs/1810.04805v2|arXiv:1810.04805v2]]
+. Polosukhin, Illia, et al. "Attention Is All You Need". [[https://arxiv.org/abs/1810.04805v2|arXiv:1706.03762]]
+. Alammar, Jay (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/
+. Alammar, Jay (2018). The Illustrated BERT, ELMo, and co. [Blog post]. Retrieved from https://jalammar.github.io/illustrated-bert/
+== Practical Session ==
+=== Technical Requirements ===
+The task will proceed using Python notebook run in web browser in the Google [[https://colab.research.google.com/|Colaboratory]] environment.
+In case of running the codes in a local environment, the requirements are Python 3.6+, jupyter notebook.
+=== Language models from scratch ===
+In this workshop, we create two language models for English and/or Czech from own texts. The models do not use any framework or complex library, only NumPy to work with vectors and matrices.
+We generate random text using these models and/or use the model in a real application of diacritics restoration.
+Access the [[https://colab.research.google.com/drive/1Xhf6i-G3B4nnhn2eSNlg0QcCdLOQwjqH?usp=sharing|Python notebook in the Google Colab environment]]. Do not forget to save your work if you want to see your changes later, leaving the browser will throw away all changes!
+OR
+download the notebook or plain python file from the shared notebook (File > Download) and run in your local environment.
+=== Training data ===
+. Small text for fast setup: *1984 book* from Project Gutenberg
+    https://gutenberg.net.au/ebooks01/0100021.txt
+. Sample from Czech part of the Europarl corpus, (1 MB, 10 MB, 150 MB)
+    https://corpora.fi.muni.cz/ces-1m.txt, https://corpora.fi.muni.cz/ces-10m.txt, https://corpora.fi.muni.cz/ces-150m.txt
+=== Tasks ===
+Choose one of the following tasks.
+==== Task 1 ====
+Use a LM for a diacritics restoration function.
+Write a function with text without diacritics as input and same text
+with added diacritics as a return value. For example:
+{{{
+>>> add_dia('cerveny kriz')
+'červený kříž'
+}}}
+You can compare your results with the [[https://nlp.fi.muni.cz/cz_accent/|czaccent]] service.
+==== Task 2 ====
+Generate text using neural LM.
+Write a function to generate random text using neural LM. Optional parameter is the start of the text.
+The text could be printed or returned as the result of the function.
+The function could work in the similar way as `generate_text` from the notebook, but it has to use the *neural* language model.
+=== Upload ===
+Upload your modified notebook or python script with results to the [[https://nlp.fi.muni.cz/en/NlpInPracticeCourse|homework vault (odevzdávárna)]].