Context Navigation

LanguageModelling

Timestamp:: Sep 23, 2022, 3:14:13 AM (3 years ago)
Author:: pary
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

private/NlpInPracticeCourse/LanguageModelling

-                      v25
+                      v26
 === References ===
-. Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". [[https://arxiv.org/abs/1810.04805v2|arXiv:1810.04805v2]]
 . Polosukhin, Illia, et al. "Attention Is All You Need". [[https://arxiv.org/abs/1810.04805v2|arXiv:1706.03762]]
 . Alammar, Jay (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/
+. Alammar, Jay (2018). The Illustrated BERT, ELMo, and co. [Blog post]. Retrieved from https://jalammar.github.io/illustrated-bert/
+. Alammar, Jay (2018). The Illustrated GPT-2 [Blog post]. Retrieved from https://jalammar.github.io/illustrated-gpt2/
+. Brown, Tom, et al. (2020) "Language Models are Few-Shot Learners" [[https://arxiv.org/abs/2005.14165|arXiv:2005.14165]]
+. Sennrich, Rico, et al. (2106) "Neural Machine Translation of Rare Words with Subword Units", In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, [[https://aclanthology.org/P16-1162|ACL 2016]]
 …
 === Language models from scratch ===
+In this workshop, we create two language models for English and/or Czech from own texts. The models do not use any framework or complex library, only NumPy to work with vectors and matrices.
+We generate random text using these models and/or use the model in a real application of diacritics restoration.
+In this workshop, we create a language models for English and/or any other language from own texts. The models use only [[https://www.fi.muni.cz/~pary/mingpt.zip|small python modules]] with PyTorch framework.
+Access the [[https://colab.research.google.com/drive/1Xhf6i-G3B4nnhn2eSNlg0QcCdLOQwjqH?usp=sharing|Python notebook in the Google Colab environment]]. Do not forget to save your work if you want to see your changes later, leaving the browser will throw away all changes!
+We generate random text using these models. The first model is based only on characters, later one uses subword tokenization with [[https://github.com/rsennrich/subword-nmt|BPE]].
+Access the [[https://colab.research.google.com/drive/1GSS_KlTVkrNNqGBi6MmZ6AMgHcIQDczq?usp=sharing|Python notebook in the Google Colab environment]]. Do not forget to save your work if you want to see your changes later, leaving the browser will throw away all changes!
 OR
 …
 === Training data ===
+. Small text for fast setup: *1984 book* from Project Gutenberg
+. R.U.R., a play by Josef Capek (155 kB)
+    https://gutenberg.org/files/59112/59112-0.txt
+. Small text for fast setup: *1984 book* from Project Gutenberg (590 kB)
     https://gutenberg.net.au/ebooks01/0100021.txt
+. Sample from Czech part of the Europarl corpus, (1 MB, 10 MB, 150 MB)
+    https://corpora.fi.muni.cz/ces-1m.txt, https://corpora.fi.muni.cz/ces-10m.txt, https://corpora.fi.muni.cz/ces-150m.txt
+. Shakespeare plays (1.1 MB)
+    https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
+. Any other data, any language (even programming languages)
 …
-Choose one of the following tasks.
 ==== Task 1 ====
-Use a LM for a diacritics restoration function.
+Generate text using character-level neural LM.
+Write a function with text without diacritics as input and same text
+with added diacritics as a return value. For example:
+{{{
+>>> add_dia('cerveny kriz')
+'červený kříž'
+}}}
+You can compare your results with the [[https://nlp.fi.muni.cz/cz_accent/|czaccent]] service.
+Use several different hyper-parameters (embedding size, number of layers, number of epochs). Describe the quality of generated text with regard to selected parameters.
 ==== Task 2 ====
+Generate text using neural LM.
+Write a function to generate random text using neural LM. Optional parameter is the start of the text.
+The text could be printed or returned as the result of the function.
+The function could work in the similar way as `generate_text` from the notebook, but it has to use the *neural* language model.
+Implement a new Dataset class to Use subwords (via BPE) instead of character.
+Compare the generated text with the text generated by character-level model with the same number of parameters.