Context Navigation

← Previous Change
Wiki History
Next Change →

LanguageModelling

Timestamp:: Sep 13, 2023, 2:44:39 PM (22 months ago)
Author:: Ales Horak
Comment:: copied from private/NlpInPracticeCourse/LanguageModelling

Legend:

: Unmodified
: Added
: Removed
: Modified

en/NlpInPracticeCourse/2022/LanguageModelling

                       v1
+= Language modelling =
+[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/NlpInPracticeCourse|NLP in Practice Course]], Course Guarantee: Aleš Horák
+Prepared by: Pavel Rychlý
+== State of the Art ==
+The goal of a language model is to assign a score to any possible input sentence. In the past, this was achieved mainly by n-gram models known since WWII. But recently, the buzzword deep learning penetrated also into language modelling and it turned out to be substantially better than Markov's n-gram models.
+The current state of the art models are build on neural networks using transformers.
+=== References ===
+. Polosukhin, Illia, et al. "Attention Is All You Need". [[https://arxiv.org/abs/1810.04805v2|arXiv:1706.03762]]
+. Alammar, Jay (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/
+. Alammar, Jay (2018). The Illustrated GPT-2 [Blog post]. Retrieved from https://jalammar.github.io/illustrated-gpt2/
+. Brown, Tom, et al. (2020) "Language Models are Few-Shot Learners" [[https://arxiv.org/abs/2005.14165|arXiv:2005.14165]]
+. Sennrich, Rico, et al. (2106) "Neural Machine Translation of Rare Words with Subword Units", In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, [[https://aclanthology.org/P16-1162|ACL 2016]]
+== Practical Session ==
+=== Technical Requirements ===
+The task will proceed using Python notebook run in web browser in the Google [[https://colab.research.google.com/|Colaboratory]] environment.
+In case of running the codes in a local environment, the requirements are Python 3.6+, jupyter notebook.
+=== Language models from scratch ===
+In this workshop, we create a language models for English and/or any other language from own texts. The models use only [[https://www.fi.muni.cz/~pary/mingpt.zip|small python modules]] with PyTorch framework.
+We generate random text using these models. The first model is based only on characters, later one uses subword tokenization with [[https://github.com/rsennrich/subword-nmt|BPE]].
+Access the [[https://colab.research.google.com/drive/1GSS_KlTVkrNNqGBi6MmZ6AMgHcIQDczq?usp=sharing|Python notebook in the Google Colab environment]]. Do not forget to save your work if you want to see your changes later, leaving the browser will throw away all changes!
+OR
+download the notebook or plain python file from the shared notebook (File > Download) and run in your local environment.
+=== Training data ===
+. R.U.R., a play by Josef Capek (155 kB)
+   https://gutenberg.org/files/59112/59112-0.txt
+. Small text for fast setup: *1984 book* from Project Gutenberg (590 kB)
+   https://gutenberg.net.au/ebooks01/0100021.txt
+. Shakespeare plays (1.1 MB)
+   https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
+. Any other data, any language (even programming languages)
+=== Tasks ===
+==== Task 1 ====
+Generate text using character-level neural LM.
+Use several different hyper-parameters (embedding size, number of layers, number of epochs). Describe the quality of generated text with regard to selected parameters.
+==== Task 2 ====
+Implement a new Dataset class to Use subwords (via BPE) instead of character.
+Compare the generated text with the text generated by character-level model with the same number of parameters.
+=== Upload ===
+Upload your modified notebook or python script with results to the [[https://nlp.fi.muni.cz/en/NlpInPracticeCourse|homework vault (odevzdávárna)]].