| 1 | = Language modelling = |
| 2 | |
| 3 | [[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/NlpInPracticeCourse|NLP in Practice Course]], Course Guarantee: Aleš Horák |
| 4 | |
| 5 | Prepared by: Pavel Rychlý |
| 6 | |
| 7 | == State of the Art == |
| 8 | |
| 9 | The goal of a language model is to assign a score to any possible input sentence. In the past, this was achieved mainly by n-gram models known since WWII. But recently, the buzzword deep learning penetrated also into language modelling and it turned out to be substantially better than Markov's n-gram models. |
| 10 | |
| 11 | The current state of the art models are build on neural networks using transformers. |
| 12 | |
| 13 | |
| 14 | === References === |
| 15 | 1. Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". [[https://arxiv.org/abs/1810.04805v2|arXiv:1810.04805v2]] |
| 16 | 1. Polosukhin, Illia, et al. "Attention Is All You Need". [[https://arxiv.org/abs/1810.04805v2|arXiv:1706.03762]] |
| 17 | 1. Alammar, Jay (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/ |
| 18 | 1. Alammar, Jay (2018). The Illustrated BERT, ELMo, and co. [Blog post]. Retrieved from https://jalammar.github.io/illustrated-bert/ |
| 19 | |
| 20 | |
| 21 | |
| 22 | |
| 23 | |
| 24 | == Practical Session == |
| 25 | |
| 26 | === Technical Requirements === |
| 27 | |
| 28 | The task will proceed using Python notebook run in web browser in the Google [[https://colab.research.google.com/|Colaboratory]] environment. |
| 29 | |
| 30 | In case of running the codes in a local environment, the requirements are Python 3.6+, jupyter notebook. |
| 31 | |
| 32 | |
| 33 | |
| 34 | === Language models from scratch === |
| 35 | |
| 36 | In this workshop, we create two language models for English and/or Czech from own texts. The models do not use any framework or complex library, only NumPy to work with vectors and matrices. |
| 37 | |
| 38 | We generate random text using these models and/or use the model in a real application of diacritics restoration. |
| 39 | |
| 40 | |
| 41 | Access the [[https://colab.research.google.com/drive/1Xhf6i-G3B4nnhn2eSNlg0QcCdLOQwjqH?usp=sharing|Python notebook in the Google Colab environment]]. Do not forget to save your work if you want to see your changes later, leaving the browser will throw away all changes! |
| 42 | |
| 43 | OR |
| 44 | |
| 45 | download the notebook or plain python file from the shared notebook (File > Download) and run in your local environment. |
| 46 | |
| 47 | |
| 48 | |
| 49 | === Training data === |
| 50 | |
| 51 | 1. Small text for fast setup: *1984 book* from Project Gutenberg |
| 52 | https://gutenberg.net.au/ebooks01/0100021.txt |
| 53 | 1. Sample from Czech part of the Europarl corpus, (1 MB, 10 MB, 150 MB) |
| 54 | https://corpora.fi.muni.cz/ces-1m.txt, https://corpora.fi.muni.cz/ces-10m.txt, https://corpora.fi.muni.cz/ces-150m.txt |
| 55 | |
| 56 | |
| 57 | === Tasks === |
| 58 | |
| 59 | |
| 60 | Choose one of the following tasks. |
| 61 | |
| 62 | ==== Task 1 ==== |
| 63 | |
| 64 | Use a LM for a diacritics restoration function. |
| 65 | |
| 66 | |
| 67 | Write a function with text without diacritics as input and same text |
| 68 | with added diacritics as a return value. For example: |
| 69 | {{{ |
| 70 | >>> add_dia('cerveny kriz') |
| 71 | 'červený kříž' |
| 72 | }}} |
| 73 | |
| 74 | You can compare your results with the [[https://nlp.fi.muni.cz/cz_accent/|czaccent]] service. |
| 75 | |
| 76 | |
| 77 | ==== Task 2 ==== |
| 78 | |
| 79 | Generate text using neural LM. |
| 80 | |
| 81 | Write a function to generate random text using neural LM. Optional parameter is the start of the text. |
| 82 | The text could be printed or returned as the result of the function. |
| 83 | |
| 84 | The function could work in the similar way as `generate_text` from the notebook, but it has to use the *neural* language model. |
| 85 | |
| 86 | |
| 87 | |
| 88 | === Upload === |
| 89 | Upload your modified notebook or python script with results to the [[https://nlp.fi.muni.cz/en/NlpInPracticeCourse|homework vault (odevzdávárna)]]. |
| 90 | |