Changes between Version 25 and Version 26 of private/NlpInPracticeCourse/LanguageModelling
- Timestamp:
- Sep 23, 2022, 3:14:13 AM (8 months ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
private/NlpInPracticeCourse/LanguageModelling
v25 v26 13 13 14 14 === References === 15 1. Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". [[https://arxiv.org/abs/1810.04805v2|arXiv:1810.04805v2]]16 15 1. Polosukhin, Illia, et al. "Attention Is All You Need". [[https://arxiv.org/abs/1810.04805v2|arXiv:1706.03762]] 17 16 1. Alammar, Jay (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/ 18 1. Alammar, Jay (2018). The Illustrated BERT, ELMo, and co. [Blog post]. Retrieved from https://jalammar.github.io/illustrated-bert/ 17 1. Alammar, Jay (2018). The Illustrated GPT-2 [Blog post]. Retrieved from https://jalammar.github.io/illustrated-gpt2/ 18 1. Brown, Tom, et al. (2020) "Language Models are Few-Shot Learners" [[https://arxiv.org/abs/2005.14165|arXiv:2005.14165]] 19 1. Sennrich, Rico, et al. (2106) "Neural Machine Translation of Rare Words with Subword Units", In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, [[https://aclanthology.org/P16-1162|ACL 2016]] 19 20 20 21 … … 34 35 === Language models from scratch === 35 36 36 In this workshop, we create two language models for English and/or Czech from own texts. The models do not use any framework or complex library, only NumPy to work with vectors and matrices. 37 38 We generate random text using these models and/or use the model in a real application of diacritics restoration. 37 In this workshop, we create a language models for English and/or any other language from own texts. The models use only [[https://www.fi.muni.cz/~pary/mingpt.zip|small python modules]] with PyTorch framework. 39 38 40 39 41 Access the [[https://colab.research.google.com/drive/1Xhf6i-G3B4nnhn2eSNlg0QcCdLOQwjqH?usp=sharing|Python notebook in the Google Colab environment]]. Do not forget to save your work if you want to see your changes later, leaving the browser will throw away all changes! 40 We generate random text using these models. The first model is based only on characters, later one uses subword tokenization with [[https://github.com/rsennrich/subword-nmt|BPE]]. 41 42 43 Access the [[https://colab.research.google.com/drive/1GSS_KlTVkrNNqGBi6MmZ6AMgHcIQDczq?usp=sharing|Python notebook in the Google Colab environment]]. Do not forget to save your work if you want to see your changes later, leaving the browser will throw away all changes! 42 44 43 45 OR … … 49 51 === Training data === 50 52 51 1. Small text for fast setup: *1984 book* from Project Gutenberg 53 1. R.U.R., a play by Josef Capek (155 kB) 54 https://gutenberg.org/files/59112/59112-0.txt 55 1. Small text for fast setup: *1984 book* from Project Gutenberg (590 kB) 52 56 https://gutenberg.net.au/ebooks01/0100021.txt 53 1. Sample from Czech part of the Europarl corpus, (1 MB, 10 MB, 150 MB) 54 https://corpora.fi.muni.cz/ces-1m.txt, https://corpora.fi.muni.cz/ces-10m.txt, https://corpora.fi.muni.cz/ces-150m.txt 57 1. Shakespeare plays (1.1 MB) 58 https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt 59 1. Any other data, any language (even programming languages) 55 60 56 61 … … 58 63 59 64 60 Choose one of the following tasks.61 62 65 ==== Task 1 ==== 63 66 64 Use a LM for a diacritics restoration function.65 67 68 Generate text using character-level neural LM. 66 69 67 Write a function with text without diacritics as input and same text 68 with added diacritics as a return value. For example: 69 {{{ 70 >>> add_dia('cerveny kriz') 71 'červený kříž' 72 }}} 73 74 You can compare your results with the [[https://nlp.fi.muni.cz/cz_accent/|czaccent]] service. 75 70 Use several different hyper-parameters (embedding size, number of layers, number of epochs). Describe the quality of generated text with regard to selected parameters. 76 71 77 72 ==== Task 2 ==== 78 73 79 Generate text using neural LM. 80 81 Write a function to generate random text using neural LM. Optional parameter is the start of the text. 82 The text could be printed or returned as the result of the function. 83 84 The function could work in the similar way as `generate_text` from the notebook, but it has to use the *neural* language model. 74 Implement a new Dataset class to Use subwords (via BPE) instead of character. 75 Compare the generated text with the text generated by character-level model with the same number of parameters. 85 76 86 77