Changes between Version 23 and Version 24 of private/NlpInPracticeCourse/LanguageModelling
- Timestamp:
- Sep 23, 2021, 5:41:17 PM (4 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
private/NlpInPracticeCourse/LanguageModelling
v23 v24 28 28 The task will proceed using Python notebook run in web browser in the Google [[https://colab.research.google.com/|Colaboratory]] environment. 29 29 30 In case of running the codes in a local environment, the requirements are Python 3.6+, jupyter notebook, 31 the main module `huggingface/transformers` is installed at the beginning of the notebook. 30 In case of running the codes in a local environment, the requirements are Python 3.6+, jupyter notebook. 32 31 33 32 34 33 34 === Language models from scratch === 35 35 36 In this workshop, we create two language models for English and/or Czech from own texts. The models do not use any framework or complex library, only NumPy to work with vectors and matrices. 36 37 37 === BERT-like language model from scratch === 38 We generate random text using these models and/or use the model in a real application of diacritics restoration. 38 39 39 In this workshop, we create a BERT-like language model for Czech from own texts.40 We investigate tokenization of such models and experiment with ''fill mask'' task41 for learning and evaluating neural language models.42 40 43 41 Access the [[https://colab.research.google.com/drive/1Xhf6i-G3B4nnhn2eSNlg0QcCdLOQwjqH?usp=sharing|Python notebook in the Google Colab environment]]. Do not forget to save your work if you want to see your changes later, leaving the browser will throw away all changes! … … 51 49 === Training data === 52 50 53 1. Small text for fast setup: RURfrom Project Gutenberg54 https:// www.gutenberg.org/files/13083/13083-0.txt51 1. Small text for fast setup: *1984 book* from Project Gutenberg 52 https://gutenberg.net.au/ebooks01/0100021.txt 55 53 1. Sample from Czech part of the Europarl corpus, (1 MB, 10 MB, 150 MB) 56 54 https://corpora.fi.muni.cz/ces-1m.txt, https://corpora.fi.muni.cz/ces-10m.txt, https://corpora.fi.muni.cz/ces-150m.txt 57 55 58 56 59 === Task ===57 === Tasks === 60 58 61 Change the training data, tune parameters (vocab size, training args, ...) to get 62 reasonable answer to simple ''fill mask'' questions, for example: 59 60 Choose one of the following tasks. 61 62 ==== Task 1 ==== 63 64 Use a LM for a diacritics restoration function. 65 66 67 Write a function with text without diacritics as input and same text 68 with added diacritics as a return value. For example: 63 69 {{{ 64 fill_mask("směrnice je určena členským <mask>") 70 >>> add_dia('cerveny kriz') 71 'červený kříž' 65 72 }}} 73 74 You can compare your results with the [[https://nlp.fi.muni.cz/cz_accent/|czaccent]] service. 75 76 77 ==== Task 2 ==== 78 79 Generate text using neural LM. 80 81 Write a function to generate random text using neural LM. Optional parameter is the start of the text. 82 The text could be printed or returned as the result of the function. 83 84 The function could work in the similar way as `generate_text` from the notebook, but it has to use the *neural* language model. 85 86 66 87 67 88 === Upload ===