Context Navigation

LanguageModelling

Timestamp:: Sep 23, 2021, 5:41:17 PM (4 years ago)
Author:: pary
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

private/NlpInPracticeCourse/LanguageModelling

-                      v23
+                      v24
 The task will proceed using Python notebook run in web browser in the Google [[https://colab.research.google.com/|Colaboratory]] environment.
+In case of running the codes in a local environment, the requirements are Python 3.6+, jupyter notebook,
+the main module `huggingface/transformers` is installed at the beginning of the notebook.
+In case of running the codes in a local environment, the requirements are Python 3.6+, jupyter notebook.
+=== Language models from scratch ===
+In this workshop, we create two language models for English and/or Czech from own texts. The models do not use any framework or complex library, only NumPy to work with vectors and matrices.
+=== BERT-like language model from scratch ===
+We generate random text using these models and/or use the model in a real application of diacritics restoration.
-In this workshop, we create a BERT-like language model for Czech from own texts.
-We investigate tokenization of such models and experiment with ''fill mask'' task
-for learning and evaluating neural language models.
 Access the [[https://colab.research.google.com/drive/1Xhf6i-G3B4nnhn2eSNlg0QcCdLOQwjqH?usp=sharing|Python notebook in the Google Colab environment]]. Do not forget to save your work if you want to see your changes later, leaving the browser will throw away all changes!
 …
 === Training data ===
 . Small text for fast setup: RUR from Project Gutenberg
     https://www.gutenberg.org/files/13083/13083-0.txt
+. Small text for fast setup: *1984 book* from Project Gutenberg
+    https://gutenberg.net.au/ebooks01/0100021.txt
 . Sample from Czech part of the Europarl corpus, (1 MB, 10 MB, 150 MB)
     https://corpora.fi.muni.cz/ces-1m.txt, https://corpora.fi.muni.cz/ces-10m.txt, https://corpora.fi.muni.cz/ces-150m.txt
 === Task ===
+=== Tasks ===
+Change the training data, tune parameters (vocab size, training args, ...) to get
+reasonable answer to simple ''fill mask'' questions, for example:
+Choose one of the following tasks.
+==== Task 1 ====
+Use a LM for a diacritics restoration function.
+Write a function with text without diacritics as input and same text
+with added diacritics as a return value. For example:
 {{{
+fill_mask("směrnice je určena členským <mask>")
+>>> add_dia('cerveny kriz')
+'červený kříž'
 }}}
+You can compare your results with the [[https://nlp.fi.muni.cz/cz_accent/|czaccent]] service.
+==== Task 2 ====
+Generate text using neural LM.
+Write a function to generate random text using neural LM. Optional parameter is the start of the text.
+The text could be printed or returned as the result of the function.
+The function could work in the similar way as `generate_text` from the notebook, but it has to use the *neural* language model.
 === Upload ===