Changes between Version 23 and Version 24 of private/AdvancedNlpCourse/LanguageModelling


Ignore:
Timestamp:
Sep 23, 2021, 5:41:17 PM (2 months ago)
Author:
pary
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • private/AdvancedNlpCourse/LanguageModelling

    v23 v24  
    2828The task will proceed using Python notebook run in web browser in the Google [[https://colab.research.google.com/|Colaboratory]] environment.
    2929
    30 In case of running the codes in a local environment, the requirements are Python 3.6+, jupyter notebook,
    31 the main module `huggingface/transformers` is installed at the beginning of the notebook.
     30In case of running the codes in a local environment, the requirements are Python 3.6+, jupyter notebook.
    3231
    3332
    3433
     34=== Language models from scratch ===
    3535
     36In this workshop, we create two language models for English and/or Czech from own texts. The models do not use any framework or complex library, only NumPy to work with vectors and matrices.
    3637
    37 === BERT-like language model from scratch ===
     38We generate random text using these models and/or use the model in a real application of diacritics restoration.
    3839
    39 In this workshop, we create a BERT-like language model for Czech from own texts.
    40 We investigate tokenization of such models and experiment with ''fill mask'' task
    41 for learning and evaluating neural language models.
    4240
    4341Access the [[https://colab.research.google.com/drive/1Xhf6i-G3B4nnhn2eSNlg0QcCdLOQwjqH?usp=sharing|Python notebook in the Google Colab environment]]. Do not forget to save your work if you want to see your changes later, leaving the browser will throw away all changes!
     
    5149=== Training data ===
    5250
    53 1. Small text for fast setup: RUR from Project Gutenberg
    54     https://www.gutenberg.org/files/13083/13083-0.txt
     511. Small text for fast setup: *1984 book* from Project Gutenberg
     52    https://gutenberg.net.au/ebooks01/0100021.txt
    55531. Sample from Czech part of the Europarl corpus, (1 MB, 10 MB, 150 MB)
    5654    https://corpora.fi.muni.cz/ces-1m.txt, https://corpora.fi.muni.cz/ces-10m.txt, https://corpora.fi.muni.cz/ces-150m.txt
    5755
    5856
    59 === Task ===
     57=== Tasks ===
    6058
    61 Change the training data, tune parameters (vocab size, training args, ...) to get
    62 reasonable answer to simple ''fill mask'' questions, for example:
     59
     60Choose one of the following tasks.
     61
     62==== Task 1 ====
     63
     64Use a LM for a diacritics restoration function.
     65
     66
     67Write a function with text without diacritics as input and same text
     68with added diacritics as a return value. For example:
    6369{{{
    64 fill_mask("směrnice je určena členským <mask>")
     70>>> add_dia('cerveny kriz')
     71'červený kříž'
    6572}}}
     73
     74You can compare your results with the [[https://nlp.fi.muni.cz/cz_accent/|czaccent]] service.
     75
     76
     77==== Task 2 ====
     78
     79Generate text using neural LM.
     80
     81Write a function to generate random text using neural LM. Optional parameter is the start of the text.
     82The text could be printed or returned as the result of the function.
     83
     84The function could work in the similar way as `generate_text` from the notebook, but it has to use the *neural* language model.
     85
     86
    6687
    6788=== Upload ===