Changes between Version 25 and Version 26 of private/NlpInPracticeCourse/LanguageModelling


Ignore:
Timestamp:
Sep 23, 2022, 3:14:13 AM (19 months ago)
Author:
pary
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • private/NlpInPracticeCourse/LanguageModelling

    v25 v26  
    1313
    1414=== References ===
    15  1. Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". [[https://arxiv.org/abs/1810.04805v2|arXiv:1810.04805v2]]
    1615 1. Polosukhin, Illia, et al. "Attention Is All You Need". [[https://arxiv.org/abs/1810.04805v2|arXiv:1706.03762]]
    1716 1. Alammar, Jay (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/
    18  1. Alammar, Jay (2018). The Illustrated BERT, ELMo, and co. [Blog post]. Retrieved from https://jalammar.github.io/illustrated-bert/
     17 1. Alammar, Jay (2018). The Illustrated GPT-2 [Blog post]. Retrieved from https://jalammar.github.io/illustrated-gpt2/
     18 1. Brown, Tom, et al. (2020) "Language Models are Few-Shot Learners" [[https://arxiv.org/abs/2005.14165|arXiv:2005.14165]]
     19 1. Sennrich, Rico, et al. (2106) "Neural Machine Translation of Rare Words with Subword Units", In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, [[https://aclanthology.org/P16-1162|ACL 2016]]
    1920
    2021
     
    3435=== Language models from scratch ===
    3536
    36 In this workshop, we create two language models for English and/or Czech from own texts. The models do not use any framework or complex library, only NumPy to work with vectors and matrices.
    37 
    38 We generate random text using these models and/or use the model in a real application of diacritics restoration.
     37In this workshop, we create a language models for English and/or any other language from own texts. The models use only [[https://www.fi.muni.cz/~pary/mingpt.zip|small python modules]] with PyTorch framework.
    3938
    4039
    41 Access the [[https://colab.research.google.com/drive/1Xhf6i-G3B4nnhn2eSNlg0QcCdLOQwjqH?usp=sharing|Python notebook in the Google Colab environment]]. Do not forget to save your work if you want to see your changes later, leaving the browser will throw away all changes!
     40We generate random text using these models. The first model is based only on characters, later one uses subword tokenization with [[https://github.com/rsennrich/subword-nmt|BPE]].
     41
     42
     43Access the [[https://colab.research.google.com/drive/1GSS_KlTVkrNNqGBi6MmZ6AMgHcIQDczq?usp=sharing|Python notebook in the Google Colab environment]]. Do not forget to save your work if you want to see your changes later, leaving the browser will throw away all changes!
    4244
    4345OR
     
    4951=== Training data ===
    5052
    51 1. Small text for fast setup: *1984 book* from Project Gutenberg
     531. R.U.R., a play by Josef Capek (155 kB)
     54    https://gutenberg.org/files/59112/59112-0.txt
     551. Small text for fast setup: *1984 book* from Project Gutenberg (590 kB)
    5256    https://gutenberg.net.au/ebooks01/0100021.txt
    53 1. Sample from Czech part of the Europarl corpus, (1 MB, 10 MB, 150 MB)
    54     https://corpora.fi.muni.cz/ces-1m.txt, https://corpora.fi.muni.cz/ces-10m.txt, https://corpora.fi.muni.cz/ces-150m.txt
     571. Shakespeare plays (1.1 MB)
     58    https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
     591. Any other data, any language (even programming languages)
    5560
    5661
     
    5863
    5964
    60 Choose one of the following tasks.
    61 
    6265==== Task 1 ====
    6366
    64 Use a LM for a diacritics restoration function.
    6567
     68Generate text using character-level neural LM.
    6669
    67 Write a function with text without diacritics as input and same text
    68 with added diacritics as a return value. For example:
    69 {{{
    70 >>> add_dia('cerveny kriz')
    71 'červený kříž'
    72 }}}
    73 
    74 You can compare your results with the [[https://nlp.fi.muni.cz/cz_accent/|czaccent]] service.
    75 
     70Use several different hyper-parameters (embedding size, number of layers, number of epochs). Describe the quality of generated text with regard to selected parameters.
    7671
    7772==== Task 2 ====
    7873
    79 Generate text using neural LM.
    80 
    81 Write a function to generate random text using neural LM. Optional parameter is the start of the text.
    82 The text could be printed or returned as the result of the function.
    83 
    84 The function could work in the similar way as `generate_text` from the notebook, but it has to use the *neural* language model.
     74Implement a new Dataset class to Use subwords (via BPE) instead of character.
     75Compare the generated text with the text generated by character-level model with the same number of parameters.
    8576
    8677