Changes between Initial Version and Version 1 of en/AdvancedNlpCourse2020/LanguageModelling


Ignore:
Timestamp:
Aug 31, 2021, 2:11:49 PM (3 years ago)
Author:
Ales Horak
Comment:

copied from private/AdvancedNlpCourse/LanguageModelling

Legend:

Unmodified
Added
Removed
Modified
  • en/AdvancedNlpCourse2020/LanguageModelling

    v1 v1  
     1= Language modelling =
     2
     3[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák
     4
     5Prepared by: Pavel Rychlý
     6
     7== State of the Art ==
     8
     9The goal of a language model is to assign a score to any possible input sentence. In the past, this was achieved mainly by n-gram models known since WWII. But recently, the buzzword deep learning penetrated also into language modelling and it turned out to be substantially better than Markov's n-gram models.
     10
     11The current state of the art models are build on neural networks using transformers.
     12
     13
     14=== References ===
     15 1. Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". [[https://arxiv.org/abs/1810.04805v2|arXiv:1810.04805v2]]
     16 1. Polosukhin, Illia, et al. "Attention Is All You Need". [[https://arxiv.org/abs/1810.04805v2|arXiv:1706.03762]]
     17 1. Alammar, Jay (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/
     18 1. Alammar, Jay (2018). The Illustrated BERT, ELMo, and co. [Blog post]. Retrieved from https://jalammar.github.io/illustrated-bert/
     19
     20
     21
     22
     23
     24== Practical Session ==
     25
     26=== Technical Requirements ===
     27
     28The task will proceed using Python notebook run in web browser in the Google [[https://colab.research.google.com/|Colaboratory]] environment.
     29
     30In case of running the codes in a local environment, the requirements are Python 3.6+, jupyter notebook,
     31the main module `huggingface/transformers` is installed at the beginning of the notebook.
     32
     33
     34
     35
     36
     37=== BERT-like language model from scratch ===
     38
     39In this workshop, we create a BERT-like language model for Czech from own texts.
     40We investigate tokenization of such models and experiment with ''fill mask'' task
     41for learning and evaluating neural language models.
     42
     43Access the [[https://colab.research.google.com/drive/1f0fMlud37ybxDdW1RNo8ZkfQ-rJoSkHv?usp=sharing|Python notebook in the Google Colab environment]]. Do not forget to save your work if you want to see your changes later, leaving the browser will throw away all changes!
     44
     45OR
     46
     47download the notebook or plain python file from the shared notebook (File > Download) and run in your local environment.
     48
     49
     50
     51=== Training data ===
     52
     531. Small text for fast setup: RUR from Project Gutenberg
     54    https://www.gutenberg.org/files/13083/13083-0.txt
     551. Sample from Czech part of the Europarl corpus, (1 MB, 10 MB, 150 MB)
     56    https://corpora.fi.muni.cz/ces-1m.txt, https://corpora.fi.muni.cz/ces-10m.txt, https://corpora.fi.muni.cz/ces-150m.txt
     57
     58
     59=== Task ===
     60
     61Change the training data, tune parameters (vocab size, training args, ...) to get
     62reasonable answer to simple ''fill mask'' questions, for example:
     63{{{
     64fill_mask("směrnice je určena členským <mask>")
     65}}}
     66
     67=== Upload ===
     68Upload your modified notebook or python script with results to the [[https://nlp.fi.muni.cz/en/AdvancedNlpCourse|homework vault (odevzdávárna)]].
     69