Changes between Initial Version and Version 1 of en/NlpInPracticeCourse/2021/LanguageModelling


Ignore:
Timestamp:
Aug 30, 2022, 10:39:35 AM (20 months ago)
Author:
Ales Horak
Comment:

copied from private/NlpInPracticeCourse/LanguageModelling

Legend:

Unmodified
Added
Removed
Modified
  • en/NlpInPracticeCourse/2021/LanguageModelling

    v1 v1  
     1= Language modelling =
     2
     3[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/NlpInPracticeCourse|NLP in Practice Course]], Course Guarantee: Aleš Horák
     4
     5Prepared by: Pavel Rychlý
     6
     7== State of the Art ==
     8
     9The goal of a language model is to assign a score to any possible input sentence. In the past, this was achieved mainly by n-gram models known since WWII. But recently, the buzzword deep learning penetrated also into language modelling and it turned out to be substantially better than Markov's n-gram models.
     10
     11The current state of the art models are build on neural networks using transformers.
     12
     13
     14=== References ===
     15 1. Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". [[https://arxiv.org/abs/1810.04805v2|arXiv:1810.04805v2]]
     16 1. Polosukhin, Illia, et al. "Attention Is All You Need". [[https://arxiv.org/abs/1810.04805v2|arXiv:1706.03762]]
     17 1. Alammar, Jay (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/
     18 1. Alammar, Jay (2018). The Illustrated BERT, ELMo, and co. [Blog post]. Retrieved from https://jalammar.github.io/illustrated-bert/
     19
     20
     21
     22
     23
     24== Practical Session ==
     25
     26=== Technical Requirements ===
     27
     28The task will proceed using Python notebook run in web browser in the Google [[https://colab.research.google.com/|Colaboratory]] environment.
     29
     30In case of running the codes in a local environment, the requirements are Python 3.6+, jupyter notebook.
     31
     32
     33
     34=== Language models from scratch ===
     35
     36In this workshop, we create two language models for English and/or Czech from own texts. The models do not use any framework or complex library, only NumPy to work with vectors and matrices.
     37
     38We generate random text using these models and/or use the model in a real application of diacritics restoration.
     39
     40
     41Access the [[https://colab.research.google.com/drive/1Xhf6i-G3B4nnhn2eSNlg0QcCdLOQwjqH?usp=sharing|Python notebook in the Google Colab environment]]. Do not forget to save your work if you want to see your changes later, leaving the browser will throw away all changes!
     42
     43OR
     44
     45download the notebook or plain python file from the shared notebook (File > Download) and run in your local environment.
     46
     47
     48
     49=== Training data ===
     50
     511. Small text for fast setup: *1984 book* from Project Gutenberg
     52    https://gutenberg.net.au/ebooks01/0100021.txt
     531. Sample from Czech part of the Europarl corpus, (1 MB, 10 MB, 150 MB)
     54    https://corpora.fi.muni.cz/ces-1m.txt, https://corpora.fi.muni.cz/ces-10m.txt, https://corpora.fi.muni.cz/ces-150m.txt
     55
     56
     57=== Tasks ===
     58
     59
     60Choose one of the following tasks.
     61
     62==== Task 1 ====
     63
     64Use a LM for a diacritics restoration function.
     65
     66
     67Write a function with text without diacritics as input and same text
     68with added diacritics as a return value. For example:
     69{{{
     70>>> add_dia('cerveny kriz')
     71'červený kříž'
     72}}}
     73
     74You can compare your results with the [[https://nlp.fi.muni.cz/cz_accent/|czaccent]] service.
     75
     76
     77==== Task 2 ====
     78
     79Generate text using neural LM.
     80
     81Write a function to generate random text using neural LM. Optional parameter is the start of the text.
     82The text could be printed or returned as the result of the function.
     83
     84The function could work in the similar way as `generate_text` from the notebook, but it has to use the *neural* language model.
     85
     86
     87
     88=== Upload ===
     89Upload your modified notebook or python script with results to the [[https://nlp.fi.muni.cz/en/NlpInPracticeCourse|homework vault (odevzdávárna)]].
     90