Changes between Initial Version and Version 1 of en/NlpInPracticeCourse/2022/LanguageModelling


Ignore:
Timestamp:
Sep 13, 2023, 2:44:39 PM (10 months ago)
Author:
Ales Horak
Comment:

copied from private/NlpInPracticeCourse/LanguageModelling

Legend:

Unmodified
Added
Removed
Modified
  • en/NlpInPracticeCourse/2022/LanguageModelling

    v1 v1  
     1= Language modelling =
     2
     3[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/NlpInPracticeCourse|NLP in Practice Course]], Course Guarantee: Aleš Horák
     4
     5Prepared by: Pavel Rychlý
     6
     7== State of the Art ==
     8
     9The goal of a language model is to assign a score to any possible input sentence. In the past, this was achieved mainly by n-gram models known since WWII. But recently, the buzzword deep learning penetrated also into language modelling and it turned out to be substantially better than Markov's n-gram models.
     10
     11The current state of the art models are build on neural networks using transformers.
     12
     13
     14=== References ===
     15 1. Polosukhin, Illia, et al. "Attention Is All You Need". [[https://arxiv.org/abs/1810.04805v2|arXiv:1706.03762]]
     16 1. Alammar, Jay (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/
     17 1. Alammar, Jay (2018). The Illustrated GPT-2 [Blog post]. Retrieved from https://jalammar.github.io/illustrated-gpt2/
     18 1. Brown, Tom, et al. (2020) "Language Models are Few-Shot Learners" [[https://arxiv.org/abs/2005.14165|arXiv:2005.14165]]
     19 1. Sennrich, Rico, et al. (2106) "Neural Machine Translation of Rare Words with Subword Units", In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, [[https://aclanthology.org/P16-1162|ACL 2016]]
     20
     21
     22
     23
     24
     25== Practical Session ==
     26
     27=== Technical Requirements ===
     28
     29The task will proceed using Python notebook run in web browser in the Google [[https://colab.research.google.com/|Colaboratory]] environment.
     30
     31In case of running the codes in a local environment, the requirements are Python 3.6+, jupyter notebook.
     32
     33
     34
     35=== Language models from scratch ===
     36
     37In this workshop, we create a language models for English and/or any other language from own texts. The models use only [[https://www.fi.muni.cz/~pary/mingpt.zip|small python modules]] with PyTorch framework.
     38
     39
     40We generate random text using these models. The first model is based only on characters, later one uses subword tokenization with [[https://github.com/rsennrich/subword-nmt|BPE]].
     41
     42
     43Access the [[https://colab.research.google.com/drive/1GSS_KlTVkrNNqGBi6MmZ6AMgHcIQDczq?usp=sharing|Python notebook in the Google Colab environment]]. Do not forget to save your work if you want to see your changes later, leaving the browser will throw away all changes!
     44
     45OR
     46
     47download the notebook or plain python file from the shared notebook (File > Download) and run in your local environment.
     48
     49
     50
     51=== Training data ===
     52
     531. R.U.R., a play by Josef Capek (155 kB)
     54   https://gutenberg.org/files/59112/59112-0.txt
     551. Small text for fast setup: *1984 book* from Project Gutenberg (590 kB)
     56   https://gutenberg.net.au/ebooks01/0100021.txt
     571. Shakespeare plays (1.1 MB)
     58   https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
     591. Any other data, any language (even programming languages)
     60
     61
     62=== Tasks ===
     63
     64
     65==== Task 1 ====
     66
     67
     68Generate text using character-level neural LM.
     69
     70Use several different hyper-parameters (embedding size, number of layers, number of epochs). Describe the quality of generated text with regard to selected parameters.
     71
     72==== Task 2 ====
     73
     74Implement a new Dataset class to Use subwords (via BPE) instead of character.
     75Compare the generated text with the text generated by character-level model with the same number of parameters.
     76
     77
     78
     79=== Upload ===
     80Upload your modified notebook or python script with results to the [[https://nlp.fi.muni.cz/en/NlpInPracticeCourse|homework vault (odevzdávárna)]].
     81