wiki:private/NlpInPracticeCourse/LanguageModelling

Version 22 (modified by pary, 3 years ago) (diff)

--

Language modelling

IA161 Advanced NLP Course?, Course Guarantee: Aleš Horák

Prepared by: Pavel Rychlý

State of the Art

The goal of a language model is to assign a score to any possible input sentence. In the past, this was achieved mainly by n-gram models known since WWII. But recently, the buzzword deep learning penetrated also into language modelling and it turned out to be substantially better than Markov's n-gram models.

The current state of the art models are build on neural networks using transformers.

References

  1. Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2
  2. Polosukhin, Illia, et al. "Attention Is All You Need". arXiv:1706.03762
  3. Alammar, Jay (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/
  4. Alammar, Jay (2018). The Illustrated BERT, ELMo, and co. [Blog post]. Retrieved from https://jalammar.github.io/illustrated-bert/

Practical Session

Technical Requirements

The task will proceed using Python notebook run in web browser in the Google Colaboratory environment.

In case of running the codes in a local environment, the requirements are Python 3.6+, jupyter notebook, the main module huggingface/transformers is installed at the beginning of the notebook.

BERT-like language model from scratch

In this workshop, we create a BERT-like language model for Czech from own texts. We investigate tokenization of such models and experiment with fill mask task for learning and evaluating neural language models.

Access the Python notebook in the Google Colab environment. Do not forget to save your work if you want to see your changes later, leaving the browser will throw away all changes!

OR

download the notebook or plain python file from the shared notebook (File > Download) and run in your local environment.

Training data

  1. Small text for fast setup: RUR from Project Gutenberg

https://www.gutenberg.org/files/13083/13083-0.txt

  1. Sample from Czech part of the Europarl corpus, (1 MB, 10 MB, 150 MB)

https://corpora.fi.muni.cz/ces-1m.txt, https://corpora.fi.muni.cz/ces-10m.txt, https://corpora.fi.muni.cz/ces-150m.txt

Task

Change the training data, tune parameters (vocab size, training args, ...) to get reasonable answer to simple fill mask questions, for example:

fill_mask("směrnice je určena členským <mask>")

Upload

Upload your modified notebook or python script with results to the homework vault (odevzdávárna).