Changes between Version 19 and Version 20 of private/AdvancedNlpCourse/LanguageModelling


Ignore:
Timestamp:
Dec 4, 2020, 6:29:43 PM (12 months ago)
Author:
pary
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • private/AdvancedNlpCourse/LanguageModelling

    v19 v20  
    33[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák
    44
    5 Prepared by: Vít Baisa
     5Prepared by: Pavel Rychlý
    66
    77== State of the Art ==
    88
    9 The goal of a language model is a) to predict a following word or phrase based on a given history and b) to assign a probability (= score) to any possible input sentence. In the past, this was achieved mainly by n-gram models known since WWII. But recently, the buzzword deep learning penetrated also into language modelling and it turned out to be substantially better than Markov's n-gram models.
     9The goal of a language model is to assign a score to any possible input sentence. In the past, this was achieved mainly by n-gram models known since WWII. But recently, the buzzword deep learning penetrated also into language modelling and it turned out to be substantially better than Markov's n-gram models.
     10
     11The current state of the art models are build on neural networks using transformers.
     12
    1013
    1114=== References ===
     15 1. Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2
     16 1. Polosukhin, Illia, et al. "Attention Is All You Need". arXiv:1706.03762
     17 1. Alammar, Jay. "The Illustrated Transformer". jalammar.github.io
    1218
    13  1. Bengio, Yoshua, et al. "A neural probabilistic language model." The Journal of Machine Learning Research 3 (2003): 1137-1155.
    14  1. Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).
    15  1. Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in Neural Information Processing Systems. 2013.
    16  1. Chelba, Ciprian, et al. "One billion word benchmark for measuring progress in statistical language modeling." arXiv preprint arXiv:1312.3005 (2013).
    1719
    1820== Practical Session ==
    1921
    20     ''Bůh požádal, aby nedošlo k případnému provádění stavby pro klasické použití techniky ve výši stovek milionů korun.''
     22=== Technical Requirements ===
    2123
    22 We will build a simple character-based language model and generate naturally-looking sentences. We need a plain text and fast suffix sorting algorithm (mksary).
     24The task will proceed using Python notebook run in web browser in the Google [[https://colab.research.google.com/|Colaboratory]] environment.
    2325
    24 http://corpora.fi.muni.cz/cblm/generate.cgi
     26In case of running the codes in a local environment, the requirements are Python 3.6+, jupyter notebook,
     27the main module `huggingface/transformers` is installed at the beginning of the notebook.
    2528
    26 == Getting necessary data and tools ==
    2729
    28 * {{{wget nlp.fi.muni.cz/~xbaisa/cblm.tar}}} (or download from a [[htdocs:bigdata/cblm.tar|stable copy]])
    29 * {{{tar xf cblm.tar}}} in your directory
    30 * {{{cd cblm}}}
    31 * get a large Czech model (3 GB): ([[htdocs:bigdata/cztenten.trie|stable copy]])
    32   {{{
    33 scp aurora:/corpora/data/cblm/data/cztenten.trie .
    34 }}}
    35  
    36 * get a large plain text: ([[htdocs:bigdata/cztenten_1M_sentences.txt.xz|stable copy]])
    37   {{{
    38 scp aurora:/corpora/data/cblm/data/cztenten_1M_sentences.txt.xz .
    39 }}}
    40  
    41 * if you want to work with English, download
    42   [[htdocs:bigdata/ententen_1M_sentences.txt.xz|ententen_1M_sentences.txt.xz]]
    4330
    44 === mksary ===
    4531
    46 {{{
    47 git clone https://github.com/lh3/libdivsufsort.git
    48 cd libdivsufsort
    49 cmake -DCMAKE_BUILD_TYPE="Release" \
    50     -DCMAKE_INSTALL_PREFIX="/ABSOLUTE_PATH_TO_LIBDIVSUFSORT"
    51 make
    52 cd ..
    53 ln -s libdivsufsort/examples/mksary mksary
    54 }}}
    5532
    56 == Training data ==
     33=== BERT-like language model from scratch ===
    5734
    58 To build a new model, we need
    59 * a plain text file (suffix .in) all in lowercase:
    60 {{{
    61 xzcat cztenten_1M_sentences.txt.xz | python lower.py > input.in
    62 }}}
    63 * to create a suffix array {{{./mksary INPUT.in OUTPUT.sa}}}
    64 * and compute the prefix tree:
    65 {{{
    66 python build_trie.py FILE_PREFIX [MINFREQ]
    67 }}}
     35In this workshop, we create a BERT-like language model for Czech from own texts.
     36We investigate tokenization of such models and experiment with ''fill mask'' task
     37for learning and evaluating neural language models.
    6838
    69 The model will be stored in FILE_PREFIX.trie file.
     39Access the [[https://colab.research.google.com/drive/1f0fMlud37ybxDdW1RNo8ZkfQ-rJoSkHv?usp=sharing|Python notebook in the Google Colab environment]]. Do not forget to save your work if you want to see your changes later, leaving the browser will throw away all changes!
    7040
    71 == Generating text ==
     41OR
    7242
    73 To generate a random text, just run
    74 {{{
    75 python alarm.py FILE_PREFIX.trie
    76 }}}
     43download the notebook or plain python file from the shared notebook (File > Download) and run in your local environment.
    7744
    78 You may try to generate a random sentence using the large 3G model:
    79 {{{
    80 python alarm.py cztenten.trie
    81 }}}
     45
     46
     47=== Training data ===
     48
     491. Small text for fast setup: RUR from Project Gutenberg
     50    https://www.gutenberg.org/files/13083/13083-0.txt
     511. Sample from Czech part of the Europarl corpus, (1 MB, 10 MB, 150 MB)
     52    https://corpora.fi.muni.cz/ces-1m.txt, https://corpora.fi.muni.cz/ces-10m.txt, https://corpora.fi.muni.cz/ces-150m.txt
     53
    8254
    8355=== Task ===
    8456
    85 Change the training data and process to generate the most naturally-looking sentences. Either by
    86 * pre-processing the input plain text or
    87 * setting training parameters or
    88 * changing training and generating scripts (advanced).
     57Change the training data, tune parameters (vocab size, training args, ...) to get
     58reasonable answer to simple ''fill mask'' questions, for example:
     59{{{
     60fill_mask("směrnice je určena členským <mask>")
     61}}}
    8962
    90 Upload 10,000 random sentences to your vault together with the amended scripts and your models. Describe your changes and tunings in README file where you can put some hilarious random sentence examples.
     63=== Upload ===
     64Upload your modified notebook or python script with results to the [[https://nlp.fi.muni.cz/en/AdvancedNlpCourse|homework vault (odevzdávárna)].
     65