Changes between Version 19 and Version 20 of private/NlpInPracticeCourse/LanguageModelling
- Timestamp:
- Dec 4, 2020, 6:29:43 PM (3 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
private/NlpInPracticeCourse/LanguageModelling
v19 v20 3 3 [[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák 4 4 5 Prepared by: Vít Baisa5 Prepared by: Pavel Rychlý 6 6 7 7 == State of the Art == 8 8 9 The goal of a language model is a) to predict a following word or phrase based on a given history and b) to assign a probability (= score) to any possible input sentence. In the past, this was achieved mainly by n-gram models known since WWII. But recently, the buzzword deep learning penetrated also into language modelling and it turned out to be substantially better than Markov's n-gram models. 9 The goal of a language model is to assign a score to any possible input sentence. In the past, this was achieved mainly by n-gram models known since WWII. But recently, the buzzword deep learning penetrated also into language modelling and it turned out to be substantially better than Markov's n-gram models. 10 11 The current state of the art models are build on neural networks using transformers. 12 10 13 11 14 === References === 15 1. Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2 16 1. Polosukhin, Illia, et al. "Attention Is All You Need". arXiv:1706.03762 17 1. Alammar, Jay. "The Illustrated Transformer". jalammar.github.io 12 18 13 1. Bengio, Yoshua, et al. "A neural probabilistic language model." The Journal of Machine Learning Research 3 (2003): 1137-1155.14 1. Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).15 1. Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in Neural Information Processing Systems. 2013.16 1. Chelba, Ciprian, et al. "One billion word benchmark for measuring progress in statistical language modeling." arXiv preprint arXiv:1312.3005 (2013).17 19 18 20 == Practical Session == 19 21 20 ''Bůh požádal, aby nedošlo k případnému provádění stavby pro klasické použití techniky ve výši stovek milionů korun.'' 22 === Technical Requirements === 21 23 22 We will build a simple character-based language model and generate naturally-looking sentences. We need a plain text and fast suffix sorting algorithm (mksary).24 The task will proceed using Python notebook run in web browser in the Google [[https://colab.research.google.com/|Colaboratory]] environment. 23 25 24 http://corpora.fi.muni.cz/cblm/generate.cgi 26 In case of running the codes in a local environment, the requirements are Python 3.6+, jupyter notebook, 27 the main module `huggingface/transformers` is installed at the beginning of the notebook. 25 28 26 == Getting necessary data and tools ==27 29 28 * {{{wget nlp.fi.muni.cz/~xbaisa/cblm.tar}}} (or download from a [[htdocs:bigdata/cblm.tar|stable copy]])29 * {{{tar xf cblm.tar}}} in your directory30 * {{{cd cblm}}}31 * get a large Czech model (3 GB): ([[htdocs:bigdata/cztenten.trie|stable copy]])32 {{{33 scp aurora:/corpora/data/cblm/data/cztenten.trie .34 }}}35 36 * get a large plain text: ([[htdocs:bigdata/cztenten_1M_sentences.txt.xz|stable copy]])37 {{{38 scp aurora:/corpora/data/cblm/data/cztenten_1M_sentences.txt.xz .39 }}}40 41 * if you want to work with English, download42 [[htdocs:bigdata/ententen_1M_sentences.txt.xz|ententen_1M_sentences.txt.xz]]43 30 44 === mksary ===45 31 46 {{{47 git clone https://github.com/lh3/libdivsufsort.git48 cd libdivsufsort49 cmake -DCMAKE_BUILD_TYPE="Release" \50 -DCMAKE_INSTALL_PREFIX="/ABSOLUTE_PATH_TO_LIBDIVSUFSORT"51 make52 cd ..53 ln -s libdivsufsort/examples/mksary mksary54 }}}55 32 56 == Training data==33 === BERT-like language model from scratch === 57 34 58 To build a new model, we need 59 * a plain text file (suffix .in) all in lowercase: 60 {{{ 61 xzcat cztenten_1M_sentences.txt.xz | python lower.py > input.in 62 }}} 63 * to create a suffix array {{{./mksary INPUT.in OUTPUT.sa}}} 64 * and compute the prefix tree: 65 {{{ 66 python build_trie.py FILE_PREFIX [MINFREQ] 67 }}} 35 In this workshop, we create a BERT-like language model for Czech from own texts. 36 We investigate tokenization of such models and experiment with ''fill mask'' task 37 for learning and evaluating neural language models. 68 38 69 The model will be stored in FILE_PREFIX.trie file. 39 Access the [[https://colab.research.google.com/drive/1f0fMlud37ybxDdW1RNo8ZkfQ-rJoSkHv?usp=sharing|Python notebook in the Google Colab environment]]. Do not forget to save your work if you want to see your changes later, leaving the browser will throw away all changes! 70 40 71 == Generating text == 41 OR 72 42 73 To generate a random text, just run 74 {{{ 75 python alarm.py FILE_PREFIX.trie 76 }}} 43 download the notebook or plain python file from the shared notebook (File > Download) and run in your local environment. 77 44 78 You may try to generate a random sentence using the large 3G model: 79 {{{ 80 python alarm.py cztenten.trie 81 }}} 45 46 47 === Training data === 48 49 1. Small text for fast setup: RUR from Project Gutenberg 50 https://www.gutenberg.org/files/13083/13083-0.txt 51 1. Sample from Czech part of the Europarl corpus, (1 MB, 10 MB, 150 MB) 52 https://corpora.fi.muni.cz/ces-1m.txt, https://corpora.fi.muni.cz/ces-10m.txt, https://corpora.fi.muni.cz/ces-150m.txt 53 82 54 83 55 === Task === 84 56 85 Change the training data and process to generate the most naturally-looking sentences. Either by 86 * pre-processing the input plain text or 87 * setting training parameters or 88 * changing training and generating scripts (advanced). 57 Change the training data, tune parameters (vocab size, training args, ...) to get 58 reasonable answer to simple ''fill mask'' questions, for example: 59 {{{ 60 fill_mask("směrnice je určena členským <mask>") 61 }}} 89 62 90 Upload 10,000 random sentences to your vault together with the amended scripts and your models. Describe your changes and tunings in README file where you can put some hilarious random sentence examples. 63 === Upload === 64 Upload your modified notebook or python script with results to the [[https://nlp.fi.muni.cz/en/AdvancedNlpCourse|homework vault (odevzdávárna)]. 65