Changes between Initial Version and Version 1 of en/AdvancedNlpCourse2019/LanguageModelling

Oct 1, 2020, 3:34:10 PM (23 months ago)
Ales Horak

copied from private/AdvancedNlpCourse/LanguageModelling


  • en/AdvancedNlpCourse2019/LanguageModelling

    v1 v1  
     1= Language modelling =
     3[[|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák
     5Prepared by: Vít Baisa
     7== State of the Art ==
     9The goal of a language model is a) to predict a following word or phrase based on a given history and b) to assign a probability (= score) to any possible input sentence. In the past, this was achieved mainly by n-gram models known since WWII. But recently, the buzzword deep learning penetrated also into language modelling and it turned out to be substantially better than Markov's n-gram models.
     11=== References ===
     13 1. Bengio, Yoshua, et al. "A neural probabilistic language model." The Journal of Machine Learning Research 3 (2003): 1137-1155.
     14 1. Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).
     15 1. Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in Neural Information Processing Systems. 2013.
     16 1. Chelba, Ciprian, et al. "One billion word benchmark for measuring progress in statistical language modeling." arXiv preprint arXiv:1312.3005 (2013).
     18== Practical Session ==
     20    ''Bůh požádal, aby nedošlo k případnému provádění stavby pro klasické použití techniky ve výši stovek milionů korun.''
     22We will build a simple character-based language model and generate naturally-looking sentences. We need a plain text and fast suffix sorting algorithm (mksary).
     26== Getting necessary data and tools ==
     28* {{{wget}}} (or download from a [[htdocs:bigdata/cblm.tar|stable copy]])
     29* {{{tar xf cblm.tar}}} in your directory
     30* {{{cd cblm}}}
     31* get a large Czech model (3 GB): ([[htdocs:bigdata/cztenten.trie|stable copy]])
     32  {{{
     33scp aurora:/corpora/data/cblm/data/cztenten.trie .
     36* get a large plain text: ([[htdocs:bigdata/cztenten_1M_sentences.txt.xz|stable copy]])
     37  {{{
     38scp aurora:/corpora/data/cblm/data/cztenten_1M_sentences.txt.xz .
     41* if you want to work with English, download
     42  [[htdocs:bigdata/ententen_1M_sentences.txt.xz|ententen_1M_sentences.txt.xz]]
     44=== mksary ===
     47git clone
     48cd libdivsufsort
     49cmake -DCMAKE_BUILD_TYPE="Release" \
     52cd ..
     53ln -s libdivsufsort/examples/mksary mksary
     56== Training data ==
     58To build a new model, we need
     59* a plain text file (suffix .in) all in lowercase:
     61xzcat cztenten_1M_sentences.txt.xz | python >
     63* to create a suffix array {{{./mksary}}}
     64* and compute the prefix tree:
     66python FILE_PREFIX [MINFREQ]
     69The model will be stored in FILE_PREFIX.trie file.
     71== Generating text ==
     73To generate a random text, just run
     75python FILE_PREFIX.trie
     78You may try to generate a random sentence using the large 3G model:
     80python cztenten.trie
     83=== Task ===
     85Change the training data and process to generate the most naturally-looking sentences. Either by
     86* pre-processing the input plain text or
     87* setting training parameters or
     88* changing training and generating scripts (advanced).
     90Upload 10,000 random sentences to your vault together with the amended scripts and your models. Describe your changes and tunings in README file where you can put some hilarious random sentence examples.