Changes between Initial Version and Version 1 of en/AdvancedNlpCourse2015/LanguageModelling


Ignore:
Timestamp:
Sep 11, 2017, 4:38:26 PM (7 years ago)
Author:
Ales Horak
Comment:

copied from private/AdvancedNlpCourse/LanguageModelling

Legend:

Unmodified
Added
Removed
Modified
  • en/AdvancedNlpCourse2015/LanguageModelling

    v1 v1  
     1= Language modelling =
     2
     3[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák
     4
     5Prepared by: Vít Baisa
     6
     7== State of the Art ==
     8
     9The goal of a language model is a) to predict a following word or phrase based on a given history and b) to assign a probability (= score) to any possible input sentence. In the past, this was achieved mainly by n-gram models known since WWII. But recently, the buzzword deep learning penetrated also into language modelling and it turned out to be substantially better than Markov's n-gram models.
     10
     11=== References ===
     12
     13 1. Bengio, Yoshua, et al. "A neural probabilistic language model." The Journal of Machine Learning Research 3 (2003): 1137-1155.
     14 1. Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).
     15 1. Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in Neural Information Processing Systems. 2013.
     16 1. Chelba, Ciprian, et al. "One billion word benchmark for measuring progress in statistical language modeling." arXiv preprint arXiv:1312.3005 (2013).
     17
     18== Practical Session ==
     19
     20    ''Bůh požádal, aby nedošlo k případnému provádění stavby pro klasické použití techniky ve výši stovek milionů korun.''
     21
     22We will build a simple character-based language model and generate naturally-looking sentences. We need a plain text and fast suffix sorting algorithm (mksary).
     23
     24http://corpora.fi.muni.cz/cblm/generate.cgi
     25
     26== Getting necessary data and tools ==
     27
     28* {{{wget nlp.fi.muni.cz/~xbaisa/cblm.tar}}}
     29* {{{tar xf cblm.tar}}} in your directory
     30* {{{cd cblm}}}
     31* get a large Czech model (3 GB): [[BR]]
     32  {{{scp anxur:/tmp/cztenten.trie .}}} [[BR]]
     33  (or download from a [[htdocs:bigdata/cztenten.trie|stable copy]])
     34
     35=== mksary ===
     36
     37* {{{git clone https://github.com/lh3/libdivsufsort.git}}}
     38* {{{cd libdivsufsort}}}
     39* {{{cmake -DCMAKE_BUILD_TYPE="Release" -DCMAKE_INSTALL_PREFIX="/ABSOLUTE_PATH_TO_LIBDIVSUFSORT"}}}
     40* {{{make}}}
     41* {{{cd ...}}}
     42* {{{ln -s libdivsufsort/examples/mksary mksary}}}
     43
     44== Training data ==
     45
     46To build a new model, we need
     47* a plain text, see {{{data}}} directory, use {{{lower.py}}}
     48* to create a suffix array {{{./mksary INPUT.txt OUTPUT.sa}}}
     49* and compute the prefix tree: {{{python build_trie.py FILE.sa [MINFREQ] [OUPUTFILE]}}}
     50
     51In .trie file, the model is stored.
     52
     53== Generating text ==
     54
     55To generate a random text, just run
     56{{{python alarm.py FILE.trie}}}
     57
     58=== Task ===
     59
     60Change the training process and the generating process to generate the most naturally-looking sentences. Either by
     61* pre-processing the input plain text or
     62* setting training parameters or
     63* changing generating process
     64* or all above.
     65
     66Upload 10,000 random sentences to your vault. Describe your changes, tunings in README where you can put some hilarious random sentence examples.