Context Navigation

← Previous Change
Wiki History
Next Change →

LanguageModelling

Timestamp:: Sep 11, 2017, 4:38:26 PM (7 years ago)
Author:: Ales Horak
Comment:: copied from private/AdvancedNlpCourse/LanguageModelling

Legend:

: Unmodified
: Added
: Removed
: Modified

en/AdvancedNlpCourse2015/LanguageModelling

                       v1
+= Language modelling =
+[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák
+Prepared by: Vít Baisa
+== State of the Art ==
+The goal of a language model is a) to predict a following word or phrase based on a given history and b) to assign a probability (= score) to any possible input sentence. In the past, this was achieved mainly by n-gram models known since WWII. But recently, the buzzword deep learning penetrated also into language modelling and it turned out to be substantially better than Markov's n-gram models.
+=== References ===
+. Bengio, Yoshua, et al. "A neural probabilistic language model." The Journal of Machine Learning Research 3 (2003): 1137-1155.
+. Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).
+. Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in Neural Information Processing Systems. 2013.
+. Chelba, Ciprian, et al. "One billion word benchmark for measuring progress in statistical language modeling." arXiv preprint arXiv:1312.3005 (2013).
+== Practical Session ==
+    ''Bůh požádal, aby nedošlo k případnému provádění stavby pro klasické použití techniky ve výši stovek milionů korun.''
+We will build a simple character-based language model and generate naturally-looking sentences. We need a plain text and fast suffix sorting algorithm (mksary).
+http://corpora.fi.muni.cz/cblm/generate.cgi
+== Getting necessary data and tools ==
+* {{{wget nlp.fi.muni.cz/~xbaisa/cblm.tar}}}
+* {{{tar xf cblm.tar}}} in your directory
+* {{{cd cblm}}}
+* get a large Czech model (3 GB): [[BR]]
+  {{{scp anxur:/tmp/cztenten.trie .}}} [[BR]]
+  (or download from a [[htdocs:bigdata/cztenten.trie|stable copy]])
+=== mksary ===
+* {{{git clone https://github.com/lh3/libdivsufsort.git}}}
+* {{{cd libdivsufsort}}}
+* {{{cmake -DCMAKE_BUILD_TYPE="Release" -DCMAKE_INSTALL_PREFIX="/ABSOLUTE_PATH_TO_LIBDIVSUFSORT"}}}
+* {{{make}}}
+* {{{cd ...}}}
+* {{{ln -s libdivsufsort/examples/mksary mksary}}}
+== Training data ==
+To build a new model, we need
+* a plain text, see {{{data}}} directory, use {{{lower.py}}}
+* to create a suffix array {{{./mksary INPUT.txt OUTPUT.sa}}}
+* and compute the prefix tree: {{{python build_trie.py FILE.sa [MINFREQ] [OUPUTFILE]}}}
+In .trie file, the model is stored.
+== Generating text ==
+To generate a random text, just run
+{{{python alarm.py FILE.trie}}}
+=== Task ===
+Change the training process and the generating process to generate the most naturally-looking sentences. Either by
+* pre-processing the input plain text or
+* setting training parameters or
+* changing generating process
+* or all above.
+Upload 10,000 random sentences to your vault. Describe your changes, tunings in README where you can put some hilarious random sentence examples.