Context Navigation

LanguageModelling

Timestamp:: Oct 30, 2017, 7:52:42 AM (8 years ago)
Author:: Vít Baisa
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

private/NlpInPracticeCourse/LanguageModelling

-                      v15
+                      v16
 * get a large Czech model (3 GB): [[BR]]
   {{{scp anxur:/tmp/cztenten.trie .}}} [[BR]]
+  or {{{scp aurora:/corpora/data/cblm/data/cztenten.trie .}}} [[BR]]
   (or download from a [[htdocs:bigdata/cztenten.trie|stable copy]])
+* get a large plain text: [[BR]]
+  {{{scp anxur:/tmp/cztenten.txt.xz . }}} [[BR]]
+   or {{{scp aurora:/corpora/data/cblm/data/cztenten_1M_sentences.txt.xz . }}} [[BR]]
 === mksary ===
 …
 * {{{cmake -DCMAKE_BUILD_TYPE="Release" -DCMAKE_INSTALL_PREFIX="/ABSOLUTE_PATH_TO_LIBDIVSUFSORT"}}}
 * {{{make}}}
 * {{{cd ...}}}
+* {{{cd ..}}}
 * {{{ln -s libdivsufsort/examples/mksary mksary}}}
 …
 To build a new model, we need
 * a plain text, see {{{data}}} directory, use {{{lower.py}}}
 * to create a suffix array {{{./mksary INPUT.txt OUTPUT.sa}}}
 * and compute the prefix tree: {{{python build_trie.py FILE.sa [MINFREQ] [OUPUTFILE]}}}
+* a plain text file (suffix .in) all in lowercase: {{{xz cztenten_1M_sentences.txt.xz | python lower.py > input.in}}}
+* to create a suffix array {{{./mksary INPUT.in OUTPUT.sa}}}
+* and compute the prefix tree: {{{python build_trie.py FILE_PREFIX [MINFREQ]}}}
 In .trie file, the model is stored.
+The model will be stored in FILE_PREFIX.trie file.
 == Generating text ==
 To generate a random text, just run
+{{{python alarm.py FILE.trie}}}
+{{{python alarm.py FILE_PREFIX.trie}}}
+You may try to generate a random sentence using the large 3G model:
+{{{python alarm.py cztenten.trie}}}
 === Task ===
 Change the training process and the generating process to generate the most naturally-looking sentences. Either by
+Change the training process (build_trie.py) and the generating process (alarm.py) to generate the most naturally-looking sentences. Either by
 * pre-processing the input plain text or
 * setting training parameters or
 * changing generating process
+* changing the generating process
 * or all above.
 Upload 10,000 random sentences to your vault. Describe your changes, tunings in README where you can put some hilarious random sentence examples.
+Upload 10,000 random sentences to your vault together with the amended scripts. Describe your changes and tunings in README file where you can put some hilarious random sentence examples.