Context Navigation

LanguageModelling

Timestamp:: Oct 30, 2017, 11:42:36 AM (8 years ago)
Author:: Ales Horak
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

private/NlpInPracticeCourse/LanguageModelling

-                      v18
+                      v19
 == Getting necessary data and tools ==
 * {{{wget nlp.fi.muni.cz/~xbaisa/cblm.tar}}}
+* {{{wget nlp.fi.muni.cz/~xbaisa/cblm.tar}}} (or download from a [[htdocs:bigdata/cblm.tar|stable copy]])
 * {{{tar xf cblm.tar}}} in your directory
 * {{{cd cblm}}}
+* get a large Czech model (3 GB): [[BR]]
+  {{{scp anxur:/tmp/cztenten.trie .}}} [[BR]]
+  or {{{scp aurora:/corpora/data/cblm/data/cztenten.trie .}}} [[BR]]
+  (or download from a [[htdocs:bigdata/cztenten.trie|stable copy]])
+* get a large plain text: [[BR]]
+  {{{scp anxur:/tmp/cztenten_1M_sentences.txt.xz .}}} [[BR]]
+   or [[BR]]
+  {{{scp aurora:/corpora/data/cblm/data/cztenten_1M_sentences.txt.xz .}}}
+* if you want to work with English, download {{{ententen_1M_sentences.txt.xz}}}
+* get a large Czech model (3 GB): ([[htdocs:bigdata/cztenten.trie|stable copy]])
+  {{{
+scp aurora:/corpora/data/cblm/data/cztenten.trie .
+}}}
+* get a large plain text: ([[htdocs:bigdata/cztenten_1M_sentences.txt.xz|stable copy]])
+  {{{
+scp aurora:/corpora/data/cblm/data/cztenten_1M_sentences.txt.xz .
+}}}
+* if you want to work with English, download
+  [[htdocs:bigdata/ententen_1M_sentences.txt.xz|ententen_1M_sentences.txt.xz]]
 === mksary ===
+* {{{git clone https://github.com/lh3/libdivsufsort.git}}}
+* {{{cd libdivsufsort}}}
+* {{{cmake -DCMAKE_BUILD_TYPE="Release" -DCMAKE_INSTALL_PREFIX="/ABSOLUTE_PATH_TO_LIBDIVSUFSORT"}}}
+* {{{make}}}
+* {{{cd ..}}}
+* {{{ln -s libdivsufsort/examples/mksary mksary}}}
+{{{
+git clone https://github.com/lh3/libdivsufsort.git
+cd libdivsufsort
+cmake -DCMAKE_BUILD_TYPE="Release" \
+    -DCMAKE_INSTALL_PREFIX="/ABSOLUTE_PATH_TO_LIBDIVSUFSORT"
+make
+cd ..
+ln -s libdivsufsort/examples/mksary mksary
+}}}
 == Training data ==
 To build a new model, we need
+* a plain text file (suffix .in) all in lowercase: {{{xzcat cztenten_1M_sentences.txt.xz | python lower.py > input.in}}}
+* a plain text file (suffix .in) all in lowercase:
+{{{
+xzcat cztenten_1M_sentences.txt.xz | python lower.py > input.in
+}}}
 * to create a suffix array {{{./mksary INPUT.in OUTPUT.sa}}}
+* and compute the prefix tree: {{{python build_trie.py FILE_PREFIX [MINFREQ]}}}
+* and compute the prefix tree:
+{{{
+python build_trie.py FILE_PREFIX [MINFREQ]
+}}}
 The model will be stored in FILE_PREFIX.trie file.
 …
 To generate a random text, just run
+{{{python alarm.py FILE_PREFIX.trie}}}
+{{{
+python alarm.py FILE_PREFIX.trie
+}}}
 You may try to generate a random sentence using the large 3G model:
+{{{python alarm.py cztenten.trie}}}
+{{{
+python alarm.py cztenten.trie
+}}}
 === Task ===