Context Navigation

LanguageModelling

Timestamp:: Oct 30, 2017, 8:10:46 AM (8 years ago)
Author:: Vít Baisa
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

private/NlpInPracticeCourse/LanguageModelling

-                      v16
+                      v17
   (or download from a [[htdocs:bigdata/cztenten.trie|stable copy]])
 * get a large plain text: [[BR]]
+  {{{scp anxur:/tmp/cztenten.txt.xz . }}} [[BR]]
+   or {{{scp aurora:/corpora/data/cblm/data/cztenten_1M_sentences.txt.xz . }}} [[BR]]
+  {{{scp anxur:/tmp/cztenten_1M_sentences.txt.xz .}}} [[BR]]
+   or [[BR]]
+  {{{scp aurora:/corpora/data/cblm/data/cztenten_1M_sentences.txt.xz .}}}
 === mksary ===
 …
 To build a new model, we need
 * a plain text file (suffix .in) all in lowercase: {{{xz cztenten_1M_sentences.txt.xz | python lower.py > input.in}}}
+* a plain text file (suffix .in) all in lowercase: {{{xzcat cztenten_1M_sentences.txt.xz | python lower.py > input.in}}}
 * to create a suffix array {{{./mksary INPUT.in OUTPUT.sa}}}
 * and compute the prefix tree: {{{python build_trie.py FILE_PREFIX [MINFREQ]}}}
 …
 === Task ===
 Change the training process (build_trie.py) and the generating process (alarm.py) to generate the most naturally-looking sentences. Either by
+Change the training data and process to generate the most naturally-looking sentences. Either by
 * pre-processing the input plain text or
 * setting training parameters or
+* changing the generating process
+* or all above.
+* changing training and generating scripts (advanced).
 Upload 10,000 random sentences to your vault together with the amended scripts. Describe your changes and tunings in README file where you can put some hilarious random sentence examples.
+Upload 10,000 random sentences to your vault together with the amended scripts and your models. Describe your changes and tunings in README file where you can put some hilarious random sentence examples.