Changes between Version 15 and Version 16 of private/NlpInPracticeCourse/LanguageModelling


Ignore:
Timestamp:
Oct 30, 2017, 7:52:42 AM (6 years ago)
Author:
Vít Baisa
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • private/NlpInPracticeCourse/LanguageModelling

    v15 v16  
    3131* get a large Czech model (3 GB): [[BR]]
    3232  {{{scp anxur:/tmp/cztenten.trie .}}} [[BR]]
     33  or {{{scp aurora:/corpora/data/cblm/data/cztenten.trie .}}} [[BR]]
    3334  (or download from a [[htdocs:bigdata/cztenten.trie|stable copy]])
     35* get a large plain text: [[BR]]
     36  {{{scp anxur:/tmp/cztenten.txt.xz . }}} [[BR]]
     37   or {{{scp aurora:/corpora/data/cblm/data/cztenten_1M_sentences.txt.xz . }}} [[BR]]
    3438
    3539=== mksary ===
     
    3943* {{{cmake -DCMAKE_BUILD_TYPE="Release" -DCMAKE_INSTALL_PREFIX="/ABSOLUTE_PATH_TO_LIBDIVSUFSORT"}}}
    4044* {{{make}}}
    41 * {{{cd ...}}}
     45* {{{cd ..}}}
    4246* {{{ln -s libdivsufsort/examples/mksary mksary}}}
    4347
     
    4549
    4650To build a new model, we need
    47 * a plain text, see {{{data}}} directory, use {{{lower.py}}}
    48 * to create a suffix array {{{./mksary INPUT.txt OUTPUT.sa}}}
    49 * and compute the prefix tree: {{{python build_trie.py FILE.sa [MINFREQ] [OUPUTFILE]}}}
     51* a plain text file (suffix .in) all in lowercase: {{{xz cztenten_1M_sentences.txt.xz | python lower.py > input.in}}}
     52* to create a suffix array {{{./mksary INPUT.in OUTPUT.sa}}}
     53* and compute the prefix tree: {{{python build_trie.py FILE_PREFIX [MINFREQ]}}}
    5054
    51 In .trie file, the model is stored.
     55The model will be stored in FILE_PREFIX.trie file.
    5256
    5357== Generating text ==
    5458
    5559To generate a random text, just run
    56 {{{python alarm.py FILE.trie}}}
     60{{{python alarm.py FILE_PREFIX.trie}}}
     61
     62You may try to generate a random sentence using the large 3G model:
     63{{{python alarm.py cztenten.trie}}}
    5764
    5865=== Task ===
    5966
    60 Change the training process and the generating process to generate the most naturally-looking sentences. Either by
     67Change the training process (build_trie.py) and the generating process (alarm.py) to generate the most naturally-looking sentences. Either by
    6168* pre-processing the input plain text or
    6269* setting training parameters or
    63 * changing generating process
     70* changing the generating process
    6471* or all above.
    6572
    66 Upload 10,000 random sentences to your vault. Describe your changes, tunings in README where you can put some hilarious random sentence examples.
     73Upload 10,000 random sentences to your vault together with the amended scripts. Describe your changes and tunings in README file where you can put some hilarious random sentence examples.