Changes between Version 16 and Version 17 of private/NlpInPracticeCourse/LanguageModelling


Ignore:
Timestamp:
Oct 30, 2017, 8:10:46 AM (6 years ago)
Author:
Vít Baisa
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • private/NlpInPracticeCourse/LanguageModelling

    v16 v17  
    3434  (or download from a [[htdocs:bigdata/cztenten.trie|stable copy]])
    3535* get a large plain text: [[BR]]
    36   {{{scp anxur:/tmp/cztenten.txt.xz . }}} [[BR]]
    37    or {{{scp aurora:/corpora/data/cblm/data/cztenten_1M_sentences.txt.xz . }}} [[BR]]
     36  {{{scp anxur:/tmp/cztenten_1M_sentences.txt.xz .}}} [[BR]]
     37   or [[BR]]
     38  {{{scp aurora:/corpora/data/cblm/data/cztenten_1M_sentences.txt.xz .}}}
    3839
    3940=== mksary ===
     
    4950
    5051To build a new model, we need
    51 * a plain text file (suffix .in) all in lowercase: {{{xz cztenten_1M_sentences.txt.xz | python lower.py > input.in}}}
     52* a plain text file (suffix .in) all in lowercase: {{{xzcat cztenten_1M_sentences.txt.xz | python lower.py > input.in}}}
    5253* to create a suffix array {{{./mksary INPUT.in OUTPUT.sa}}}
    5354* and compute the prefix tree: {{{python build_trie.py FILE_PREFIX [MINFREQ]}}}
     
    6566=== Task ===
    6667
    67 Change the training process (build_trie.py) and the generating process (alarm.py) to generate the most naturally-looking sentences. Either by
     68Change the training data and process to generate the most naturally-looking sentences. Either by
    6869* pre-processing the input plain text or
    6970* setting training parameters or
    70 * changing the generating process
    71 * or all above.
     71* changing training and generating scripts (advanced).
    7272
    73 Upload 10,000 random sentences to your vault together with the amended scripts. Describe your changes and tunings in README file where you can put some hilarious random sentence examples.
     73Upload 10,000 random sentences to your vault together with the amended scripts and your models. Describe your changes and tunings in README file where you can put some hilarious random sentence examples.