Changes between Version 6 and Version 7 of private/NlpInPracticeCourse/LanguageModelling


Ignore:
Timestamp:
Nov 2, 2015, 8:01:00 AM (8 years ago)
Author:
Vít Baisa
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • private/NlpInPracticeCourse/LanguageModelling

    v6 v7  
    2020We will build a simple character-based language model and generate naturally-looking sentences. We need a plain text and fast suffix sorting algorithm (mksary).
    2121
     22== Getting necessary data and tools ==
    2223
     24* {{{wget nlp.fi.muni.cz/~xbaisa/cblm.tar.gz}}}
     25* {{{tar xvzf cblm.tar.gz}}} in your directory
     26* {{{cd cblm}}}
    2327
     28== Training data ==
     29
     30To build a new model, we need
     31* a plain text, see {{{data}}} directory, use {{{lower.py}}}
     32* to create a suffix array {{{mksary INPUT.txt OUTPUT.sa}}}
     33* and compute the prefix tree: {{{python build_trie.py FILE.sa [MINFREQ] [OUPUTFILE]}}}
     34
     35In .trie file, the model is stored.
     36
     37== Generating text ==
     38
     39To generate a random text, just run
     40{{{python alarm.py FILE.trie}}}
    2441
    2542=== Task ===
     
    3047* changing generating process
    3148* or all above.
     49
     50Upload 10,000 random sentences to your wault. Describe your changes, tunings in README where you can put some hilarious random sentence examples.