Changes between Version 18 and Version 19 of private/NlpInPracticeCourse/LanguageModelling


Ignore:
Timestamp:
Oct 30, 2017, 11:42:36 AM (6 years ago)
Author:
Ales Horak
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • private/NlpInPracticeCourse/LanguageModelling

    v18 v19  
    2626== Getting necessary data and tools ==
    2727
    28 * {{{wget nlp.fi.muni.cz/~xbaisa/cblm.tar}}}
     28* {{{wget nlp.fi.muni.cz/~xbaisa/cblm.tar}}} (or download from a [[htdocs:bigdata/cblm.tar|stable copy]])
    2929* {{{tar xf cblm.tar}}} in your directory
    3030* {{{cd cblm}}}
    31 * get a large Czech model (3 GB): [[BR]]
    32   {{{scp anxur:/tmp/cztenten.trie .}}} [[BR]]
    33   or {{{scp aurora:/corpora/data/cblm/data/cztenten.trie .}}} [[BR]]
    34   (or download from a [[htdocs:bigdata/cztenten.trie|stable copy]])
    35 * get a large plain text: [[BR]]
    36   {{{scp anxur:/tmp/cztenten_1M_sentences.txt.xz .}}} [[BR]]
    37    or [[BR]]
    38   {{{scp aurora:/corpora/data/cblm/data/cztenten_1M_sentences.txt.xz .}}}
    39 * if you want to work with English, download {{{ententen_1M_sentences.txt.xz}}}
     31* get a large Czech model (3 GB): ([[htdocs:bigdata/cztenten.trie|stable copy]])
     32  {{{
     33scp aurora:/corpora/data/cblm/data/cztenten.trie .
     34}}}
     35 
     36* get a large plain text: ([[htdocs:bigdata/cztenten_1M_sentences.txt.xz|stable copy]])
     37  {{{
     38scp aurora:/corpora/data/cblm/data/cztenten_1M_sentences.txt.xz .
     39}}}
     40 
     41* if you want to work with English, download
     42  [[htdocs:bigdata/ententen_1M_sentences.txt.xz|ententen_1M_sentences.txt.xz]]
    4043
    4144=== mksary ===
    4245
    43 * {{{git clone https://github.com/lh3/libdivsufsort.git}}}
    44 * {{{cd libdivsufsort}}}
    45 * {{{cmake -DCMAKE_BUILD_TYPE="Release" -DCMAKE_INSTALL_PREFIX="/ABSOLUTE_PATH_TO_LIBDIVSUFSORT"}}}
    46 * {{{make}}}
    47 * {{{cd ..}}}
    48 * {{{ln -s libdivsufsort/examples/mksary mksary}}}
     46{{{
     47git clone https://github.com/lh3/libdivsufsort.git
     48cd libdivsufsort
     49cmake -DCMAKE_BUILD_TYPE="Release" \
     50    -DCMAKE_INSTALL_PREFIX="/ABSOLUTE_PATH_TO_LIBDIVSUFSORT"
     51make
     52cd ..
     53ln -s libdivsufsort/examples/mksary mksary
     54}}}
    4955
    5056== Training data ==
    5157
    5258To build a new model, we need
    53 * a plain text file (suffix .in) all in lowercase: {{{xzcat cztenten_1M_sentences.txt.xz | python lower.py > input.in}}}
     59* a plain text file (suffix .in) all in lowercase:
     60{{{
     61xzcat cztenten_1M_sentences.txt.xz | python lower.py > input.in
     62}}}
    5463* to create a suffix array {{{./mksary INPUT.in OUTPUT.sa}}}
    55 * and compute the prefix tree: {{{python build_trie.py FILE_PREFIX [MINFREQ]}}}
     64* and compute the prefix tree:
     65{{{
     66python build_trie.py FILE_PREFIX [MINFREQ]
     67}}}
    5668
    5769The model will be stored in FILE_PREFIX.trie file.
     
    6072
    6173To generate a random text, just run
    62 {{{python alarm.py FILE_PREFIX.trie}}}
     74{{{
     75python alarm.py FILE_PREFIX.trie
     76}}}
    6377
    6478You may try to generate a random sentence using the large 3G model:
    65 {{{python alarm.py cztenten.trie}}}
     79{{{
     80python alarm.py cztenten.trie
     81}}}
    6682
    6783=== Task ===