Changes between Version 18 and Version 19 of private/NlpInPracticeCourse/LanguageModelling
- Timestamp:
- Oct 30, 2017, 11:42:36 AM (6 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
private/NlpInPracticeCourse/LanguageModelling
v18 v19 26 26 == Getting necessary data and tools == 27 27 28 * {{{wget nlp.fi.muni.cz/~xbaisa/cblm.tar}}} 28 * {{{wget nlp.fi.muni.cz/~xbaisa/cblm.tar}}} (or download from a [[htdocs:bigdata/cblm.tar|stable copy]]) 29 29 * {{{tar xf cblm.tar}}} in your directory 30 30 * {{{cd cblm}}} 31 * get a large Czech model (3 GB): [[BR]] 32 {{{scp anxur:/tmp/cztenten.trie .}}} [[BR]] 33 or {{{scp aurora:/corpora/data/cblm/data/cztenten.trie .}}} [[BR]] 34 (or download from a [[htdocs:bigdata/cztenten.trie|stable copy]]) 35 * get a large plain text: [[BR]] 36 {{{scp anxur:/tmp/cztenten_1M_sentences.txt.xz .}}} [[BR]] 37 or [[BR]] 38 {{{scp aurora:/corpora/data/cblm/data/cztenten_1M_sentences.txt.xz .}}} 39 * if you want to work with English, download {{{ententen_1M_sentences.txt.xz}}} 31 * get a large Czech model (3 GB): ([[htdocs:bigdata/cztenten.trie|stable copy]]) 32 {{{ 33 scp aurora:/corpora/data/cblm/data/cztenten.trie . 34 }}} 35 36 * get a large plain text: ([[htdocs:bigdata/cztenten_1M_sentences.txt.xz|stable copy]]) 37 {{{ 38 scp aurora:/corpora/data/cblm/data/cztenten_1M_sentences.txt.xz . 39 }}} 40 41 * if you want to work with English, download 42 [[htdocs:bigdata/ententen_1M_sentences.txt.xz|ententen_1M_sentences.txt.xz]] 40 43 41 44 === mksary === 42 45 43 * {{{git clone https://github.com/lh3/libdivsufsort.git}}} 44 * {{{cd libdivsufsort}}} 45 * {{{cmake -DCMAKE_BUILD_TYPE="Release" -DCMAKE_INSTALL_PREFIX="/ABSOLUTE_PATH_TO_LIBDIVSUFSORT"}}} 46 * {{{make}}} 47 * {{{cd ..}}} 48 * {{{ln -s libdivsufsort/examples/mksary mksary}}} 46 {{{ 47 git clone https://github.com/lh3/libdivsufsort.git 48 cd libdivsufsort 49 cmake -DCMAKE_BUILD_TYPE="Release" \ 50 -DCMAKE_INSTALL_PREFIX="/ABSOLUTE_PATH_TO_LIBDIVSUFSORT" 51 make 52 cd .. 53 ln -s libdivsufsort/examples/mksary mksary 54 }}} 49 55 50 56 == Training data == 51 57 52 58 To build a new model, we need 53 * a plain text file (suffix .in) all in lowercase: {{{xzcat cztenten_1M_sentences.txt.xz | python lower.py > input.in}}} 59 * a plain text file (suffix .in) all in lowercase: 60 {{{ 61 xzcat cztenten_1M_sentences.txt.xz | python lower.py > input.in 62 }}} 54 63 * to create a suffix array {{{./mksary INPUT.in OUTPUT.sa}}} 55 * and compute the prefix tree: {{{python build_trie.py FILE_PREFIX [MINFREQ]}}} 64 * and compute the prefix tree: 65 {{{ 66 python build_trie.py FILE_PREFIX [MINFREQ] 67 }}} 56 68 57 69 The model will be stored in FILE_PREFIX.trie file. … … 60 72 61 73 To generate a random text, just run 62 {{{python alarm.py FILE_PREFIX.trie}}} 74 {{{ 75 python alarm.py FILE_PREFIX.trie 76 }}} 63 77 64 78 You may try to generate a random sentence using the large 3G model: 65 {{{python alarm.py cztenten.trie}}} 79 {{{ 80 python alarm.py cztenten.trie 81 }}} 66 82 67 83 === Task ===