Changes between Version 15 and Version 16 of private/NlpInPracticeCourse/LanguageModelling
- Timestamp:
- Oct 30, 2017, 7:52:42 AM (6 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
private/NlpInPracticeCourse/LanguageModelling
v15 v16 31 31 * get a large Czech model (3 GB): [[BR]] 32 32 {{{scp anxur:/tmp/cztenten.trie .}}} [[BR]] 33 or {{{scp aurora:/corpora/data/cblm/data/cztenten.trie .}}} [[BR]] 33 34 (or download from a [[htdocs:bigdata/cztenten.trie|stable copy]]) 35 * get a large plain text: [[BR]] 36 {{{scp anxur:/tmp/cztenten.txt.xz . }}} [[BR]] 37 or {{{scp aurora:/corpora/data/cblm/data/cztenten_1M_sentences.txt.xz . }}} [[BR]] 34 38 35 39 === mksary === … … 39 43 * {{{cmake -DCMAKE_BUILD_TYPE="Release" -DCMAKE_INSTALL_PREFIX="/ABSOLUTE_PATH_TO_LIBDIVSUFSORT"}}} 40 44 * {{{make}}} 41 * {{{cd .. .}}}45 * {{{cd ..}}} 42 46 * {{{ln -s libdivsufsort/examples/mksary mksary}}} 43 47 … … 45 49 46 50 To build a new model, we need 47 * a plain text , see {{{data}}} directory, use {{{lower.py}}}48 * to create a suffix array {{{./mksary INPUT. txtOUTPUT.sa}}}49 * and compute the prefix tree: {{{python build_trie.py FILE .sa [MINFREQ] [OUPUTFILE]}}}51 * a plain text file (suffix .in) all in lowercase: {{{xz cztenten_1M_sentences.txt.xz | python lower.py > input.in}}} 52 * to create a suffix array {{{./mksary INPUT.in OUTPUT.sa}}} 53 * and compute the prefix tree: {{{python build_trie.py FILE_PREFIX [MINFREQ]}}} 50 54 51 In .trie file, the model is stored.55 The model will be stored in FILE_PREFIX.trie file. 52 56 53 57 == Generating text == 54 58 55 59 To generate a random text, just run 56 {{{python alarm.py FILE.trie}}} 60 {{{python alarm.py FILE_PREFIX.trie}}} 61 62 You may try to generate a random sentence using the large 3G model: 63 {{{python alarm.py cztenten.trie}}} 57 64 58 65 === Task === 59 66 60 Change the training process and the generating processto generate the most naturally-looking sentences. Either by67 Change the training process (build_trie.py) and the generating process (alarm.py) to generate the most naturally-looking sentences. Either by 61 68 * pre-processing the input plain text or 62 69 * setting training parameters or 63 * changing generating process70 * changing the generating process 64 71 * or all above. 65 72 66 Upload 10,000 random sentences to your vault . Describe your changes, tunings in READMEwhere you can put some hilarious random sentence examples.73 Upload 10,000 random sentences to your vault together with the amended scripts. Describe your changes and tunings in README file where you can put some hilarious random sentence examples.