Changes between Version 16 and Version 17 of private/NlpInPracticeCourse/LanguageModelling
- Timestamp:
- Oct 30, 2017, 8:10:46 AM (6 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
private/NlpInPracticeCourse/LanguageModelling
v16 v17 34 34 (or download from a [[htdocs:bigdata/cztenten.trie|stable copy]]) 35 35 * get a large plain text: [[BR]] 36 {{{scp anxur:/tmp/cztenten.txt.xz . }}} [[BR]] 37 or {{{scp aurora:/corpora/data/cblm/data/cztenten_1M_sentences.txt.xz . }}} [[BR]] 36 {{{scp anxur:/tmp/cztenten_1M_sentences.txt.xz .}}} [[BR]] 37 or [[BR]] 38 {{{scp aurora:/corpora/data/cblm/data/cztenten_1M_sentences.txt.xz .}}} 38 39 39 40 === mksary === … … 49 50 50 51 To build a new model, we need 51 * a plain text file (suffix .in) all in lowercase: {{{xz cztenten_1M_sentences.txt.xz | python lower.py > input.in}}}52 * a plain text file (suffix .in) all in lowercase: {{{xzcat cztenten_1M_sentences.txt.xz | python lower.py > input.in}}} 52 53 * to create a suffix array {{{./mksary INPUT.in OUTPUT.sa}}} 53 54 * and compute the prefix tree: {{{python build_trie.py FILE_PREFIX [MINFREQ]}}} … … 65 66 === Task === 66 67 67 Change the training process (build_trie.py) and the generating process (alarm.py)to generate the most naturally-looking sentences. Either by68 Change the training data and process to generate the most naturally-looking sentences. Either by 68 69 * pre-processing the input plain text or 69 70 * setting training parameters or 70 * changing the generating process 71 * or all above. 71 * changing training and generating scripts (advanced). 72 72 73 Upload 10,000 random sentences to your vault together with the amended scripts . Describe your changes and tunings in README file where you can put some hilarious random sentence examples.73 Upload 10,000 random sentences to your vault together with the amended scripts and your models. Describe your changes and tunings in README file where you can put some hilarious random sentence examples.