Changes between Version 14 and Version 15 of private/NlpInPracticeCourse/MachineTranslation
- Timestamp:
- Oct 2, 2017, 9:11:09 AM (6 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
private/NlpInPracticeCourse/MachineTranslation
v14 v15 22 22 23 23 * download [raw-attachment:ia161_mt.tar.gz ia161_mt.tar.gz] with scripts and train data 24 * unzip into home directorywith {{{tar xzf ia161_mt.tar.gz}}}25 * a new subdir will be created: {{{it161_mt}}}24 * unzip with {{{tar xzf ia161_mt.tar.gz}}} 25 * subdir {{{ia161_mt}}} will be created 26 26 27 === Files in the archive===27 === Files === 28 28 29 29 ||czech.words||100,000 sentences from Czech part of DGT-TM|| … … 31 31 ||english.words||100,000 sentences from English DGT|| 32 32 ||english.lemmas||100,000 sentences (lemmas) from EN DGT|| 33 ||eval.py||a script for evaluation of coverage and precision of a generated dictionary in comparison witha small English-Czech dictionary||34 ||gnudfl.txt||a small English-Czech dictionary containing only one-word items and words from the train data||35 ||make_dict.py||a script for generating dictionary based on co-occurrences and frequency lists||36 ||Makefile||a file with rules for building the dictionary based on the train data||33 ||eval.py||a script for evaluation of coverage and precision of a generated dictionary using a small English-Czech dictionary|| 34 ||gnudfl.txt||a small English-Czech dictionary containing only one-word items and words from the training data|| 35 ||make_dict.py||a script for generating a translation dictionary based on co-occurrences and frequency lists|| 36 ||Makefile||a file with rules for building the dictionary based on the training data|| 37 37 ||par2items.py||a file for generating pairs of words (lemmas) from the parallel data|| 38 38 … … 41 41 {{{make dict}}} 42 42 43 * the command uses 1,000 lines from train data and generates a dictionary based on wordforms (files czech.words and english.words)43 * the command uses 1,000 lines from training data and generates a dictionary based on wordforms (files czech.words and english.words) 44 44 * it is possible to use alternative files with lemmas using parameter L1DATA and L2DATA 45 45 * it is also possible to change the number of lines used for the computation (parameter LIMIT) … … 47 47 * e.g.: {{{make dict L1DATA=english.lemmas L2DATA=czech.lemmas LIMIT=10000}}} 48 48 49 The 1,000 lines by default are for the sake of speed. 50 49 51 {{{make eval}}} 50 52 51 * when adictionary is generated, you can measure its precision and coverage using script eval.py: {{{make eval}}}.53 * when the dictionary is generated, you can measure its precision and coverage using script eval.py: {{{make eval}}}. 52 54 * if you use parameters {{{L1DATA}}} and {{{L2DATA}}}, you must repeat them {{{make eval}}} 53 55 * e.g.: {{{make dict L1DATA=english.lemmas L2DATA=czech.lemmas}}} … … 70 72 == Assignment == 71 73 72 1. Change the key places of scripts {{{par2items.py}}}, {{{make_dict.py}}} soto achieve the highest possible f-score (see {{{make eval}}}).74 1. Change the key places of scripts {{{par2items.py}}}, {{{make_dict.py}}} to achieve the highest possible f-score (see {{{make eval}}}). 73 75 1. Upload all the scripts into the vault in one archive file. 74 76 1. You can create it like this: {{{tar czf ia161_mt_<uco_or_login>.tar.gz Makefile *.py}}}