Context Navigation

MachineTranslation

Timestamp:: Oct 2, 2017, 9:11:09 AM (8 years ago)
Author:: Vít Baisa
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

private/NlpInPracticeCourse/MachineTranslation

-                      v14
+                      v15
 * download [raw-attachment:ia161_mt.tar.gz ia161_mt.tar.gz] with scripts and train data
 * unzip into home directory with {{{tar xzf ia161_mt.tar.gz}}}
 * a new subdir will be created: {{{it161_mt}}}
+* unzip with {{{tar xzf ia161_mt.tar.gz}}}
+* subdir {{{ia161_mt}}} will be created
 === Files in the archive ===
+=== Files ===
 ||czech.words||100,000 sentences from Czech part of DGT-TM||
 …
 ||english.words||100,000 sentences from English DGT||
 ||english.lemmas||100,000 sentences (lemmas) from EN DGT||
 ||eval.py||a script for evaluation of coverage and precision of a generated dictionary in comparison with a small English-Czech dictionary||
 ||gnudfl.txt||a small English-Czech dictionary containing only one-word items and words from the train data||
 ||make_dict.py||a script for generating dictionary based on co-occurrences and frequency lists||
 ||Makefile||a file with rules for building the dictionary based on the train data||
+||eval.py||a script for evaluation of coverage and precision of a generated dictionary using a small English-Czech dictionary||
+||gnudfl.txt||a small English-Czech dictionary containing only one-word items and words from the training data||
+||make_dict.py||a script for generating a translation dictionary based on co-occurrences and frequency lists||
+||Makefile||a file with rules for building the dictionary based on the training data||
 ||par2items.py||a file for generating pairs of words (lemmas) from the parallel data||
 …
 {{{make dict}}}
 * the command uses 1,000 lines from train data and generates a dictionary based on wordforms (files czech.words and english.words)
+* the command uses 1,000 lines from training data and generates a dictionary based on wordforms (files czech.words and english.words)
 * it is possible to use alternative files with lemmas using parameter L1DATA and L2DATA
 * it is also possible to change the number of lines used for the computation (parameter LIMIT)
 …
 * e.g.: {{{make dict L1DATA=english.lemmas L2DATA=czech.lemmas LIMIT=10000}}}
+The 1,000 lines by default are for the sake of speed.
 {{{make eval}}}
 * when a dictionary is generated, you can measure its precision and coverage using script eval.py: {{{make eval}}}.
+* when the dictionary is generated, you can measure its precision and coverage using script eval.py: {{{make eval}}}.
 * if you use parameters {{{L1DATA}}} and {{{L2DATA}}}, you must repeat them {{{make eval}}}
 * e.g.: {{{make dict L1DATA=english.lemmas L2DATA=czech.lemmas}}}
 …
 == Assignment ==
 . Change the key places of scripts {{{par2items.py}}}, {{{make_dict.py}}} so to achieve the highest possible f-score (see {{{make eval}}}).
+. Change the key places of scripts {{{par2items.py}}}, {{{make_dict.py}}} to achieve the highest possible f-score (see {{{make eval}}}).
 . Upload all the scripts into the vault in one archive file.
 . You can create it like this: {{{tar czf ia161_mt_<uco_or_login>.tar.gz Makefile *.py}}}