Context Navigation

← Previous Change
Wiki History
Next Change →

MachineTranslation

Timestamp:: Oct 1, 2020, 3:33:17 PM (5 years ago)
Author:: Ales Horak
Comment:: copied from private/AdvancedNlpCourse/MachineTranslation

Legend:

: Unmodified
: Added
: Removed
: Modified

en/AdvancedNlpCourse2019/MachineTranslation

                       v1
+= Machine translation =
+[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák
+Prepared by: Vít Baisa
+== State of the Art ==
+The Statistical Machine Translation consists of two main parts: a language model for a target language which is responsible for fluency and good-looking output sentences and a translation model which translates source words and phrases into target language. Both models are probability distributions and can be built using a monolingual corpus for language model and a parallel corpus for translation model.
+=== References ===
+Approx 3 current papers (preferably from best NLP conferences/journals, eg. [[https://www.aclweb.org/anthology/|ACL Anthology]]) that will be used as a source for the one-hour lecture:
+. Koehn, Philipp, et al. "Moses: Open source toolkit for statistical machine translation." Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions. Association for Computational Linguistics, 2007.
+. Koehn, Philipp, Franz Josef Och, and Daniel Marcu. "Statistical phrase-based translation." Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for Computational Linguistics, 2003.
+. Denkowski, Michael, and Alon Lavie. "Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems." Proceedings of the Sixth Workshop on Statistical Machine Translation. Association for Computational Linguistics, 2011.
+== Workshop: generating translation dictionary from parallel data ==
+=== Basic instructions ===
+* download [raw-attachment:ia161_mt.tar.gz ia161_mt.tar.gz] with scripts and train data
+* unzip with {{{tar xzf ia161_mt.tar.gz}}}
+* subdir {{{ia161_mt}}} will be created
+=== Files ===
+||czech.words||100,000 sentences from Czech part of DGT-TM||
+||czech.lemmas||100,000 sentences (lemmas) from Czech part of DGT||
+||english.words||100,000 sentences from English DGT||
+||english.lemmas||100,000 sentences (lemmas) from EN DGT||
+||eval.py||a script for evaluation of coverage and precision of a generated dictionary using a small English-Czech dictionary||
+||gnudfl.txt||a small English-Czech dictionary containing only one-word items and words from the training data||
+||make_dict.py||a script for generating a translation dictionary based on co-occurrences and frequency lists||
+||Makefile||a file with rules for building the dictionary based on the training data||
+||par2items.py||a file for generating pairs of words (lemmas) from the parallel data||
+=== Description of make ===
+{{{make dict}}}
+* the command uses 1,000 lines from training data and generates a dictionary based on wordforms (files czech.words and english.words)
+* it is possible to use alternative files with lemmas using parameter L1DATA and L2DATA
+* it is also possible to change the number of lines used for the computation (parameter LIMIT)
+* in general: {{{make dict [L1DATA=<file>] [L2DATA=<file>] [LIMIT=<number of lines>]}}}
+* e.g.: {{{make dict L1DATA=english.lemmas L2DATA=czech.lemmas LIMIT=10000}}}
+The 1,000 lines by default are for the sake of speed.
+{{{make eval}}}
+* when the dictionary is generated, you can measure its precision and coverage using script eval.py: {{{make eval}}}.
+* if you use parameters {{{L1DATA}}} and {{{L2DATA}}}, you must repeat them {{{make eval}}}
+* e.g.: {{{make dict L1DATA=english.lemmas L2DATA=czech.lemmas}}}
+{{{make clean}}}
+* after each change to the input files or the scripts or parameters, clean temporary files: {{{make clean}}}
+== Detailed description of the scripts and generated data ==
+* Try to run default {{{make dict}}} and look at the results:
+  * czech.words.freq
+  * english.words.freq
+  * english.words-czech.words.cofreq
+  * english.words-czech.words.dict (the resulting dictionary)
+* Look at sizes of the output files (how many lines they contain) and its contents.
+* Look at the script {{{make_dict.py}}}, which generates the dictionary: at key places it contains {{{TODO}}}
+* there you can change the script, add heuristics, change conditions etc. so the final f-score is the highest possible
+== Assignment ==
+. Change the key places of scripts {{{par2items.py}}}, {{{make_dict.py}}} to achieve the highest possible f-score (see {{{make eval}}}).
+. Upload all the scripts into the vault in one archive file.
+. You can create it like this: {{{tar czf ia161_mt_<uco_or_login>.tar.gz Makefile *.py}}}