| 1 | = Machine translation = |
| 2 | |
| 3 | [[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák |
| 4 | |
| 5 | Prepared by: Vít Baisa |
| 6 | |
| 7 | == State of the Art == |
| 8 | |
| 9 | The Statistical Machine Translation consists of two main parts: a language model for a target language which is responsible for fluency and good-looking output sentences and a translation model which translates source words and phrases into target language. Both models are probability distributions and can be built using a monolingual corpus for language model and a parallel corpus for translation model. |
| 10 | |
| 11 | === References === |
| 12 | |
| 13 | Approx 3 current papers (preferably from best NLP conferences/journals, eg. [[https://www.aclweb.org/anthology/|ACL Anthology]]) that will be used as a source for the one-hour lecture: |
| 14 | |
| 15 | 1. Koehn, Philipp, et al. "Moses: Open source toolkit for statistical machine translation." Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions. Association for Computational Linguistics, 2007. |
| 16 | 1. Koehn, Philipp, Franz Josef Och, and Daniel Marcu. "Statistical phrase-based translation." Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for Computational Linguistics, 2003. |
| 17 | 1. Denkowski, Michael, and Alon Lavie. "Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems." Proceedings of the Sixth Workshop on Statistical Machine Translation. Association for Computational Linguistics, 2011. |
| 18 | |
| 19 | == Workshop: generating translation dictionary from parallel data == |
| 20 | |
| 21 | === Basic instructions === |
| 22 | |
| 23 | * download [raw-attachment:ia161_mt.tar.gz ia161_mt.tar.gz] with scripts and train data |
| 24 | * unzip with {{{tar xzf ia161_mt.tar.gz}}} |
| 25 | * subdir {{{ia161_mt}}} will be created |
| 26 | |
| 27 | === Files === |
| 28 | |
| 29 | ||czech.words||100,000 sentences from Czech part of DGT-TM|| |
| 30 | ||czech.lemmas||100,000 sentences (lemmas) from Czech part of DGT|| |
| 31 | ||english.words||100,000 sentences from English DGT|| |
| 32 | ||english.lemmas||100,000 sentences (lemmas) from EN DGT|| |
| 33 | ||eval.py||a script for evaluation of coverage and precision of a generated dictionary using a small English-Czech dictionary|| |
| 34 | ||gnudfl.txt||a small English-Czech dictionary containing only one-word items and words from the training data|| |
| 35 | ||make_dict.py||a script for generating a translation dictionary based on co-occurrences and frequency lists|| |
| 36 | ||Makefile||a file with rules for building the dictionary based on the training data|| |
| 37 | ||par2items.py||a file for generating pairs of words (lemmas) from the parallel data|| |
| 38 | |
| 39 | === Description of make === |
| 40 | |
| 41 | {{{make dict}}} |
| 42 | |
| 43 | * the command uses 1,000 lines from training data and generates a dictionary based on wordforms (files czech.words and english.words) |
| 44 | * it is possible to use alternative files with lemmas using parameter L1DATA and L2DATA |
| 45 | * it is also possible to change the number of lines used for the computation (parameter LIMIT) |
| 46 | * in general: {{{make dict [L1DATA=<file>] [L2DATA=<file>] [LIMIT=<number of lines>]}}} |
| 47 | * e.g.: {{{make dict L1DATA=english.lemmas L2DATA=czech.lemmas LIMIT=10000}}} |
| 48 | |
| 49 | The 1,000 lines by default are for the sake of speed. |
| 50 | |
| 51 | {{{make eval}}} |
| 52 | |
| 53 | * when the dictionary is generated, you can measure its precision and coverage using script eval.py: {{{make eval}}}. |
| 54 | * if you use parameters {{{L1DATA}}} and {{{L2DATA}}}, you must repeat them {{{make eval}}} |
| 55 | * e.g.: {{{make dict L1DATA=english.lemmas L2DATA=czech.lemmas}}} |
| 56 | |
| 57 | {{{make clean}}} |
| 58 | |
| 59 | * after each change to the input files or the scripts or parameters, clean temporary files: {{{make clean}}} |
| 60 | |
| 61 | == Detailed description of the scripts and generated data == |
| 62 | |
| 63 | * Try to run default {{{make dict}}} and look at the results: |
| 64 | * czech.words.freq |
| 65 | * english.words.freq |
| 66 | * english.words-czech.words.cofreq |
| 67 | * english.words-czech.words.dict (the resulting dictionary) |
| 68 | * Look at sizes of the output files (how many lines they contain) and its contents. |
| 69 | * Look at the script {{{make_dict.py}}}, which generates the dictionary: at key places it contains {{{TODO}}} |
| 70 | * there you can change the script, add heuristics, change conditions etc. so the final f-score is the highest possible |
| 71 | |
| 72 | == Assignment == |
| 73 | |
| 74 | 1. Change the key places of scripts {{{par2items.py}}}, {{{make_dict.py}}} to achieve the highest possible f-score (see {{{make eval}}}). |
| 75 | 1. Upload all the scripts into the vault in one archive file. |
| 76 | 1. You can create it like this: {{{tar czf ia161_mt_<uco_or_login>.tar.gz Makefile *.py}}} |