Changes between Version 17 and Version 18 of private/NlpInPracticeCourse/MachineTranslation


Ignore:
Timestamp:
Nov 10, 2021, 11:32:25 AM (3 years ago)
Author:
pary
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • private/NlpInPracticeCourse/MachineTranslation

    v17 v18  
    33[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák
    44
    5 Prepared by: Vít Baisa, Pavel Rychlý
     5Prepared by: Pavel Rychlý
    66
    77== State of the Art ==
    88
    9 The Statistical Machine Translation consists of two main parts: a language model for a target language which is responsible for fluency and good-looking output sentences and a translation model which translates source words and phrases into target language. Both models are probability distributions and can be built using a monolingual corpus for language model and a parallel corpus for translation model.
     9The Neural Machine Translation system are structured as Encoder-Decoder pair.
     10They are trained on parallel corpora, each training example is a pair of source sentence and a reference translation.
     11Big advances could be done by preparing cleaner data and feeding the network with the right order of sentences.
     12
    1013
    1114=== References ===
    1215
    13 Approx 3 current papers (preferably from best NLP conferences/journals, eg. [[https://www.aclweb.org/anthology/|ACL Anthology]]) that will be used as a source for the one-hour lecture:
     16 1. Alammar, Jay (2018). The Illustrated Transformer [Blog post]. Retrieved from ​https://jalammar.github.io/illustrated-transformer/
     17 1. Popel, Martin, et al. "Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals." Nature communications 11.1 (2020): 1-15.
     18 1. Thompson, Brian and Koehn, Philipp. "Vecalign: Improved Sentence Alignment in Linear Time and Space", Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019
    1419
    15  1. Popel, Martin, et al. "Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals." Nature communications 11.1 (2020): 1-15.
    16  1. Koehn, Philipp, et al. "Moses: Open source toolkit for statistical machine translation." Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions. Association for Computational Linguistics, 2007.
    17  1. Koehn, Philipp, Franz Josef Och, and Daniel Marcu. "Statistical phrase-based translation." Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for Computational Linguistics, 2003.
    18  1. Denkowski, Michael, and Alon Lavie. "Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems." Proceedings of the Sixth Workshop on Statistical Machine Translation. Association for Computational Linguistics, 2011.
    1920
    20 == Workshop: generating translation dictionary from parallel data ==
     21== Practical Session ==
    2122
    22 === Basic instructions ===
     23===== Technical Requirements ====
    2324
    24 * download [raw-attachment:ia161_mt.tar.gz ia161_mt.tar.gz] with scripts and train data
    25 * unzip with {{{tar xzf ia161_mt.tar.gz}}}
    26 * subdir {{{ia161_mt}}} will be created
     25The task will proceed using Python notebook run in web browser in the Google ​Colaboratory environment.
    2726
    28 e.g. in ssh/terminal:
    29 {{{
    30 wget https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/private/AdvancedNlpCourse/MachineTranslation/ia161_mt.tar.gz
    31 tar xzf ia161_mt.tar.gz
    32 cd ia161_mt
    33 }}}
     27In case of running the codes in a local environment, the requirements are Python 3.6+, jupyter notebook.
    3428
    35 For file editing in linux terminal, you may use e.g. the `nano` editor.
    3629
    37 === Files ===
     30=== Translation with a Sequence to Sequence Network and Attention ===
    3831
    39 ||czech.words||100,000 sentences from Czech part of DGT-TM||
    40 ||czech.lemmas||100,000 sentences (lemmas) from Czech part of DGT||
    41 ||english.words||100,000 sentences from English DGT||
    42 ||english.lemmas||100,000 sentences (lemmas) from EN DGT||
    43 ||eval.py||a script for evaluation of coverage and precision of a generated dictionary using a small English-Czech dictionary||
    44 ||gnudfl.txt||a small English-Czech dictionary containing only one-word items and words from the training data||
    45 ||make_dict.py||a script for generating a translation dictionary based on co-occurrences and frequency lists||
    46 ||Makefile||a file with rules for building the dictionary based on the training data||
    47 ||par2items.py||a file for generating pairs of words (lemmas) from the parallel data||
     32Access ​[https://colab.research.google.com/drive/1t9y01lL6gPw8f9GU1phC5qlS8AqqdvS0?usp=sharing|Python notebook in the Google Colab environment].
    4833
    49 === Description of make ===
    5034
    51 {{{make dict}}}
     35OR
    5236
    53 * the command uses 1,000 lines from training data and generates a dictionary based on wordforms (files czech.words and english.words)
    54 * it is possible to use alternative files with lemmas using parameter L1DATA and L2DATA
    55 * it is also possible to change the number of lines used for the computation (parameter LIMIT)
    56 * in general: {{{make dict [L1DATA=<file>] [L2DATA=<file>] [LIMIT=<number of lines>]}}}
    57 * e.g.: {{{make dict L1DATA=english.lemmas L2DATA=czech.lemmas LIMIT=10000}}}
     37download the notebook or plain python file from the shared notebook (File > Download) and run in your local environment.
    5838
    59 The 1,000 lines by default are for the sake of speed.
    6039
    61 {{{make eval}}}
     40Follow the notebook. Choose one of the task at the end of the notebook.
    6241
    63 * when the dictionary is generated, you can measure its precision and coverage using script eval.py: {{{make eval}}}.
    64 * if you use parameters {{{L1DATA}}} and {{{L2DATA}}}, you must repeat them {{{make eval}}}
    65 * e.g.: {{{make dict L1DATA=english.lemmas L2DATA=czech.lemmas}}}
     42==== upload ====
    6643
    67 {{{make clean}}}
     44Upload your modified notebook or python script with results to the ​homework vault (odevzdávárna).
    6845
    69 * after each change to the input files or the scripts or parameters, clean temporary files: {{{make clean}}}
    70 
    71 == Detailed description of the scripts and generated data ==
    72 
    73 * Try to run default {{{make dict}}} and look at the results:
    74   * czech.words.freq
    75   * english.words.freq
    76   * english.words-czech.words.cofreq
    77   * english.words-czech.words.dict (the resulting dictionary)
    78 * Look at sizes of the output files (how many lines they contain) and its contents.
    79 * Look at the script {{{make_dict.py}}}, which generates the dictionary: at key places it contains {{{TODO}}}
    80 * there you can change the script, add heuristics, change conditions etc. so the final f-score is the highest possible
    81 
    82 == Assignment ==
    83 
    84 1. Change the key places of scripts {{{par2items.py}}}, {{{make_dict.py}}} to achieve the highest possible f-score (see {{{make eval}}}).
    85 1. Upload all the scripts into the vault in one archive file.
    86 1. You can create it like this: {{{tar czf ia161_mt_<uco_or_login>.tar.gz Makefile *.py}}}