Changes between Version 14 and Version 15 of private/NlpInPracticeCourse/MachineTranslation


Ignore:
Timestamp:
Oct 2, 2017, 9:11:09 AM (6 years ago)
Author:
Vít Baisa
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • private/NlpInPracticeCourse/MachineTranslation

    v14 v15  
    2222
    2323* download [raw-attachment:ia161_mt.tar.gz ia161_mt.tar.gz] with scripts and train data
    24 * unzip into home directory with {{{tar xzf ia161_mt.tar.gz}}}
    25 * a new subdir will be created: {{{it161_mt}}}
     24* unzip with {{{tar xzf ia161_mt.tar.gz}}}
     25* subdir {{{ia161_mt}}} will be created
    2626
    27 === Files in the archive ===
     27=== Files ===
    2828
    2929||czech.words||100,000 sentences from Czech part of DGT-TM||
     
    3131||english.words||100,000 sentences from English DGT||
    3232||english.lemmas||100,000 sentences (lemmas) from EN DGT||
    33 ||eval.py||a script for evaluation of coverage and precision of a generated dictionary in comparison with a small English-Czech dictionary||
    34 ||gnudfl.txt||a small English-Czech dictionary containing only one-word items and words from the train data||
    35 ||make_dict.py||a script for generating dictionary based on co-occurrences and frequency lists||
    36 ||Makefile||a file with rules for building the dictionary based on the train data||
     33||eval.py||a script for evaluation of coverage and precision of a generated dictionary using a small English-Czech dictionary||
     34||gnudfl.txt||a small English-Czech dictionary containing only one-word items and words from the training data||
     35||make_dict.py||a script for generating a translation dictionary based on co-occurrences and frequency lists||
     36||Makefile||a file with rules for building the dictionary based on the training data||
    3737||par2items.py||a file for generating pairs of words (lemmas) from the parallel data||
    3838
     
    4141{{{make dict}}}
    4242
    43 * the command uses 1,000 lines from train data and generates a dictionary based on wordforms (files czech.words and english.words)
     43* the command uses 1,000 lines from training data and generates a dictionary based on wordforms (files czech.words and english.words)
    4444* it is possible to use alternative files with lemmas using parameter L1DATA and L2DATA
    4545* it is also possible to change the number of lines used for the computation (parameter LIMIT)
     
    4747* e.g.: {{{make dict L1DATA=english.lemmas L2DATA=czech.lemmas LIMIT=10000}}}
    4848
     49The 1,000 lines by default are for the sake of speed.
     50
    4951{{{make eval}}}
    5052
    51 * when a dictionary is generated, you can measure its precision and coverage using script eval.py: {{{make eval}}}.
     53* when the dictionary is generated, you can measure its precision and coverage using script eval.py: {{{make eval}}}.
    5254* if you use parameters {{{L1DATA}}} and {{{L2DATA}}}, you must repeat them {{{make eval}}}
    5355* e.g.: {{{make dict L1DATA=english.lemmas L2DATA=czech.lemmas}}}
     
    7072== Assignment ==
    7173
    72 1. Change the key places of scripts {{{par2items.py}}}, {{{make_dict.py}}} so to achieve the highest possible f-score (see {{{make eval}}}).
     741. Change the key places of scripts {{{par2items.py}}}, {{{make_dict.py}}} to achieve the highest possible f-score (see {{{make eval}}}).
    73751. Upload all the scripts into the vault in one archive file.
    74761. You can create it like this: {{{tar czf ia161_mt_<uco_or_login>.tar.gz Makefile *.py}}}