Changes between Version 31 and Version 32 of private/NlpInPracticeCourse/ParsingCzech


Ignore:
Timestamp:
Nov 5, 2023, 8:46:00 PM (6 months ago)
Author:
Ales Horak
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • private/NlpInPracticeCourse/ParsingCzech

    v31 v32  
    1717== Practical Session ==
    1818
    19 We will develop/adjust the grammar of the SET parser.
     19We will develop/adjust the grammar of the SET parser (for English or Czech).
    2020
    21211. Download the [[htdocs:bigdata/ukol_ia161-parsing.zip|SET parser with evaluation dataset]]
     
    3131cd ukol_ia161-parsing
    3232}}}
    33 1. Choose the language you want to work with. The default is Czech (`cs`) which can be changed to English (`en`) via editing `Makefile`:
     331. [optional] Choose the language you want to work with. The default is English (`en`) which can be changed to Czech (`cs`) via editing `Makefile`:
    3434{{{
    3535nano Makefile
    3636}}}
    37  change the first line to
     37 if you want to work with Czech, change the first line to
    3838{{{
    39 LANGUAGE=en
     39LANGUAGE=cs
    4040}}}
    41411. Test the prepared program that analyses 100 selected sentences
     
    4646 The output should be
    4747{{{
    48 ./compare_dep_trees.py data/trees/pdt2_etest data/trees/set_pdt2_etest-sel100
    49 UAS =  66.1 %
     48./compare_dep_trees.py data/trees/ud21_gum_dev data/trees/set_ud21_gum_dev
     49UAS =  55.4 %
    5050}}}
    5151 You can see detailed evaluation (sentence by sentence) with
     
    5555 You can watch differences for one tree with
    5656{{{
    57 make diff SENTENCE=00009
     57make diff SENTENCE=academic_librarians-10
    5858}}}
     59 The left window with `ud21_gum_dev/academic_librarians-10` shows the
     60 expected ground truth, the right window of `set_ud21_gum_dev/academic_librarians-10` displays the current parsing result (to be improved by you).[[br]]
    5961 Exit the diff by pressing `q`.[[br]]
    6062 You may inspect the tagged vertical text with
    6163 {{{
    62  make vert SENTENCE=00009
     64 make vert SENTENCE=academic_librarians-10
    6365}}}
    6466 You can watch the two trees with (`python3-tk` must be installed in the system)
    6567 {{{
    66 make view SENTENCE=00009
     68make view SENTENCE=academic_librarians-10
    6769}}}
    68  For remote tree view, you may run
     70 For remote tree view (i.e. inspecting the trees on different computer), you may run
    6971 {{{
    70 make html SENTENCE=00009
     72make html SENTENCE=academic_librarians-10
    7173}}}
    7274 And point your browser to the `html/index.html` file. [[br]]
    7375 You can extract the text of the sentence easily with
    7476 {{{
    75 make text SENTENCE=00009
     77make text SENTENCE=academic_librarians-10
    7678}}}
    7779 English translation of the Czech sentences can be obtained via
    7880 {{{
    79 make texttrans SENTENCE=00009
     81make texttrans SENTENCE=academic_librarians-10
    8082}}}
    81 1. Look at the files:
     831. Look at the files (you may use `mc` file manager, exit it with `Esc+0`):
    8284 * `data/vert/pdt2_etest` or `ud21_gum_dev` - 100 input sentences in vertical format. The tag format is  the Prague Dependency Treebank [https://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/ch02s02s01.html positional tagset] for Czech and the [https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html  Penn Treebank tagset] for English
    8385 * `data/trees/pdt2_etest` or `ud21_gum_dev` - 100 gold standard dependency trees from the Prague Dependency Treebank or the Universal Dependencies GUM corpus
     
    8789== Assignment ==
    8890
    89 1. Study the [https://nlp.fi.muni.cz/trac/set/wiki/documentation SET documentation]. The tags used in the Czech grammar are in the [raw-attachment:tagset.pdf Brno tagset].
     911. Study the [https://nlp.fi.muni.cz/trac/set/wiki/documentation SET documentation]. The tags used in the English `grammar-en.set` follow the [https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html  Penn Treebank tagset] and in the Czech grammar `grammar-cs.set` the [raw-attachment:tagset.pdf Brno tagset].
    90921. Develop better grammar - repeat the process:
    9193{{{
    92 nano grammar.set # or use your favourite editor
     94nano grammar-en.set # or use your favourite editor
    9395make set_trees
    9496make compare
     
    97991. Write the final UAS in `grammar-cs.set` or `grammar-en.set`
    98100{{{
    99 # This is the SET grammar for Czech used in IA161 course
     101# This is the SET grammar for English used in IA161 course
    100102#
    101103# ===========   resulting UAS =  66.9 %  ===================