Changes between Version 39 and Version 40 of private/NlpInPracticeCourse/ParsingCzech


Ignore:
Timestamp:
Nov 6, 2024, 8:28:33 PM (7 months ago)
Author:
Ales Horak
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • private/NlpInPracticeCourse/ParsingCzech

    v39 v40  
    1717== Practical Session ==
    1818
    19 {{{
    20 #!div class="wiki-toc" style="width: 40%"
    21 **Note:** If you are new to the [https://en.wikipedia.org/wiki/Command-line_interface command line interface] via a [https://en.wikipedia.org/wiki/Terminal_emulator terminal window], you may find the **[https://ubuntu.com/tutorials/command-line-for-beginners#3-opening-a-terminal tutorial for working in terminal]** useful.
    22 }}}
     19We will develop/adjust the grammar of the SET parser (for English or Czech).
    2320
    24 We will develop/adjust the grammar of the SET parser (for English or Czech).[[br]][[br]][[br]]
     21Open [https://colab.research.google.com/drive/1SUtMScLK-6sKsX5eYIUfFjBtgrKCpkRy?usp=sharing Google Colab notebook IA161-ParsingCzech.ipynb] and follow the text and code in it.
    2522
    26 1. Download the [[htdocs:bigdata/ukol_ia161-parsing.zip|SET parser with evaluation dataset]]
    27 {{{
    28 wget https://nlp.fi.muni.cz/trac/research/chrome/site/bigdata/ukol_ia161-parsing.zip
    29 }}}
    30 1. Unzip the downloaded file
    31 {{{
    32 unzip ukol_ia161-parsing.zip
    33 }}}
    34 1. Go to the unziped folder
    35 {{{
    36 cd ukol_ia161-parsing
    37 }}}
    38 1. [optional] Choose the language you want to work with. The default is English (`en`) which can be changed to Czech (`cs`) via editing `Makefile`:
    39 {{{
    40 nano Makefile
    41 }}}
    42  if you want to work with Czech, change the first line to
    43 {{{
    44 LANGUAGE=cs
    45 }}}
    46 1. Test the prepared program that analyses 100 selected sentences
    47 {{{
    48 make set_trees
    49 make compare
    50 }}}
    51  The output should be
    52 {{{
    53 ./compare_dep_trees.py data/trees/ud21_gum_dev data/trees/set_ud21_gum_dev
    54 UAS =  55.4 %
    55 }}}
    56  You can see detailed evaluation (sentence by sentence) with
    57 {{{
    58 make compare SENTENCES=1
    59 }}}
    60  You can watch differences for one tree with
    61 {{{
    62 make diff SENTENCE=academic_librarians-10
    63 }}}
    64  The left window with `ud21_gum_dev/academic_librarians-10` shows the
    65  expected ground truth, the right window of `set_ud21_gum_dev/academic_librarians-10` displays the current parsing result (to be improved by you).[[br]]
    66  Exit the diff by pressing `q`.[[br]]
    67  You may inspect the tagged vertical text with
    68  {{{
    69  make vert SENTENCE=academic_librarians-10
    70 }}}
    71  You can watch the two trees with (`python3-tk` must be installed in the system)
    72  {{{
    73 make view SENTENCE=academic_librarians-10
    74 }}}
    75  For remote tree view (i.e. inspecting the trees on different computer), you may run
    76  {{{
    77 make html SENTENCE=academic_librarians-10
    78 }}}
    79  And point your browser to the `html/index.html` file. [[br]]
    80  You can extract the text of the sentence easily with
    81  {{{
    82 make text SENTENCE=academic_librarians-10
    83 }}}
    84  English translation of the Czech sentences can be obtained via
    85  {{{
    86 make texttrans SENTENCE=academic_librarians-10
    87 }}}
    88 1. Debugging the parsing process can be done using
    89  {{{
    90 make debug SENTENCE=academic_librarians-10
    91 }}}
    92  which will print the final rules used to build the tree. Adding
    93  `DETAIL=1` will show all details of the parsing process, including
    94  the unused rules.
    95  {{{
    96 make debug SENTENCE=academic_librarians-10 DETAIL=1
    97 }}}
    98 1. Look at the files (you may use `mc` file manager, exit it with `Esc+0`):
    99  * `data/vert/pdt2_etest` or `ud21_gum_dev` - 100 input sentences in vertical format.[[br]]
    100   The tag format is  the Prague Dependency Treebank [https://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/ch02s02s01.html positional tagset] for Czech and the [https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html  Penn Treebank tagset] for English
    101  * `data/trees/pdt2_etest` or `ud21_gum_dev` - 100 gold standard dependency trees from the Prague Dependency Treebank or the Universal Dependencies GUM corpus
    102  * `data/trees/set_pdt2_etest` or `set_ud21_gum_dev` - 100 trees output from SET by running `make set_trees`
    103  * `grammar-cs.set` or `grammar-en.set` - the grammar used in running SET
    104 
    105 == Assignment ==
    106 
    107 1. Study the [https://nlp.fi.muni.cz/trac/set/wiki/documentation SET documentation]. The tags used in the English `grammar-en.set` follow the [https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html  Penn Treebank tagset] and in the Czech grammar `grammar-cs.set` the [raw-attachment:tagset.pdf Brno tagset].
    108 1. Develop better grammar - repeat the process:
    109 {{{
    110 nano grammar-en.set # or use your favourite editor
    111 make set_trees
    112 make compare
    113 }}}
    114  to improve the original UAS
    115 1. Write the final UAS in `grammar-cs.set` or `grammar-en.set`
    116 {{{
    117 # This is the SET grammar for English used in IA161 course
    118 #
    119 # ===========   resulting UAS =  66.9 %  ===================
    120 }}}
    121 1. Upload your `grammar-cs.set` or `grammar-en.set` to the homework vault.
     23Upload the resulting grammar file with improved UAS to the homework vault