Changes between Initial Version and Version 1 of en/NlpInPracticeCourse/2023/ParsingCzech


Ignore:
Timestamp:
Sep 3, 2024, 2:51:18 PM (11 months ago)
Author:
Ales Horak
Comment:

copied from private/NlpInPracticeCourse/ParsingCzech

Legend:

Unmodified
Added
Removed
Modified
  • en/NlpInPracticeCourse/2023/ParsingCzech

    v1 v1  
     1= Parsing of Czech: Between Rules and Stats =
     2
     3[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/NlpInPracticeCourse|NLP in Practice Course]], Course Guarantee: Aleš Horák
     4
     5Prepared by: Miloš Jakubíček, Aleš Horák
     6
     7== State of the Art ==
     8
     9=== References ===
     10
     11 1. Fernández-González, D., & Gómez-Rodríguez, C. (2023). Dependency parsing with bottom-up hierarchical pointer networks. Information Fusion, 91, 494-503.
     12 1. Arps, D., Samih, Y., Kallmeyer, L., & Sajjad, H. (2022). Probing for constituency structure in neural language models. arXiv preprint arXiv:2204.06201.
     13 1. Qi, P., Dozat, T., Zhang, Y., & Manning, C. D. (2019). Universal dependency parsing from scratch. arXiv preprint arXiv:1901.10457.
     14 1. Baisa, V. and Kovář, V. (2014). Information extraction for Czech based on syntactic analysis. In Vetulani, Z. and Mariani, J., editors,Human Language Technology Challenges for Computer Science and Linguistics, pages 155–165. Springer International Publishing.
     15
     16
     17== Practical Session ==
     18
     19{{{
     20#!div class="wiki-toc" style="width: 40%"
     21**Note:** If you are new to the [https://en.wikipedia.org/wiki/Command-line_interface command line interface] via a [https://en.wikipedia.org/wiki/Terminal_emulator terminal window], you may find the **[https://ubuntu.com/tutorials/command-line-for-beginners#3-opening-a-terminal tutorial for working in terminal]** useful.
     22}}}
     23
     24We will develop/adjust the grammar of the SET parser (for English or Czech).[[br]][[br]][[br]]
     25
     261. Download the [[htdocs:bigdata/ukol_ia161-parsing.zip|SET parser with evaluation dataset]]
     27{{{
     28wget https://nlp.fi.muni.cz/trac/research/chrome/site/bigdata/ukol_ia161-parsing.zip
     29}}}
     301. Unzip the downloaded file
     31{{{
     32unzip ukol_ia161-parsing.zip
     33}}}
     341. Go to the unziped folder
     35{{{
     36cd ukol_ia161-parsing
     37}}}
     381. [optional] Choose the language you want to work with. The default is English (`en`) which can be changed to Czech (`cs`) via editing `Makefile`:
     39{{{
     40nano Makefile
     41}}}
     42 if you want to work with Czech, change the first line to
     43{{{
     44LANGUAGE=cs
     45}}}
     461. Test the prepared program that analyses 100 selected sentences
     47{{{
     48make set_trees
     49make compare
     50}}}
     51 The output should be
     52{{{
     53./compare_dep_trees.py data/trees/ud21_gum_dev data/trees/set_ud21_gum_dev
     54UAS =  55.4 %
     55}}}
     56 You can see detailed evaluation (sentence by sentence) with
     57{{{
     58make compare SENTENCES=1
     59}}}
     60 You can watch differences for one tree with
     61{{{
     62make diff SENTENCE=academic_librarians-10
     63}}}
     64 The left window with `ud21_gum_dev/academic_librarians-10` shows the
     65 expected ground truth, the right window of `set_ud21_gum_dev/academic_librarians-10` displays the current parsing result (to be improved by you).[[br]]
     66 Exit the diff by pressing `q`.[[br]]
     67 You may inspect the tagged vertical text with
     68 {{{
     69 make vert SENTENCE=academic_librarians-10
     70}}}
     71 You can watch the two trees with (`python3-tk` must be installed in the system)
     72 {{{
     73make view SENTENCE=academic_librarians-10
     74}}}
     75 For remote tree view (i.e. inspecting the trees on different computer), you may run
     76 {{{
     77make html SENTENCE=academic_librarians-10
     78}}}
     79 And point your browser to the `html/index.html` file. [[br]]
     80 You can extract the text of the sentence easily with
     81 {{{
     82make text SENTENCE=academic_librarians-10
     83}}}
     84 English translation of the Czech sentences can be obtained via
     85 {{{
     86make texttrans SENTENCE=academic_librarians-10
     87}}}
     881. Debugging the parsing process can be done using
     89 {{{
     90make debug SENTENCE=academic_librarians-10
     91}}}
     92 which will print the final rules used to build the tree. Adding
     93 `DETAIL=1` will show all details of the parsing process, including
     94 the unused rules.
     95 {{{
     96make debug SENTENCE=academic_librarians-10 DETAIL=1
     97}}}
     981. Look at the files (you may use `mc` file manager, exit it with `Esc+0`):
     99 * `data/vert/pdt2_etest` or `ud21_gum_dev` - 100 input sentences in vertical format.[[br]]
     100  The tag format is  the Prague Dependency Treebank [https://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/ch02s02s01.html positional tagset] for Czech and the [https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html  Penn Treebank tagset] for English
     101 * `data/trees/pdt2_etest` or `ud21_gum_dev` - 100 gold standard dependency trees from the Prague Dependency Treebank or the Universal Dependencies GUM corpus
     102 * `data/trees/set_pdt2_etest` or `set_ud21_gum_dev` - 100 trees output from SET by running `make set_trees`
     103 * `grammar-cs.set` or `grammar-en.set` - the grammar used in running SET
     104
     105== Assignment ==
     106
     1071. Study the [https://nlp.fi.muni.cz/trac/set/wiki/documentation SET documentation]. The tags used in the English `grammar-en.set` follow the [https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html  Penn Treebank tagset] and in the Czech grammar `grammar-cs.set` the [raw-attachment:tagset.pdf Brno tagset].
     1081. Develop better grammar - repeat the process:
     109{{{
     110nano grammar-en.set # or use your favourite editor
     111make set_trees
     112make compare
     113}}}
     114 to improve the original UAS
     1151. Write the final UAS in `grammar-cs.set` or `grammar-en.set`
     116{{{
     117# This is the SET grammar for English used in IA161 course
     118#
     119# ===========   resulting UAS =  66.9 %  ===================
     120}}}
     1211. Upload your `grammar-cs.set` or `grammar-en.set` to the homework vault.