Changes between Initial Version and Version 1 of en/NlpInPracticeCourse/2022/ParsingCzech


Ignore:
Timestamp:
Sep 13, 2023, 2:45:44 PM (23 months ago)
Author:
Ales Horak
Comment:

copied from private/NlpInPracticeCourse/ParsingCzech

Legend:

Unmodified
Added
Removed
Modified
  • en/NlpInPracticeCourse/2022/ParsingCzech

    v1 v1  
     1= Parsing of Czech: Between Rules and Stats =
     2
     3[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/NlpInPracticeCourse|NLP in Practice Course]], Course Guarantee: Aleš Horák
     4
     5Prepared by: Miloš Jakubíček
     6
     7== State of the Art ==
     8
     9=== References ===
     10
     11 1. Zhang, Y., Zhou, H., & Li, Z. (2020). Fast and Accurate Neural CRF Constituency Parsing. arXiv preprint arXiv:2008.03736.
     12 1. Qi, P., Dozat, T., Zhang, Y., & Manning, C. D. (2019). Universal dependency parsing from scratch. arXiv preprint arXiv:1901.10457.
     13 1. Straka, M., Straková, J., & Hajič, J. (2019). Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER. In International Conference on Text, Speech, and Dialogue (pp. 137-150). Springer, Cham.
     14 1. Baisa, V. and Kovář, V. (2014). Information extraction for Czech based on syntactic analysis. In Vetulani, Z. and Mariani, J., editors,Human Language Technology Challenges for Computer Science and Linguistics, pages 155–165. Springer International Publishing.
     15
     16
     17== Practical Session ==
     18
     19We will develop/adjust the grammar of the SET parser.
     20
     211. Download the [[htdocs:bigdata/ukol_ia161-parsing.zip|SET parser with evaluation dataset]]
     22{{{
     23wget https://nlp.fi.muni.cz/trac/research/chrome/site/bigdata/ukol_ia161-parsing.zip
     24}}}
     251. Unzip the downloaded file
     26{{{
     27unzip ukol_ia161-parsing.zip
     28}}}
     291. Go to the unziped folder
     30{{{
     31cd ukol_ia161-parsing
     32}}}
     331. Choose the language you want to work with. The default is Czech (`cs`) which can be changed to English (`en`) via editing `Makefile`:
     34{{{
     35nano Makefile
     36}}}
     37 change the first line to
     38{{{
     39LANGUAGE=en
     40}}}
     411. Test the prepared program that analyses 100 selected sentences
     42{{{
     43make set_trees
     44make compare
     45}}}
     46 The output should be
     47{{{
     48./compare_dep_trees.py data/trees/pdt2_etest data/trees/set_pdt2_etest-sel100
     49UAS =  66.1 %
     50}}}
     51 You can see detailed evaluation (sentence by sentence) with
     52{{{
     53make compare SENTENCES=1
     54}}}
     55 You can watch differences for one tree with
     56{{{
     57make diff SENTENCE=00009
     58}}}
     59 Exit the diff by pressing `q`.[[br]]
     60 You may inspect the tagged vertical text with
     61 {{{
     62 make vert SENTENCE=00009
     63}}}
     64 You can watch the two trees with (`python3-tk` must be installed in the system)
     65 {{{
     66make view SENTENCE=00009
     67}}}
     68 For remote tree view, you may run
     69 {{{
     70make html SENTENCE=00009
     71}}}
     72 And point your browser to the `html/index.html` file. [[br]]
     73 You can extract the text of the sentence easily with
     74 {{{
     75make text SENTENCE=00009
     76}}}
     77 English translation of the Czech sentences can be obtained via
     78 {{{
     79make texttrans SENTENCE=00009
     80}}}
     811. Look at the files:
     82 * `data/vert/pdt2_etest` or `ud21_gum_dev` - 100 input sentences in vertical format. The tag format is  the Prague Dependency Treebank [https://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/ch02s02s01.html positional tagset] for Czech and the [https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html  Penn Treebank tagset] for English
     83 * `data/trees/pdt2_etest` or `ud21_gum_dev` - 100 gold standard dependency trees from the Prague Dependency Treebank or the Universal Dependencies GUM corpus
     84 * `data/trees/set_pdt2_etest` or `set_ud21_gum_dev` - 100 trees output from SET by running `make set_trees`
     85 * `grammar-cs.set` or `grammar-en.set` - the grammar used in running SET
     86
     87== Assignment ==
     88
     891. Study the [https://nlp.fi.muni.cz/trac/set/wiki/documentation SET documentation]. The tags used in the Czech grammar are in the [raw-attachment:tagset.pdf Brno tagset].
     901. Develop better grammar - repeat the process:
     91{{{
     92nano grammar.set # or use your favourite editor
     93make set_trees
     94make compare
     95}}}
     96 to improve the original UAS
     971. Write the final UAS in `grammar-cs.set` or `grammar-en.set`
     98{{{
     99# This is the SET grammar for Czech used in IA161 course
     100#
     101# ===========   resulting UAS =  66.9 %  ===================
     102}}}
     1031. Upload your `grammar-cs.set` or `grammar-en.set` to the homework vault.