wiki:en/NlpInPracticeCourse/ParsingCzech

Version 2 (modified by Ales Horak, 8 months ago) (diff)

edited by hales in edit_page_in_vim.py

Parsing of Czech: Between Rules and Stats

IA161 NLP in Practice Course, Course Guarantee: Aleš Horák

Prepared by: Miloš Jakubíček

State of the Art

References

  1. Zhang, Y., Zhou, H., & Li, Z. (2020). Fast and Accurate Neural CRF Constituency Parsing. arXiv preprint arXiv:2008.03736.
  2. Qi, P., Dozat, T., Zhang, Y., & Manning, C. D. (2019). Universal dependency parsing from scratch. arXiv preprint arXiv:1901.10457.
  3. Straka, M., Straková, J., & Hajič, J. (2019). Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER. In International Conference on Text, Speech, and Dialogue (pp. 137-150). Springer, Cham.
  4. Baisa, V. and Kovář, V. (2014). Information extraction for Czech based on syntactic analysis. In Vetulani, Z. and Mariani, J., editors,Human Language Technology Challenges for Computer Science and Linguistics, pages 155–165. Springer International Publishing.

Practical Session

We will develop/adjust the grammar of the SET parser.

  1. Download the SET parser with evaluation dataset
    wget https://nlp.fi.muni.cz/trac/research/chrome/site/bigdata/ukol_ia161-parsing.zip
    
  2. Unzip the downloaded file
    unzip ukol_ia161-parsing.zip
    
  3. Go to the unziped folder
    cd ukol_ia161-parsing
    
  4. Choose the language you want to work with. The default is Czech (cs) which can be changed to English (en) via editing Makefile:
    nano Makefile
    
    change the first line to
    LANGUAGE=en
    
  5. Test the prepared program that analyses 100 selected sentences
    make set_trees
    make compare
    
    The output should be
    ./compare_dep_trees.py data/trees/pdt2_etest data/trees/set_pdt2_etest-sel100
    UAS =  66.1 %
    
    You can see detailed evaluation (sentence by sentence) with
    make compare SENTENCES=1
    
    You can watch differences for one tree with
    make diff SENTENCE=00009
    
    Exit the diff by pressing q.
    You may inspect the tagged vertical text with
    make vert SENTENCE=00009
    
    You can watch the two trees with (python3-tk must be installed in the system)
    make view SENTENCE=00009
    
    For remote tree view, you may run
    make html SENTENCE=00009
    
    And point your browser to the html/index.html file.
    You can extract the text of the sentence easily with
    make text SENTENCE=00009
    
    English translation of the Czech sentences can be obtained via
    make texttrans SENTENCE=00009
    
  6. Look at the files:
    • data/vert/pdt2_etest or ud21_gum_dev - 100 input sentences in vertical format. The tag format is the Prague Dependency Treebank positional tagset for Czech and the Penn Treebank tagset for English
    • data/trees/pdt2_etest or ud21_gum_dev - 100 gold standard dependency trees from the Prague Dependency Treebank or the Universal Dependencies GUM corpus
    • data/trees/set_pdt2_etest or set_ud21_gum_dev - 100 trees output from SET by running make set_trees
    • grammar-cs.set or grammar-en.set - the grammar used in running SET

Assignment

  1. Study the SET documentation. The tags used in the Czech grammar are in the Brno tagset.
  2. Develop better grammar - repeat the process:
    nano grammar.set # or use your favourite editor
    make set_trees
    make compare
    
    to improve the original UAS
  3. Write the final UAS in grammar-cs.set or grammar-en.set
    # This is the SET grammar for Czech used in IA161 course
    # 
    # ===========   resulting UAS =  66.9 %  ===================
    
  4. Upload your grammar-cs.set or grammar-en.set to the homework vault.