Changes between Initial Version and Version 1 of en/AdvancedNlpCourse2020/ParsingCzech


Ignore:
Timestamp:
Aug 31, 2021, 2:11:28 PM (5 months ago)
Author:
Ales Horak
Comment:

copied from private/AdvancedNlpCourse/ParsingCzech

Legend:

Unmodified
Added
Removed
Modified
  • en/AdvancedNlpCourse2020/ParsingCzech

    v1 v1  
     1= Parsing of Czech: Between Rules and Stats =
     2
     3[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák
     4
     5Prepared by: Miloš Jakubíček
     6
     7== State of the Art ==
     8
     9=== References ===
     10
     11 1. Zhang, Y., Zhou, H., & Li, Z. (2020). Fast and Accurate Neural CRF Constituency Parsing. arXiv preprint arXiv:2008.03736.
     12 1. Qi, P., Dozat, T., Zhang, Y., & Manning, C. D. (2019). Universal dependency parsing from scratch. arXiv preprint arXiv:1901.10457.
     13 1. Straka, M., Straková, J., & Hajič, J. (2019). Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER. In International Conference on Text, Speech, and Dialogue (pp. 137-150). Springer, Cham.
     14 1. Baisa, V. and Kovář, V. (2014). Information extraction for Czech based on syntactic analysis. In Vetulani, Z. and Mariani, J., editors,Human Language Technology Challenges for Computer Science and Linguistics, pages 155–165. Springer International Publishing.
     15
     16
     17== Practical Session ==
     18
     19We will develop/adjust the grammar of the SET parser.
     20
     211. Download the [[htdocs:bigdata/ukol_ia161-parsing.zip|SET parser with evaluation dataset]]
     22{{{
     23wget https://nlp.fi.muni.cz/trac/research/chrome/site/bigdata/ukol_ia161-parsing.zip
     24}}}
     251. Unzip the downloaded file
     26{{{
     27unzip ukol_ia161-parsing.zip
     28}}}
     291. Go to the unziped folder
     30{{{
     31cd ukol_ia161-parsing
     32}}}
     331. Test the prepared program that analyses 100 selected sentences
     34{{{
     35make set_trees
     36make compare
     37}}}
     38 The output should be
     39{{{
     40./compare_dep_trees.py data/trees/pdt2_etest data/trees/set_pdt2_etest-sel100
     41UAS =  66.1 %
     42}}}
     43 You can see detailed evaluation (sentence by sentence) with
     44{{{
     45make compare SENTENCES=1
     46}}}
     47 You can watch differences for one tree with
     48{{{
     49make diff SENTENCE=00009
     50}}}
     51 Exit the diff by pressing `q`.[[br]]
     52 You can watch the two trees with (`python-qt4` must be installed in the system)
     53 {{{
     54make view SENTENCE=00009
     55}}}
     56 For remote tree view, you may run
     57 {{{
     58make html SENTENCE=00009
     59}}}
     60 And point your browser to the `html/index.html` file. [[br]]
     61 You can extract the text of the sentence (e.g. for Google translate) easily with
     62 {{{
     63make text SENTENCE=00009
     64}}}
     651. Look at the files:
     66 * `data/vert/pdt2_etest-sel100` - 100 input sentences in vertical format. The tag format is  the Prague Dependency Treebank [https://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/ch02s02s01.html positional tagset]
     67 * `data/trees/pdt2_etest` - 100 gold standard dependency trees from the Prague Dependency Treebank
     68 * `data/trees/set_pdt2_etest-sel100` - 100 trees output from SET by running `make set_trees`
     69 * `grammar.set` - the grammar used in running SET
     70
     71== Assignment ==
     72
     731. Study the [https://nlp.fi.muni.cz/trac/set/wiki/documentation SET documentation]. The tags used in the grammar are in the [raw-attachment:tagset.pdf Brno tagset].
     741. Develop better grammar - repeat the process:
     75{{{
     76edit grammar.set # use your favourite editor
     77make set_trees
     78make compare
     79}}}
     80 to improve the original UAS
     811. Write the final UAS in `grammar.set`
     82{{{
     83# This is the SET grammar for Czech used in IA161 course
     84#
     85# ===========   resulting UAS =  66.1 %  ===================
     86}}}
     871. Upload your `grammar.set` to the homework vault.