Context Navigation

← Previous Change
Wiki History
Next Change →

ParsingCzech

Timestamp:: Sep 13, 2023, 2:45:44 PM (23 months ago)
Author:: Ales Horak
Comment:: copied from private/NlpInPracticeCourse/ParsingCzech

Legend:

: Unmodified
: Added
: Removed
: Modified

en/NlpInPracticeCourse/2022/ParsingCzech

                       v1
+= Parsing of Czech: Between Rules and Stats =
+[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/NlpInPracticeCourse|NLP in Practice Course]], Course Guarantee: Aleš Horák
+Prepared by: Miloš Jakubíček
+== State of the Art ==
+=== References ===
+. Zhang, Y., Zhou, H., & Li, Z. (2020). Fast and Accurate Neural CRF Constituency Parsing. arXiv preprint arXiv:2008.03736.
+. Qi, P., Dozat, T., Zhang, Y., & Manning, C. D. (2019). Universal dependency parsing from scratch. arXiv preprint arXiv:1901.10457.
+. Straka, M., Straková, J., & Hajič, J. (2019). Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER. In International Conference on Text, Speech, and Dialogue (pp. 137-150). Springer, Cham.
+. Baisa, V. and Kovář, V. (2014). Information extraction for Czech based on syntactic analysis. In Vetulani, Z. and Mariani, J., editors,Human Language Technology Challenges for Computer Science and Linguistics, pages 155–165. Springer International Publishing.
+== Practical Session ==
+We will develop/adjust the grammar of the SET parser.
+. Download the [[htdocs:bigdata/ukol_ia161-parsing.zip|SET parser with evaluation dataset]]
+{{{
+wget https://nlp.fi.muni.cz/trac/research/chrome/site/bigdata/ukol_ia161-parsing.zip
+}}}
+. Unzip the downloaded file
+{{{
+unzip ukol_ia161-parsing.zip
+}}}
+. Go to the unziped folder
+{{{
+cd ukol_ia161-parsing
+}}}
+. Choose the language you want to work with. The default is Czech (`cs`) which can be changed to English (`en`) via editing `Makefile`:
+{{{
+nano Makefile
+}}}
+ change the first line to
+{{{
+LANGUAGE=en
+}}}
+. Test the prepared program that analyses 100 selected sentences
+{{{
+make set_trees
+make compare
+}}}
+ The output should be
+{{{
+./compare_dep_trees.py data/trees/pdt2_etest data/trees/set_pdt2_etest-sel100
+UAS =  66.1 %
+}}}
+ You can see detailed evaluation (sentence by sentence) with
+{{{
+make compare SENTENCES=1
+}}}
+ You can watch differences for one tree with
+{{{
+make diff SENTENCE=00009
+}}}
+ Exit the diff by pressing `q`.[[br]]
+ You may inspect the tagged vertical text with
+ {{{
+ make vert SENTENCE=00009
+}}}
+ You can watch the two trees with (`python3-tk` must be installed in the system)
+ {{{
+make view SENTENCE=00009
+}}}
+ For remote tree view, you may run
+ {{{
+make html SENTENCE=00009
+}}}
+ And point your browser to the `html/index.html` file. [[br]]
+ You can extract the text of the sentence easily with
+ {{{
+make text SENTENCE=00009
+}}}
+ English translation of the Czech sentences can be obtained via
+ {{{
+make texttrans SENTENCE=00009
+}}}
+. Look at the files:
+ * `data/vert/pdt2_etest` or `ud21_gum_dev` - 100 input sentences in vertical format. The tag format is  the Prague Dependency Treebank [https://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/ch02s02s01.html positional tagset] for Czech and the [https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html  Penn Treebank tagset] for English
+ * `data/trees/pdt2_etest` or `ud21_gum_dev` - 100 gold standard dependency trees from the Prague Dependency Treebank or the Universal Dependencies GUM corpus
+ * `data/trees/set_pdt2_etest` or `set_ud21_gum_dev` - 100 trees output from SET by running `make set_trees`
+ * `grammar-cs.set` or `grammar-en.set` - the grammar used in running SET
+== Assignment ==
+. Study the [https://nlp.fi.muni.cz/trac/set/wiki/documentation SET documentation]. The tags used in the Czech grammar are in the [raw-attachment:tagset.pdf Brno tagset].
+. Develop better grammar - repeat the process:
+{{{
+nano grammar.set # or use your favourite editor
+make set_trees
+make compare
+}}}
+ to improve the original UAS
+. Write the final UAS in `grammar-cs.set` or `grammar-en.set`
+{{{
+# This is the SET grammar for Czech used in IA161 course
+#
+# ===========   resulting UAS =  66.9 %  ===================
+}}}
+. Upload your `grammar-cs.set` or `grammar-en.set` to the homework vault.