Parsing of Czech: Between Rules and Stats
Prepared by: Miloš Jakubíček
State of the Art
- Zhang, Y., Zhou, H., & Li, Z. (2020). Fast and Accurate Neural CRF Constituency Parsing. arXiv preprint arXiv:2008.03736.
- Qi, P., Dozat, T., Zhang, Y., & Manning, C. D. (2019). Universal dependency parsing from scratch. arXiv preprint arXiv:1901.10457.
- Straka, M., Straková, J., & Hajič, J. (2019). Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER. In International Conference on Text, Speech, and Dialogue (pp. 137-150). Springer, Cham.
- Baisa, V. and Kovář, V. (2014). Information extraction for Czech based on syntactic analysis. In Vetulani, Z. and Mariani, J., editors,Human Language Technology Challenges for Computer Science and Linguistics, pages 155–165. Springer International Publishing.
We will develop/adjust the grammar of the SET parser.
- Download the SET parser with evaluation dataset
- Unzip the downloaded file
- Go to the unziped folder
- Choose the language you want to work with. The default is Czech (
cs) which can be changed to English (
en) via editing
nano Makefilechange the first line to
- Test the prepared program that analyses 100 selected sentences
make set_trees make compareThe output should be
./compare_dep_trees.py data/trees/pdt2_etest data/trees/set_pdt2_etest-sel100 UAS = 66.1 %You can see detailed evaluation (sentence by sentence) with
make compare SENTENCES=1You can watch differences for one tree with
make diff SENTENCE=00009Exit the diff by pressing
You may inspect the tagged vertical text with
make vert SENTENCE=00009You can watch the two trees with (
python3-tkmust be installed in the system)
make view SENTENCE=00009For remote tree view, you may run
make html SENTENCE=00009And point your browser to the
You can extract the text of the sentence easily with
make text SENTENCE=00009English translation of the Czech sentences can be obtained via
make texttrans SENTENCE=00009
- Look at the files:
ud21_gum_dev- 100 input sentences in vertical format. The tag format is the Prague Dependency Treebank positional tagset for Czech and the Penn Treebank tagset for English
ud21_gum_dev- 100 gold standard dependency trees from the Prague Dependency Treebank or the Universal Dependencies GUM corpus
set_ud21_gum_dev- 100 trees output from SET by running
grammar-en.set- the grammar used in running SET
- Study the SET documentation. The tags used in the Czech grammar are in the Brno tagset.
- Develop better grammar - repeat the process:
nano grammar.set # or use your favourite editor make set_trees make compareto improve the original UAS
- Write the final UAS in
# This is the SET grammar for Czech used in IA161 course # # =========== resulting UAS = 66.9 % ===================
- Upload your
grammar-en.setto the homework vault.