Parsing of Czech: Between Rules and Stats
IA161 NLP in Practice Course, Course Guarantee: Aleš Horák
Prepared by: Miloš Jakubíček, Aleš Horák
State of the Art
References
- Fernández-González, D., & Gómez-Rodríguez, C. (2023). Dependency parsing with bottom-up hierarchical pointer networks. Information Fusion, 91, 494-503.
- Arps, D., Samih, Y., Kallmeyer, L., & Sajjad, H. (2022). Probing for constituency structure in neural language models. arXiv preprint arXiv:2204.06201.
- Qi, P., Dozat, T., Zhang, Y., & Manning, C. D. (2019). Universal dependency parsing from scratch. arXiv preprint arXiv:1901.10457.
- Baisa, V. and Kovář, V. (2014). Information extraction for Czech based on syntactic analysis. In Vetulani, Z. and Mariani, J., editors,Human Language Technology Challenges for Computer Science and Linguistics, pages 155–165. Springer International Publishing.
Practical Session
Note: If you are new to the command line interface via a terminal window, you may find the tutorial for working in terminal useful.
We will develop/adjust the grammar of the SET parser (for English or Czech).
- Download the SET parser with evaluation dataset
wget https://nlp.fi.muni.cz/trac/research/chrome/site/bigdata/ukol_ia161-parsing.zip
- Unzip the downloaded file
unzip ukol_ia161-parsing.zip
- Go to the unziped folder
cd ukol_ia161-parsing
- [optional] Choose the language you want to work with. The default is English (
en
) which can be changed to Czech (cs
) via editingMakefile
:nano Makefile
if you want to work with Czech, change the first line toLANGUAGE=cs
- Test the prepared program that analyses 100 selected sentences
make set_trees make compare
The output should be./compare_dep_trees.py data/trees/ud21_gum_dev data/trees/set_ud21_gum_dev UAS = 55.4 %
You can see detailed evaluation (sentence by sentence) withmake compare SENTENCES=1
You can watch differences for one tree withmake diff SENTENCE=academic_librarians-10
The left window withud21_gum_dev/academic_librarians-10
shows the expected ground truth, the right window ofset_ud21_gum_dev/academic_librarians-10
displays the current parsing result (to be improved by you).
Exit the diff by pressingq
.
You may inspect the tagged vertical text withmake vert SENTENCE=academic_librarians-10
You can watch the two trees with (python3-tk
must be installed in the system)make view SENTENCE=academic_librarians-10
For remote tree view (i.e. inspecting the trees on different computer), you may runmake html SENTENCE=academic_librarians-10
And point your browser to thehtml/index.html
file.
You can extract the text of the sentence easily withmake text SENTENCE=academic_librarians-10
English translation of the Czech sentences can be obtained viamake texttrans SENTENCE=academic_librarians-10
- Debugging the parsing process can be done using
make debug SENTENCE=academic_librarians-10
which will print the final rules used to build the tree. AddingDETAIL=1
will show all details of the parsing process, including the unused rules.make debug SENTENCE=academic_librarians-10 DETAIL=1
- Look at the files (you may use
mc
file manager, exit it withEsc+0
):data/vert/pdt2_etest
orud21_gum_dev
- 100 input sentences in vertical format.
The tag format is the Prague Dependency Treebank positional tagset for Czech and the Penn Treebank tagset for Englishdata/trees/pdt2_etest
orud21_gum_dev
- 100 gold standard dependency trees from the Prague Dependency Treebank or the Universal Dependencies GUM corpusdata/trees/set_pdt2_etest
orset_ud21_gum_dev
- 100 trees output from SET by runningmake set_trees
grammar-cs.set
orgrammar-en.set
- the grammar used in running SET
Assignment
- Study the SET documentation. The tags used in the English
grammar-en.set
follow the Penn Treebank tagset and in the Czech grammargrammar-cs.set
the Brno tagset. - Develop better grammar - repeat the process:
nano grammar-en.set # or use your favourite editor make set_trees make compare
to improve the original UAS - Write the final UAS in
grammar-cs.set
orgrammar-en.set
# This is the SET grammar for English used in IA161 course # # =========== resulting UAS = 66.9 % ===================
- Upload your
grammar-cs.set
orgrammar-en.set
to the homework vault.
Last modified 2 years ago
Last modified on Apr 10, 2022, 9:02:40 PM