Version 25 (modified by 3 years ago) (diff) | ,
---|
Parsing of Czech: Between Rules and Stats
IA161 Advanced NLP Course?, Course Guarantee: Aleš Horák
Prepared by: Miloš Jakubíček
State of the Art
References
- Zhang, Y., Zhou, H., & Li, Z. (2020). Fast and Accurate Neural CRF Constituency Parsing. arXiv preprint arXiv:2008.03736.
- Qi, P., Dozat, T., Zhang, Y., & Manning, C. D. (2019). Universal dependency parsing from scratch. arXiv preprint arXiv:1901.10457.
- Straka, M., Straková, J., & Hajič, J. (2019, September). Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER. In International Conference on Text, Speech, and Dialogue (pp. 137-150). Springer, Cham.
Practical Session
We will develop/adjust the grammar of the SET parser.
- Download the SET parser with evaluation dataset
wget https://nlp.fi.muni.cz/trac/research/chrome/site/bigdata/ukol_ia161-parsing.zip
- Unzip the downloaded file
unzip ukol_ia161-parsing.zip
- Go to the unziped folder
cd ukol_ia161-parsing
- Test the prepared program that analyses 100 selected sentences
make set_trees make compare
The output should be./compare_dep_trees.py data/trees/pdt2_etest data/trees/set_pdt2_etest-sel100 UAS = 66.1 %
You can see detailed evaluation (sentence by sentence) withmake compare SENTENCES=1
You can watch differences for one tree withmake diff SENTENCE=00009
Exit the diff by pressingq
.
You can watch the two trees with (python-qt4
must be installed in the system)make view SENTENCE=00009
You can extract the text of the sentence (e.g. for Google translate) easily withmake text SENTENCE=00009
- Look at the files:
data/vert/pdt2_etest-sel100
- 100 input sentences in vertical format. The tag format is the Prague Dependency Treebank positional tagsetdata/trees/pdt2_etest
- 100 gold standard dependency trees from the Prague Dependency Treebankdata/trees/set_pdt2_etest-sel100
- 100 trees output from SET by runningmake set_trees
grammar.set
- the grammar used in running SET
Assignment
- Study the SET documentation. The tags used in the grammar are in the Brno tagset.
- Develop better grammar - repeat the process:
edit grammar.set # use your favourite editor make set_trees make compare
to improve the original UAS - Write the final UAS in
grammar.set
# This is the SET grammar for Czech used in IA161 course # # =========== resulting UAS = 66.1 % ===================
- Upload your
grammar.set
to the homework vault.
Attachments (2)
- add.png (288 bytes) - added by 8 years ago.
- tagset.pdf (120.2 KB) - added by 4 years ago.
Download all attachments as: .zip