Version 1 (modified by 4 years ago) (diff) | ,
---|
Parsing of Czech: Between Rules and Stats
IA161 Advanced NLP Course?, Course Guarantee: Aleš Horák
Prepared by: Miloš Jakubíček
State of the Art
References
- Zhang, Y., Zhou, H., & Li, Z. (2020). Fast and Accurate Neural CRF Constituency Parsing. arXiv preprint arXiv:2008.03736.
- Qi, P., Dozat, T., Zhang, Y., & Manning, C. D. (2019). Universal dependency parsing from scratch. arXiv preprint arXiv:1901.10457.
- Straka, M., Straková, J., & Hajič, J. (2019). Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER. In International Conference on Text, Speech, and Dialogue (pp. 137-150). Springer, Cham.
- Baisa, V. and Kovář, V. (2014). Information extraction for Czech based on syntactic analysis. In Vetulani, Z. and Mariani, J., editors,Human Language Technology Challenges for Computer Science and Linguistics, pages 155–165. Springer International Publishing.
Practical Session
We will develop/adjust the grammar of the SET parser.
- Download the SET parser with evaluation dataset
wget https://nlp.fi.muni.cz/trac/research/chrome/site/bigdata/ukol_ia161-parsing.zip
- Unzip the downloaded file
unzip ukol_ia161-parsing.zip
- Go to the unziped folder
cd ukol_ia161-parsing
- Test the prepared program that analyses 100 selected sentences
make set_trees make compare
The output should be./compare_dep_trees.py data/trees/pdt2_etest data/trees/set_pdt2_etest-sel100 UAS = 66.1 %
You can see detailed evaluation (sentence by sentence) withmake compare SENTENCES=1
You can watch differences for one tree withmake diff SENTENCE=00009
Exit the diff by pressingq
.
You can watch the two trees with (python-qt4
must be installed in the system)make view SENTENCE=00009
For remote tree view, you may runmake html SENTENCE=00009
And point your browser to thehtml/index.html
file.
You can extract the text of the sentence (e.g. for Google translate) easily withmake text SENTENCE=00009
- Look at the files:
data/vert/pdt2_etest-sel100
- 100 input sentences in vertical format. The tag format is the Prague Dependency Treebank positional tagsetdata/trees/pdt2_etest
- 100 gold standard dependency trees from the Prague Dependency Treebankdata/trees/set_pdt2_etest-sel100
- 100 trees output from SET by runningmake set_trees
grammar.set
- the grammar used in running SET
Assignment
- Study the SET documentation. The tags used in the grammar are in the Brno tagset.
- Develop better grammar - repeat the process:
edit grammar.set # use your favourite editor make set_trees make compare
to improve the original UAS - Write the final UAS in
grammar.set
# This is the SET grammar for Czech used in IA161 course # # =========== resulting UAS = 66.1 % ===================
- Upload your
grammar.set
to the homework vault.
Attachments (2)
- add.png (288 bytes) - added by 4 years ago.
- tagset.pdf (120.2 KB) - added by 4 years ago.
Download all attachments as: .zip