Context Navigation

Parsing of Czech: Between Rules and Stats

IA161 Advanced NLP Course?, Course Guarantee: Aleš Horák

Prepared by: Miloš Jakubíček

Zhang, Y., Zhou, H., & Li, Z. (2020). Fast and Accurate Neural CRF Constituency Parsing. arXiv preprint arXiv:2008.03736.
Qi, P., Dozat, T., Zhang, Y., & Manning, C. D. (2019). Universal dependency parsing from scratch. arXiv preprint arXiv:1901.10457.
Straka, M., Straková, J., & Hajič, J. (2019). Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER. In International Conference on Text, Speech, and Dialogue (pp. 137-150). Springer, Cham.
Baisa, V. and Kovář, V. (2014). Information extraction for Czech based on syntactic analysis. In Vetulani, Z. and Mariani, J., editors,Human Language Technology Challenges for Computer Science and Linguistics, pages 155–165. Springer International Publishing.

We will develop/adjust the grammar of the SET parser.

wget https://nlp.fi.muni.cz/trac/research/chrome/site/bigdata/ukol_ia161-parsing.zip

Unzip the downloaded file
```
unzip ukol_ia161-parsing.zip
```
Go to the unziped folder
```
cd ukol_ia161-parsing
```
Test the prepared program that analyses 100 selected sentences
```
make set_trees
make compare
```
The output should be
```
./compare_dep_trees.py data/trees/pdt2_etest data/trees/set_pdt2_etest-sel100
UAS =  66.1 %
```
You can see detailed evaluation (sentence by sentence) with
```
make compare SENTENCES=1
```
You can watch differences for one tree with
```
make diff SENTENCE=00009
```
Exit the diff by pressing q.
You can watch the two trees with (python-qt4 must be installed in the system)
```
make view SENTENCE=00009
```
For remote tree view, you may run
```
make html SENTENCE=00009
```
And point your browser to the html/index.html file.
You can extract the text of the sentence (e.g. for Google translate) easily with
```
make text SENTENCE=00009
```
Look at the files:
- data/vert/pdt2_etest-sel100 - 100 input sentences in vertical format. The tag format is the Prague Dependency Treebank positional tagset
- data/trees/pdt2_etest - 100 gold standard dependency trees from the Prague Dependency Treebank
- data/trees/set_pdt2_etest-sel100 - 100 trees output from SET by running make set_trees
- grammar.set - the grammar used in running SET

Study the SET documentation. The tags used in the grammar are in the Brno tagset.

Develop better grammar - repeat the process:

edit grammar.set # use your favourite editor
make set_trees
make compare

to improve the original UAS

Write the final UAS in grammar.set

# This is the SET grammar for Czech used in IA161 course
# 
# ===========   resulting UAS =  66.1 %  ===================

Last modified 4 years ago Last modified on Aug 31, 2021, 2:11:28 PM

Download all attachments as: .zip