26 | | 1. Download the [[htdocs:bigdata/ukol_ia161-parsing.zip|SET parser with evaluation dataset]] |
27 | | {{{ |
28 | | wget https://nlp.fi.muni.cz/trac/research/chrome/site/bigdata/ukol_ia161-parsing.zip |
29 | | }}} |
30 | | 1. Unzip the downloaded file |
31 | | {{{ |
32 | | unzip ukol_ia161-parsing.zip |
33 | | }}} |
34 | | 1. Go to the unziped folder |
35 | | {{{ |
36 | | cd ukol_ia161-parsing |
37 | | }}} |
38 | | 1. [optional] Choose the language you want to work with. The default is English (`en`) which can be changed to Czech (`cs`) via editing `Makefile`: |
39 | | {{{ |
40 | | nano Makefile |
41 | | }}} |
42 | | if you want to work with Czech, change the first line to |
43 | | {{{ |
44 | | LANGUAGE=cs |
45 | | }}} |
46 | | 1. Test the prepared program that analyses 100 selected sentences |
47 | | {{{ |
48 | | make set_trees |
49 | | make compare |
50 | | }}} |
51 | | The output should be |
52 | | {{{ |
53 | | ./compare_dep_trees.py data/trees/ud21_gum_dev data/trees/set_ud21_gum_dev |
54 | | UAS = 55.4 % |
55 | | }}} |
56 | | You can see detailed evaluation (sentence by sentence) with |
57 | | {{{ |
58 | | make compare SENTENCES=1 |
59 | | }}} |
60 | | You can watch differences for one tree with |
61 | | {{{ |
62 | | make diff SENTENCE=academic_librarians-10 |
63 | | }}} |
64 | | The left window with `ud21_gum_dev/academic_librarians-10` shows the |
65 | | expected ground truth, the right window of `set_ud21_gum_dev/academic_librarians-10` displays the current parsing result (to be improved by you).[[br]] |
66 | | Exit the diff by pressing `q`.[[br]] |
67 | | You may inspect the tagged vertical text with |
68 | | {{{ |
69 | | make vert SENTENCE=academic_librarians-10 |
70 | | }}} |
71 | | You can watch the two trees with (`python3-tk` must be installed in the system) |
72 | | {{{ |
73 | | make view SENTENCE=academic_librarians-10 |
74 | | }}} |
75 | | For remote tree view (i.e. inspecting the trees on different computer), you may run |
76 | | {{{ |
77 | | make html SENTENCE=academic_librarians-10 |
78 | | }}} |
79 | | And point your browser to the `html/index.html` file. [[br]] |
80 | | You can extract the text of the sentence easily with |
81 | | {{{ |
82 | | make text SENTENCE=academic_librarians-10 |
83 | | }}} |
84 | | English translation of the Czech sentences can be obtained via |
85 | | {{{ |
86 | | make texttrans SENTENCE=academic_librarians-10 |
87 | | }}} |
88 | | 1. Debugging the parsing process can be done using |
89 | | {{{ |
90 | | make debug SENTENCE=academic_librarians-10 |
91 | | }}} |
92 | | which will print the final rules used to build the tree. Adding |
93 | | `DETAIL=1` will show all details of the parsing process, including |
94 | | the unused rules. |
95 | | {{{ |
96 | | make debug SENTENCE=academic_librarians-10 DETAIL=1 |
97 | | }}} |
98 | | 1. Look at the files (you may use `mc` file manager, exit it with `Esc+0`): |
99 | | * `data/vert/pdt2_etest` or `ud21_gum_dev` - 100 input sentences in vertical format.[[br]] |
100 | | The tag format is the Prague Dependency Treebank [https://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/ch02s02s01.html positional tagset] for Czech and the [https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html Penn Treebank tagset] for English |
101 | | * `data/trees/pdt2_etest` or `ud21_gum_dev` - 100 gold standard dependency trees from the Prague Dependency Treebank or the Universal Dependencies GUM corpus |
102 | | * `data/trees/set_pdt2_etest` or `set_ud21_gum_dev` - 100 trees output from SET by running `make set_trees` |
103 | | * `grammar-cs.set` or `grammar-en.set` - the grammar used in running SET |
104 | | |
105 | | == Assignment == |
106 | | |
107 | | 1. Study the [https://nlp.fi.muni.cz/trac/set/wiki/documentation SET documentation]. The tags used in the English `grammar-en.set` follow the [https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html Penn Treebank tagset] and in the Czech grammar `grammar-cs.set` the [raw-attachment:tagset.pdf Brno tagset]. |
108 | | 1. Develop better grammar - repeat the process: |
109 | | {{{ |
110 | | nano grammar-en.set # or use your favourite editor |
111 | | make set_trees |
112 | | make compare |
113 | | }}} |
114 | | to improve the original UAS |
115 | | 1. Write the final UAS in `grammar-cs.set` or `grammar-en.set` |
116 | | {{{ |
117 | | # This is the SET grammar for English used in IA161 course |
118 | | # |
119 | | # =========== resulting UAS = 66.9 % =================== |
120 | | }}} |
121 | | 1. Upload your `grammar-cs.set` or `grammar-en.set` to the homework vault. |
| 23 | Upload the resulting grammar file with improved UAS to the homework vault |