17 | | 1. Go to http://ske.fi.muni.cz, login and create a shadow copy of the Czech Wikipedia corpus by clicking on [[Image(add.png,valign=middle,nolink,class=intext)]]''Create grammar development corpus'' (if you do not have such link at the bottom of the main page, ask for it). |
18 | | 1. Develop your own sketch grammar that will capture the following semantic relations in this corpus: hypernymy/hyponymy, meronymy/holonymy (hint: use {{{DUAL}}} directive), optionally you can develop more relations (e.g. "is-defined-as"). |
19 | | Read related [https://www.sketchengine.co.uk/writing-sketch-grammars/ documentation]. Start with a couple of simple CQL queries that you pretest in the interface. |
20 | | 1. You can iteratively expand the grammar, upload it into the system, have the system compute word sketches and review the results |
21 | | 1. When you are happy with the grammar, process the raw !WordSketch data (output of `dumpws` command) of your corpus. The data can be obtained in two ways: |
22 | | 1. smaller data (up to 100,000 relations) can be downloaded from web: [[BR]] |
23 | | `https://ske.fi.muni.cz/bonito/r.cgi/dumpws?corpname=user/<YOUR_USERNAME_IN_SKETCH_ENGINE>/gramdev_czechwiki` [[BR]] |
24 | | e.g. [[BR]] |
25 | | https://ske.fi.muni.cz/bonito/r.cgi/dumpws?corpname=user/novakjan/gramdev_czechwiki [[BR]] |
26 | | [[BR]] |
27 | | First, you have to be authenticated at https://ske.fi.muni.cz/login/. |
28 | | `gramdev_czechwiki` is the ''corpus_id'' of the Czech Wikipedia corpus. [[BR]] |
29 | | Or, if you need more than 100,000 relations, you can use the other way |
30 | | 1. logon to the {{{alba.fi.muni.cz}}} server and use the {{{dumpws}}} command to export the content of the word sketch database: [[BR]] |
31 | | {{{dumpws /corpora/ca/user_data/<YOUR_USERNAME_IN_SKETCH_ENGINE>/registry/gramdev_czechwiki}}} [[BR]] |
32 | | For this you may need to ask for extra permission to registry directories. |
33 | | 5. Process the output of {{{dumpws}}} with a simple Bash or Python script to select first 100 most salient headword-collocation pairs for each relation. Upload the resulting list into the IS vault. |
| 17 | We will develop/ajdust the grammar of the SET parser. |
| 18 | |
| 19 | 1. Download the [[htdocs:bigdata/ukol_ia161-parsing.zip|SET parser with evaluation dataset]] |
| 20 | {{{ |
| 21 | wget https://nlp.fi.muni.cz/trac/research/chrome/site/bigdata/ukol_ia161-parsing.zip |
| 22 | }}} |
| 23 | 1. Unzip the downloaded file |
| 24 | {{{ |
| 25 | unzip ukol_ia161-parsing.zip |
| 26 | }}} |
| 27 | 1. Go to the unziped folder |
| 28 | {{{ |
| 29 | cd ukol_ia161-parsing |
| 30 | }}} |
| 31 | 1. Test the prepared program that analyses 100 selected sentences |
| 32 | {{{ |
| 33 | make set_trees |
| 34 | make compare |
| 35 | }}} |
| 36 | The output should be |
| 37 | {{{ |
| 38 | ./compare_dep_trees.py data/trees/pdt2_etest data/trees/set_pdt2_etest-sel100 |
| 39 | UAS = 66.1 % |
| 40 | }}} |
| 41 | You can see detailed evaluation (sentence by sentence) with |
| 42 | {{{ |
| 43 | make compare SENTENCES=1 |
| 44 | }}} |
| 45 | 1. Look at the files: |
| 46 | * `data/vert/pdt2_etest-sel100` - 100 input sentences in vertical format. The tag format is the Prague Dependency Treebank [https://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/ch02s02s01.html positional tagset] |
| 47 | * `data/trees/pdt2_etest` - 100 gold standard dependency trees from the Prague Dependency Treebank |
| 48 | * `data/trees/set_pdt2_etest-sel100` - 100 trees output from SET by running `make set_trees` |
| 49 | * `grammar.set` - the grammar used in running SET |
| 50 | |
| 51 | == Assignment == |
| 52 | |
| 53 | 1. Study the [https://nlp.fi.muni.cz/trac/set/wiki/documentation SET documentation]. The tags used in the grammar are in the [raw-attachment:tagset.pdf Brno tagset]. |
| 54 | 1. Develop better grammar - repeat the process: |
| 55 | {{{ |
| 56 | edit grammar.set # use your favourite editor |
| 57 | make set_trees |
| 58 | make compare |
| 59 | }}} |
| 60 | to improve the original UAS |
| 61 | 1. Write the final UAS in `grammar.set` |
| 62 | {{{ |
| 63 | # This is the SET grammar for Czech used in IA161 course |
| 64 | # |
| 65 | # =========== resulting UAS = 66.1 % =================== |
| 66 | }}} |
| 67 | 1. Upload your `grammar.set` to the homework vault. |