Context Navigation

ParsingCzech

Timestamp:: Oct 28, 2019, 9:19:25 PM (6 years ago)
Author:: Ales Horak
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

private/NlpInPracticeCourse/ParsingCzech

-                      v18
+                      v19
 == Practical Session ==
+. Go to http://ske.fi.muni.cz, login and create a shadow copy of the Czech Wikipedia corpus by clicking on [[Image(add.png,valign=middle,nolink,class=intext)]]''Create grammar development corpus'' (if you do not have such link at the bottom of the main page, ask for it).
+. Develop your own sketch grammar that will capture the following semantic relations in this corpus: hypernymy/hyponymy, meronymy/holonymy (hint: use {{{DUAL}}} directive), optionally you can develop more relations (e.g. "is-defined-as").
+    Read related [https://www.sketchengine.co.uk/writing-sketch-grammars/ documentation]. Start with a couple of simple CQL queries that you pretest in the interface.
+. You can iteratively expand the grammar, upload it into the system, have the system compute word sketches and review the results
+. When you are happy with the grammar, process the raw !WordSketch data (output of `dumpws` command) of your corpus. The data can be obtained in two ways:
+. smaller data (up to 100,000 relations) can be downloaded from web: [[BR]]
+   `https://ske.fi.muni.cz/bonito/r.cgi/dumpws?corpname=user/<YOUR_USERNAME_IN_SKETCH_ENGINE>/gramdev_czechwiki` [[BR]]
+   e.g. [[BR]]
+   https://ske.fi.muni.cz/bonito/r.cgi/dumpws?corpname=user/novakjan/gramdev_czechwiki [[BR]]
+   [[BR]]
+   First, you have to be authenticated at https://ske.fi.muni.cz/login/.
+   `gramdev_czechwiki` is the ''corpus_id'' of the Czech Wikipedia corpus. [[BR]]
+   Or, if you need more than 100,000 relations, you can use the other way
+. logon to the {{{alba.fi.muni.cz}}} server and use the {{{dumpws}}} command to export the content of the word sketch database: [[BR]]
+   {{{dumpws /corpora/ca/user_data/<YOUR_USERNAME_IN_SKETCH_ENGINE>/registry/gramdev_czechwiki}}} [[BR]]
+   For this you may need to ask for extra permission to registry directories.
+. Process the output of {{{dumpws}}} with a simple Bash or Python script to select first 100 most salient headword-collocation pairs for each relation. Upload the resulting list into the IS vault.
+We will develop/ajdust the grammar of the SET parser.
+. Download the [[htdocs:bigdata/ukol_ia161-parsing.zip|SET parser with evaluation dataset]]
+{{{
+wget https://nlp.fi.muni.cz/trac/research/chrome/site/bigdata/ukol_ia161-parsing.zip
+}}}
+. Unzip the downloaded file
+{{{
+unzip ukol_ia161-parsing.zip
+}}}
+. Go to the unziped folder
+{{{
+cd ukol_ia161-parsing
+}}}
+. Test the prepared program that analyses 100 selected sentences
+{{{
+make set_trees
+make compare
+}}}
+ The output should be
+{{{
+./compare_dep_trees.py data/trees/pdt2_etest data/trees/set_pdt2_etest-sel100
+UAS =  66.1 %
+}}}
+ You can see detailed evaluation (sentence by sentence) with
+{{{
+make compare SENTENCES=1
+}}}
+. Look at the files:
+ * `data/vert/pdt2_etest-sel100` - 100 input sentences in vertical format. The tag format is  the Prague Dependency Treebank [https://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/ch02s02s01.html positional tagset]
+ * `data/trees/pdt2_etest` - 100 gold standard dependency trees from the Prague Dependency Treebank
+ * `data/trees/set_pdt2_etest-sel100` - 100 trees output from SET by running `make set_trees`
+ * `grammar.set` - the grammar used in running SET
+== Assignment ==
+. Study the [https://nlp.fi.muni.cz/trac/set/wiki/documentation SET documentation]. The tags used in the grammar are in the [raw-attachment:tagset.pdf Brno tagset].
+. Develop better grammar - repeat the process:
+{{{
+edit grammar.set # use your favourite editor
+make set_trees
+make compare
+}}}
+ to improve the original UAS
+. Write the final UAS in `grammar.set`
+{{{
+# This is the SET grammar for Czech used in IA161 course
+#
+# ===========   resulting UAS =  66.1 %  ===================
+}}}
+. Upload your `grammar.set` to the homework vault.