Changes between Version 18 and Version 19 of private/NlpInPracticeCourse/ParsingCzech


Ignore:
Timestamp:
Oct 28, 2019, 9:19:25 PM (5 years ago)
Author:
Ales Horak
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • private/NlpInPracticeCourse/ParsingCzech

    v18 v19  
    1515== Practical Session ==
    1616
    17  1. Go to http://ske.fi.muni.cz, login and create a shadow copy of the Czech Wikipedia corpus by clicking on [[Image(add.png,valign=middle,nolink,class=intext)]]''Create grammar development corpus'' (if you do not have such link at the bottom of the main page, ask for it).
    18  1. Develop your own sketch grammar that will capture the following semantic relations in this corpus: hypernymy/hyponymy, meronymy/holonymy (hint: use {{{DUAL}}} directive), optionally you can develop more relations (e.g. "is-defined-as").
    19     Read related [https://www.sketchengine.co.uk/writing-sketch-grammars/ documentation]. Start with a couple of simple CQL queries that you pretest in the interface.
    20  1. You can iteratively expand the grammar, upload it into the system, have the system compute word sketches and review the results
    21  1. When you are happy with the grammar, process the raw !WordSketch data (output of `dumpws` command) of your corpus. The data can be obtained in two ways:
    22   1. smaller data (up to 100,000 relations) can be downloaded from web: [[BR]]
    23    `https://ske.fi.muni.cz/bonito/r.cgi/dumpws?corpname=user/<YOUR_USERNAME_IN_SKETCH_ENGINE>/gramdev_czechwiki` [[BR]]
    24    e.g. [[BR]]
    25    https://ske.fi.muni.cz/bonito/r.cgi/dumpws?corpname=user/novakjan/gramdev_czechwiki [[BR]]
    26    [[BR]]
    27    First, you have to be authenticated at https://ske.fi.muni.cz/login/.
    28    `gramdev_czechwiki` is the ''corpus_id'' of the Czech Wikipedia corpus. [[BR]]
    29    Or, if you need more than 100,000 relations, you can use the other way
    30   1. logon to the {{{alba.fi.muni.cz}}} server and use the {{{dumpws}}} command to export the content of the word sketch database: [[BR]]
    31    {{{dumpws /corpora/ca/user_data/<YOUR_USERNAME_IN_SKETCH_ENGINE>/registry/gramdev_czechwiki}}} [[BR]]
    32    For this you may need to ask for extra permission to registry directories.
    33  5. Process the output of {{{dumpws}}} with a simple Bash or Python script to select first 100 most salient headword-collocation pairs for each relation. Upload the resulting list into the IS vault.
     17We will develop/ajdust the grammar of the SET parser.
     18
     191. Download the [[htdocs:bigdata/ukol_ia161-parsing.zip|SET parser with evaluation dataset]]
     20{{{
     21wget https://nlp.fi.muni.cz/trac/research/chrome/site/bigdata/ukol_ia161-parsing.zip
     22}}}
     231. Unzip the downloaded file
     24{{{
     25unzip ukol_ia161-parsing.zip
     26}}}
     271. Go to the unziped folder
     28{{{
     29cd ukol_ia161-parsing
     30}}}
     311. Test the prepared program that analyses 100 selected sentences
     32{{{
     33make set_trees
     34make compare
     35}}}
     36 The output should be
     37{{{
     38./compare_dep_trees.py data/trees/pdt2_etest data/trees/set_pdt2_etest-sel100
     39UAS =  66.1 %
     40}}}
     41 You can see detailed evaluation (sentence by sentence) with
     42{{{
     43make compare SENTENCES=1
     44}}}
     451. Look at the files:
     46 * `data/vert/pdt2_etest-sel100` - 100 input sentences in vertical format. The tag format is  the Prague Dependency Treebank [https://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/ch02s02s01.html positional tagset]
     47 * `data/trees/pdt2_etest` - 100 gold standard dependency trees from the Prague Dependency Treebank
     48 * `data/trees/set_pdt2_etest-sel100` - 100 trees output from SET by running `make set_trees`
     49 * `grammar.set` - the grammar used in running SET
     50
     51== Assignment ==
     52
     531. Study the [https://nlp.fi.muni.cz/trac/set/wiki/documentation SET documentation]. The tags used in the grammar are in the [raw-attachment:tagset.pdf Brno tagset].
     541. Develop better grammar - repeat the process:
     55{{{
     56edit grammar.set # use your favourite editor
     57make set_trees
     58make compare
     59}}}
     60 to improve the original UAS
     611. Write the final UAS in `grammar.set`
     62{{{
     63# This is the SET grammar for Czech used in IA161 course
     64#
     65# ===========   resulting UAS =  66.1 %  ===================
     66}}}
     671. Upload your `grammar.set` to the homework vault.