Changes between Version 5 and Version 6 of private/NlpInPracticeCourse/ParsingCzech


Ignore:
Timestamp:
Oct 26, 2015, 2:42:12 PM (9 years ago)
Author:
Ales Horak
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • private/NlpInPracticeCourse/ParsingCzech

    v5 v6  
    1515== Practical Session ==
    1616
    17  1. Go to http://ske.fi.muni.cz, login and create a shadow copy of the Czech Wikipedia corpus by clicking on "Create grammar development corpus".
     17 1. Go to http://ske.fi.muni.cz, login and create a shadow copy of the Czech Wikipedia corpus by clicking on "Create grammar development corpus" (if you do not have such link at the bottom of the main page, ask for it).
    1818 1. Develop your own sketch grammar that will capture the following semantic relations in this corpus: hypernymy/hyponymy, meronymy/holonymy (hint: use {{{DUAL}}} directive), optionally you can develop more relations (e.g. "is-defined-as").
    1919    Read related [https://www.sketchengine.co.uk/writing-sketch-grammars/ documentation]. Start with a couple of simple CQL queries that you pretest in the interface.
    2020 1. You can iteratively expand the grammar, upload it into the system, have the system compute word sketches and review the results
    21  1. When you are happy with the grammar, logon to the {{{alba.fi.muni.cz}}} server and use the {{{dumpws}}} command to export the content of the word sketch database:
    22 
    23     {{{dumpws /corpora/ca/user_data/<YOUR_USERNAME_IN_SKETCH_ENGINE>/registry/<YOUR_CORPUS_ID>}}}
     21 1. When you are happy with the grammar, process the raw WordSketch data (output of `dumpws` command) of your corpus. The data can be obtained in two ways:
     22  1. smaller data (up to 100,000 relations) can be downloaded from web: [[BR]]
     23   `https://ske.fi.muni.cz/bonito/r.cgi/dumpws?corpname=user/<YOUR_USERNAME_IN_SKETCH_ENGINE>/gramdev_czechwiki` [[BR]]
     24   e.g. [[BR]]
     25   https://ske.fi.muni.cz/bonito/r.cgi/dumpws?corpname=user/novakjan/gramdev_czechwiki [[BR]]
     26   [[BR]]
     27   `gramdev_czechwiki` is the `<corpus_id>` of the Czech Wikipedia corpus.
     28   Or, if you need more than 100,000 relations, you can use the other way
     29  1. logon to the {{{alba.fi.muni.cz}}} server and use the {{{dumpws}}} command to export the content of the word sketch database: [[BR]]
     30   {{{dumpws /corpora/ca/user_data/<YOUR_USERNAME_IN_SKETCH_ENGINE>/registry/gramdev_czechwiki}}}
    2431 5. Process the output of {{{dumpws}}} with a simple Bash or Python script to select first 100 most salient headword-collocation pairs for each relation. Upload the resulting list into the IS vault.