Changes between Version 18 and Version 19 of private/NlpInPracticeCourse/ParsingCzech

Oct 28, 2019, 9:19:25 PM (4 years ago)
Ales Horak



  • private/NlpInPracticeCourse/ParsingCzech

    v18 v19  
    1515== Practical Session ==
    17  1. Go to, login and create a shadow copy of the Czech Wikipedia corpus by clicking on [[Image(add.png,valign=middle,nolink,class=intext)]]''Create grammar development corpus'' (if you do not have such link at the bottom of the main page, ask for it).
    18  1. Develop your own sketch grammar that will capture the following semantic relations in this corpus: hypernymy/hyponymy, meronymy/holonymy (hint: use {{{DUAL}}} directive), optionally you can develop more relations (e.g. "is-defined-as").
    19     Read related [ documentation]. Start with a couple of simple CQL queries that you pretest in the interface.
    20  1. You can iteratively expand the grammar, upload it into the system, have the system compute word sketches and review the results
    21  1. When you are happy with the grammar, process the raw !WordSketch data (output of `dumpws` command) of your corpus. The data can be obtained in two ways:
    22   1. smaller data (up to 100,000 relations) can be downloaded from web: [[BR]]
    23    `<YOUR_USERNAME_IN_SKETCH_ENGINE>/gramdev_czechwiki` [[BR]]
    24    e.g. [[BR]]
    25 [[BR]]
    26    [[BR]]
    27    First, you have to be authenticated at
    28    `gramdev_czechwiki` is the ''corpus_id'' of the Czech Wikipedia corpus. [[BR]]
    29    Or, if you need more than 100,000 relations, you can use the other way
    30   1. logon to the {{{}}} server and use the {{{dumpws}}} command to export the content of the word sketch database: [[BR]]
    31    {{{dumpws /corpora/ca/user_data/<YOUR_USERNAME_IN_SKETCH_ENGINE>/registry/gramdev_czechwiki}}} [[BR]]
    32    For this you may need to ask for extra permission to registry directories.
    33  5. Process the output of {{{dumpws}}} with a simple Bash or Python script to select first 100 most salient headword-collocation pairs for each relation. Upload the resulting list into the IS vault.
     17We will develop/ajdust the grammar of the SET parser.
     191. Download the [[htdocs:bigdata/|SET parser with evaluation dataset]]
     231. Unzip the downloaded file
     271. Go to the unziped folder
     29cd ukol_ia161-parsing
     311. Test the prepared program that analyses 100 selected sentences
     33make set_trees
     34make compare
     36 The output should be
     38./ data/trees/pdt2_etest data/trees/set_pdt2_etest-sel100
     39UAS =  66.1 %
     41 You can see detailed evaluation (sentence by sentence) with
     43make compare SENTENCES=1
     451. Look at the files:
     46 * `data/vert/pdt2_etest-sel100` - 100 input sentences in vertical format. The tag format is  the Prague Dependency Treebank [ positional tagset]
     47 * `data/trees/pdt2_etest` - 100 gold standard dependency trees from the Prague Dependency Treebank
     48 * `data/trees/set_pdt2_etest-sel100` - 100 trees output from SET by running `make set_trees`
     49 * `grammar.set` - the grammar used in running SET
     51== Assignment ==
     531. Study the [ SET documentation]. The tags used in the grammar are in the [raw-attachment:tagset.pdf Brno tagset].
     541. Develop better grammar - repeat the process:
     56edit grammar.set # use your favourite editor
     57make set_trees
     58make compare
     60 to improve the original UAS
     611. Write the final UAS in `grammar.set`
     63# This is the SET grammar for Czech used in IA161 course
     65# ===========   resulting UAS =  66.1 %  ===================
     671. Upload your `grammar.set` to the homework vault.