Čeština
English
  • Vítejte na stránkách NLP Centra!
  • Zapojte se do vývoje softwarových nástrojů!
  • Analýza přirozeného jazyka
  • Vyzkoušejte si korpusy o velikosti knihoven online!
  • Studujte jednu ze specializací!
  • Členové laboratoře

Parsing of Czech: Between Rules and Stats

IA161 Advanced NLP Course, Course Guarantee: Aleš Horák

Prepared by: Miloš Jakubíček

State of the Art

References

  1. PEI, Wenzhe; GE, Tao; CHANG, Baobao. An effective neural network model for graph-based dependency parsing. In: Proc. of ACL. 2015.
  2. CHOI, Jinho D.; TETREAULT, Joel; STENT, Amanda. It Depends: Dependency Parser Comparison Using A Web-based Evaluation Tool. In: Proc. of ACL. 2015.
  3. DURRETT, Greg; KLEIN, Dan. Neural CRF Parsing. In: Proc. of ACL. 2015.

Practical Session

  1. Go to http://ske.fi.muni.cz, login and create a shadow copy of the Czech Wikipedia corpus by clicking on Create grammar development corpus (if you do not have such link at the bottom of the main page, ask for it).
  2. Develop your own sketch grammar that will capture the following semantic relations in this corpus: hypernymy/hyponymy, meronymy/holonymy (hint: use DUAL directive), optionally you can develop more relations (e.g. "is-defined-as"). Read related documentation. Start with a couple of simple CQL queries that you pretest in the interface.
  3. You can iteratively expand the grammar, upload it into the system, have the system compute word sketches and review the results
  4. When you are happy with the grammar, process the raw WordSketch data (output of dumpws command) of your corpus. The data can be obtained in two ways:
    1. smaller data (up to 100,000 relations) can be downloaded from web:
      https://ske.fi.muni.cz/bonito/r.cgi/dumpws?corpname=user/<YOUR_USERNAME_IN_SKETCH_ENGINE>/gramdev_czechwiki
      e.g.
      https://ske.fi.muni.cz/bonito/r.cgi/dumpws?corpname=user/novakjan/gramdev_czechwiki

      First, you have to be authenticated at https://ske.fi.muni.cz/login/. gramdev_czechwiki is the corpus_id of the Czech Wikipedia corpus.
      Or, if you need more than 100,000 relations, you can use the other way
    2. logon to the alba.fi.muni.cz server and use the dumpws command to export the content of the word sketch database:
      dumpws /corpora/ca/user_data/<YOUR_USERNAME_IN_SKETCH_ENGINE>/registry/gramdev_czechwiki
      For this you may need to ask for extra permission to registry directories.
  5. Process the output of dumpws with a simple Bash or Python script to select first 100 most salient headword-collocation pairs for each relation. Upload the resulting list into the IS vault.