| 1 | = Parsing of Czech: Between Rules and Stats = |
| 2 | |
| 3 | [[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák |
| 4 | |
| 5 | Prepared by: Miloš Jakubíček |
| 6 | |
| 7 | == State of the Art == |
| 8 | |
| 9 | === References === |
| 10 | |
| 11 | 1. PEI, Wenzhe; GE, Tao; CHANG, Baobao. An effective neural network model for graph-based dependency parsing. In: Proc. of ACL. 2015. |
| 12 | 1. CHOI, Jinho D.; TETREAULT, Joel; STENT, Amanda. It Depends: Dependency Parser Comparison Using A Web-based Evaluation Tool. In: Proc. of ACL. 2015. |
| 13 | 1. DURRETT, Greg; KLEIN, Dan. Neural CRF Parsing. In: Proc. of ACL. 2015. |
| 14 | |
| 15 | == Practical Session == |
| 16 | |
| 17 | 1. Go to http://ske.fi.muni.cz, login and create a shadow copy of the Czech Wikipedia corpus by clicking on [[Image(add.png,valign=middle,nolink,class=intext)]]''Create grammar development corpus'' (if you do not have such link at the bottom of the main page, ask for it). |
| 18 | 1. Develop your own sketch grammar that will capture the following semantic relations in this corpus: hypernymy/hyponymy, meronymy/holonymy (hint: use {{{DUAL}}} directive), optionally you can develop more relations (e.g. "is-defined-as"). |
| 19 | Read related [https://www.sketchengine.co.uk/writing-sketch-grammars/ documentation]. Start with a couple of simple CQL queries that you pretest in the interface. |
| 20 | 1. You can iteratively expand the grammar, upload it into the system, have the system compute word sketches and review the results |
| 21 | 1. When you are happy with the grammar, process the raw !WordSketch data (output of `dumpws` command) of your corpus. The data can be obtained in two ways: |
| 22 | 1. smaller data (up to 100,000 relations) can be downloaded from web: [[BR]] |
| 23 | `https://ske.fi.muni.cz/bonito/r.cgi/dumpws?corpname=user/<YOUR_USERNAME_IN_SKETCH_ENGINE>/gramdev_czechwiki` [[BR]] |
| 24 | e.g. [[BR]] |
| 25 | https://ske.fi.muni.cz/bonito/r.cgi/dumpws?corpname=user/novakjan/gramdev_czechwiki [[BR]] |
| 26 | [[BR]] |
| 27 | First, you have to be authenticated at https://ske.fi.muni.cz/login/. |
| 28 | `gramdev_czechwiki` is the ''corpus_id'' of the Czech Wikipedia corpus. [[BR]] |
| 29 | Or, if you need more than 100,000 relations, you can use the other way |
| 30 | 1. logon to the {{{alba.fi.muni.cz}}} server and use the {{{dumpws}}} command to export the content of the word sketch database: [[BR]] |
| 31 | {{{dumpws /corpora/ca/user_data/<YOUR_USERNAME_IN_SKETCH_ENGINE>/registry/gramdev_czechwiki}}} [[BR]] |
| 32 | For this you may need to ask for extra permission to registry directories. |
| 33 | 5. Process the output of {{{dumpws}}} with a simple Bash or Python script to select first 100 most salient headword-collocation pairs for each relation. Upload the resulting list into the IS vault. |