Context Navigation

← Previous Change
Wiki History
Next Change →

ParsingCzech

Timestamp:: Nov 6, 2024, 8:28:33 PM (7 months ago)
Author:: Ales Horak
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

private/NlpInPracticeCourse/ParsingCzech

-                      v39
+                      v40
 == Practical Session ==
+{{{
+#!div class="wiki-toc" style="width: 40%"
+**Note:** If you are new to the [https://en.wikipedia.org/wiki/Command-line_interface command line interface] via a [https://en.wikipedia.org/wiki/Terminal_emulator terminal window], you may find the **[https://ubuntu.com/tutorials/command-line-for-beginners#3-opening-a-terminal tutorial for working in terminal]** useful.
+}}}
+We will develop/adjust the grammar of the SET parser (for English or Czech).
+We will develop/adjust the grammar of the SET parser (for English or Czech).[[br]][[br]][[br]]
+Open [https://colab.research.google.com/drive/1SUtMScLK-6sKsX5eYIUfFjBtgrKCpkRy?usp=sharing Google Colab notebook IA161-ParsingCzech.ipynb] and follow the text and code in it.
+. Download the [[htdocs:bigdata/ukol_ia161-parsing.zip|SET parser with evaluation dataset]]
+{{{
+wget https://nlp.fi.muni.cz/trac/research/chrome/site/bigdata/ukol_ia161-parsing.zip
+}}}
+. Unzip the downloaded file
+{{{
+unzip ukol_ia161-parsing.zip
+}}}
+. Go to the unziped folder
+{{{
+cd ukol_ia161-parsing
+}}}
+. [optional] Choose the language you want to work with. The default is English (`en`) which can be changed to Czech (`cs`) via editing `Makefile`:
+{{{
+nano Makefile
+}}}
+ if you want to work with Czech, change the first line to
+{{{
+LANGUAGE=cs
+}}}
+. Test the prepared program that analyses 100 selected sentences
+{{{
+make set_trees
+make compare
+}}}
+ The output should be
+{{{
+./compare_dep_trees.py data/trees/ud21_gum_dev data/trees/set_ud21_gum_dev
+UAS =  55.4 %
+}}}
+ You can see detailed evaluation (sentence by sentence) with
+{{{
+make compare SENTENCES=1
+}}}
+ You can watch differences for one tree with
+{{{
+make diff SENTENCE=academic_librarians-10
+}}}
+ The left window with `ud21_gum_dev/academic_librarians-10` shows the
+ expected ground truth, the right window of `set_ud21_gum_dev/academic_librarians-10` displays the current parsing result (to be improved by you).[[br]]
+ Exit the diff by pressing `q`.[[br]]
+ You may inspect the tagged vertical text with
+ {{{
+ make vert SENTENCE=academic_librarians-10
+}}}
+ You can watch the two trees with (`python3-tk` must be installed in the system)
+ {{{
+make view SENTENCE=academic_librarians-10
+}}}
+ For remote tree view (i.e. inspecting the trees on different computer), you may run
+ {{{
+make html SENTENCE=academic_librarians-10
+}}}
+ And point your browser to the `html/index.html` file. [[br]]
+ You can extract the text of the sentence easily with
+ {{{
+make text SENTENCE=academic_librarians-10
+}}}
+ English translation of the Czech sentences can be obtained via
+ {{{
+make texttrans SENTENCE=academic_librarians-10
+}}}
+. Debugging the parsing process can be done using
+ {{{
+make debug SENTENCE=academic_librarians-10
+}}}
+ which will print the final rules used to build the tree. Adding
+ `DETAIL=1` will show all details of the parsing process, including
+ the unused rules.
+ {{{
+make debug SENTENCE=academic_librarians-10 DETAIL=1
+}}}
+. Look at the files (you may use `mc` file manager, exit it with `Esc+0`):
+ * `data/vert/pdt2_etest` or `ud21_gum_dev` - 100 input sentences in vertical format.[[br]]
+  The tag format is  the Prague Dependency Treebank [https://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/ch02s02s01.html positional tagset] for Czech and the [https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html  Penn Treebank tagset] for English
+ * `data/trees/pdt2_etest` or `ud21_gum_dev` - 100 gold standard dependency trees from the Prague Dependency Treebank or the Universal Dependencies GUM corpus
+ * `data/trees/set_pdt2_etest` or `set_ud21_gum_dev` - 100 trees output from SET by running `make set_trees`
+ * `grammar-cs.set` or `grammar-en.set` - the grammar used in running SET
+== Assignment ==
+. Study the [https://nlp.fi.muni.cz/trac/set/wiki/documentation SET documentation]. The tags used in the English `grammar-en.set` follow the [https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html  Penn Treebank tagset] and in the Czech grammar `grammar-cs.set` the [raw-attachment:tagset.pdf Brno tagset].
+. Develop better grammar - repeat the process:
+{{{
+nano grammar-en.set # or use your favourite editor
+make set_trees
+make compare
+}}}
+ to improve the original UAS
+. Write the final UAS in `grammar-cs.set` or `grammar-en.set`
+{{{
+# This is the SET grammar for English used in IA161 course
+#
+# ===========   resulting UAS =  66.9 %  ===================
+}}}
+. Upload your `grammar-cs.set` or `grammar-en.set` to the homework vault.
+Upload the resulting grammar file with improved UAS to the homework vault