Changes between Initial Version and Version 1 of documentation


Ignore:
Timestamp:
Mar 15, 2009, 12:30:03 PM (15 years ago)
Author:
Miloš Jakubíček
Comment:

added documentation

Legend:

Unmodified
Added
Removed
Modified
  • documentation

    v1 v1  
     1= Documentation =
     2
     3== Introduction ==
     4
     5SET is an open source tool for syntax analysis of natural languages. It is based on the principle of detection of important patterns in the text and incremental segmentation of the sentence. Its core consists of a set of patterns (or rules) and a parsing engine that analyses the input sentence according to given rules. Currently, SET is distributed with a set of rules for parsing the Czech language, containing about 150 rules. A simple tree viewer for displaying parser output is also present.
     6
     7==System features ==
     8
     9The system is able to parse a morphologically tagged sentence in the vertical (BRIEF) format, i.e. one token per line, in word - lemma - tag order. At the time, the morphological tagging must be disambiguated and the tags are expected in the attribute format, as used by the [http://nlp.fi.muni.cz/projekty/ajka ajka] morphological analyser. Examples of correct input files: sentence 1, sentence 2.
     10
     11As the output, the system returns syntactic information found in the input sentence in several possible formats:
     12
     13    * '''All patterns found in the input sentence'''
     14      This information is printed on stderr in the form of matched tokens followed by the particular rule. It is indicated by label Match found.
     15    * '''Best matches'''
     16      The best pattern matches that are selected by the parser ranking functions and that are used for building the output tree. This information is printed on stderr in the form of matched tokens followed by the particular rule as well. It is indicated by label Match selected.
     17    * '''Hybrid trees'''
     18      Full syntactic trees containing phrasal and dependency elements together. The native output of the parser. In the text form, it is printed on stdout; it can be also displayed in the graphic module.
     19    * '''Dependency trees'''
     20      Full syntactic trees containing only dependency elements, corresponding to the formalism used by the [http://ufal.mff.cuni.cz Institute of Formal and Applied Linguistics] in Prague. In the text form, it is printed on stdout; it can be also displayed in the graphic module.
     21
     22In the text form, the output trees are encoded by set of lines, each of them representing one node of the resulting tree. Each line contains four TAB-delimited fields:
     23
     24    * Node ID (integer number)
     25    * Node label
     26    * Node dependency ID (integer number)
     27    * Dependency type ('p' or 'd', for phrasal or dependency edge)
     28
     29The latest precision measures (performed with SET version 0.2) show that the precision of the parser dependency output ranges between 75 and 86 percent with respect to the human-anotated Czech corpus data, depending on the particular testing set.
     30
     31== Program usage ==
     32
     33The usage of the system is very simple:
     34
     35{{{./set.py [-gd] <file>}}}
     36
     37where
     38
     39    * {{{-g}}} specifies graphical tree output (if not given, output tree will display only in the text format that is not readable well),
     40    * {{{-d}}} switches to the dependency tree output, instead of hybrid trees, and
     41    * {{{<file>}}} should contain a tagged Czech sentence in the BRIEF format as showed above and in UTF-8 encoding.
     42
     43The system performs parsing of the input sentence according to rules defined in the file grammar.set that is present in the installation. The structure of the rules and the process of analysis are further described in the following sections.
     44
     45== Rules structure ==
     46
     47=== Rules syntax ===
     48
     49to be described...
     50
     51=== SET rules ===
     52
     53to be described...
     54
     55== Implementation overview ==
     56
     57to be described...