Changes between Version 1 and Version 2 of LexicalAnalysis


Ignore:
Timestamp:
Oct 11, 2013, 12:20:53 PM (11 years ago)
Author:
xmedved1
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • LexicalAnalysis

    v1 v2  
    11
    22== SYNT lexical analysis ==
     3For better syntactic analysis of system SYNT we have to provide lexical analysis of given language. In this lexical analysis program assign preteminal for given expression and this preterminal is then use in rules in meta-grammar of system SYNT.
     4
     5{{{
     6Example:
     7         Lexical analysis:
     8         "Monday" -> MONTH
     9
     10         Metagramar rule:
     11         adv -> NUMBER '.' MONTH
     12}}}
     13
     14Now we decide to change this lexical analysis (originally written in C-code) into format that can understand everybody who knows regular expressions (mainly linguists). This change make system easily adaptable for another language and more flexible for future development.
     15
     16This new lexical analysis is based on system re2c. Re2c is a tool for writing very fast and very flexible scanners. The input for this program is program written in very specific format. [http://re2c.org/manual.html] But it still contains C like code. So we decide to transform text file that contains only regular expressions and actions if this regular expression is matched to re2c like file. For this task we create script written in python called "RE2re2c.py".
     17
     18The input for script "RE2re2c.py" is text file that contains regular expressions for word, lemma, tag and list of actions. Strings in each part are writen in quotes. This four parts are divided by tabulator white space "\t".
     19
     20{{{
     21Example:
     22               tag RE        lemma RE      word RE               action
     23
     24
     25             "k4".*"xC".*      "dva"           .*         preterm=__SYNT_NTERM_TWO;
     26}}}
     27
     28User can put a macro in regular expression too. This macro in lexical rule is then substituted by predefined regular expression. The definition of macro is: "=!NAME RE". Where NAME and RE are separated by tab white space "\t". Macro in lexical rule is then written in curly braces.
     29
     30{{{
     31Example of macro:
     32                     ## defined macros
     33                     =!IS_NUMBER     [0-9]+
     34                     =!UPPERCASE     ("Á"|"É"|"Í"|"Ó"|"Ú"|"Ů"|"Ď"|"Ť"|"Ň"|"Ľ"|"Č"|"Ž"|"Š"|"Ĺ"|"Ŕ"|"Ř"|"Ě"|"Ý"|[A-Z])+
     35
     36                     ## rule tat use macro
     37                     .*      .*      {IS_NUMBER}    preterm=NUM
     38}}}
     39
     40User is allowed to write comment for defined macro or lexical rule. The "RE2re2c.py" can recognize two types of comments. Format of comment is "#!comment".
     41- First type is in line comment. This comment is after lexical rule or after macro definition.
     42
     43{{{
     44Example of in line comment:
     45                              .*      {IS_MONTH}      .*      preterm=MONTH      #!lexical rule comment
     46                              =!IS_NUMBER     [0-9]+    #!macro comment
     47}}}
     48
     49- Second type is top comment. This comment is on previous line of lexical rule or macro specification.
     50
     51
     52{{{
     53Example of top comment:
     54                          #!Top comment for lexical rule
     55                          .*      {IS_MONTH}      .*      preterm=MONTH
     56                          #!Top commnet for macro specification
     57                          =!IS_NUMBER     [0-9]+
     58}}}
     59
     60The output of "RE2re2c.py" is re2c like file and all strings from text file is transformed into Unicode code point format. If lexical rule contains signs ".*" in RE than it is replaced by special macro STRING. The STRING macro is specified as follows: {{{STRING = [^\t\n\r\0]*}}} '''This is only special macro and user is not allowed to specified macro with name STRING.'''
     61
     62In tag regular expression part sign ".*" s replaced with STING or if there are some strings like "k1" then sign is replaced by [^\t\n\r\0k]*. This modification makes program faster when the re2c file is transformed in final automata written in C-code.
     63
     64The command to generate C-code from re2c file is: {{{re2c -isuF -o analyze_lex.c analyze_lex.re2c}}}. In SYNT system Makefile makes all work.
     65
     66