Changes between Version 7 and Version 8 of LexicalAnalysis


Ignore:
Timestamp:
Feb 12, 2014, 12:01:27 PM (10 years ago)
Author:
xmedved1
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • LexicalAnalysis

    v7 v8  
    1414Now we decide to change this lexical analysis (originally written in C-code) into format that can understand everybody who knows regular expressions (mainly linguists). This change make system easily adaptable for another language and more flexible for future development.
    1515
    16 This new lexical analysis is based on system re2c. Re2c is a tool for writing very fast and very flexible scanners. The input for this program is program written in very specific format. [http://re2c.org/manual.html] But it still contains C like code. So we decide to transform text file that contains only regular expressions and actions if this regular expression is matched to re2c like file. For this task we create script written in python called "RE2re2c.py".
     16This new lexical analysis is based on system re2c. Re2c is a tool for writing very fast and very flexible scanners. The input for this program is program written in very specific format. [http://re2c.org/manual.html] But it still contains C like code. So we decide to transform text file that contains only regular expressions and actions if this regular expression is matched to re2c like file. For this task we create script written in python called "gen_lex.py".
    1717
    18 The input for script "RE2re2c.py" is text file that contains regular expressions for word, lemma, tag and list of actions (separated by ";"). Strings in each part are writen in quotes. This four parts are divided by tabulator white space "\t".
     18The input for script "gen_lex.py" is text file that contains regular expressions for word, lemma, tag and list of actions (separated by ";"). Strings in each part are writen in quotes. This four parts are divided by tabulator white space "\t".
    1919
    2020{{{
    2121Example:
    22                tag RE        lemma RE      word RE               action
     22               tag RE        lemma RE      word RE        word index           action
    2323
    2424
    25              "k4".*"xC".*      "dva"           .*         preterm=__SYNT_NTERM_TWO;
     25             "k4".*"xC".*      "dva"           .*           [0-9]+       preterm=__SYNT_NTERM_TWO;
    2626}}}
    2727
    28 User can put a macro in regular expression too. This macro in lexical rule is then substituted by predefined regular expression. The definition of macro is: "=!NAME RE". Where NAME and RE are separated by tab white space "\t". Macro in lexical rule is then written in curly braces.
     28User can put a macro in regular expression too. This macro in lexical rule is then substituted by predefined regular expression. The definition of macro is: "m=NAME RE". Where NAME and RE are separated by tab white space "\t". Macro in lexical rule is then written in curly braces.
    2929
    3030{{{
    3131Example of macro:
    32                      ## defined macros
    33                      =!IS_NUMBER     [0-9]+
    34                      =!UPPERCASE     ("Á"|"É"|"Í"|"Ó"|"Ú"|"Ů"|"Ď"|"Ť"|"Ň"|"Ľ"|"Č"|"Ž"|"Š"|"Ĺ"|"Ŕ"|"Ř"|"Ě"|"Ý"|[A-Z])+
     32                     #= defined macros
     33                     m=IS_NUMBER     [0-9]+
     34                     m=UPPERCASE     ("Á"|"É"|"Í"|"Ó"|"Ú"|"Ů"|"Ď"|"Ť"|"Ň"|"Ľ"|"Č"|"Ž"|"Š"|"Ĺ"|"Ŕ"|"Ř"|"Ě"|"Ý"|[A-Z])+
    3535
    36                      ## rule tat use macro
    37                      .*      .*      {IS_NUMBER}    preterm=NUM
     36                     #= rule tat use macro
     37                     .*      .*      {IS_NUMBER}   [0-9]+  preterm=NUM
    3838}}}
    3939
    40 User is allowed to write comment for defined macro or lexical rule. The "RE2re2c.py" can recognize two types of comments. Format of comment is "#!comment".
     40User is allowed to write comment for defined macro or lexical rule. The "gen_lex.py" can recognize two types of comments. Format of comment is "#=comment".
    4141- First type is in line comment. This comment is after lexical rule or after macro definition.
    4242
    4343{{{
    4444Example of in line comment:
    45                               .*      {IS_MONTH}      .*      preterm=MONTH      #!lexical rule comment
    46                               =!IS_NUMBER     [0-9]+    #!macro comment
     45                              .*      {IS_MONTH}      .*     [0-9]+    preterm=MONTH      #=lexical rule comment
     46                              m=IS_NUMBER     [0-9]+    #=macro comment
    4747}}}
    4848
     
    5252{{{
    5353Example of top comment:
    54                           #!Top comment for lexical rule
    55                           .*      {IS_MONTH}      .*      preterm=MONTH
    56                           #!Top commnet for macro specification
    57                           =!IS_NUMBER     [0-9]+
     54                          #=Top comment for lexical rule
     55                          .*      {IS_MONTH}      .*      [0-9]+     preterm=MONTH
     56                          #=Top commnet for macro specification
     57                          m=IS_NUMBER     [0-9]+
    5858}}}
    5959
     
    6161
    6262{{{
    63 "k1".*  .*  {ONLY_FIRST_UPPER}    if (word_index>0){lemma->preterm=__SYNT_NTERM_NPR;lemma = word->duplicateLemma(LEMMAINDEX);}
     63"k1".*  .*  {ONLY_FIRST_UPPER}   WI    if (word_index>0){lemma->preterm=__SYNT_NTERM_NPR;lemma = word->duplicateLemma(LEMMAINDEX);}
    6464}}}
    6565
    66 The output of "RE2re2c.py" is re2c like file and all strings from text file is transformed into Unicode code point format. If lexical rule contains signs ".*" in RE than it is replaced by special macro STRING. The STRING macro is specified as follows: {{{STRING = [^\t\n\r\0]*}}}. '''This is only special macro and user is not allowed to specified macro with name STRING.'''
     66The output of "gen_lex.py" is re2c like file and all strings from text file is transformed into Unicode code point format. If lexical rule contains signs ".*" in RE than it is replaced by special macro STRING. The STRING macro is specified as follows: {{{STRING = [^\t\n\r\0]*}}}. '''This is only special macro and user is not allowed to specified macro with name STRING.'''
    6767
    6868In tag regular expression part sign ".*" s replaced with STING or if there are some strings like "k1" then sign is replaced by {{{[^\t\n\r\0k]*}}}. This modification makes program faster when the re2c file is transformed in final automata written in C-code.
     
    7070The command to generate C-code from re2c file is: {{{re2c -isuF -o analyze_lex.c analyze_lex.re2c}}}. In SYNT system Makefile makes all work.
    7171
    72 Commands for RE2re2c.py: [[BR]]
     72Usage of gen_lex.py: [[BR]]
    7373
    74 {{{stdin | ./RE2re2c.py [-u] > stdout}}}[[BR]]
    75 {{{stdin | ./RE2re2c.py -o output [-u]}}}[[BR]]
    76 {{{./RE2re2c.py -i input -o output [-u]}}}[[BR]]
    77 {{{./Re2re2c.py -i input [-u] > stdout}}}[[BR]]
     74{{{gen_lex.py [-u] < INPUT > OUTPUT}}}[[BR]]
     75