wiki:LexicalAnalysis

SYNT lexical analysis

For better syntactic analysis of system SYNT we have to provide lexical analysis of given language. In this lexical analysis, program assign preteminal for given expression and this preterminal is then use in rules in meta-grammar of system SYNT.

Example:
         Lexical analysis:
         "Monday" -> MONTH

         Metagramar rule:
         adv -> NUMBER '.' MONTH


Now we decide to change this lexical analysis (originally written in C-code) into format that can understand everybody who knows regular expressions (mainly linguists). This change make system easily adaptable for another language and more flexible for future development.

This new lexical analysis is based on system re2c. Re2c is a tool for writing very fast and very flexible scanners. The input for this program is program written in very specific format. http://re2c.org/manual.html But it still contains C like code. So we decide to transform text file that contains only regular expressions and actions if this regular expression is matched to re2c like file. For this task we create script written in python called "gen_lex.py".

The input for script "gen_lex.py" is text file that contains regular expressions for word, lemma, tag and list of actions (separated by ";"). Strings in each part are writen in quotes. This four parts are divided by tabulator white space "\t".

Example:
               tag RE        lemma RE      word RE        word index           action


             "k4".*"xC".*      "dva"           .*           [0-9]+       preterm=__SYNT_NTERM_TWO;


User can put a macro in regular expression too. This macro in lexical rule is then substituted by predefined regular expression. The definition of macro is: "m=NAME RE". Where NAME and RE are separated by tab white space "\t". Macro in lexical rule is then written in curly braces.

Example of macro:
                     #= defined macros
                     m=IS_NUMBER     [0-9]+
                     m=UPPERCASE     ("Á"|"É"|"Í"|"Ó"|"Ú"|"Ů"|"Ď"|"Ť"|"Ň"|"Ľ"|"Č"|"Ž"|"Š"|"Ĺ"|"Ŕ"|"Ř"|"Ě"|"Ý"|[A-Z])+

                     #= rule tat use macro
                     .*      .*      {IS_NUMBER}   [0-9]+   preterm=NUM


User is allowed to write comment for defined macro or lexical rule. The "gen_lex.py" can recognize two types of comments. Format of comment is "#=comment".

  • First type is in line comment. This comment is after lexical rule or after macro definition.
Example of in line comment:
                              .*      {IS_MONTH}      .*     [0-9]+    preterm=MONTH      #=lexical rule comment
                              m=IS_NUMBER     [0-9]+    #=macro comment
  • Second type is top comment. This comment is on previous line of lexical rule or macro specification.
Example of top comment:
                          #=Top comment for lexical rule
                          .*      {IS_MONTH}      .*      [0-9]+     preterm=MONTH
                          #=Top commnet for macro specification
                          m=IS_NUMBER     [0-9]+


In lexical rules and macros you can use predefined variables word, lemma, morf_info, word index and lemma index. For example if you want to write lexical rule for words that are not in the firs place of sentence:

"k1".*  .*  {ONLY_FIRST_UPPER}   WI    if (word_index>0){lemma->preterm=__SYNT_NTERM_NPR;lemma = word->duplicateLemma(LEMMAINDEX);}


The output of "gen_lex.py" is re2c like file and all strings from text file is transformed into Unicode code point format. In tag regular expression part all ".*" signs are replaced with special macro STING (specified as follows STRING = [^\t\n\r\0]*) or if there are some strings like "k1" then it is replaced by [^\t\n\r\0k]*. This modification makes program faster when the re2c file is transformed in final automata written in C-code. This STRING macro is not allowed to use in user input.


The command to generate C-code from re2c file is: re2c -isuF -o analyze_lex.c analyze_lex.re2c.

Usage of gen_lex.py: gen_lex.py [-u] < INPUT > OUTPUT

Last modified 10 years ago Last modified on Feb 12, 2014, 12:23:20 PM