Changes between Version 8 and Version 9 of LexicalAnalysis


Ignore:
Timestamp:
Feb 12, 2014, 12:23:20 PM (10 years ago)
Author:
xmedved1
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • LexicalAnalysis

    v8 v9  
    1111         adv -> NUMBER '.' MONTH
    1212}}}
     13[[BR]]
    1314
    1415Now we decide to change this lexical analysis (originally written in C-code) into format that can understand everybody who knows regular expressions (mainly linguists). This change make system easily adaptable for another language and more flexible for future development.
     
    2526             "k4".*"xC".*      "dva"           .*           [0-9]+       preterm=__SYNT_NTERM_TWO;
    2627}}}
     28[[BR]]
    2729
    2830User can put a macro in regular expression too. This macro in lexical rule is then substituted by predefined regular expression. The definition of macro is: "m=NAME RE". Where NAME and RE are separated by tab white space "\t". Macro in lexical rule is then written in curly braces.
     
    3739                     .*      .*      {IS_NUMBER}   [0-9]+   preterm=NUM
    3840}}}
     41[[BR]]
    3942
    4043User is allowed to write comment for defined macro or lexical rule. The "gen_lex.py" can recognize two types of comments. Format of comment is "#=comment".
     
    5659                          #=Top commnet for macro specification
    5760                          m=IS_NUMBER     [0-9]+
    58 }}}
     61}}}
     62[[BR]]
    5963
    6064In lexical rules and macros you can use predefined variables word, lemma, morf_info, word index and lemma index. For example if you want to write lexical rule for words that are not in the firs place of sentence:
     
    6266{{{
    6367"k1".*  .*  {ONLY_FIRST_UPPER}   WI    if (word_index>0){lemma->preterm=__SYNT_NTERM_NPR;lemma = word->duplicateLemma(LEMMAINDEX);}
    64 }}}
     68}}}
     69[[BR]]
    6570
    66 The output of "gen_lex.py" is re2c like file and all strings from text file is transformed into Unicode code point format. If lexical rule contains signs ".*" in RE than it is replaced by special macro STRING. The STRING macro is specified as follows: {{{STRING = [^\t\n\r\0]*}}}. '''This is only special macro and user is not allowed to specified macro with name STRING.'''
     71The output of "gen_lex.py" is re2c like file and all strings from text file is transformed into Unicode code point format. In tag regular expression part all ".*" signs are replaced with special macro '''STING''' (specified as follows {{{STRING = [^\t\n\r\0]*}}}) or if there are some strings like "k1" then it is replaced by {{{[^\t\n\r\0k]*}}}. This modification makes program faster when the re2c file is transformed in final automata written in C-code. '''This STRING macro is not allowed to use in user input.'''
    6772
    68 In tag regular expression part sign ".*" s replaced with STING or if there are some strings like "k1" then sign is replaced by {{{[^\t\n\r\0k]*}}}. This modification makes program faster when the re2c file is transformed in final automata written in C-code.
     73[[BR]]
    6974
    70 The command to generate C-code from re2c file is: {{{re2c -isuF -o analyze_lex.c analyze_lex.re2c}}}. In SYNT system Makefile makes all work.
     75The command to generate C-code from re2c file is: {{{re2c -isuF -o analyze_lex.c analyze_lex.re2c}}}.
    7176
    72 Usage of gen_lex.py: [[BR]]
     77Usage of gen_lex.py: {{{gen_lex.py [-u] < INPUT > OUTPUT}}}[[BR]]
    7378
    74 {{{gen_lex.py [-u] < INPUT > OUTPUT}}}[[BR]]
    75