Changes between Version 8 and Version 9 of LexicalAnalysis
- Timestamp:
- Feb 12, 2014, 12:23:20 PM (10 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
LexicalAnalysis
v8 v9 11 11 adv -> NUMBER '.' MONTH 12 12 }}} 13 [[BR]] 13 14 14 15 Now we decide to change this lexical analysis (originally written in C-code) into format that can understand everybody who knows regular expressions (mainly linguists). This change make system easily adaptable for another language and more flexible for future development. … … 25 26 "k4".*"xC".* "dva" .* [0-9]+ preterm=__SYNT_NTERM_TWO; 26 27 }}} 28 [[BR]] 27 29 28 30 User can put a macro in regular expression too. This macro in lexical rule is then substituted by predefined regular expression. The definition of macro is: "m=NAME RE". Where NAME and RE are separated by tab white space "\t". Macro in lexical rule is then written in curly braces. … … 37 39 .* .* {IS_NUMBER} [0-9]+ preterm=NUM 38 40 }}} 41 [[BR]] 39 42 40 43 User is allowed to write comment for defined macro or lexical rule. The "gen_lex.py" can recognize two types of comments. Format of comment is "#=comment". … … 56 59 #=Top commnet for macro specification 57 60 m=IS_NUMBER [0-9]+ 58 }}} 61 }}} 62 [[BR]] 59 63 60 64 In lexical rules and macros you can use predefined variables word, lemma, morf_info, word index and lemma index. For example if you want to write lexical rule for words that are not in the firs place of sentence: … … 62 66 {{{ 63 67 "k1".* .* {ONLY_FIRST_UPPER} WI if (word_index>0){lemma->preterm=__SYNT_NTERM_NPR;lemma = word->duplicateLemma(LEMMAINDEX);} 64 }}} 68 }}} 69 [[BR]] 65 70 66 The output of "gen_lex.py" is re2c like file and all strings from text file is transformed into Unicode code point format. I f lexical rule contains signs ".*" in RE than it is replaced by special macro STRING. The STRING macro is specified as follows: {{{STRING = [^\t\n\r\0]*}}}. '''This is only special macro and user is not allowed to specified macro with name STRING.'''71 The output of "gen_lex.py" is re2c like file and all strings from text file is transformed into Unicode code point format. In tag regular expression part all ".*" signs are replaced with special macro '''STING''' (specified as follows {{{STRING = [^\t\n\r\0]*}}}) or if there are some strings like "k1" then it is replaced by {{{[^\t\n\r\0k]*}}}. This modification makes program faster when the re2c file is transformed in final automata written in C-code. '''This STRING macro is not allowed to use in user input.''' 67 72 68 In tag regular expression part sign ".*" s replaced with STING or if there are some strings like "k1" then sign is replaced by {{{[^\t\n\r\0k]*}}}. This modification makes program faster when the re2c file is transformed in final automata written in C-code. 73 [[BR]] 69 74 70 The command to generate C-code from re2c file is: {{{re2c -isuF -o analyze_lex.c analyze_lex.re2c}}}. In SYNT system Makefile makes all work.75 The command to generate C-code from re2c file is: {{{re2c -isuF -o analyze_lex.c analyze_lex.re2c}}}. 71 76 72 Usage of gen_lex.py: [[BR]]77 Usage of gen_lex.py: {{{gen_lex.py [-u] < INPUT > OUTPUT}}}[[BR]] 73 78 74 {{{gen_lex.py [-u] < INPUT > OUTPUT}}}[[BR]]75