| 5 | [[Image(/trac/research/raw-attachment/wiki/en/SentenceLevelTextAnalysis/simon_britney.png)]] |
| 6 | |
| 7 | [[Image(/trac/research/raw-attachment/wiki/en/SentenceLevelTextAnalysis/ukazka1.png)]] |
| 8 | |
| 9 | == Sentence level analysis == |
| 10 | |
| 11 | ''' Natural language syntax ''' |
| 12 | * describes relationships among words |
| 13 | |
| 14 | ''' Automatic syntactic analysis ''' |
| 15 | * revealing inter-word relationships on various levels |
| 16 | * detection of noun (prepositional, verb, ...) phrases, clauses |
| 17 | |
| 18 | * '''| Simon | spoke | about sex | with Britney Spears |''' |
| 19 | * '''| Simon | spoke | about sex with Britney Spears |''' |
| 20 | |
| 21 | == Syntactic trees == |
| 22 | |
| 23 | [[Image(/trac/research/raw-attachment/wiki/en/SentenceLevelTextAnalysis/tree1.png)]] |
| 24 | |
| 25 | [[Image(/trac/research/raw-attachment/wiki/en/SentenceLevelTextAnalysis/tree2.png)]] |
| 26 | |
| 27 | == Why are we doing this? == |
| 28 | |
| 29 | Syntactic units are carriers of meaning |
| 30 | * “in the city” |
| 31 | * meaning of “in”, “the” is unclear, complicated |
| 32 | * meaning of “in the city” is simply '''where''' |
| 33 | |
| 34 | Words are not enough |
| 35 | * '''red brick house''' vs. '''brick house red''' vs. '''red house brick''' |
| 36 | * '''Honey, give me love''' vs. '''Love, give me honey''' |
| 37 | |
| 38 | Starting point for intelligent natural language applications |
| 39 | * extraction of facts & question answering |
| 40 | * logical analysis |
| 41 | * punctuation detection & grammar checking |
| 42 | * natural text generation |
| 43 | * authorship detection |
| 44 | * machine translation |
| 45 | |
| 46 | == Example: Extraction of facts == |
| 47 | |
| 48 | [[Image(/trac/research/raw-attachment/wiki/en/SentenceLevelTextAnalysis/ukazka2.png)]] |
| 49 | |
| 50 | |
| 51 | == Example: Logical analysis == |
| 52 | |
| 53 | [[Image(/trac/research/raw-attachment/wiki/en/SentenceLevelTextAnalysis/ukazka3.png)]] |
| 54 | |
| 55 | |
| 56 | == Example: Grammar checking == |
| 57 | |
| 58 | * Let’s eat grandma! |
| 59 | * syntactic analysis |
| 60 | * detection of non-probable constructions |
| 61 | * -> grandma is not a usual object of eating |
| 62 | * -> correction suggestion |
| 63 | |
| 64 | * Let’s eat, grandma! |
| 65 | * life saved :) |
| 66 | |
| 67 | [[Image(/trac/research/raw-attachment/wiki/en/SentenceLevelTextAnalysis/punctuation.jpg)]] |
| 68 | |
| 69 | |
| 70 | Similarly with other grammar phenomena |
| 71 | “This is worth try” -> “This is worth try'''ing'''” |
| 72 | |
| 73 | |
| 74 | == How to analyse natural language syntax? == |
| 75 | |
| 76 | '''Prerequisites''' |
| 77 | * '''word level analysis''' (part of speech, gender, number) |
| 78 | * named entity recognition |
| 79 | * common sense information (e.g. “pregnant” goes with women only) |
| 80 | |
| 81 | '''Named entity recognition''' |
| 82 | * determine that e.g. “prof. Václav Šplíchal” is a person |
| 83 | * can be viewed as a sub-task of syntactic analysis |
| 84 | |
| 85 | '''Statistical methods''' |
| 86 | * people annotate corpus |
| 87 | * statistic methods learn rules from the corpus |
| 88 | * universal across languages (to some extent) |
| 89 | * annotation is expensive |
| 90 | * hard to customize for different applications |
| 91 | * data are usually not big enough |
| 92 | |
| 93 | '''Rule-based methods''' |
| 94 | * specialists develop a set of rules (“grammar”) |
| 95 | * not universal, depends on specialists |
| 96 | * grammar can become uneasy to maintain |
| 97 | * easy to customize for different applications |
| 98 | |
| 99 | '''Hybrids''' |
| 100 | |
| 101 | |
| 102 | |
| 103 | == Syntactic analysers in the NLP Centre == |
| 104 | |
| 105 | '''Synt''' |
| 106 | * C++, fast (0.07 s/sentence) |
| 107 | * based on an expressive meta-grammar |
| 108 | |
| 109 | '''SET''' |
| 110 | * Python, slower but easily adaptable |
| 111 | * based on a set of phrase patterns |
| 112 | |
| 113 | '''Synt+SET''' |
| 114 | * rule-based backbone with statistical extensions |
| 115 | * grammars for Czech, English and Slovak |
| 116 | * accuracy 85–90 % on newspaper texts |
| 117 | |
| 118 | '''Word Sketches''' |
| 119 | * very fast shallow syntax for large corpora |
| 120 | * 31 languages |
| 121 | |
| 122 | |
| 123 | == Conclusions == |
| 124 | Sentence level analysis |
| 125 | * detection of phrases and inter-word relationships |
| 126 | * their further processing |
| 127 | |
| 128 | Applications |
| 129 | * grammar checking |
| 130 | * information analysis of text |
| 131 | * text generation |