| 38 | == Why does size matter so much? == |
| 39 | [[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/distribution.png)]] |
| 40 | |
| 41 | == Corpora now == |
| 42 | |
| 43 | Corpora at NLP Centre: |
| 44 | * '''LARGE:''' billions (~10^(10)) of words |
| 45 | [[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/corpora_langs.png)]] |
| 46 | * '''COMPLEX:''' muti-level multi-value annotation, wide range of languages |
| 47 | [[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/query.png)]] |
| 48 | |
| 49 | A big need for search/retrieval that is: |
| 50 | * '''INTELLIGENT:''' complex searching involving large amounts of metadata |
| 51 | * '''VERY FAST:''' parallel and distributed processing |
| 52 | * '''ACCESSIBLE:''' interfaces for automatic processing via third-party tools |
| 53 | |
| 54 | == Applications == |
| 55 | |
| 56 | * '''information systems''' (going beyond fulltext search) |
| 57 | * '''information analytics''' (opinion mining, marketing assessment) |
| 58 | * '''intelligent text processing''' (predictive and adaptive writing, correction tools, effective writing in mobile devices) |
| 59 | * '''computer lexicography''' (better dictionaries, larger dictionaries) |
| 60 | * '''machine translation''' (parallel corpora) |
| 61 | * '''statistics''' for enhancing NLP tools |
| 62 | |
| 63 | == What can we offer? == |
| 64 | |
| 65 | Ready-made tools for corpus building, management and effective search: |
| 66 | * '''Building:''' from own data/from the web, crawling, cleaning, deduplication |
| 67 | * '''Management:''' effective indexing in special DBMS |
| 68 | * '''Search:''' very fast evaluation of complex queries, keywords extraction, extraction of semantically related words, word sketches |
| 69 | |
| 70 | Most of the tools are part of Sketch Engine, a product developed in collaboration with Lexical Computing Ltd. |
| 71 | |
| 72 | |
| 73 | == Demo: Sketch Engine == |
| 74 | |
| 75 | Compare and contrast words visually |
| 76 | [[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/comparison.png)]] |
| 77 | |
| 78 | Build specialised corpora instantly from the Web |
| 79 | [[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/corpus_build.png)]] |
| 80 | |
| 81 | Thesaurus |
| 82 | [[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/corpus_test.png)]] |
| 83 | |
| 84 | |
| 85 | == Conclusions == |
| 86 | Text corpora represent a '''valuable information source''' useful for many practical applications. |
| 87 | |
| 88 | Corpora as text databases require '''special solutions''' that are fast and powerful. |
| 89 | |
| 90 | There are number of '''tools developed in the NLP Centre''' for corpus building, management and efficient search. |