Changes between Version 6 and Version 7 of en/ProcessingLargeTextCollections


Ignore:
Timestamp:
Jun 5, 2014, 11:48:11 AM (10 years ago)
Author:
xkocinc
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • en/ProcessingLargeTextCollections

    v6 v7  
    3636[[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/corpora_size.png)]]
    3737
     38== Why does size matter so much? ==
     39[[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/distribution.png)]]
     40
     41== Corpora now ==
     42
     43Corpora at NLP Centre:
     44 * '''LARGE:''' billions (~10^(10)) of words
     45[[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/corpora_langs.png)]]
     46 * '''COMPLEX:''' muti-level multi-value annotation, wide range of languages
     47[[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/query.png)]]
     48
     49A big need for search/retrieval that is:
     50 * '''INTELLIGENT:''' complex searching involving large amounts of metadata
     51 * '''VERY FAST:''' parallel and distributed processing
     52 * '''ACCESSIBLE:''' interfaces for automatic processing via third-party tools
     53
     54== Applications ==
     55 
     56 * '''information systems''' (going beyond fulltext search)
     57 * '''information analytics''' (opinion mining, marketing assessment)
     58 * '''intelligent text processing''' (predictive and adaptive writing, correction tools, effective writing in mobile devices)
     59 * '''computer lexicography''' (better dictionaries, larger dictionaries)
     60 * '''machine translation''' (parallel corpora)
     61 * '''statistics''' for enhancing NLP tools
     62
     63== What can we offer? ==
     64
     65Ready-made tools for corpus building, management and effective search:
     66 * '''Building:''' from own data/from the web, crawling, cleaning, deduplication
     67 * '''Management:''' effective indexing in special DBMS
     68 * '''Search:''' very fast evaluation of complex queries, keywords extraction, extraction of semantically related words, word sketches
     69
     70Most of the tools are part of Sketch Engine, a product developed in collaboration with Lexical Computing Ltd.
     71
     72
     73== Demo: Sketch Engine ==
     74
     75Compare and contrast words visually
     76[[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/comparison.png)]]
     77
     78Build specialised corpora instantly from the Web
     79[[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/corpus_build.png)]]
     80
     81Thesaurus
     82[[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/corpus_test.png)]]
     83
     84
     85== Conclusions ==
     86Text corpora represent a '''valuable information source''' useful for many practical applications.
     87
     88Corpora as text databases require '''special solutions''' that are fast and powerful.
     89
     90There are number of '''tools developed in the NLP Centre''' for corpus building, management and efficient search.
    3891
    3992
    4093
    41