= Processing of Very Large Text Collections = Why to process natural language texts? * '''lots''' of information, growing every day (web) * need for '''fast''' and continuous knowledge mining * '''no time''' for human intervention * '''large''' data make statistical processing possible * '''real''' data instead of false assumptions == Information in Text == [[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/text.png)]] == Text collection = a text corpus == * text collection: usually referred to as '''text corpus''' * '''humanities''' → corpus linguistics, language learning * '''computer science''' → effective design of specialized database management systems * '''applications''' → usage of ''any text'' as information source == Text Corpora as Information Source == [[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/goal.png)]] == So what is a corpus? == [[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/what_is_corpus.png)]] == Corpora == * '''text type''' * ''general language'' (gather domain independent information: common sense knowledge, global statistics, information defaults) * ''domain specific'' (gather domain specific information: terminology, in-domain knowledge, contrast to common texts) * '''timeline''' * ''synchronic'': one time period / time span (→ what is up now?) * ''diachronic'': different time periods / time spans (→ what are the trends?) * '''language, written/spoken, metadata annotation type,...''' [[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/corpora_size.png)]] == Why does size matter so much? == [[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/distribution.png)]] == Corpora now == Corpora at NLP Centre: * '''LARGE:''' billions (~10^(10)) of words [[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/corpus_langs.png)]] * '''COMPLEX:''' muti-level multi-value annotation, wide range of languages [[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/query.png)]] A big need for search/retrieval that is: * '''INTELLIGENT:''' complex searching involving large amounts of metadata * '''VERY FAST:''' parallel and distributed processing * '''ACCESSIBLE:''' interfaces for automatic processing via third-party tools == Applications == * '''information systems''' (going beyond fulltext search) * '''information analytics''' (opinion mining, marketing assessment) * '''intelligent text processing''' (predictive and adaptive writing, correction tools, effective writing in mobile devices) * '''computer lexicography''' (better dictionaries, larger dictionaries) * '''machine translation''' (parallel corpora) * '''statistics''' for enhancing NLP tools == What can we offer? == Ready-made tools for corpus building, management and effective search: * '''Building:''' from own data/from the web, crawling, cleaning, deduplication * '''Management:''' effective indexing in special DBMS * '''Search:''' very fast evaluation of complex queries, keywords extraction, extraction of semantically related words, word sketches Most of the tools are part of Sketch Engine, a product developed in collaboration with Lexical Computing Ltd. == Demo: Sketch Engine == Compare and contrast words visually [[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/comparison.png)]] Build specialised corpora instantly from the Web [[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/corpus_build.png)]] Thesaurus [[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/corpus_test.png)]] == Conclusions == Text corpora represent a '''valuable information source''' useful for many practical applications. Corpora as text databases require '''special solutions''' that are fast and powerful. There are number of '''tools developed in the NLP Centre''' for corpus building, management and efficient search.