Processing of Very Large Text Collections

Why to process natural language texts?

Information in Text

text collection: usually referred to as text corpus
humanities → corpus linguistics, language learning
computer science → effective design of specialized database management systems
applications → usage of any text as information source

text type
- general language (gather domain independent information: common sense knowledge, global statistics, information defaults)
- domain specific (gather domain specific information: terminology, in-domain knowledge, contrast to common texts)
timeline
- synchronic: one time period / time span (→ what is up now?)
- diachronic: different time periods / time spans (→ what are the trends?)
language, written/spoken, metadata annotation type,...

Corpora at NLP Centre:

A big need for search/retrieval that is:

information systems (going beyond fulltext search)
information analytics (opinion mining, marketing assessment)
intelligent text processing (predictive and adaptive writing, correction tools, effective writing in mobile devices)
computer lexicography (better dictionaries, larger dictionaries)
machine translation (parallel corpora)
statistics for enhancing NLP tools

Ready-made tools for corpus building, management and effective search:

Building: from own data/from the web, crawling, cleaning, deduplication
Management: effective indexing in special DBMS
Search: very fast evaluation of complex queries, keywords extraction, extraction of semantically related words, word sketches

Most of the tools are part of Sketch Engine, a product developed in collaboration with Lexical Computing Ltd.

Text corpora represent a valuable information source useful for many practical applications.

Corpora as text databases require special solutions that are fast and powerful.

There are number of tools developed in the NLP Centre for corpus building, management and efficient search.