= Processing of Very Large Text Collections =

Why to process natural language texts?
 * '''lots''' of information, growing every day (web)
 * need for '''fast''' and continuous knowledge mining
 * '''no time''' for human intervention
 * '''large''' data make statistical processing possible
 * '''real''' data instead of false assumptions

== Information in Text ==

[[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/text.png)]]

== Text collection = a text corpus ==

 * text collection: usually referred to as '''text corpus'''
 * '''humanities''' → corpus linguistics, language learning
 * '''computer science''' → effective design of specialized database management systems
 * '''applications''' → usage of ''any text'' as information source

== Text Corpora as Information Source ==
[[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/goal.png)]]

== So what is a corpus? ==
[[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/what_is_corpus.png)]]

== Corpora ==
 * '''text type'''
   * ''general language'' (gather domain independent information: common sense knowledge, global statistics, information defaults)
   * ''domain specific'' (gather domain specific information: terminology, in-domain knowledge, contrast to common texts)
 * '''timeline'''
   * ''synchronic'': one time period / time span (→ what is up now?)
   * ''diachronic'': different time periods / time spans (→ what are the trends?)
 * '''language, written/spoken, metadata annotation type,...'''