= Processing of Very Large Text Collections = Why to process natural language texts? * '''lots''' of information, growing every day (web) * need for '''fast''' and continuous knowledge mining * '''no time''' for human intervention * '''large''' data make statistical processing possible * '''real''' data instead of false assumptions == Information in Text == [[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/text.png)]] == Text collection = a text corpus == * text collection: usually referred to as '''text corpus''' * '''humanities''' → corpus linguistics, language learning * '''computer science''' → effective design of specialized database management systems * '''applications''' → usage of ''any text'' as information source == Text Corpora as Information Source == [[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/goal.png)]] == So what is a corpus? == [[Image(/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/what_is_corpus.png)]] == Corpora == * '''text type''' * ''general language'' (gather domain independent information: common sense knowledge, global statistics, information defaults) * ''domain specific'' (gather domain specific information: terminology, in-domain knowledge, contrast to common texts) * '''timeline''' * ''synchronic'': one time period / time span (→ what is up now?) * ''diachronic'': different time periods / time spans (→ what are the trends?) * '''language, written/spoken, metadata annotation type,...'''