Version 4 (modified by xkocinc, 10 years ago) (diff)


Processing of Very Large Text Collections

Why to process natural language texts?

  • lots of information, growing every day (web)
  • need for fast and continuous knowledge mining
  • no time for human intervention
  • large data make statistical processing possible
  • real data instead of false assumptions

Information in Text


Text collection = a text corpus

  • text collection: usually referred to as text corpus
  • humanities → corpus linguistics, language learning
  • computer science → effective design of specialized database management systems
  • applications → usage of any text as information source

Text Corpora as Information Source


So what is a corpus?



  • text type
    • general language (gather domain independent information: common sense knowledge, global statistics, information defaults)
    • domain specific (gather domain specific information: terminology, in-domain knowledge, contrast to common texts)
  • timeline
    • synchronic: one time period / time span (→ what is up now?)
    • diachronic: different time periods / time spans (→ what are the trends?)
  • language, written/spoken, metadata annotation type,...

Attachments (10)