wiki:en/ProcessingLargeTextCollections

Version 5 (modified by xkocinc, 10 years ago) (diff)

--

Processing of Very Large Text Collections

Why to process natural language texts?

  • lots of information, growing every day (web)
  • need for fast and continuous knowledge mining
  • no time for human intervention
  • large data make statistical processing possible
  • real data instead of false assumptions

Information in Text

/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/text.png

Text collection = a text corpus

  • text collection: usually referred to as text corpus
  • humanities → corpus linguistics, language learning
  • computer science → effective design of specialized database management systems
  • applications → usage of any text as information source

Text Corpora as Information Source

/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/goal.png

So what is a corpus?

/trac/research/raw-attachment/wiki/en/ProcessingLargeTextCollections/what_is_corpus.png

Corpora

  • text type
    • general language (gather domain independent information: common sense knowledge, global statistics, information defaults)
    • domain specific (gather domain specific information: terminology, in-domain knowledge, contrast to common texts)
  • timeline
    • synchronic: one time period / time span (→ what is up now?)
    • diachronic: different time periods / time spans (→ what are the trends?)
  • language, written/spoken, metadata annotation type,...

So is there any property one should aim at for all corpora?

Attachments (10)