| 10 | == Information in Text == |
| 11 | |
| 12 | [[Image(/trac/research/raw-attachment/wiki/en/WordLevelAnalysis/text.png)]] |
| 13 | |
| 14 | == Text collection = a text corpus == |
| 15 | |
| 16 | * text collection: usually referred to as '''text corpus''' |
| 17 | * '''humanities''' → corpus linguistics, language learning |
| 18 | * '''computer science''' → effective design of specialized database management systems |
| 19 | * '''applications''' → usage of ''any text'' as information source |
| 20 | |
| 21 | == Text Corpora as Information Source == |
| 22 | [[Image(/trac/research/raw-attachment/wiki/en/WordLevelAnalysis/goal.png)]] |
| 23 | |
| 24 | == So what is a corpus? == |
| 25 | [[Image(/trac/research/raw-attachment/wiki/en/WordLevelAnalysis/what_is_corpus.png)]] |
| 26 | |
| 27 | == Corpora == |
| 28 | * '''text type''' |
| 29 | * ''general language'' (gather domain independent information: common sense knowledge, global statistics, information defaults) |
| 30 | * ''domain specific'' (gather domain specific information: terminology, in-domain knowledge, contrast to common texts) |
| 31 | * '''timeline''' |
| 32 | * ''synchronic'': one time period / time span (→ what is up now?) |
| 33 | * ''diachronic'': different time periods / time spans (→ what are the trends?) |
| 34 | * '''language, written/spoken, metadata annotation type,...''' |
| 35 | |