Changes between Version 1 and Version 2 of en/ProcessingLargeTextCollections


Ignore:
Timestamp:
Jun 5, 2014, 11:37:29 AM (10 years ago)
Author:
xkocinc
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • en/ProcessingLargeTextCollections

    v1 v2  
    88 * '''real''' data instead of false assumptions
    99
     10== Information in Text ==
     11
     12[[Image(/trac/research/raw-attachment/wiki/en/WordLevelAnalysis/text.png)]]
     13
     14== Text collection = a text corpus ==
     15
     16 * text collection: usually referred to as '''text corpus'''
     17 * '''humanities''' → corpus linguistics, language learning
     18 * '''computer science''' → effective design of specialized database management systems
     19 * '''applications''' → usage of ''any text'' as information source
     20
     21== Text Corpora as Information Source ==
     22[[Image(/trac/research/raw-attachment/wiki/en/WordLevelAnalysis/goal.png)]]
     23
     24== So what is a corpus? ==
     25[[Image(/trac/research/raw-attachment/wiki/en/WordLevelAnalysis/what_is_corpus.png)]]
     26
     27== Corpora ==
     28 * '''text type'''
     29   * ''general language'' (gather domain independent information: common sense knowledge, global statistics, information defaults)
     30   * ''domain specific'' (gather domain specific information: terminology, in-domain knowledge, contrast to common texts)
     31 * '''timeline'''
     32   * ''synchronic'': one time period / time span (→ what is up now?)
     33   * ''diachronic'': different time periods / time spans (→ what are the trends?)
     34 * '''language, written/spoken, metadata annotation type,...'''
     35