Indexing and Searching Very Large Texts

Prepared by: Miloš Jakubíček

State of the Art


  1. RYCHLÝ, Pavel, et al. Korpusové manažery a~ jejich efektivní implementace. 2000.
  2. JAKUBÍCEK, Miloš; KILGARRIFF, Adam; RYCHLÝ, Pavel. Effective Corpus Virtualization. In: Challenges in the Management of Large Corpora (CMLC-2) Workshop Programme. p. 7.
  3. JAKUBICEK, Milos, et al. Fast Syntactic Searching in Very Large Corpora for Many Languages. In: PACLIC. 2010. p. 741-747.

Practical Session

  1. (optionally) login to aurora
  2. write a program or script that will find all occurrences of a given word form including a small context (at least 5 preceding and succeeding words) in the vertical file
  3. the script will take two arguments: path to the vertical file and word to be searched
    If you have logged to aurora, you may use fixed path to the vertical file as
    without the need to copy it.
  4. submit the script into the IS vault
