wiki:private/AdvancedNlpCourse/CorpusIndexing

Indexing and Searching Very Large Texts

IA161 Advanced NLP Course, Course Guarantee: Aleš Horák

Prepared by: Miloš Jakubíček

State of the Art

References

  1. RYCHLÝ, Pavel, et al. Korpusové manažery a~ jejich efektivní implementace. 2000.
  2. JAKUBÍCEK, Miloš; KILGARRIFF, Adam; RYCHLÝ, Pavel. Effective Corpus Virtualization. In: Challenges in the Management of Large Corpora (CMLC-2) Workshop Programme. p. 7.
  3. JAKUBICEK, Milos, et al. Fast Syntactic Searching in Very Large Corpora for Many Languages. In: PACLIC. 2010. p. 741-747.

Practical Session

Compare search through (A) plain text using grep, (B) an indexed corpus using Manatee, (C) a corpus indexed in an arbitrary SQL database Use vertical text for BNC available at aurora:/corpora/vert/bnc/bnc.vert.xz.

Search for the phrase "test case", display context of 10 words before and after each occurrence of the search phrase.

(A) plain

Hint: use grep -C to display context

(B)

Corpus is already indexed on Manatee, try:

time corpquery bnc '[word="test"] [word="case"]'

(C)

Use your favourite SQL database, on aurora you can use sqlite3. Hint how to import vertical text:

https://stackoverflow.com/questions/26065872/how-to-import-a-tsv-file-with-sqlite3

For (A), (B) and (C), submit the commands you used and how long the search took to evaluate.

Last modified 11 months ago Last modified on Nov 5, 2020, 11:33:18 AM