wiki:en/NlpInPracticeCourse/2021/CorpusIndexing

Context Navigation

← Previous Version
View Latest Version
Next Version →

Version 1 (modified by Ales Horak, 3 years ago) (diff)
copied from private/NlpInPracticeCourse/CorpusIndexing

Indexing and Searching Very Large Texts

IA161 NLP in Practice Course, Course Guarantee: Aleš Horák

Prepared by: Miloš Jakubíček

State of the Art

References

RYCHLÝ, Pavel, et al. Korpusové manažery a~ jejich efektivní implementace. 2000.
JAKUBÍCEK, Miloš; KILGARRIFF, Adam; RYCHLÝ, Pavel. Effective Corpus Virtualization. In: Challenges in the Management of Large Corpora (CMLC-2) Workshop Programme. p. 7.
JAKUBICEK, Milos, et al. Fast Syntactic Searching in Very Large Corpora for Many Languages. In: PACLIC. 2010. p. 741-747.

Practical Session

Compare search through (A) plain text using grep, (B) an indexed corpus using Manatee, (C) a corpus indexed in an arbitrary SQL database Use vertical text for BNC available at aurora:/corpora/vert/bnc/bnc.vert.xz.

Search for the phrase "test case", display context of 10 words before and after each occurrence of the search phrase.

(A) plain

Hint: use grep -C to display context

(B)

Corpus is already indexed on Manatee, try:

time corpquery bnc '[word="test"] [word="case"]'

(C)

Use your favourite SQL database, on aurora you can use sqlite3. Hint how to import vertical text:

https://stackoverflow.com/questions/26065872/how-to-import-a-tsv-file-with-sqlite3

For (A), (B) and (C), submit the commands you used and how long the search took to evaluate.

Download in other formats:

Plain Text