Indexing and Searching Very Large Texts
IA161 NLP in Practice Course, Course Guarantee: Aleš Horák
Prepared by: Miloš Jakubíček
State of the Art
References
- RYCHLÝ, Pavel, et al. Korpusové manažery a~ jejich efektivní implementace. 2000.
- JAKUBÍCEK, Miloš; KILGARRIFF, Adam; RYCHLÝ, Pavel. Effective Corpus Virtualization. In: Challenges in the Management of Large Corpora (CMLC-2) Workshop Programme. p. 7.
- JAKUBICEK, Milos, et al. Fast Syntactic Searching in Very Large Corpora for Many Languages. In: PACLIC. 2010. p. 741-747.
Practical Session
Compare search through (A) plain text using grep, (B) an indexed corpus using Manatee, (C) a corpus indexed in an arbitrary SQL database Use vertical text for BNC available at aurora:/corpora/vert/bnc/bnc.vert.xz or download the compressed file via this URL (224 MB) or uncompressed here (2 GB).
Search for the phrase "test case", display context of 10 words before and after each occurrence of the search phrase.
(A) plain
Hint: use grep -C
to display context.
(B)
Use a Google Colab environment with pre-built Manatee:
https://colab.research.google.com/drive/1mMKSZm__Cw7f2yuKhUY1MGseE5iCcK8f?usp=sharing
Compile the corpus using the configuration file according to the instructions available at
https://www.sketchengine.eu/documentation/local-installations/compiling-corpus/
For easy access to the compiled Manatee commands you shall adjust the execution PATH shell variable via
PATH=%env PATH %env PATH=/content/manatee-open-2.225.8/api:/content/manatee-open-2.225.8/src:$PATH
Adjust PATH
and VERTICAL
in the configuration file to the actual path and file name, e.g.
PATH . VERTICAL bnc.vert
and refer to the corpus with full path, e.g.
! compilecorp ./bnc
The corpus compilation process takes about 45 minutes.
After successful compilation, try (again with full path):
time corpquery ./bnc '[word="test"] [word="case"]'
See https://www.sketchengine.eu/documentation/corpus-querying/
(C)
Use your favourite SQL database, you can use e.g. sqlite3
.
Hint how to import vertical text:
https://stackoverflow.com/questions/26065872/how-to-import-a-tsv-file-with-sqlite3
(D) bonus task: use SQLite3 with FTS extension and a custom FTS tokenizer to index all attributes (word, lemma, tag).
For (A), (B) and (C), submit the commands you used and how long the search took to evaluate.