= Indexing and Searching Very Large Texts = [[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/NlpInPracticeCourse|NLP in Practice Course]], Course Guarantee: Aleš Horák Prepared by: Miloš Jakubíček == State of the Art == === References === 1. RYCHLÝ, Pavel, et al. Korpusové manažery a~ jejich efektivní implementace. 2000. 1. JAKUBÍCEK, Miloš; KILGARRIFF, Adam; RYCHLÝ, Pavel. Effective Corpus Virtualization. In: Challenges in the Management of Large Corpora (CMLC-2) Workshop Programme. p. 7. 1. JAKUBICEK, Milos, et al. Fast Syntactic Searching in Very Large Corpora for Many Languages. In: PACLIC. 2010. p. 741-747. == Practical Session == Compare search through (A) plain text using grep, (B) an indexed corpus using Manatee, (C) a corpus indexed in an arbitrary SQL database Use vertical text for BNC available at aurora:/corpora/vert/bnc/bnc.vert.xz or download the compressed file via [htdocs:bigdata/bnc.vert.xz this URL] (224 MB) or uncompressed [htdocs:bigdata/bnc.vert here] (2 GB). Search for the phrase "test case", display context of 10 words before and after each occurrence of the search phrase. (A) plain Hint: use `grep -C` to display context. (B) Use a Google Colab environment with pre-built Manatee: https://colab.research.google.com/drive/1mMKSZm__Cw7f2yuKhUY1MGseE5iCcK8f?usp=sharing Compile the corpus using the [raw-attachment:bnc configuration file] according to the instructions available at https://www.sketchengine.eu/documentation/local-installations/compiling-corpus/ For easy access to the compiled Manatee commands you shall adjust the execution PATH shell variable via {{{ PATH=%env PATH %env PATH=/content/manatee-open-2.225.8/api:/content/manatee-open-2.225.8/src:$PATH }}} Adjust `PATH` and `VERTICAL` in the configuration file to the actual path and file name, e.g. {{{ PATH . VERTICAL bnc.vert }}} and refer to the corpus with full path, e.g. {{{ ! compilecorp ./bnc }}} The corpus compilation process takes about 45 minutes. After successful compilation, try (again with full path): {{{ time corpquery ./bnc '[word="test"] [word="case"]' }}} See https://www.sketchengine.eu/documentation/corpus-querying/ (C) Use your favourite SQL database, you can use e.g. `sqlite3`. Hint how to import vertical text: https://stackoverflow.com/questions/26065872/how-to-import-a-tsv-file-with-sqlite3 (D) bonus task: use SQLite3 with FTS extension and a custom FTS tokenizer to index all attributes (word, lemma, tag). For (A), (B) and (C), submit the commands you used and how long the search took to evaluate.