13 | | * 16 tuples of files named `dataset_ner_*.sentences.txt`, `.ner_tags.txt`, and in two cases also `.docx`.[[BR]]These contain sentences and NER tags for supervised training, validation, and testing of language models.[[BR]]The `.docx` files are authored by human annotators and may contain extra details missing from files `.sentences.txt` and `.ner_tags.txt`.[[BR]]Here are the five variables that we used to produce the different files: |
14 | | 1. The sentences may originate from book OCR texts using information retrieval techniques (`fuzzy-regex` or `manatee`).[[BR]]The sentences may also originate from regests (`regests`).[[BR]]Furthermore, the sentences may originate both from book OCR texts and regests (`fuzzy-regex+regests` and `fuzzy-regex+manatee`). |
| 13 | * 16 tuples of files named `dataset_ner_*.sentences.txt`, `.ner_tags.txt`, and in two cases also `.docx`.[[BR]]These contain sentences and NER tags for supervised training, validation, and testing of language models.[[BR]]The `.docx` files are authored by human annotators and may contain extra details missing from files `.sentences.txt` and `.ner_tags.txt`.[[BR]]Here are the five variables that we used to produce the different files: |
| 14 | 1. The sentences may originate from book OCR texts using information retrieval techniques (`fuzzy-regex` or `manatee`).[[BR]]The sentences may also originate from regests (`regests`) or both books and regests (`fuzzy-regex+regests` and `fuzzy-regex+manatee`). |
15 | 15 | 1. When sentences originate from book OCR texts, they may span several pages of a book.[[BR]]However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.[[BR]]We either allow the sentences in the file to cross page boundaries (`all`) or not (`non-crossing`). |
16 | 16 | 1. When sentences originate from book OCR texts, they may come from book pages of different relevance.[[BR]]We either use sentences from all book pages (`all`) or just those considered relevant by expert annotators (`only-relevant`). |
17 | 17 | 1. When sentences and NER tags originate from book OCR texts using information retrieval techniques, many entities in the sentences may lack tags.[[BR]]Therefore, we also provide NER tags completed by language models (`automatically_tagged`) and human annotators (`tagged`). |