Změny mezi verzí 10 a verzí 11 u NerDataset
- Časová značka:
- 30. 11. 2022 14:28:44 (před 3 lety)
Vysvětlivky:
- Nezměněno
- Přidáno
- Odstraněno
- Změněno
-
NerDataset
v10 v11 5 5 6 6 == Contents == 7 The dataset is stored in archive [https://nlp.fi.muni.cz/projects/ahisto/ner-dataset.zip ner-dataset.zip] (1.7 GB) with following structure:7 The dataset is stored in archive [https://nlp.fi.muni.cz/projects/ahisto/ner-dataset.zip ner-dataset.zip] (1.7 GB) with following contents: 8 8 9 9 * 8 files named `dataset_mlm_*.txt` that contain sentences for unsupervised training and validation of language models.[[BR]]We used the following three variables to produce the different files: … … 11 11 1. The sentences come from all book pages (`all`) or just those considered relevant by human annotators (`only-relevant`). 12 12 1. We split the sentences roughly into 90% for training (`training`) and 10% for validation (`validation`). 13 * 16 tuples of files named `dataset_ner_*.sentences.txt`, `.ner_tags.txt`, and in two cases also `.docx`.[[BR]]These files contain sentences and NER tags for supervised training, validation, and testing of language models.[[BR]]The `.docx` files are authored by human annotators and may contain extra detailsmissing from files `.sentences.txt` and `.ner_tags.txt`.[[BR]]Here are the five variables that we used to produce the different files:13 * 16 tuples of files named `dataset_ner_*.sentences.txt`, `.ner_tags.txt`, and in one case also `.docx`.[[BR]]These files contain sentences and NER tags for supervised training, validation, and testing of language models.[[BR]]The `.docx` files are authored by human annotators and contain extra details^1^ missing from files `.sentences.txt` and `.ner_tags.txt`.[[BR]]Here are the five variables that we used to produce the different files: 14 14 1. The sentences may originate from book OCR texts using information retrieval techniques (`fuzzy-regex` or `manatee`).[[BR]]The sentences may also originate from regests (`regests`) or both books and regests (`fuzzy-regex+regests` and `fuzzy-regex+manatee`). 15 15 1. When sentences originate from book OCR texts, they may span several pages of a book.[[BR]]However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.[[BR]]We either allow the sentences to cross page boundaries (`all`) or not (`non-crossing`). … … 17 17 1. When sentences and NER tags originate from book OCR texts using information retrieval techniques, many entities in the sentences may lack tags.[[BR]]Therefore, we also provide NER tags completed by language models (`automatically_tagged`) and human annotators (`tagged`). 18 18 1. We split the sentences roughly into 80% for training (`training`), 10% for validation (`validation`), and 10% for testing (`testing`).[[BR]]For repeated testing, we subdivide the testing split (`testing_001-400` and `testing_401-500`). 19 20 ^1^The extra details include nested entities such as locations in person names (Blažek z Kralup) and people in location names (Kostel sv. Martina).[[BR]]Use the `search.TaggedSentence.load()` function from [https://gitlab.fi.muni.cz/nlp/ahisto-modules/named-entity-search the ahisto_named_entity_search software tool] to load the `.docx` files together with the extra details. 19 21 20 22 == Citing ==