Kontextová navigace

Změny mezi verzí 27 a verzí 28 u NerDataset

Časová značka:: 15. 1. 2023 16:58:51 (před 2 lety)
Autor:: xnovot32@fi.muni.cz
Komentář:: --

Vysvětlivky:

: Nezměněno
: Přidáno
: Odstraněno
: Změněno

NerDataset

-                      v27
+                      v28
 This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era.[[BR]]The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
 You can [http://hdl.handle.net/11234/1-5024 download the dataset] from the LINDAT/CLARIAH-CZ repository.
+You can [http://hdl.handle.net/11234/1-5024 download the dataset] from the LINDAT/CLARIAH-CZ repository.
 == Contents ==
 The dataset is structured as follows:
  * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-5024/language-modeling-corpus.zip?sequence=1&isAllowed=y language-modeling-corpus.zip] (633.79 MB) contains 8 files with sentences for unsupervised training and validation of language models.[[BR]]We used the following three variables to produce the different files:
+ * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-5024/language-modeling-corpus.zip?sequence=1&isAllowed=y language-modeling-corpus.zip] (633.79 MB) contains 8 files with sentences for unsupervised training and validation of language models.[[BR]]We used the following three variables to produce the different files:
 . The sentences are extracted from book OCR texts and may therefore span several pages.[[BR]]However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.[[BR]]We either allow the sentences to cross page boundaries (`all`) or not (`non-crossing`).
 . The sentences come from all book pages (`all`) or just those considered relevant by human annotators (`only-relevant`).
 …
 ||=dataset_mlm_non-crossing_only-relevant_validation =|| 549.4 kB|| 2,489|| 81,293|| 22,090||
  * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-5024/named-entity-recognition-annotations-small.zip?sequence=2&isAllowed=y named-entity-recognition-annotations-small.zip] (978.29 MB) contains 82 tuples of files named `*.sentences.txt`, `.ner_tags.txt`, and in one case also `.docx`.^1^[[BR]]These files contain sentences and NER tags for supervised training, validation, and testing of language models. We used them to produce our intermediate language models.[[BR]]These are the “small” sentences and NER tags that we used for the supervised training, validation, and testing of intermediate language models.[[BR]]Here are the five variables that we used to produce the different files:
+ * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-5024/named-entity-recognition-annotations-small.zip?sequence=2&isAllowed=y named-entity-recognition-annotations-small.zip] (978.29 MB) contains 82 tuples of files named `*.sentences.txt`, `.ner_tags.txt`, and in one case also `.docx`.^1^[[BR]]These files contain sentences and NER tags for supervised training, validation, and testing of language models. We used them to produce our intermediate language models.[[BR]]These are the “small” sentences and NER tags that we used for the supervised training, validation, and testing of intermediate language models.[[BR]]Here are the five variables that we used to produce the different files:
 . The sentences may originate from book OCR texts using information retrieval techniques (`fuzzy-regex` or `manatee`).[[BR]]The sentences may also originate from regests (`regests`) or both books and regests (`fuzzy-regex+regests` and `fuzzy-regex+manatee`).
 . When sentences originate from book OCR texts, they may span several pages of a book.[[BR]]However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.[[BR]]We either allow the sentences to cross page boundaries (`all`) or not (`non-crossing`).
 …
 ||=dataset_ner_manatee_non-crossing_only-relevant_testing_401-500 =|| 38.5 kB|| 100|| 4,507|| 110|| 55|| 55|| 2,449||
  * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-5024/named-entity-recognition-annotations-large.zip?sequence=3&isAllowed=y named-entity-recognition-annotations-large.zip] (1.31 GB) contains 16 tuples of files named `*.sentences.txt` and `.ner_tags.txt`.[[BR]]These files contain sentences and NER tags for supervised training, validation, and testing of language models. We produced them with our language models.[[BR]]Here are the four variables that we used to produce the different files:
+ * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-5024/named-entity-recognition-annotations-large.zip?sequence=3&isAllowed=y named-entity-recognition-annotations-large.zip] (1.31 GB) contains 16 tuples of files named `*.sentences.txt` and `.ner_tags.txt`.[[BR]]These files contain sentences and NER tags for supervised training, validation, and testing of language models. We produced them with our language models.[[BR]]Here are the four variables that we used to produce the different files:
 . The sentences are extracted from book OCR texts and may therefore span several pages.[[BR]]However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.[[BR]]We either allow the sentences to cross page boundaries (`all`) or not (`non-crossing`).
 . The sentences come from all book pages (`all`) or just those considered relevant by human annotators (`only-relevant`).
 …
 ||=dataset_mlm_all_only-relevant_validation_automatically_tagged_004 =|| 1.0 MB|| 2,786|| 107,609|| 8,937|| 4,125|| 4,812|| 27,019||
 ||=dataset_mlm_all_only-relevant_validation_automatically_tagged_007 =|| 989.2 kB|| 2,786|| 107,609|| 6,581|| 2,980|| 3,601|| 27,019||
 ||=dataset_mlm_non-crossing_only-relevant_validation_automatically_tagged_004 =|| 754.3 kB|| 2,484|| 8,0619|| 7,290|| 3,380|| 3,910|| 22,087||
 ||=dataset_mlm_non-crossing_only-relevant_validation_automatically_tagged_007 =|| 740.2 kB|| 2,484|| 8,0619|| 5,281|| 2,404|| 2,877|| 22,087||
+||=dataset_mlm_non-crossing_only-relevant_validation_automatically_tagged_004 =|| 754.3 kB|| 2,484|| 80,619|| 7,290|| 3,380|| 3,910|| 22,087||
+||=dataset_mlm_non-crossing_only-relevant_validation_automatically_tagged_007 =|| 740.2 kB|| 2,484|| 80,619|| 5,281|| 2,404|| 2,877|| 22,087||
 == Citing ==