Kontextová navigace

Změny mezi verzí 24 a verzí 25 u NerDataset

Časová značka:: 5. 1. 2023 16:43:48 (před 3 lety)
Autor:: xnovot32@fi.muni.cz
Komentář:: --

Vysvětlivky:

: Nezměněno
: Přidáno
: Odstraněno
: Změněno

NerDataset

-                      v24
+                      v25
 This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era.[[BR]]The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
+You can [https://hdl.handle.net/11234/1-4936 download the dataset] in the LINDAT/CLARIAH-CZ repository.
+You can download the dataset from the LINDAT/CLARIAH-CZ repository:
+ * [https://hdl.handle.net/11234/1-4936 The language modeling corpus and the “small” NER annotations that we used to train intermediate language models]
+ * The “large” NER annotations that we produced with our language models
 == Contents ==
 …
 ||=dataset_ner_manatee_non-crossing_only-relevant_testing_401-500 =|| 38.5 kB|| 100|| 4,507|| 110|| 55|| 55|| 2,449||
+ * The archive [https://nlp.fi.muni.cz/projekty/ahisto/named-entity-recognition-annotations-large.zip named-entity-recognition-annotations-large.zip] (1.3 GB) contains 16 tuples of files named `*.sentences.txt` and `.ner_tags.txt`. These files contain sentences and NER tags for supervised training, validation, and testing of language models.[[BR]]Here are the four variables that we used to produce the different files:
+. The sentences are extracted from book OCR texts and may therefore span several pages.[[BR]]However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.[[BR]]We either allow the sentences to cross page boundaries (`all`) or not (`non-crossing`).
+. The sentences come from all book pages (`all`) or just those considered relevant by human annotators (`only-relevant`).
+. We split the sentences roughly into 90% for training (`training`) and 10% for validation (`validation`).
+. We use an ensemble of a baseline model and weak fourth-generation NER models (`004`) or the final seventh-generation NER model (`007`).
+'''Table 3:''' Dataset statistics from the archive [https://nlp.fi.muni.cz/projekty/ahisto/named-entity-recognition-annotations-large.zip named-entity-recognition-annotations-large.zip], ordered by the number of B-* tags.
+|| ||= file size =||= # sentences =||= # tokens =||= # B-* tags =||= # B-PER tags =||= # B-LOC tags =||= # types =||
+||=dataset_mlm_all_all_training_automatically_tagged_007 =|| 860.0 MB|| 3,227,624|| 95,054,481|| 6,340,811|| 3,794,991|| 2,545,820|| 6,562,841||
+||=dataset_mlm_all_all_training_automatically_tagged_004 =|| 882.6 MB|| 3,227,624|| 95,054,481|| 9,727,269|| 5,429,801|| 4,297,468|| 6,562,841||
+||=dataset_mlm_non-crossing_all_training_automatically_tagged_004 =|| 736.0 MB|| 3,009,481|| 79,003,252|| 8,447,053|| 4,721,604|| 3,725,449|| 5,660,658||
+||=dataset_mlm_non-crossing_all_training_automatically_tagged_007 =|| 716.0 MB|| 3,009,481|| 79,003,252|| 5,441,290|| 3,264,675|| 2,176,615|| 5,660,658||
+||=dataset_mlm_all_all_validation_automatically_tagged_004 =|| 114.0 MB|| 402,179|| 12,240,756|| 1,201,467|| 659,139|| 542,328|| 1,319,365||
+||=dataset_mlm_all_all_validation_automatically_tagged_007 =|| 111.2 MB|| 402,179|| 12,240,756|| 781,509|| 462,102|| 319,407|| 1,319,365||
+||=dataset_mlm_non-crossing_all_validation_automatically_tagged_004 =|| 94.0 MB|| 372,880|| 10,061,113|| 1,035,283|| 571,082|| 464,201|| 1,141,033||
+||=dataset_mlm_non-crossing_all_validation_automatically_tagged_007 =|| 91.6 MB|| 372,880|| 10,061,113|| 663,793|| 395,771|| 268,022|| 1,141,033||
+||=dataset_mlm_all_only-relevant_training_automatically_tagged_004 =|| 11.5 MB|| 47,835|| 1,277,430|| 133,101|| 64,711|| 68,390|| 183,563||
+||=dataset_mlm_all_only-relevant_training_automatically_tagged_007 =|| 11.3 MB|| 47,835|| 1,277,430|| 99,103|| 50,544|| 48,559|| 183,563||
+||=dataset_mlm_non-crossing_only-relevant_training_automatically_tagged_004 =|| 9.6 MB|| 44,155|| 1,066,545|| 116,176|| 55,996|| 60,180|| 158,622||
+||=dataset_mlm_non-crossing_only-relevant_training_automatically_tagged_007 =|| 9.4 MB|| 44,155|| 1,066,545|| 85,675|| 43,360|| 42,315|| 158,622||
+||=dataset_mlm_all_only-relevant_validation_automatically_tagged_004 =|| 1.0 MB|| 2,786|| 107,609|| 8,937|| 4,125|| 4,812|| 27,019||
+||=dataset_mlm_all_only-relevant_validation_automatically_tagged_007 =|| 989.2 kB|| 2,786|| 107,609|| 6,581|| 2,980|| 3,601|| 27,019||
+||=dataset_mlm_non-crossing_only-relevant_validation_automatically_tagged_004 =|| 754.3 kB|| 2,484|| 8,0619|| 7,290|| 3,380|| 3,910|| 22,087||
+||=dataset_mlm_non-crossing_only-relevant_validation_automatically_tagged_007 =|| 740.2 kB|| 2,484|| 8,0619|| 5,281|| 2,404|| 2,877|| 22,087||
 == Citing ==
 If you use our dataset in your work, please cite the following article: