Změny mezi verzí 24 a verzí 25 u NerDataset


Ignorovat:
Časová značka:
5. 1. 2023 16:43:48 (před 19 měsíci)
Autor:
xnovot32@fi.muni.cz
Komentář:

--

Vysvětlivky:

Nezměněno
Přidáno
Odstraněno
Změněno
  • NerDataset

    v24 v25  
    22This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era.[[BR]]The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
    33
    4 You can [https://hdl.handle.net/11234/1-4936 download the dataset] in the LINDAT/CLARIAH-CZ repository.
     4You can download the dataset from the LINDAT/CLARIAH-CZ repository:
     5
     6 * [https://hdl.handle.net/11234/1-4936 The language modeling corpus and the “small” NER annotations that we used to train intermediate language models]
     7 * The “large” NER annotations that we produced with our language models
    58
    69== Contents ==
     
    119122||=dataset_ner_manatee_non-crossing_only-relevant_testing_401-500 =|| 38.5 kB|| 100|| 4,507|| 110|| 55|| 55|| 2,449||
    120123
     124 * The archive [https://nlp.fi.muni.cz/projekty/ahisto/named-entity-recognition-annotations-large.zip named-entity-recognition-annotations-large.zip] (1.3 GB) contains 16 tuples of files named `*.sentences.txt` and `.ner_tags.txt`. These files contain sentences and NER tags for supervised training, validation, and testing of language models.[[BR]]Here are the four variables that we used to produce the different files:
     125   1. The sentences are extracted from book OCR texts and may therefore span several pages.[[BR]]However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.[[BR]]We either allow the sentences to cross page boundaries (`all`) or not (`non-crossing`).
     126   1. The sentences come from all book pages (`all`) or just those considered relevant by human annotators (`only-relevant`).
     127   1. We split the sentences roughly into 90% for training (`training`) and 10% for validation (`validation`).
     128   1. We use an ensemble of a baseline model and weak fourth-generation NER models (`004`) or the final seventh-generation NER model (`007`).
     129
     130'''Table 3:''' Dataset statistics from the archive [https://nlp.fi.muni.cz/projekty/ahisto/named-entity-recognition-annotations-large.zip named-entity-recognition-annotations-large.zip], ordered by the number of B-* tags.
     131
     132|| ||= file size =||= # sentences =||= # tokens =||= # B-* tags =||= # B-PER tags =||= # B-LOC tags =||= # types =||
     133||=dataset_mlm_all_all_training_automatically_tagged_007 =|| 860.0 MB|| 3,227,624|| 95,054,481|| 6,340,811|| 3,794,991|| 2,545,820|| 6,562,841||
     134||=dataset_mlm_all_all_training_automatically_tagged_004 =|| 882.6 MB|| 3,227,624|| 95,054,481|| 9,727,269|| 5,429,801|| 4,297,468|| 6,562,841||
     135||=dataset_mlm_non-crossing_all_training_automatically_tagged_004 =|| 736.0 MB|| 3,009,481|| 79,003,252|| 8,447,053|| 4,721,604|| 3,725,449|| 5,660,658||
     136||=dataset_mlm_non-crossing_all_training_automatically_tagged_007 =|| 716.0 MB|| 3,009,481|| 79,003,252|| 5,441,290|| 3,264,675|| 2,176,615|| 5,660,658||
     137||=dataset_mlm_all_all_validation_automatically_tagged_004 =|| 114.0 MB|| 402,179|| 12,240,756|| 1,201,467|| 659,139|| 542,328|| 1,319,365||
     138||=dataset_mlm_all_all_validation_automatically_tagged_007 =|| 111.2 MB|| 402,179|| 12,240,756|| 781,509|| 462,102|| 319,407|| 1,319,365||
     139||=dataset_mlm_non-crossing_all_validation_automatically_tagged_004 =|| 94.0 MB|| 372,880|| 10,061,113|| 1,035,283|| 571,082|| 464,201|| 1,141,033||
     140||=dataset_mlm_non-crossing_all_validation_automatically_tagged_007 =|| 91.6 MB|| 372,880|| 10,061,113|| 663,793|| 395,771|| 268,022|| 1,141,033||
     141||=dataset_mlm_all_only-relevant_training_automatically_tagged_004 =|| 11.5 MB|| 47,835|| 1,277,430|| 133,101|| 64,711|| 68,390|| 183,563||
     142||=dataset_mlm_all_only-relevant_training_automatically_tagged_007 =|| 11.3 MB|| 47,835|| 1,277,430|| 99,103|| 50,544|| 48,559|| 183,563||
     143||=dataset_mlm_non-crossing_only-relevant_training_automatically_tagged_004 =|| 9.6 MB|| 44,155|| 1,066,545|| 116,176|| 55,996|| 60,180|| 158,622||
     144||=dataset_mlm_non-crossing_only-relevant_training_automatically_tagged_007 =|| 9.4 MB|| 44,155|| 1,066,545|| 85,675|| 43,360|| 42,315|| 158,622||
     145||=dataset_mlm_all_only-relevant_validation_automatically_tagged_004 =|| 1.0 MB|| 2,786|| 107,609|| 8,937|| 4,125|| 4,812|| 27,019||
     146||=dataset_mlm_all_only-relevant_validation_automatically_tagged_007 =|| 989.2 kB|| 2,786|| 107,609|| 6,581|| 2,980|| 3,601|| 27,019||
     147||=dataset_mlm_non-crossing_only-relevant_validation_automatically_tagged_004 =|| 754.3 kB|| 2,484|| 8,0619|| 7,290|| 3,380|| 3,910|| 22,087||
     148||=dataset_mlm_non-crossing_only-relevant_validation_automatically_tagged_007 =|| 740.2 kB|| 2,484|| 8,0619|| 5,281|| 2,404|| 2,877|| 22,087||
     149
    121150== Citing ==
    122151If you use our dataset in your work, please cite the following article: