wiki:NerDataset

Version 22 (modified by xnovot32@fi.muni.cz, před 3 lety) (diff)

--

A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents

This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era.
The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).

You can download the dataset in the LINDAT/CLARIAH-CZ repository.

Contents

The dataset is structured as follows:

  • The archive language-modeling-corpus.zip (633.79 MB) contains 8 files with sentences for unsupervised training and validation of language models.
    We used the following three variables to produce the different files:
    1. The sentences are extracted from book OCR texts and may therefore span several pages.
      However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.
      We either allow the sentences to cross page boundaries (all) or not (non-crossing).
    2. The sentences come from all book pages (all) or just those considered relevant by human annotators (only-relevant).
    3. We split the sentences roughly into 90% for training (training) and 10% for validation (validation).

Table 1: Dataset statistics from the archive language-modeling-corpus.zip, ordered by file size.

file size # sentences # tokens # types
dataset_mlm_all_all_training 630.7 MB 3228077 96556612 6198957
dataset_mlm_non-crossing_all_training 524.1 MB 3009931 80220907 5362515
dataset_mlm_all_all_validation 81.8 MB 402184 12374044 1273737
dataset_mlm_non-crossing_all_validation 67.3 MB 372885 10157799 1105583
dataset_mlm_all_only-relevant_training 8.1 MB 47958 1286573 181845
dataset_mlm_non-crossing_only-relevant_training 6.7 MB 44278 1074734 157354
dataset_mlm_all_only-relevant_validation 736.7 kB 2791 108364 26986
dataset_mlm_non-crossing_only-relevant_validation 549.4 kB 2489 81293 22090
  • The archive named-entity-recognition-annotations.zip (978.29 MB) contains 82 tuples of files named *.sentences.txt, .ner_tags.txt, and in one case also .docx.1
    These files contain sentences and NER tags for supervised training, validation, and testing of language models.
    Here are the five variables that we used to produce the different files:
    1. The sentences may originate from book OCR texts using information retrieval techniques (fuzzy-regex or manatee).
      The sentences may also originate from regests (regests) or both books and regests (fuzzy-regex+regests and fuzzy-regex+manatee).
    2. When sentences originate from book OCR texts, they may span several pages of a book.
      However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.
      We either allow the sentences to cross page boundaries (all) or not (non-crossing).
    3. When sentences originate from book OCR texts, they may come from book pages of different relevance.
      We either use sentences from all book pages (all) or just those considered relevant by human annotators (only-relevant).
    4. When sentences and NER tags originate from book OCR texts using information retrieval techniques, many entities in the sentences may lack tags.
      Therefore, we also provide NER tags completed by language models (automatically_tagged) and human annotators (tagged).
    5. We split the sentences roughly into 80% for training (training), 10% for validation (validation), and 10% for testing (testing).
      For repeated testing, we subdivide the testing split (testing_001-400 and testing_401-500).

1 The .docx files were authored by human annotators and contain extra details missing from files .sentences.txt and .ner_tags.txt. The extra details include nested entities such as locations in person names (e.g. “Blažek z Kralup”) and people in location names (e.g. “Kostel sv. Martina”).

Table 2: Dataset statistics from the archive named-entity-recognition-annotations.zip, ordered by the number of B-* tags.

file size # sentences # tokens # B-* tags # B-PER tags # B-LOC tags # types
dataset_ner_fuzzy-regex_all_all_training_automatically_tagged 230.4 MB 407395 24585832 2669582 1403789 1265793 2420836
dataset_ner_fuzzy-regex+regests_all_all_training_automatically_tagged 231.6 MB 411715 24735069 2640803 1378804 1261999 2427135
dataset_ner_fuzzy-regex+regests_non-crossing_all_training_automatically_tagged 164.4 MB 353301 17387149 2065805 1100245 965560 1850210
dataset_ner_fuzzy-regex_non-crossing_all_training_automatically_tagged 162.9 MB 348981 17237912 2049537 1089768 959769 1843163
dataset_ner_manatee+regests_all_all_training_automatically_tagged 95.4 MB 158759 10155332 1175031 563912 611119 1267107
dataset_ner_manatee_all_all_training_automatically_tagged 93.8 MB 154439 10006095 1158763 553435 605328 1258983
dataset_ner_manatee+regests_non-crossing_all_training_automatically_tagged 64.5 MB 134909 6795014 870613 423345 447268 932654
dataset_ner_manatee_non-crossing_all_training_automatically_tagged 63.0 MB 130589 6645777 854345 412868 441477 923554
dataset_ner_fuzzy-regex+regests_all_all_validation_automatically_tagged 58.3 MB 81651 6211198 685020 356017 329003 910379
dataset_ner_fuzzy-regex_all_all_validation_automatically_tagged 58.1 MB 81149 6193356 682993 354671 328322 908885
dataset_ner_fuzzy-regex+regests_all_all_training 218.0 MB 411715 24735069 606807 290530 316277 2427135
dataset_ner_fuzzy-regex_all_all_training 217.7 MB 407395 24585832 592822 281497 311325 2420836
dataset_ner_fuzzy-regex+regests_non-crossing_all_training 153.8 MB 353301 17387149 494302 238381 255921 1850210
dataset_ner_fuzzy-regex+regests_non-crossing_all_validation_automatically_tagged 37.9 MB 67971 3989670 487724 259777 227947 651387
dataset_ner_fuzzy-regex_non-crossing_all_validation_automatically_tagged 37.7 MB 67469 3971828 485697 258431 227266 649698
dataset_ner_fuzzy-regex_non-crossing_all_training 153.1 MB 348981 17237912 480318 229349 250969 1843163
dataset_ner_manatee+regests_all_all_validation_automatically_tagged 21.0 MB 28727 2249037 261612 120358 141254 427057
dataset_ner_manatee_all_all_validation_automatically_tagged 20.8 MB 28225 2231195 259585 119012 140573 425088
dataset_ner_manatee+regests_all_all_training 88.9 MB 158759 10155332 214566 79924 134642 1267107
dataset_ner_manatee_all_all_training 87.9 MB 154439 10006095 200582 70892 129690 1258983
dataset_ner_manatee+regests_non-crossing_all_validation_automatically_tagged 12.8 MB 23643 1348859 176809 83699 93110 293119
dataset_ner_manatee+regests_non-crossing_all_training 59.8 MB 134909 6795014 174902 65897 109005 932654
dataset_ner_manatee_non-crossing_all_validation_automatically_tagged 12.6 MB 23141 1331017 174782 82353 92429 290894
dataset_ner_manatee_non-crossing_all_training 58.6 MB 130589 6645777 160918 56865 104053 923554
dataset_ner_fuzzy-regex+regests_all_all_validation 54.2 MB 81651 6211198 92485 46038 46447 910379
dataset_ner_fuzzy-regex_all_all_testing 54.2 MB 80929 6167375 90747 45176 45571 908276
dataset_ner_fuzzy-regex_all_all_validation 54.4 MB 81149 6193356 90719 44878 45841 908885
dataset_ner_fuzzy-regex+regests_non-crossing_all_validation 35.0 MB 67971 3989670 75207 37496 37711 651387
dataset_ner_fuzzy-regex+regests_all_only-relevant_training_automatically_tagged 6.6 MB 14942 694242 73757 41838 31919 119272
dataset_ner_fuzzy-regex_non-crossing_all_testing 34.8 MB 67208 3938611 73476 36506 36970 644220
dataset_ner_fuzzy-regex_non-crossing_all_validation 35.1 MB 67469 3971828 73441 36336 37105 649698
dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_training_automatically_tagged 5.3 MB 13456 548928 61522 35007 26515 99275
dataset_ner_fuzzy-regex_all_only-relevant_training_automatically_tagged 5.1 MB 10622 545005 57489 31361 26128 98843
dataset_ner_manatee+regests_all_only-relevant_training_automatically_tagged 4.6 MB 11813 490147 51653 28315 23338 88535
dataset_ner_fuzzy-regex_non-crossing_only-relevant_training_automatically_tagged 3.7 MB 9136 399691 45254 24530 20724 77963
dataset_ner_manatee+regests_non-crossing_only-relevant_training_automatically_tagged 3.8 MB 10813 401164 44213 24435 19778 74376
dataset_ner_manatee_all_only-relevant_training_automatically_tagged 3.1 MB 7493 340910 34247 17193 17054 66659
dataset_ner_manatee+regests_all_all_validation 19.5 MB 28727 2249037 32546 12999 19547 427057
dataset_ner_manatee_all_all_testing 19.9 MB 29516 2279822 32234 12555 19679 437414
dataset_ner_manatee_all_all_validation 19.4 MB 28225 2231195 30780 11839 18941 425088
dataset_ner_fuzzy-regex+regests_all_only-relevant_training 6.3 MB 14942 694242 30455 19214 11241 119272
dataset_ner_manatee_non-crossing_only-relevant_training_automatically_tagged 2.3 MB 6493 251927 27945 13958 13987 51600
dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_training 5.0 MB 13456 548928 27324 17257 10067 99275
dataset_ner_manatee+regests_non-crossing_all_validation 11.8 MB 23643 1348859 26287 10498 15789 293119
dataset_ner_manatee_non-crossing_all_testing 12.2 MB 24420 1384547 25937 10068 15869 300862
dataset_ner_manatee_non-crossing_all_validation 11.7 MB 23141 1331017 24521 9338 15183 290894
dataset_ner_manatee+regests_all_only-relevant_training 4.4 MB 11813 490147 24212 13626 10586 88535
dataset_ner_manatee+regests_non-crossing_only-relevant_training 3.7 MB 10813 401164 22583 12909 9674 74376
dataset_ner_fuzzy-regex+regests_all_only-relevant_validation_automatically_tagged 1.5 MB 2776 158548 16901 9936 6965 44018
dataset_ner_fuzzy-regex_all_only-relevant_training 4.8 MB 10622 545005 16471 10182 6289 98843
dataset_ner_regests_training_automatically_tagged 1.5 MB 4320 149237 16268 10477 5791 29166
dataset_ner_fuzzy-regex_all_only-relevant_validation_automatically_tagged 1.3 MB 2274 140706 14874 8590 6284 39612
dataset_ner_regests_training 1.5 MB 4320 149237 13984 9032 4952 29166
dataset_ner_fuzzy-regex_non-crossing_only-relevant_training 3.5 MB 9136 399691 13340 8225 5115 77963
dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_validation_automatically_tagged 1.1 MB 2420 110376 12902 7592 5310 33352
dataset_ner_fuzzy-regex_non-crossing_only-relevant_validation_automatically_tagged 885.1 kB 1918 92534 10875 6246 4629 28676
dataset_ner_manatee_all_only-relevant_training 2.9 MB 7493 340910 10228 4594 5634 66659
dataset_ner_manatee+regests_all_only-relevant_validation_automatically_tagged 913.3 kB 1972 97069 10180 5592 4588 28324
dataset_ner_manatee_non-crossing_only-relevant_training 2.2 MB 6493 251927 8599 3877 4722 51600
dataset_ner_manatee_all_only-relevant_validation_automatically_tagged 730.1 kB 1470 79227 8153 4246 3907 23569
dataset_ner_manatee+regests_non-crossing_only-relevant_validation_automatically_tagged 683.4 kB 1751 71948 8136 4501 3635 22133
dataset_ner_manatee_non-crossing_only-relevant_validation_automatically_tagged 500.3 kB 1249 54106 6109 3155 2954 17138
dataset_ner_fuzzy-regex+regests_all_only-relevant_validation 1.4 MB 2776 158548 4421 2817 1604 44018
dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_validation 998.7 kB 2420 110376 3938 2519 1419 33352
dataset_ner_manatee+regests_all_only-relevant_validation 862.5 kB 1972 97069 3347 1887 1460 28324
dataset_ner_manatee+regests_non-crossing_only-relevant_validation 646.8 kB 1751 71948 3094 1774 1320 22133
dataset_ner_fuzzy-regex_all_only-relevant_testing 1.3 MB 2405 144684 2780 1784 996 39977
dataset_ner_fuzzy-regex_all_only-relevant_validation 1.2 MB 2274 140706 2655 1657 998 39612
dataset_ner_fuzzy-regex_non-crossing_only-relevant_testing 867.0 kB 2034 98659 2292 1455 837 29874
dataset_ner_regests_testing 261.7 kB 799 26148 2182 1422 760 8978
dataset_ner_fuzzy-regex_non-crossing_only-relevant_validation 818.4 kB 1918 92534 2172 1359 813 28676
dataset_ner_regests_validation_automatically_tagged 183.1 kB 502 17842 2027 1346 681 6445
dataset_ner_regests_validation 181.7 kB 502 17842 1766 1160 606 6445
dataset_ner_manatee_all_only-relevant_validation 681.8 kB 1470 79227 1581 727 854 23569
dataset_ner_manatee_all_only-relevant_testing 678.8 kB 1420 78751 1529 695 834 23949
dataset_ner_manatee_non-crossing_only-relevant_validation 465.9 kB 1249 54106 1328 614 714 17138
dataset_ner_manatee_non-crossing_only-relevant_testing 469.1 kB 1208 54391 1283 587 696 17713
dataset_ner_regests_testing_001-400 129.8 kB 400 12811 1164 789 375 5121
dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_tagged 41.6 kB 100 4507 530 287 243 2449
dataset_ner_manatee_non-crossing_only-relevant_testing_001-400 169.0 kB 400 19554 439 201 238 7928
dataset_ner_manatee_non-crossing_only-relevant_testing_401-500 38.5 kB 100 4507 110 55 55 2449

Citing

If you use our dataset in your work, please cite the following article:

TODO

If you use LaTeX, you can use the following BibTeX entry:

TODO