wiki:NerDataset

A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents

This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era.
The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).

You can download the dataset from the LINDAT/CLARIAH-CZ repository.

Contents

The dataset is structured as follows:

  • The archive language-modeling-corpus.zip (633.79 MB) contains 8 files with sentences for unsupervised training and validation of language models.
    We used the following three variables to produce the different files:
    1. The sentences are extracted from book OCR texts and may therefore span several pages.
      However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.
      We either allow the sentences to cross page boundaries (all) or not (non-crossing).
    2. The sentences come from all book pages (all) or just those considered relevant by human annotators (only-relevant).
    3. We split the sentences roughly into 90% for training (training) and 10% for validation (validation).

Table 1: Dataset statistics from the archive language-modeling-corpus.zip, ordered by file size.

file size # sentences # tokens # types
dataset_mlm_all_all_training 630.7 MB 3,228,077 96,556,612 6,198,957
dataset_mlm_non-crossing_all_training 524.1 MB 3,009,931 80,220,907 5,362,515
dataset_mlm_all_all_validation 81.8 MB 402,184 12,374,044 1,273,737
dataset_mlm_non-crossing_all_validation 67.3 MB 372,885 10,157,799 1,105,583
dataset_mlm_all_only-relevant_training 8.1 MB 47,958 1,286,573 181,845
dataset_mlm_non-crossing_only-relevant_training 6.7 MB 44,278 1,074,734 157,354
dataset_mlm_all_only-relevant_validation 736.7 kB 2,791 108,364 26,986
dataset_mlm_non-crossing_only-relevant_validation 549.4 kB 2,489 81,293 22,090
  • The archive named-entity-recognition-annotations-small.zip (978.29 MB) contains 82 tuples of files named *.sentences.txt, .ner_tags.txt, and in one case also .docx.1
    These files contain sentences and NER tags for supervised training, validation, and testing of language models. We used them to produce our intermediate language models.
    These are the “small” sentences and NER tags that we used for the supervised training, validation, and testing of intermediate language models.
    Here are the five variables that we used to produce the different files:
    1. The sentences may originate from book OCR texts using information retrieval techniques (fuzzy-regex or manatee).
      The sentences may also originate from regests (regests) or both books and regests (fuzzy-regex+regests and fuzzy-regex+manatee).
    2. When sentences originate from book OCR texts, they may span several pages of a book.
      However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.
      We either allow the sentences to cross page boundaries (all) or not (non-crossing).
    3. When sentences originate from book OCR texts, they may come from book pages of different relevance.
      We either use sentences from all book pages (all) or just those considered relevant by human annotators (only-relevant).
    4. When sentences and NER tags originate from book OCR texts using information retrieval techniques, many entities in the sentences may lack tags.
      Therefore, we also provide NER tags completed by language models (automatically_tagged) and human annotators (tagged).
    5. We split the sentences roughly into 80% for training (training), 10% for validation (validation), and 10% for testing (testing).
      For repeated testing, we subdivide the testing split (testing_001-400 and testing_401-500).

1 The .docx files were authored by human annotators and contain extra details missing from files .sentences.txt and .ner_tags.txt. The extra details include nested entities such as locations in person names (e.g. “Blažek z Kralup”) and people in location names (e.g. “Kostel sv. Martina”).

Table 2: Dataset statistics from the archive named-entity-recognition-annotations-small.zip, ordered by the number of B-* tags. In the article describing the dataset, the files dataset_ner_regests_training_* are referred to as Abstracts-Tiny, the files dataset_ner_manatee_non-crossing_only-relevant_* are referred to as Books-Small, and the files dataset_ner_manatee_non-crossing_only-relevant_*_automatically_tagged are referred to as Books-Medium.

file size # sentences # tokens # B-* tags # B-PER tags # B-LOC tags # types
dataset_ner_fuzzy-regex_all_all_training_automatically_tagged 230.4 MB 407,395 24,585,832 2,669,582 1,403,789 1,265,793 2,420,836
dataset_ner_fuzzy-regex+regests_all_all_training_automatically_tagged 231.6 MB 411,715 24,735,069 2,640,803 1,378,804 1,261,999 2,427,135
dataset_ner_fuzzy-regex+regests_non-crossing_all_training_automatically_tagged 164.4 MB 353,301 17,387,149 2,065,805 1,100,245 965,560 1,850,210
dataset_ner_fuzzy-regex_non-crossing_all_training_automatically_tagged 162.9 MB 348,981 17,237,912 2,049,537 1,089,768 959,769 1,843,163
dataset_ner_manatee+regests_all_all_training_automatically_tagged 95.4 MB 158,759 10,155,332 1,175,031 563,912 611,119 1,267,107
dataset_ner_manatee_all_all_training_automatically_tagged 93.8 MB 154,439 10,006,095 1,158,763 553,435 605,328 1,258,983
dataset_ner_manatee+regests_non-crossing_all_training_automatically_tagged 64.5 MB 134,909 6,795,014 870,613 423,345 447,268 932,654
dataset_ner_manatee_non-crossing_all_training_automatically_tagged 63.0 MB 130,589 6,645,777 854,345 412,868 441,477 923,554
dataset_ner_fuzzy-regex+regests_all_all_validation_automatically_tagged 58.3 MB 81,651 6,211,198 685,020 356,017 329,003 910,379
dataset_ner_fuzzy-regex_all_all_validation_automatically_tagged 58.1 MB 81,149 6,193,356 682,993 354,671 328,322 908,885
dataset_ner_fuzzy-regex+regests_all_all_training 218.0 MB 411,715 24,735,069 606,807 290,530 316,277 2,427,135
dataset_ner_fuzzy-regex_all_all_training 217.7 MB 407,395 24,585,832 592,822 281,497 311,325 2,420,836
dataset_ner_fuzzy-regex+regests_non-crossing_all_training 153.8 MB 353,301 17,387,149 494,302 238,381 255,921 1,850,210
dataset_ner_fuzzy-regex+regests_non-crossing_all_validation_automatically_tagged 37.9 MB 67,971 3,989,670 487,724 259,777 227,947 651,387
dataset_ner_fuzzy-regex_non-crossing_all_validation_automatically_tagged 37.7 MB 67,469 3,971,828 485,697 258,431 227,266 649,698
dataset_ner_fuzzy-regex_non-crossing_all_training 153.1 MB 348,981 17,237,912 480,318 229,349 250,969 1,843,163
dataset_ner_manatee+regests_all_all_validation_automatically_tagged 21.0 MB 28,727 2,249,037 261,612 120,358 141,254 427,057
dataset_ner_manatee_all_all_validation_automatically_tagged 20.8 MB 28,225 2,231,195 259,585 119,012 140,573 425,088
dataset_ner_manatee+regests_all_all_training 88.9 MB 158,759 10,155,332 214,566 79,924 134,642 1,267,107
dataset_ner_manatee_all_all_training 87.9 MB 154,439 10,006,095 200,582 70,892 129,690 1,258,983
dataset_ner_manatee+regests_non-crossing_all_validation_automatically_tagged 12.8 MB 23,643 1,348,859 176,809 83,699 93,110 293,119
dataset_ner_manatee+regests_non-crossing_all_training 59.8 MB 134,909 6,795,014 174,902 65,897 109,005 932,654
dataset_ner_manatee_non-crossing_all_validation_automatically_tagged 12.6 MB 23,141 1,331,017 174,782 82,353 92,429 290,894
dataset_ner_manatee_non-crossing_all_training 58.6 MB 130,589 6,645,777 160,918 56,865 104,053 923,554
dataset_ner_fuzzy-regex+regests_all_all_validation 54.2 MB 81,651 6,211,198 92,485 46,038 46,447 910,379
dataset_ner_fuzzy-regex_all_all_testing 54.2 MB 80,929 6,167,375 90,747 45,176 45,571 908,276
dataset_ner_fuzzy-regex_all_all_validation 54.4 MB 81,149 6,193,356 90,719 44,878 45,841 908,885
dataset_ner_fuzzy-regex+regests_non-crossing_all_validation 35.0 MB 67,971 3,989,670 75,207 37,496 37,711 651,387
dataset_ner_fuzzy-regex+regests_all_only-relevant_training_automatically_tagged 6.6 MB 14,942 694,242 73,757 41,838 31,919 119,272
dataset_ner_fuzzy-regex_non-crossing_all_testing 34.8 MB 67,208 3,938,611 73,476 36,506 36,970 644,220
dataset_ner_fuzzy-regex_non-crossing_all_validation 35.1 MB 67,469 3,971,828 73,441 36,336 37,105 649,698
dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_training_automatically_tagged 5.3 MB 13,456 548,928 61,522 35,007 26,515 99,275
dataset_ner_fuzzy-regex_all_only-relevant_training_automatically_tagged 5.1 MB 10,622 545,005 57,489 31,361 26,128 98,843
dataset_ner_manatee+regests_all_only-relevant_training_automatically_tagged 4.6 MB 11,813 490,147 51,653 28,315 23,338 88,535
dataset_ner_fuzzy-regex_non-crossing_only-relevant_training_automatically_tagged 3.7 MB 9,136 399,691 45,254 24,530 20,724 77,963
dataset_ner_manatee+regests_non-crossing_only-relevant_training_automatically_tagged 3.8 MB 10,813 401,164 44,213 24,435 19,778 74,376
dataset_ner_manatee_all_only-relevant_training_automatically_tagged 3.1 MB 7,493 340,910 34,247 17,193 17,054 66,659
dataset_ner_manatee+regests_all_all_validation 19.5 MB 28,727 2,249,037 32,546 12,999 19,547 427,057
dataset_ner_manatee_all_all_testing 19.9 MB 29,516 2,279,822 32,234 12,555 19,679 437,414
dataset_ner_manatee_all_all_validation 19.4 MB 28,225 2,231,195 30,780 11,839 18,941 425,088
dataset_ner_fuzzy-regex+regests_all_only-relevant_training 6.3 MB 14,942 694,242 30,455 19,214 11,241 119,272
dataset_ner_manatee_non-crossing_only-relevant_training_automatically_tagged 2.3 MB 6,493 251,927 27,945 13,958 13,987 51,600
dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_training 5.0 MB 13,456 548,928 27,324 17,257 10,067 99,275
dataset_ner_manatee+regests_non-crossing_all_validation 11.8 MB 23,643 1,348,859 26,287 10,498 15,789 293,119
dataset_ner_manatee_non-crossing_all_testing 12.2 MB 24,420 1,384,547 25,937 10,068 15,869 300,862
dataset_ner_manatee_non-crossing_all_validation 11.7 MB 23,141 1,331,017 24,521 9,338 15,183 290,894
dataset_ner_manatee+regests_all_only-relevant_training 4.4 MB 11,813 490,147 24,212 13,626 10,586 88,535
dataset_ner_manatee+regests_non-crossing_only-relevant_training 3.7 MB 10,813 401,164 22,583 12,909 9,674 74,376
dataset_ner_fuzzy-regex+regests_all_only-relevant_validation_automatically_tagged 1.5 MB 2,776 158,548 16,901 9,936 6,965 44,018
dataset_ner_fuzzy-regex_all_only-relevant_training 4.8 MB 10,622 545,005 16,471 10,182 6,289 98,843
dataset_ner_regests_training_automatically_tagged 1.5 MB 4,320 149,237 16,268 10,477 5,791 29,166
dataset_ner_fuzzy-regex_all_only-relevant_validation_automatically_tagged 1.3 MB 2,274 140,706 14,874 8,590 6,284 39,612
dataset_ner_regests_training 1.5 MB 4,320 149,237 13,984 9,032 4,952 29,166
dataset_ner_fuzzy-regex_non-crossing_only-relevant_training 3.5 MB 9,136 399,691 13,340 8,225 5,115 77,963
dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_validation_automatically_tagged 1.1 MB 2,420 110,376 12,902 7,592 5,310 33,352
dataset_ner_fuzzy-regex_non-crossing_only-relevant_validation_automatically_tagged 885.1 kB 1,918 92,534 10,875 6,246 4,629 28,676
dataset_ner_manatee_all_only-relevant_training 2.9 MB 7,493 340,910 10,228 4,594 5,634 66,659
dataset_ner_manatee+regests_all_only-relevant_validation_automatically_tagged 913.3 kB 1,972 97,069 10,180 5,592 4,588 28,324
dataset_ner_manatee_non-crossing_only-relevant_training 2.2 MB 6,493 251,927 8,599 3,877 4,722 51,600
dataset_ner_manatee_all_only-relevant_validation_automatically_tagged 730.1 kB 1,470 79,227 8,153 4,246 3,907 23,569
dataset_ner_manatee+regests_non-crossing_only-relevant_validation_automatically_tagged 683.4 kB 1,751 71,948 8,136 4,501 3,635 22,133
dataset_ner_manatee_non-crossing_only-relevant_validation_automatically_tagged 500.3 kB 1,249 54,106 6,109 3,155 2,954 17,138
dataset_ner_fuzzy-regex+regests_all_only-relevant_validation 1.4 MB 2,776 158,548 4,421 2,817 1,604 44,018
dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_validation 998.7 kB 2,420 110,376 3,938 2,519 1,419 33,352
dataset_ner_manatee+regests_all_only-relevant_validation 862.5 kB 1,972 97,069 3,347 1,887 1,460 28,324
dataset_ner_manatee+regests_non-crossing_only-relevant_validation 646.8 kB 1,751 71,948 3,094 1,774 1,320 22,133
dataset_ner_fuzzy-regex_all_only-relevant_testing 1.3 MB 2,405 144,684 2,780 1,784 996 39,977
dataset_ner_fuzzy-regex_all_only-relevant_validation 1.2 MB 2,274 140,706 2,655 1,657 998 39,612
dataset_ner_fuzzy-regex_non-crossing_only-relevant_testing 867.0 kB 2,034 98,659 2,292 1,455 837 29,874
dataset_ner_regests_testing 261.7 kB 799 26,148 2,182 1,422 760 8,978
dataset_ner_fuzzy-regex_non-crossing_only-relevant_validation 818.4 kB 1,918 92,534 2,172 1,359 813 28,676
dataset_ner_regests_validation_automatically_tagged 183.1 kB 502 17,842 2,027 1,346 681 6,445
dataset_ner_regests_validation 181.7 kB 502 17,842 1,766 1,160 606 6,445
dataset_ner_manatee_all_only-relevant_validation 681.8 kB 1,470 79,227 1,581 727 854 23,569
dataset_ner_manatee_all_only-relevant_testing 678.8 kB 1,420 78,751 1,529 695 834 23,949
dataset_ner_manatee_non-crossing_only-relevant_validation 465.9 kB 1,249 54,106 1,328 614 714 17,138
dataset_ner_manatee_non-crossing_only-relevant_testing 469.1 kB 1,208 54,391 1,283 587 696 17,713
dataset_ner_regests_testing_001-400 129.8 kB 400 12,811 1,164 789 375 5,121
dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_tagged 41.6 kB 100 4,507 530 287 243 2,449
dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_automatically_tagged 41.0 kB 100 4,507 459 233 226 2,449
dataset_ner_manatee_non-crossing_only-relevant_testing_001-400 169.0 kB 400 19,554 439 201 238 7,928
dataset_ner_manatee_non-crossing_only-relevant_testing_401-500 38.5 kB 100 4,507 110 55 55 2,449
  • The archive named-entity-recognition-annotations-large.zip (1.31 GB) contains 16 tuples of files named *.sentences.txt and .ner_tags.txt.
    These files contain sentences and NER tags for supervised training, validation, and testing of language models. We produced them with our language models.
    Here are the four variables that we used to produce the different files:
    1. The sentences are extracted from book OCR texts and may therefore span several pages.
      However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.
      We either allow the sentences to cross page boundaries (all) or not (non-crossing).
    2. The sentences come from all book pages (all) or just those considered relevant by human annotators (only-relevant).
    3. We split the sentences roughly into 90% for training (training) and 10% for validation (validation).
    4. We use an ensemble of a baseline model and weak fourth-generation NER models (004) or the final seventh-generation NER model (007).

Table 3: Dataset statistics from the archive named-entity-recognition-annotations-large.zip, ordered by the number of B-* tags. In the article describing the dataset, the files dataset_mlm_non-crossing_only-relevant_*_automatically_tagged_007 are referred to as Books-Large and the files dataset_mlm_all_all_training_automatically_tagged_007 are referred to as Books-Huge.

file size # sentences # tokens # B-* tags # B-PER tags # B-LOC tags # types
dataset_mlm_all_all_training_automatically_tagged_007 860.0 MB 3,227,624 95,054,481 6,340,811 3,794,991 2,545,820 6,562,841
dataset_mlm_all_all_training_automatically_tagged_004 882.6 MB 3,227,624 95,054,481 9,727,269 5,429,801 4,297,468 6,562,841
dataset_mlm_non-crossing_all_training_automatically_tagged_004 736.0 MB 3,009,481 79,003,252 8,447,053 4,721,604 3,725,449 5,660,658
dataset_mlm_non-crossing_all_training_automatically_tagged_007 716.0 MB 3,009,481 79,003,252 5,441,290 3,264,675 2,176,615 5,660,658
dataset_mlm_all_all_validation_automatically_tagged_004 114.0 MB 402,179 12,240,756 1,201,467 659,139 542,328 1,319,365
dataset_mlm_all_all_validation_automatically_tagged_007 111.2 MB 402,179 12,240,756 781,509 462,102 319,407 1,319,365
dataset_mlm_non-crossing_all_validation_automatically_tagged_004 94.0 MB 372,880 10,061,113 1,035,283 571,082 464,201 1,141,033
dataset_mlm_non-crossing_all_validation_automatically_tagged_007 91.6 MB 372,880 10,061,113 663,793 395,771 268,022 1,141,033
dataset_mlm_all_only-relevant_training_automatically_tagged_004 11.5 MB 47,835 1,277,430 133,101 64,711 68,390 183,563
dataset_mlm_all_only-relevant_training_automatically_tagged_007 11.3 MB 47,835 1,277,430 99,103 50,544 48,559 183,563
dataset_mlm_non-crossing_only-relevant_training_automatically_tagged_004 9.6 MB 44,155 1,066,545 116,176 55,996 60,180 158,622
dataset_mlm_non-crossing_only-relevant_training_automatically_tagged_007 9.4 MB 44,155 1,066,545 85,675 43,360 42,315 158,622
dataset_mlm_all_only-relevant_validation_automatically_tagged_004 1.0 MB 2,786 107,609 8,937 4,125 4,812 27,019
dataset_mlm_all_only-relevant_validation_automatically_tagged_007 989.2 kB 2,786 107,609 6,581 2,980 3,601 27,019
dataset_mlm_non-crossing_only-relevant_validation_automatically_tagged_004 754.3 kB 2,484 80,619 7,290 3,380 3,910 22,087
dataset_mlm_non-crossing_only-relevant_validation_automatically_tagged_007 740.2 kB 2,484 80,619 5,281 2,404 2,877 22,087

Corpus

The file corpus.vert.gz (1.3G compressed) contains a vertical file with the results of optical character recognition, named entity recognition, language identification, and lemmatization on all books in the AHISTO project database. See also the schema of the vertical file. (Warning: The corpus is a work in progress and may change. Last modified: 2023-05-25)

Citing

An article describing our dataset is currently under review. Preprint is available on ArXiv.

Last modified 12 měsíci ago Naposledy změněno 29. 5. 2023 9:13:20