A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents
This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era.
The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
You can download the dataset from the LINDAT/CLARIAH-CZ repository.
Contents
The dataset is structured as follows:
- The archive language-modeling-corpus.zip (633.79 MB) contains 8 files with sentences for unsupervised training and validation of language models.
We used the following three variables to produce the different files:- The sentences are extracted from book OCR texts and may therefore span several pages.
However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.
We either allow the sentences to cross page boundaries (all
) or not (non-crossing
). - The sentences come from all book pages (
all
) or just those considered relevant by human annotators (only-relevant
). - We split the sentences roughly into 90% for training (
training
) and 10% for validation (validation
).
- The sentences are extracted from book OCR texts and may therefore span several pages.
Table 1: Dataset statistics from the archive language-modeling-corpus.zip, ordered by file size.
file size | # sentences | # tokens | # types | |
---|---|---|---|---|
dataset_mlm_all_all_training | 630.7 MB | 3,228,077 | 96,556,612 | 6,198,957 |
dataset_mlm_non-crossing_all_training | 524.1 MB | 3,009,931 | 80,220,907 | 5,362,515 |
dataset_mlm_all_all_validation | 81.8 MB | 402,184 | 12,374,044 | 1,273,737 |
dataset_mlm_non-crossing_all_validation | 67.3 MB | 372,885 | 10,157,799 | 1,105,583 |
dataset_mlm_all_only-relevant_training | 8.1 MB | 47,958 | 1,286,573 | 181,845 |
dataset_mlm_non-crossing_only-relevant_training | 6.7 MB | 44,278 | 1,074,734 | 157,354 |
dataset_mlm_all_only-relevant_validation | 736.7 kB | 2,791 | 108,364 | 26,986 |
dataset_mlm_non-crossing_only-relevant_validation | 549.4 kB | 2,489 | 81,293 | 22,090 |
- The archive named-entity-recognition-annotations-small.zip (978.29 MB) contains 82 tuples of files named
*.sentences.txt
,.ner_tags.txt
, and in one case also.docx
.1
These files contain sentences and NER tags for supervised training, validation, and testing of language models. We used them to produce our intermediate language models.
These are the “small” sentences and NER tags that we used for the supervised training, validation, and testing of intermediate language models.
Here are the five variables that we used to produce the different files:- The sentences may originate from book OCR texts using information retrieval techniques (
fuzzy-regex
ormanatee
).
The sentences may also originate from regests (regests
) or both books and regests (fuzzy-regex+regests
andfuzzy-regex+manatee
). - When sentences originate from book OCR texts, they may span several pages of a book.
However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.
We either allow the sentences to cross page boundaries (all
) or not (non-crossing
). - When sentences originate from book OCR texts, they may come from book pages of different relevance.
We either use sentences from all book pages (all
) or just those considered relevant by human annotators (only-relevant
). - When sentences and NER tags originate from book OCR texts using information retrieval techniques, many entities in the sentences may lack tags.
Therefore, we also provide NER tags completed by language models (automatically_tagged
) and human annotators (tagged
). - We split the sentences roughly into 80% for training (
training
), 10% for validation (validation
), and 10% for testing (testing
).
For repeated testing, we subdivide the testing split (testing_001-400
andtesting_401-500
).
- The sentences may originate from book OCR texts using information retrieval techniques (
1 The .docx
files were authored by human annotators and contain extra details missing from files .sentences.txt
and .ner_tags.txt
. The extra details include nested entities such as locations in person names (e.g. “Blažek z Kralup”) and people in location names (e.g. “Kostel sv. Martina”).
Table 2: Dataset statistics from the archive named-entity-recognition-annotations-small.zip, ordered by the number of B-* tags. In the article describing the dataset, the files dataset_ner_regests_training_*
are referred to as Abstracts-Tiny, the files dataset_ner_manatee_non-crossing_only-relevant_*
are referred to as Books-Small, and the files dataset_ner_manatee_non-crossing_only-relevant_*_automatically_tagged
are referred to as Books-Medium.
file size | # sentences | # tokens | # B-* tags | # B-PER tags | # B-LOC tags | # types | |
---|---|---|---|---|---|---|---|
dataset_ner_fuzzy-regex_all_all_training_automatically_tagged | 230.4 MB | 407,395 | 24,585,832 | 2,669,582 | 1,403,789 | 1,265,793 | 2,420,836 |
dataset_ner_fuzzy-regex+regests_all_all_training_automatically_tagged | 231.6 MB | 411,715 | 24,735,069 | 2,640,803 | 1,378,804 | 1,261,999 | 2,427,135 |
dataset_ner_fuzzy-regex+regests_non-crossing_all_training_automatically_tagged | 164.4 MB | 353,301 | 17,387,149 | 2,065,805 | 1,100,245 | 965,560 | 1,850,210 |
dataset_ner_fuzzy-regex_non-crossing_all_training_automatically_tagged | 162.9 MB | 348,981 | 17,237,912 | 2,049,537 | 1,089,768 | 959,769 | 1,843,163 |
dataset_ner_manatee+regests_all_all_training_automatically_tagged | 95.4 MB | 158,759 | 10,155,332 | 1,175,031 | 563,912 | 611,119 | 1,267,107 |
dataset_ner_manatee_all_all_training_automatically_tagged | 93.8 MB | 154,439 | 10,006,095 | 1,158,763 | 553,435 | 605,328 | 1,258,983 |
dataset_ner_manatee+regests_non-crossing_all_training_automatically_tagged | 64.5 MB | 134,909 | 6,795,014 | 870,613 | 423,345 | 447,268 | 932,654 |
dataset_ner_manatee_non-crossing_all_training_automatically_tagged | 63.0 MB | 130,589 | 6,645,777 | 854,345 | 412,868 | 441,477 | 923,554 |
dataset_ner_fuzzy-regex+regests_all_all_validation_automatically_tagged | 58.3 MB | 81,651 | 6,211,198 | 685,020 | 356,017 | 329,003 | 910,379 |
dataset_ner_fuzzy-regex_all_all_validation_automatically_tagged | 58.1 MB | 81,149 | 6,193,356 | 682,993 | 354,671 | 328,322 | 908,885 |
dataset_ner_fuzzy-regex+regests_all_all_training | 218.0 MB | 411,715 | 24,735,069 | 606,807 | 290,530 | 316,277 | 2,427,135 |
dataset_ner_fuzzy-regex_all_all_training | 217.7 MB | 407,395 | 24,585,832 | 592,822 | 281,497 | 311,325 | 2,420,836 |
dataset_ner_fuzzy-regex+regests_non-crossing_all_training | 153.8 MB | 353,301 | 17,387,149 | 494,302 | 238,381 | 255,921 | 1,850,210 |
dataset_ner_fuzzy-regex+regests_non-crossing_all_validation_automatically_tagged | 37.9 MB | 67,971 | 3,989,670 | 487,724 | 259,777 | 227,947 | 651,387 |
dataset_ner_fuzzy-regex_non-crossing_all_validation_automatically_tagged | 37.7 MB | 67,469 | 3,971,828 | 485,697 | 258,431 | 227,266 | 649,698 |
dataset_ner_fuzzy-regex_non-crossing_all_training | 153.1 MB | 348,981 | 17,237,912 | 480,318 | 229,349 | 250,969 | 1,843,163 |
dataset_ner_manatee+regests_all_all_validation_automatically_tagged | 21.0 MB | 28,727 | 2,249,037 | 261,612 | 120,358 | 141,254 | 427,057 |
dataset_ner_manatee_all_all_validation_automatically_tagged | 20.8 MB | 28,225 | 2,231,195 | 259,585 | 119,012 | 140,573 | 425,088 |
dataset_ner_manatee+regests_all_all_training | 88.9 MB | 158,759 | 10,155,332 | 214,566 | 79,924 | 134,642 | 1,267,107 |
dataset_ner_manatee_all_all_training | 87.9 MB | 154,439 | 10,006,095 | 200,582 | 70,892 | 129,690 | 1,258,983 |
dataset_ner_manatee+regests_non-crossing_all_validation_automatically_tagged | 12.8 MB | 23,643 | 1,348,859 | 176,809 | 83,699 | 93,110 | 293,119 |
dataset_ner_manatee+regests_non-crossing_all_training | 59.8 MB | 134,909 | 6,795,014 | 174,902 | 65,897 | 109,005 | 932,654 |
dataset_ner_manatee_non-crossing_all_validation_automatically_tagged | 12.6 MB | 23,141 | 1,331,017 | 174,782 | 82,353 | 92,429 | 290,894 |
dataset_ner_manatee_non-crossing_all_training | 58.6 MB | 130,589 | 6,645,777 | 160,918 | 56,865 | 104,053 | 923,554 |
dataset_ner_fuzzy-regex+regests_all_all_validation | 54.2 MB | 81,651 | 6,211,198 | 92,485 | 46,038 | 46,447 | 910,379 |
dataset_ner_fuzzy-regex_all_all_testing | 54.2 MB | 80,929 | 6,167,375 | 90,747 | 45,176 | 45,571 | 908,276 |
dataset_ner_fuzzy-regex_all_all_validation | 54.4 MB | 81,149 | 6,193,356 | 90,719 | 44,878 | 45,841 | 908,885 |
dataset_ner_fuzzy-regex+regests_non-crossing_all_validation | 35.0 MB | 67,971 | 3,989,670 | 75,207 | 37,496 | 37,711 | 651,387 |
dataset_ner_fuzzy-regex+regests_all_only-relevant_training_automatically_tagged | 6.6 MB | 14,942 | 694,242 | 73,757 | 41,838 | 31,919 | 119,272 |
dataset_ner_fuzzy-regex_non-crossing_all_testing | 34.8 MB | 67,208 | 3,938,611 | 73,476 | 36,506 | 36,970 | 644,220 |
dataset_ner_fuzzy-regex_non-crossing_all_validation | 35.1 MB | 67,469 | 3,971,828 | 73,441 | 36,336 | 37,105 | 649,698 |
dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_training_automatically_tagged | 5.3 MB | 13,456 | 548,928 | 61,522 | 35,007 | 26,515 | 99,275 |
dataset_ner_fuzzy-regex_all_only-relevant_training_automatically_tagged | 5.1 MB | 10,622 | 545,005 | 57,489 | 31,361 | 26,128 | 98,843 |
dataset_ner_manatee+regests_all_only-relevant_training_automatically_tagged | 4.6 MB | 11,813 | 490,147 | 51,653 | 28,315 | 23,338 | 88,535 |
dataset_ner_fuzzy-regex_non-crossing_only-relevant_training_automatically_tagged | 3.7 MB | 9,136 | 399,691 | 45,254 | 24,530 | 20,724 | 77,963 |
dataset_ner_manatee+regests_non-crossing_only-relevant_training_automatically_tagged | 3.8 MB | 10,813 | 401,164 | 44,213 | 24,435 | 19,778 | 74,376 |
dataset_ner_manatee_all_only-relevant_training_automatically_tagged | 3.1 MB | 7,493 | 340,910 | 34,247 | 17,193 | 17,054 | 66,659 |
dataset_ner_manatee+regests_all_all_validation | 19.5 MB | 28,727 | 2,249,037 | 32,546 | 12,999 | 19,547 | 427,057 |
dataset_ner_manatee_all_all_testing | 19.9 MB | 29,516 | 2,279,822 | 32,234 | 12,555 | 19,679 | 437,414 |
dataset_ner_manatee_all_all_validation | 19.4 MB | 28,225 | 2,231,195 | 30,780 | 11,839 | 18,941 | 425,088 |
dataset_ner_fuzzy-regex+regests_all_only-relevant_training | 6.3 MB | 14,942 | 694,242 | 30,455 | 19,214 | 11,241 | 119,272 |
dataset_ner_manatee_non-crossing_only-relevant_training_automatically_tagged | 2.3 MB | 6,493 | 251,927 | 27,945 | 13,958 | 13,987 | 51,600 |
dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_training | 5.0 MB | 13,456 | 548,928 | 27,324 | 17,257 | 10,067 | 99,275 |
dataset_ner_manatee+regests_non-crossing_all_validation | 11.8 MB | 23,643 | 1,348,859 | 26,287 | 10,498 | 15,789 | 293,119 |
dataset_ner_manatee_non-crossing_all_testing | 12.2 MB | 24,420 | 1,384,547 | 25,937 | 10,068 | 15,869 | 300,862 |
dataset_ner_manatee_non-crossing_all_validation | 11.7 MB | 23,141 | 1,331,017 | 24,521 | 9,338 | 15,183 | 290,894 |
dataset_ner_manatee+regests_all_only-relevant_training | 4.4 MB | 11,813 | 490,147 | 24,212 | 13,626 | 10,586 | 88,535 |
dataset_ner_manatee+regests_non-crossing_only-relevant_training | 3.7 MB | 10,813 | 401,164 | 22,583 | 12,909 | 9,674 | 74,376 |
dataset_ner_fuzzy-regex+regests_all_only-relevant_validation_automatically_tagged | 1.5 MB | 2,776 | 158,548 | 16,901 | 9,936 | 6,965 | 44,018 |
dataset_ner_fuzzy-regex_all_only-relevant_training | 4.8 MB | 10,622 | 545,005 | 16,471 | 10,182 | 6,289 | 98,843 |
dataset_ner_regests_training_automatically_tagged | 1.5 MB | 4,320 | 149,237 | 16,268 | 10,477 | 5,791 | 29,166 |
dataset_ner_fuzzy-regex_all_only-relevant_validation_automatically_tagged | 1.3 MB | 2,274 | 140,706 | 14,874 | 8,590 | 6,284 | 39,612 |
dataset_ner_regests_training | 1.5 MB | 4,320 | 149,237 | 13,984 | 9,032 | 4,952 | 29,166 |
dataset_ner_fuzzy-regex_non-crossing_only-relevant_training | 3.5 MB | 9,136 | 399,691 | 13,340 | 8,225 | 5,115 | 77,963 |
dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_validation_automatically_tagged | 1.1 MB | 2,420 | 110,376 | 12,902 | 7,592 | 5,310 | 33,352 |
dataset_ner_fuzzy-regex_non-crossing_only-relevant_validation_automatically_tagged | 885.1 kB | 1,918 | 92,534 | 10,875 | 6,246 | 4,629 | 28,676 |
dataset_ner_manatee_all_only-relevant_training | 2.9 MB | 7,493 | 340,910 | 10,228 | 4,594 | 5,634 | 66,659 |
dataset_ner_manatee+regests_all_only-relevant_validation_automatically_tagged | 913.3 kB | 1,972 | 97,069 | 10,180 | 5,592 | 4,588 | 28,324 |
dataset_ner_manatee_non-crossing_only-relevant_training | 2.2 MB | 6,493 | 251,927 | 8,599 | 3,877 | 4,722 | 51,600 |
dataset_ner_manatee_all_only-relevant_validation_automatically_tagged | 730.1 kB | 1,470 | 79,227 | 8,153 | 4,246 | 3,907 | 23,569 |
dataset_ner_manatee+regests_non-crossing_only-relevant_validation_automatically_tagged | 683.4 kB | 1,751 | 71,948 | 8,136 | 4,501 | 3,635 | 22,133 |
dataset_ner_manatee_non-crossing_only-relevant_validation_automatically_tagged | 500.3 kB | 1,249 | 54,106 | 6,109 | 3,155 | 2,954 | 17,138 |
dataset_ner_fuzzy-regex+regests_all_only-relevant_validation | 1.4 MB | 2,776 | 158,548 | 4,421 | 2,817 | 1,604 | 44,018 |
dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_validation | 998.7 kB | 2,420 | 110,376 | 3,938 | 2,519 | 1,419 | 33,352 |
dataset_ner_manatee+regests_all_only-relevant_validation | 862.5 kB | 1,972 | 97,069 | 3,347 | 1,887 | 1,460 | 28,324 |
dataset_ner_manatee+regests_non-crossing_only-relevant_validation | 646.8 kB | 1,751 | 71,948 | 3,094 | 1,774 | 1,320 | 22,133 |
dataset_ner_fuzzy-regex_all_only-relevant_testing | 1.3 MB | 2,405 | 144,684 | 2,780 | 1,784 | 996 | 39,977 |
dataset_ner_fuzzy-regex_all_only-relevant_validation | 1.2 MB | 2,274 | 140,706 | 2,655 | 1,657 | 998 | 39,612 |
dataset_ner_fuzzy-regex_non-crossing_only-relevant_testing | 867.0 kB | 2,034 | 98,659 | 2,292 | 1,455 | 837 | 29,874 |
dataset_ner_regests_testing | 261.7 kB | 799 | 26,148 | 2,182 | 1,422 | 760 | 8,978 |
dataset_ner_fuzzy-regex_non-crossing_only-relevant_validation | 818.4 kB | 1,918 | 92,534 | 2,172 | 1,359 | 813 | 28,676 |
dataset_ner_regests_validation_automatically_tagged | 183.1 kB | 502 | 17,842 | 2,027 | 1,346 | 681 | 6,445 |
dataset_ner_regests_validation | 181.7 kB | 502 | 17,842 | 1,766 | 1,160 | 606 | 6,445 |
dataset_ner_manatee_all_only-relevant_validation | 681.8 kB | 1,470 | 79,227 | 1,581 | 727 | 854 | 23,569 |
dataset_ner_manatee_all_only-relevant_testing | 678.8 kB | 1,420 | 78,751 | 1,529 | 695 | 834 | 23,949 |
dataset_ner_manatee_non-crossing_only-relevant_validation | 465.9 kB | 1,249 | 54,106 | 1,328 | 614 | 714 | 17,138 |
dataset_ner_manatee_non-crossing_only-relevant_testing | 469.1 kB | 1,208 | 54,391 | 1,283 | 587 | 696 | 17,713 |
dataset_ner_regests_testing_001-400 | 129.8 kB | 400 | 12,811 | 1,164 | 789 | 375 | 5,121 |
dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_tagged | 41.6 kB | 100 | 4,507 | 530 | 287 | 243 | 2,449 |
dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_automatically_tagged | 41.0 kB | 100 | 4,507 | 459 | 233 | 226 | 2,449 |
dataset_ner_manatee_non-crossing_only-relevant_testing_001-400 | 169.0 kB | 400 | 19,554 | 439 | 201 | 238 | 7,928 |
dataset_ner_manatee_non-crossing_only-relevant_testing_401-500 | 38.5 kB | 100 | 4,507 | 110 | 55 | 55 | 2,449 |
- The archive named-entity-recognition-annotations-large.zip (1.31 GB) contains 16 tuples of files named
*.sentences.txt
and.ner_tags.txt
.
These files contain sentences and NER tags for supervised training, validation, and testing of language models. We produced them with our language models.
Here are the four variables that we used to produce the different files:- The sentences are extracted from book OCR texts and may therefore span several pages.
However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.
We either allow the sentences to cross page boundaries (all
) or not (non-crossing
). - The sentences come from all book pages (
all
) or just those considered relevant by human annotators (only-relevant
). - We split the sentences roughly into 90% for training (
training
) and 10% for validation (validation
). - We use an ensemble of a baseline model and weak fourth-generation NER models (
004
) or the final seventh-generation NER model (007
).
- The sentences are extracted from book OCR texts and may therefore span several pages.
Table 3: Dataset statistics from the archive named-entity-recognition-annotations-large.zip, ordered by the number of B-* tags. In the article describing the dataset, the files dataset_mlm_non-crossing_only-relevant_*_automatically_tagged_007
are referred to as Books-Large and the files dataset_mlm_all_all_training_automatically_tagged_007
are referred to as Books-Huge.
file size | # sentences | # tokens | # B-* tags | # B-PER tags | # B-LOC tags | # types | |
---|---|---|---|---|---|---|---|
dataset_mlm_all_all_training_automatically_tagged_007 | 860.0 MB | 3,227,624 | 95,054,481 | 6,340,811 | 3,794,991 | 2,545,820 | 6,562,841 |
dataset_mlm_all_all_training_automatically_tagged_004 | 882.6 MB | 3,227,624 | 95,054,481 | 9,727,269 | 5,429,801 | 4,297,468 | 6,562,841 |
dataset_mlm_non-crossing_all_training_automatically_tagged_004 | 736.0 MB | 3,009,481 | 79,003,252 | 8,447,053 | 4,721,604 | 3,725,449 | 5,660,658 |
dataset_mlm_non-crossing_all_training_automatically_tagged_007 | 716.0 MB | 3,009,481 | 79,003,252 | 5,441,290 | 3,264,675 | 2,176,615 | 5,660,658 |
dataset_mlm_all_all_validation_automatically_tagged_004 | 114.0 MB | 402,179 | 12,240,756 | 1,201,467 | 659,139 | 542,328 | 1,319,365 |
dataset_mlm_all_all_validation_automatically_tagged_007 | 111.2 MB | 402,179 | 12,240,756 | 781,509 | 462,102 | 319,407 | 1,319,365 |
dataset_mlm_non-crossing_all_validation_automatically_tagged_004 | 94.0 MB | 372,880 | 10,061,113 | 1,035,283 | 571,082 | 464,201 | 1,141,033 |
dataset_mlm_non-crossing_all_validation_automatically_tagged_007 | 91.6 MB | 372,880 | 10,061,113 | 663,793 | 395,771 | 268,022 | 1,141,033 |
dataset_mlm_all_only-relevant_training_automatically_tagged_004 | 11.5 MB | 47,835 | 1,277,430 | 133,101 | 64,711 | 68,390 | 183,563 |
dataset_mlm_all_only-relevant_training_automatically_tagged_007 | 11.3 MB | 47,835 | 1,277,430 | 99,103 | 50,544 | 48,559 | 183,563 |
dataset_mlm_non-crossing_only-relevant_training_automatically_tagged_004 | 9.6 MB | 44,155 | 1,066,545 | 116,176 | 55,996 | 60,180 | 158,622 |
dataset_mlm_non-crossing_only-relevant_training_automatically_tagged_007 | 9.4 MB | 44,155 | 1,066,545 | 85,675 | 43,360 | 42,315 | 158,622 |
dataset_mlm_all_only-relevant_validation_automatically_tagged_004 | 1.0 MB | 2,786 | 107,609 | 8,937 | 4,125 | 4,812 | 27,019 |
dataset_mlm_all_only-relevant_validation_automatically_tagged_007 | 989.2 kB | 2,786 | 107,609 | 6,581 | 2,980 | 3,601 | 27,019 |
dataset_mlm_non-crossing_only-relevant_validation_automatically_tagged_004 | 754.3 kB | 2,484 | 80,619 | 7,290 | 3,380 | 3,910 | 22,087 |
dataset_mlm_non-crossing_only-relevant_validation_automatically_tagged_007 | 740.2 kB | 2,484 | 80,619 | 5,281 | 2,404 | 2,877 | 22,087 |
Corpus
The file corpus.vert.gz (1.3G compressed) contains a vertical file with the results of optical character recognition, named entity recognition, language identification, and lemmatization on all books in the AHISTO project database. See also the schema of the vertical file. (Warning: The corpus is a work in progress and may change. Last modified: 2023-05-25)
Citing
An article describing our dataset is currently under review. Preprint is available on ArXiv.