| 124 | * The archive [https://nlp.fi.muni.cz/projekty/ahisto/named-entity-recognition-annotations-large.zip named-entity-recognition-annotations-large.zip] (1.3 GB) contains 16 tuples of files named `*.sentences.txt` and `.ner_tags.txt`. These files contain sentences and NER tags for supervised training, validation, and testing of language models.[[BR]]Here are the four variables that we used to produce the different files: |
| 125 | 1. The sentences are extracted from book OCR texts and may therefore span several pages.[[BR]]However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.[[BR]]We either allow the sentences to cross page boundaries (`all`) or not (`non-crossing`). |
| 126 | 1. The sentences come from all book pages (`all`) or just those considered relevant by human annotators (`only-relevant`). |
| 127 | 1. We split the sentences roughly into 90% for training (`training`) and 10% for validation (`validation`). |
| 128 | 1. We use an ensemble of a baseline model and weak fourth-generation NER models (`004`) or the final seventh-generation NER model (`007`). |
| 129 | |
| 130 | '''Table 3:''' Dataset statistics from the archive [https://nlp.fi.muni.cz/projekty/ahisto/named-entity-recognition-annotations-large.zip named-entity-recognition-annotations-large.zip], ordered by the number of B-* tags. |
| 131 | |
| 132 | || ||= file size =||= # sentences =||= # tokens =||= # B-* tags =||= # B-PER tags =||= # B-LOC tags =||= # types =|| |
| 133 | ||=dataset_mlm_all_all_training_automatically_tagged_007 =|| 860.0 MB|| 3,227,624|| 95,054,481|| 6,340,811|| 3,794,991|| 2,545,820|| 6,562,841|| |
| 134 | ||=dataset_mlm_all_all_training_automatically_tagged_004 =|| 882.6 MB|| 3,227,624|| 95,054,481|| 9,727,269|| 5,429,801|| 4,297,468|| 6,562,841|| |
| 135 | ||=dataset_mlm_non-crossing_all_training_automatically_tagged_004 =|| 736.0 MB|| 3,009,481|| 79,003,252|| 8,447,053|| 4,721,604|| 3,725,449|| 5,660,658|| |
| 136 | ||=dataset_mlm_non-crossing_all_training_automatically_tagged_007 =|| 716.0 MB|| 3,009,481|| 79,003,252|| 5,441,290|| 3,264,675|| 2,176,615|| 5,660,658|| |
| 137 | ||=dataset_mlm_all_all_validation_automatically_tagged_004 =|| 114.0 MB|| 402,179|| 12,240,756|| 1,201,467|| 659,139|| 542,328|| 1,319,365|| |
| 138 | ||=dataset_mlm_all_all_validation_automatically_tagged_007 =|| 111.2 MB|| 402,179|| 12,240,756|| 781,509|| 462,102|| 319,407|| 1,319,365|| |
| 139 | ||=dataset_mlm_non-crossing_all_validation_automatically_tagged_004 =|| 94.0 MB|| 372,880|| 10,061,113|| 1,035,283|| 571,082|| 464,201|| 1,141,033|| |
| 140 | ||=dataset_mlm_non-crossing_all_validation_automatically_tagged_007 =|| 91.6 MB|| 372,880|| 10,061,113|| 663,793|| 395,771|| 268,022|| 1,141,033|| |
| 141 | ||=dataset_mlm_all_only-relevant_training_automatically_tagged_004 =|| 11.5 MB|| 47,835|| 1,277,430|| 133,101|| 64,711|| 68,390|| 183,563|| |
| 142 | ||=dataset_mlm_all_only-relevant_training_automatically_tagged_007 =|| 11.3 MB|| 47,835|| 1,277,430|| 99,103|| 50,544|| 48,559|| 183,563|| |
| 143 | ||=dataset_mlm_non-crossing_only-relevant_training_automatically_tagged_004 =|| 9.6 MB|| 44,155|| 1,066,545|| 116,176|| 55,996|| 60,180|| 158,622|| |
| 144 | ||=dataset_mlm_non-crossing_only-relevant_training_automatically_tagged_007 =|| 9.4 MB|| 44,155|| 1,066,545|| 85,675|| 43,360|| 42,315|| 158,622|| |
| 145 | ||=dataset_mlm_all_only-relevant_validation_automatically_tagged_004 =|| 1.0 MB|| 2,786|| 107,609|| 8,937|| 4,125|| 4,812|| 27,019|| |
| 146 | ||=dataset_mlm_all_only-relevant_validation_automatically_tagged_007 =|| 989.2 kB|| 2,786|| 107,609|| 6,581|| 2,980|| 3,601|| 27,019|| |
| 147 | ||=dataset_mlm_non-crossing_only-relevant_validation_automatically_tagged_004 =|| 754.3 kB|| 2,484|| 8,0619|| 7,290|| 3,380|| 3,910|| 22,087|| |
| 148 | ||=dataset_mlm_non-crossing_only-relevant_validation_automatically_tagged_007 =|| 740.2 kB|| 2,484|| 8,0619|| 5,281|| 2,404|| 2,877|| 22,087|| |
| 149 | |