= A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents = This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era.[[BR]]The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER). You can [http://hdl.handle.net/11234/1-5024 download the dataset] from the LINDAT/CLARIAH-CZ repository. == Contents == The dataset is structured as follows: * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-5024/language-modeling-corpus.zip?sequence=1&isAllowed=y language-modeling-corpus.zip] (633.79 MB) contains 8 files with sentences for unsupervised training and validation of language models.[[BR]]We used the following three variables to produce the different files: 1. The sentences are extracted from book OCR texts and may therefore span several pages.[[BR]]However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.[[BR]]We either allow the sentences to cross page boundaries (`all`) or not (`non-crossing`). 1. The sentences come from all book pages (`all`) or just those considered relevant by human annotators (`only-relevant`). 1. We split the sentences roughly into 90% for training (`training`) and 10% for validation (`validation`). '''Table 1:''' Dataset statistics from the archive language-modeling-corpus.zip, ordered by file size. || ||= file size =||= # sentences =||= # tokens =||= # types =|| ||=dataset_mlm_all_all_training =|| 630.7 MB|| 3,228,077|| 96,556,612|| 6,198,957|| ||=dataset_mlm_non-crossing_all_training =|| 524.1 MB|| 3,009,931|| 80,220,907|| 5,362,515|| ||=dataset_mlm_all_all_validation =|| 81.8 MB|| 402,184|| 12,374,044|| 1,273,737|| ||=dataset_mlm_non-crossing_all_validation =|| 67.3 MB|| 372,885|| 10,157,799|| 1,105,583|| ||=dataset_mlm_all_only-relevant_training =|| 8.1 MB|| 47,958|| 1,286,573|| 181,845|| ||=dataset_mlm_non-crossing_only-relevant_training =|| 6.7 MB|| 44,278|| 1,074,734|| 157,354|| ||=dataset_mlm_all_only-relevant_validation =|| 736.7 kB|| 2,791|| 108,364|| 26,986|| ||=dataset_mlm_non-crossing_only-relevant_validation =|| 549.4 kB|| 2,489|| 81,293|| 22,090|| * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-5024/named-entity-recognition-annotations-small.zip?sequence=2&isAllowed=y named-entity-recognition-annotations-small.zip] (978.29 MB) contains 82 tuples of files named `*.sentences.txt`, `.ner_tags.txt`, and in one case also `.docx`.^1^[[BR]]These files contain sentences and NER tags for supervised training, validation, and testing of language models. We used them to produce our intermediate language models.[[BR]]These are the “small” sentences and NER tags that we used for the supervised training, validation, and testing of intermediate language models.[[BR]]Here are the five variables that we used to produce the different files: 1. The sentences may originate from book OCR texts using information retrieval techniques (`fuzzy-regex` or `manatee`).[[BR]]The sentences may also originate from regests (`regests`) or both books and regests (`fuzzy-regex+regests` and `fuzzy-regex+manatee`). 1. When sentences originate from book OCR texts, they may span several pages of a book.[[BR]]However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.[[BR]]We either allow the sentences to cross page boundaries (`all`) or not (`non-crossing`). 1. When sentences originate from book OCR texts, they may come from book pages of different relevance.[[BR]]We either use sentences from all book pages (`all`) or just those considered relevant by human annotators (`only-relevant`). 1. When sentences and NER tags originate from book OCR texts using information retrieval techniques, many entities in the sentences may lack tags.[[BR]]Therefore, we also provide NER tags completed by language models (`automatically_tagged`) and human annotators (`tagged`). 1. We split the sentences roughly into 80% for training (`training`), 10% for validation (`validation`), and 10% for testing (`testing`).[[BR]]For repeated testing, we subdivide the testing split (`testing_001-400` and `testing_401-500`). ''^1 ^The `.docx` files were authored by human annotators and contain extra details missing from files `.sentences.txt` and `.ner_tags.txt`. The extra details include nested entities such as locations in person names (e.g. “Blažek z __Kralup__”) and people in location names (e.g. “Kostel __sv. Martina__”).'' '''Table 2:''' Dataset statistics from the archive named-entity-recognition-annotations-small.zip, ordered by the number of B-* tags. || ||= file size =||= # sentences =||= # tokens =||= # B-* tags =||= # B-PER tags =||= # B-LOC tags =||= # types =|| ||=dataset_ner_fuzzy-regex_all_all_training_automatically_tagged =|| 230.4 MB|| 407,395|| 24,585,832|| 2,669,582|| 1,403,789|| 1,265,793|| 2,420,836|| ||=dataset_ner_fuzzy-regex+regests_all_all_training_automatically_tagged =|| 231.6 MB|| 411,715|| 24,735,069|| 2,640,803|| 1,378,804|| 1,261,999|| 2,427,135|| ||=dataset_ner_fuzzy-regex+regests_non-crossing_all_training_automatically_tagged =|| 164.4 MB|| 353,301|| 17,387,149|| 2,065,805|| 1,100,245|| 965,560|| 1,850,210|| ||=dataset_ner_fuzzy-regex_non-crossing_all_training_automatically_tagged =|| 162.9 MB|| 348,981|| 17,237,912|| 2,049,537|| 1,089,768|| 959,769|| 1,843,163|| ||=dataset_ner_manatee+regests_all_all_training_automatically_tagged =|| 95.4 MB|| 158,759|| 10,155,332|| 1,175,031|| 563,912|| 611,119|| 1,267,107|| ||=dataset_ner_manatee_all_all_training_automatically_tagged =|| 93.8 MB|| 154,439|| 10,006,095|| 1,158,763|| 553,435|| 605,328|| 1,258,983|| ||=dataset_ner_manatee+regests_non-crossing_all_training_automatically_tagged =|| 64.5 MB|| 134,909|| 6,795,014|| 870,613|| 423,345|| 447,268|| 932,654|| ||=dataset_ner_manatee_non-crossing_all_training_automatically_tagged =|| 63.0 MB|| 130,589|| 6,645,777|| 854,345|| 412,868|| 441,477|| 923,554|| ||=dataset_ner_fuzzy-regex+regests_all_all_validation_automatically_tagged =|| 58.3 MB|| 81,651|| 6,211,198|| 685,020|| 356,017|| 329,003|| 910,379|| ||=dataset_ner_fuzzy-regex_all_all_validation_automatically_tagged =|| 58.1 MB|| 81,149|| 6,193,356|| 682,993|| 354,671|| 328,322|| 908,885|| ||=dataset_ner_fuzzy-regex+regests_all_all_training =|| 218.0 MB|| 411,715|| 24,735,069|| 606,807|| 290,530|| 316,277|| 2,427,135|| ||=dataset_ner_fuzzy-regex_all_all_training =|| 217.7 MB|| 407,395|| 24,585,832|| 592,822|| 281,497|| 311,325|| 2,420,836|| ||=dataset_ner_fuzzy-regex+regests_non-crossing_all_training =|| 153.8 MB|| 353,301|| 17,387,149|| 494,302|| 238,381|| 255,921|| 1,850,210|| ||=dataset_ner_fuzzy-regex+regests_non-crossing_all_validation_automatically_tagged =|| 37.9 MB|| 67,971|| 3,989,670|| 487,724|| 259,777|| 227,947|| 651,387|| ||=dataset_ner_fuzzy-regex_non-crossing_all_validation_automatically_tagged =|| 37.7 MB|| 67,469|| 3,971,828|| 485,697|| 258,431|| 227,266|| 649,698|| ||=dataset_ner_fuzzy-regex_non-crossing_all_training =|| 153.1 MB|| 348,981|| 17,237,912|| 480,318|| 229,349|| 250,969|| 1,843,163|| ||=dataset_ner_manatee+regests_all_all_validation_automatically_tagged =|| 21.0 MB|| 28,727|| 2,249,037|| 261,612|| 120,358|| 141,254|| 427,057|| ||=dataset_ner_manatee_all_all_validation_automatically_tagged =|| 20.8 MB|| 28,225|| 2,231,195|| 259,585|| 119,012|| 140,573|| 425,088|| ||=dataset_ner_manatee+regests_all_all_training =|| 88.9 MB|| 158,759|| 10,155,332|| 214,566|| 79,924|| 134,642|| 1,267,107|| ||=dataset_ner_manatee_all_all_training =|| 87.9 MB|| 154,439|| 10,006,095|| 200,582|| 70,892|| 129,690|| 1,258,983|| ||=dataset_ner_manatee+regests_non-crossing_all_validation_automatically_tagged =|| 12.8 MB|| 23,643|| 1,348,859|| 176,809|| 83,699|| 93,110|| 293,119|| ||=dataset_ner_manatee+regests_non-crossing_all_training =|| 59.8 MB|| 134,909|| 6,795,014|| 174,902|| 65,897|| 109,005|| 932,654|| ||=dataset_ner_manatee_non-crossing_all_validation_automatically_tagged =|| 12.6 MB|| 23,141|| 1,331,017|| 174,782|| 82,353|| 92,429|| 290,894|| ||=dataset_ner_manatee_non-crossing_all_training =|| 58.6 MB|| 130,589|| 6,645,777|| 160,918|| 56,865|| 104,053|| 923,554|| ||=dataset_ner_fuzzy-regex+regests_all_all_validation =|| 54.2 MB|| 81,651|| 6,211,198|| 92,485|| 46,038|| 46,447|| 910,379|| ||=dataset_ner_fuzzy-regex_all_all_testing =|| 54.2 MB|| 80,929|| 6,167,375|| 90,747|| 45,176|| 45,571|| 908,276|| ||=dataset_ner_fuzzy-regex_all_all_validation =|| 54.4 MB|| 81,149|| 6,193,356|| 90,719|| 44,878|| 45,841|| 908,885|| ||=dataset_ner_fuzzy-regex+regests_non-crossing_all_validation =|| 35.0 MB|| 67,971|| 3,989,670|| 75,207|| 37,496|| 37,711|| 651,387|| ||=dataset_ner_fuzzy-regex+regests_all_only-relevant_training_automatically_tagged =|| 6.6 MB|| 14,942|| 694,242|| 73,757|| 41,838|| 31,919|| 119,272|| ||=dataset_ner_fuzzy-regex_non-crossing_all_testing =|| 34.8 MB|| 67,208|| 3,938,611|| 73,476|| 36,506|| 36,970|| 644,220|| ||=dataset_ner_fuzzy-regex_non-crossing_all_validation =|| 35.1 MB|| 67,469|| 3,971,828|| 73,441|| 36,336|| 37,105|| 649,698|| ||=dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_training_automatically_tagged =|| 5.3 MB|| 13,456|| 548,928|| 61,522|| 35,007|| 26,515|| 99,275|| ||=dataset_ner_fuzzy-regex_all_only-relevant_training_automatically_tagged =|| 5.1 MB|| 10,622|| 545,005|| 57,489|| 31,361|| 26,128|| 98,843|| ||=dataset_ner_manatee+regests_all_only-relevant_training_automatically_tagged =|| 4.6 MB|| 11,813|| 490,147|| 51,653|| 28,315|| 23,338|| 88,535|| ||=dataset_ner_fuzzy-regex_non-crossing_only-relevant_training_automatically_tagged =|| 3.7 MB|| 9,136|| 399,691|| 45,254|| 24,530|| 20,724|| 77,963|| ||=dataset_ner_manatee+regests_non-crossing_only-relevant_training_automatically_tagged =|| 3.8 MB|| 10,813|| 401,164|| 44,213|| 24,435|| 19,778|| 74,376|| ||=dataset_ner_manatee_all_only-relevant_training_automatically_tagged =|| 3.1 MB|| 7,493|| 340,910|| 34,247|| 17,193|| 17,054|| 66,659|| ||=dataset_ner_manatee+regests_all_all_validation =|| 19.5 MB|| 28,727|| 2,249,037|| 32,546|| 12,999|| 19,547|| 427,057|| ||=dataset_ner_manatee_all_all_testing =|| 19.9 MB|| 29,516|| 2,279,822|| 32,234|| 12,555|| 19,679|| 437,414|| ||=dataset_ner_manatee_all_all_validation =|| 19.4 MB|| 28,225|| 2,231,195|| 30,780|| 11,839|| 18,941|| 425,088|| ||=dataset_ner_fuzzy-regex+regests_all_only-relevant_training =|| 6.3 MB|| 14,942|| 694,242|| 30,455|| 19,214|| 11,241|| 119,272|| ||=dataset_ner_manatee_non-crossing_only-relevant_training_automatically_tagged =|| 2.3 MB|| 6,493|| 251,927|| 27,945|| 13,958|| 13,987|| 51,600|| ||=dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_training =|| 5.0 MB|| 13,456|| 548,928|| 27,324|| 17,257|| 10,067|| 99,275|| ||=dataset_ner_manatee+regests_non-crossing_all_validation =|| 11.8 MB|| 23,643|| 1,348,859|| 26,287|| 10,498|| 15,789|| 293,119|| ||=dataset_ner_manatee_non-crossing_all_testing =|| 12.2 MB|| 24,420|| 1,384,547|| 25,937|| 10,068|| 15,869|| 300,862|| ||=dataset_ner_manatee_non-crossing_all_validation =|| 11.7 MB|| 23,141|| 1,331,017|| 24,521|| 9,338|| 15,183|| 290,894|| ||=dataset_ner_manatee+regests_all_only-relevant_training =|| 4.4 MB|| 11,813|| 490,147|| 24,212|| 13,626|| 10,586|| 88,535|| ||=dataset_ner_manatee+regests_non-crossing_only-relevant_training =|| 3.7 MB|| 10,813|| 401,164|| 22,583|| 12,909|| 9,674|| 74,376|| ||=dataset_ner_fuzzy-regex+regests_all_only-relevant_validation_automatically_tagged =|| 1.5 MB|| 2,776|| 158,548|| 16,901|| 9,936|| 6,965|| 44,018|| ||=dataset_ner_fuzzy-regex_all_only-relevant_training =|| 4.8 MB|| 10,622|| 545,005|| 16,471|| 10,182|| 6,289|| 98,843|| ||=dataset_ner_regests_training_automatically_tagged =|| 1.5 MB|| 4,320|| 149,237|| 16,268|| 10,477|| 5,791|| 29,166|| ||=dataset_ner_fuzzy-regex_all_only-relevant_validation_automatically_tagged =|| 1.3 MB|| 2,274|| 140,706|| 14,874|| 8,590|| 6,284|| 39,612|| ||=dataset_ner_regests_training =|| 1.5 MB|| 4,320|| 149,237|| 13,984|| 9,032|| 4,952|| 29,166|| ||=dataset_ner_fuzzy-regex_non-crossing_only-relevant_training =|| 3.5 MB|| 9,136|| 399,691|| 13,340|| 8,225|| 5,115|| 77,963|| ||=dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_validation_automatically_tagged =|| 1.1 MB|| 2,420|| 110,376|| 12,902|| 7,592|| 5,310|| 33,352|| ||=dataset_ner_fuzzy-regex_non-crossing_only-relevant_validation_automatically_tagged =|| 885.1 kB|| 1,918|| 92,534|| 10,875|| 6,246|| 4,629|| 28,676|| ||=dataset_ner_manatee_all_only-relevant_training =|| 2.9 MB|| 7,493|| 340,910|| 10,228|| 4,594|| 5,634|| 66,659|| ||=dataset_ner_manatee+regests_all_only-relevant_validation_automatically_tagged =|| 913.3 kB|| 1,972|| 97,069|| 10,180|| 5,592|| 4,588|| 28,324|| ||=dataset_ner_manatee_non-crossing_only-relevant_training =|| 2.2 MB|| 6,493|| 251,927|| 8,599|| 3,877|| 4,722|| 51,600|| ||=dataset_ner_manatee_all_only-relevant_validation_automatically_tagged =|| 730.1 kB|| 1,470|| 79,227|| 8,153|| 4,246|| 3,907|| 23,569|| ||=dataset_ner_manatee+regests_non-crossing_only-relevant_validation_automatically_tagged =|| 683.4 kB|| 1,751|| 71,948|| 8,136|| 4,501|| 3,635|| 22,133|| ||=dataset_ner_manatee_non-crossing_only-relevant_validation_automatically_tagged =|| 500.3 kB|| 1,249|| 54,106|| 6,109|| 3,155|| 2,954|| 17,138|| ||=dataset_ner_fuzzy-regex+regests_all_only-relevant_validation =|| 1.4 MB|| 2,776|| 158,548|| 4,421|| 2,817|| 1,604|| 44,018|| ||=dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_validation =|| 998.7 kB|| 2,420|| 110,376|| 3,938|| 2,519|| 1,419|| 33,352|| ||=dataset_ner_manatee+regests_all_only-relevant_validation =|| 862.5 kB|| 1,972|| 97,069|| 3,347|| 1,887|| 1,460|| 28,324|| ||=dataset_ner_manatee+regests_non-crossing_only-relevant_validation =|| 646.8 kB|| 1,751|| 71,948|| 3,094|| 1,774|| 1,320|| 22,133|| ||=dataset_ner_fuzzy-regex_all_only-relevant_testing =|| 1.3 MB|| 2,405|| 144,684|| 2,780|| 1,784|| 996|| 39,977|| ||=dataset_ner_fuzzy-regex_all_only-relevant_validation =|| 1.2 MB|| 2,274|| 140,706|| 2,655|| 1,657|| 998|| 39,612|| ||=dataset_ner_fuzzy-regex_non-crossing_only-relevant_testing =|| 867.0 kB|| 2,034|| 98,659|| 2,292|| 1,455|| 837|| 29,874|| ||=dataset_ner_regests_testing =|| 261.7 kB|| 799|| 26,148|| 2,182|| 1,422|| 760|| 8,978|| ||=dataset_ner_fuzzy-regex_non-crossing_only-relevant_validation =|| 818.4 kB|| 1,918|| 92,534|| 2,172|| 1,359|| 813|| 28,676|| ||=dataset_ner_regests_validation_automatically_tagged =|| 183.1 kB|| 502|| 17,842|| 2,027|| 1,346|| 681|| 6,445|| ||=dataset_ner_regests_validation =|| 181.7 kB|| 502|| 17,842|| 1,766|| 1,160|| 606|| 6,445|| ||=dataset_ner_manatee_all_only-relevant_validation =|| 681.8 kB|| 1,470|| 79,227|| 1,581|| 727|| 854|| 23,569|| ||=dataset_ner_manatee_all_only-relevant_testing =|| 678.8 kB|| 1,420|| 78,751|| 1,529|| 695|| 834|| 23,949|| ||=dataset_ner_manatee_non-crossing_only-relevant_validation =|| 465.9 kB|| 1,249|| 54,106|| 1,328|| 614|| 714|| 17,138|| ||=dataset_ner_manatee_non-crossing_only-relevant_testing =|| 469.1 kB|| 1,208|| 54,391|| 1,283|| 587|| 696|| 17,713|| ||=dataset_ner_regests_testing_001-400 =|| 129.8 kB|| 400|| 12,811|| 1,164|| 789|| 375|| 5,121|| ||=dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_tagged =|| 41.6 kB|| 100|| 4,507|| 530|| 287|| 243|| 2,449|| ||=dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_automatically_tagged =|| 41.0 kB|| 100|| 4,507|| 459|| 233|| 226|| 2,449|| ||=dataset_ner_manatee_non-crossing_only-relevant_testing_001-400 =|| 169.0 kB|| 400|| 19,554|| 439|| 201|| 238|| 7,928|| ||=dataset_ner_manatee_non-crossing_only-relevant_testing_401-500 =|| 38.5 kB|| 100|| 4,507|| 110|| 55|| 55|| 2,449|| * The archive [https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-5024/named-entity-recognition-annotations-large.zip?sequence=3&isAllowed=y named-entity-recognition-annotations-large.zip] (1.31 GB) contains 16 tuples of files named `*.sentences.txt` and `.ner_tags.txt`.[[BR]]These files contain sentences and NER tags for supervised training, validation, and testing of language models. We produced them with our language models.[[BR]]Here are the four variables that we used to produce the different files: 1. The sentences are extracted from book OCR texts and may therefore span several pages.[[BR]]However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.[[BR]]We either allow the sentences to cross page boundaries (`all`) or not (`non-crossing`). 1. The sentences come from all book pages (`all`) or just those considered relevant by human annotators (`only-relevant`). 1. We split the sentences roughly into 90% for training (`training`) and 10% for validation (`validation`). 1. We use an ensemble of a baseline model and weak fourth-generation NER models (`004`) or the final seventh-generation NER model (`007`). '''Table 3:''' Dataset statistics from the archive named-entity-recognition-annotations-large.zip, ordered by the number of B-* tags. || ||= file size =||= # sentences =||= # tokens =||= # B-* tags =||= # B-PER tags =||= # B-LOC tags =||= # types =|| ||=dataset_mlm_all_all_training_automatically_tagged_007 =|| 860.0 MB|| 3,227,624|| 95,054,481|| 6,340,811|| 3,794,991|| 2,545,820|| 6,562,841|| ||=dataset_mlm_all_all_training_automatically_tagged_004 =|| 882.6 MB|| 3,227,624|| 95,054,481|| 9,727,269|| 5,429,801|| 4,297,468|| 6,562,841|| ||=dataset_mlm_non-crossing_all_training_automatically_tagged_004 =|| 736.0 MB|| 3,009,481|| 79,003,252|| 8,447,053|| 4,721,604|| 3,725,449|| 5,660,658|| ||=dataset_mlm_non-crossing_all_training_automatically_tagged_007 =|| 716.0 MB|| 3,009,481|| 79,003,252|| 5,441,290|| 3,264,675|| 2,176,615|| 5,660,658|| ||=dataset_mlm_all_all_validation_automatically_tagged_004 =|| 114.0 MB|| 402,179|| 12,240,756|| 1,201,467|| 659,139|| 542,328|| 1,319,365|| ||=dataset_mlm_all_all_validation_automatically_tagged_007 =|| 111.2 MB|| 402,179|| 12,240,756|| 781,509|| 462,102|| 319,407|| 1,319,365|| ||=dataset_mlm_non-crossing_all_validation_automatically_tagged_004 =|| 94.0 MB|| 372,880|| 10,061,113|| 1,035,283|| 571,082|| 464,201|| 1,141,033|| ||=dataset_mlm_non-crossing_all_validation_automatically_tagged_007 =|| 91.6 MB|| 372,880|| 10,061,113|| 663,793|| 395,771|| 268,022|| 1,141,033|| ||=dataset_mlm_all_only-relevant_training_automatically_tagged_004 =|| 11.5 MB|| 47,835|| 1,277,430|| 133,101|| 64,711|| 68,390|| 183,563|| ||=dataset_mlm_all_only-relevant_training_automatically_tagged_007 =|| 11.3 MB|| 47,835|| 1,277,430|| 99,103|| 50,544|| 48,559|| 183,563|| ||=dataset_mlm_non-crossing_only-relevant_training_automatically_tagged_004 =|| 9.6 MB|| 44,155|| 1,066,545|| 116,176|| 55,996|| 60,180|| 158,622|| ||=dataset_mlm_non-crossing_only-relevant_training_automatically_tagged_007 =|| 9.4 MB|| 44,155|| 1,066,545|| 85,675|| 43,360|| 42,315|| 158,622|| ||=dataset_mlm_all_only-relevant_validation_automatically_tagged_004 =|| 1.0 MB|| 2,786|| 107,609|| 8,937|| 4,125|| 4,812|| 27,019|| ||=dataset_mlm_all_only-relevant_validation_automatically_tagged_007 =|| 989.2 kB|| 2,786|| 107,609|| 6,581|| 2,980|| 3,601|| 27,019|| ||=dataset_mlm_non-crossing_only-relevant_validation_automatically_tagged_004 =|| 754.3 kB|| 2,484|| 80,619|| 7,290|| 3,380|| 3,910|| 22,087|| ||=dataset_mlm_non-crossing_only-relevant_validation_automatically_tagged_007 =|| 740.2 kB|| 2,484|| 80,619|| 5,281|| 2,404|| 2,877|| 22,087|| == Citing == If you use our dataset in your work, please cite the following article: TODO If you use LaTeX, you can use the following BibTeX entry: {{{ TODO }}}