Kontextová navigace

A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents

This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era.
The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).

You can download the dataset from the LINDAT/CLARIAH-CZ repository.

The dataset is structured as follows:

The archive language-modeling-corpus.zip (633.79 MB) contains 8 files with sentences for unsupervised training and validation of language models.
We used the following three variables to produce the different files:
1. The sentences are extracted from book OCR texts and may therefore span several pages.
  However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.
  We either allow the sentences to cross page boundaries (all) or not (non-crossing).
2. The sentences come from all book pages (all) or just those considered relevant by human annotators (only-relevant).
3. We split the sentences roughly into 90% for training (training) and 10% for validation (validation).

Table 1: Dataset statistics from the archive language-modeling-corpus.zip, ordered by file size.

	file size	# sentences	# tokens	# types
dataset_mlm_all_all_training	630.7 MB	3,228,077	96,556,612	6,198,957
dataset_mlm_non-crossing_all_training	524.1 MB	3,009,931	80,220,907	5,362,515
dataset_mlm_all_all_validation	81.8 MB	402,184	12,374,044	1,273,737
dataset_mlm_non-crossing_all_validation	67.3 MB	372,885	10,157,799	1,105,583
dataset_mlm_all_only-relevant_training	8.1 MB	47,958	1,286,573	181,845
dataset_mlm_non-crossing_only-relevant_training	6.7 MB	44,278	1,074,734	157,354
dataset_mlm_all_only-relevant_validation	736.7 kB	2,791	108,364	26,986
dataset_mlm_non-crossing_only-relevant_validation	549.4 kB	2,489	81,293	22,090

The archive named-entity-recognition-annotations-small.zip (978.29 MB) contains 82 tuples of files named *.sentences.txt, .ner_tags.txt, and in one case also .docx.¹
These files contain sentences and NER tags for supervised training, validation, and testing of language models. We used them to produce our intermediate language models.
These are the “small” sentences and NER tags that we used for the supervised training, validation, and testing of intermediate language models.
Here are the five variables that we used to produce the different files:
1. The sentences may originate from book OCR texts using information retrieval techniques (fuzzy-regex or manatee).
  The sentences may also originate from regests (regests) or both books and regests (fuzzy-regex+regests and fuzzy-regex+manatee).
2. When sentences originate from book OCR texts, they may span several pages of a book.
  However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.
  We either allow the sentences to cross page boundaries (all) or not (non-crossing).
3. When sentences originate from book OCR texts, they may come from book pages of different relevance.
  We either use sentences from all book pages (all) or just those considered relevant by human annotators (only-relevant).
4. When sentences and NER tags originate from book OCR texts using information retrieval techniques, many entities in the sentences may lack tags.
  Therefore, we also provide NER tags completed by language models (automatically_tagged) and human annotators (tagged).
5. We split the sentences roughly into 80% for training (training), 10% for validation (validation), and 10% for testing (testing).
  For repeated testing, we subdivide the testing split (testing_001-400 and testing_401-500).

¹The .docx files were authored by human annotators and contain extra details missing from files .sentences.txt and .ner_tags.txt. The extra details include nested entities such as locations in person names (e.g. “Blažek z Kralup”) and people in location names (e.g. “Kostel sv. Martina”).

Table 2: Dataset statistics from the archive named-entity-recognition-annotations-small.zip, ordered by the number of B-* tags. In the article describing the dataset, the files dataset_ner_regests_training_* are referred to as Abstracts-Tiny, the files dataset_ner_manatee_non-crossing_only-relevant_* are referred to as Books-Small, and the files dataset_ner_manatee_non-crossing_only-relevant_*_automatically_tagged are referred to as Books-Medium.

	file size	# sentences	# tokens	# B-* tags	# B-PER tags	# B-LOC tags	# types
dataset_ner_fuzzy-regex_all_all_training_automatically_tagged	230.4 MB	407,395	24,585,832	2,669,582	1,403,789	1,265,793	2,420,836
dataset_ner_fuzzy-regex+regests_all_all_training_automatically_tagged	231.6 MB	411,715	24,735,069	2,640,803	1,378,804	1,261,999	2,427,135
dataset_ner_fuzzy-regex+regests_non-crossing_all_training_automatically_tagged	164.4 MB	353,301	17,387,149	2,065,805	1,100,245	965,560	1,850,210
dataset_ner_fuzzy-regex_non-crossing_all_training_automatically_tagged	162.9 MB	348,981	17,237,912	2,049,537	1,089,768	959,769	1,843,163
dataset_ner_manatee+regests_all_all_training_automatically_tagged	95.4 MB	158,759	10,155,332	1,175,031	563,912	611,119	1,267,107
dataset_ner_manatee_all_all_training_automatically_tagged	93.8 MB	154,439	10,006,095	1,158,763	553,435	605,328	1,258,983
dataset_ner_manatee+regests_non-crossing_all_training_automatically_tagged	64.5 MB	134,909	6,795,014	870,613	423,345	447,268	932,654
dataset_ner_manatee_non-crossing_all_training_automatically_tagged	63.0 MB	130,589	6,645,777	854,345	412,868	441,477	923,554
dataset_ner_fuzzy-regex+regests_all_all_validation_automatically_tagged	58.3 MB	81,651	6,211,198	685,020	356,017	329,003	910,379
dataset_ner_fuzzy-regex_all_all_validation_automatically_tagged	58.1 MB	81,149	6,193,356	682,993	354,671	328,322	908,885
dataset_ner_fuzzy-regex+regests_all_all_training	218.0 MB	411,715	24,735,069	606,807	290,530	316,277	2,427,135
dataset_ner_fuzzy-regex_all_all_training	217.7 MB	407,395	24,585,832	592,822	281,497	311,325	2,420,836
dataset_ner_fuzzy-regex+regests_non-crossing_all_training	153.8 MB	353,301	17,387,149	494,302	238,381	255,921	1,850,210
dataset_ner_fuzzy-regex+regests_non-crossing_all_validation_automatically_tagged	37.9 MB	67,971	3,989,670	487,724	259,777	227,947	651,387
dataset_ner_fuzzy-regex_non-crossing_all_validation_automatically_tagged	37.7 MB	67,469	3,971,828	485,697	258,431	227,266	649,698
dataset_ner_fuzzy-regex_non-crossing_all_training	153.1 MB	348,981	17,237,912	480,318	229,349	250,969	1,843,163
dataset_ner_manatee+regests_all_all_validation_automatically_tagged	21.0 MB	28,727	2,249,037	261,612	120,358	141,254	427,057
dataset_ner_manatee_all_all_validation_automatically_tagged	20.8 MB	28,225	2,231,195	259,585	119,012	140,573	425,088
dataset_ner_manatee+regests_all_all_training	88.9 MB	158,759	10,155,332	214,566	79,924	134,642	1,267,107
dataset_ner_manatee_all_all_training	87.9 MB	154,439	10,006,095	200,582	70,892	129,690	1,258,983
dataset_ner_manatee+regests_non-crossing_all_validation_automatically_tagged	12.8 MB	23,643	1,348,859	176,809	83,699	93,110	293,119
dataset_ner_manatee+regests_non-crossing_all_training	59.8 MB	134,909	6,795,014	174,902	65,897	109,005	932,654
dataset_ner_manatee_non-crossing_all_validation_automatically_tagged	12.6 MB	23,141	1,331,017	174,782	82,353	92,429	290,894
dataset_ner_manatee_non-crossing_all_training	58.6 MB	130,589	6,645,777	160,918	56,865	104,053	923,554
dataset_ner_fuzzy-regex+regests_all_all_validation	54.2 MB	81,651	6,211,198	92,485	46,038	46,447	910,379
dataset_ner_fuzzy-regex_all_all_testing	54.2 MB	80,929	6,167,375	90,747	45,176	45,571	908,276
dataset_ner_fuzzy-regex_all_all_validation	54.4 MB	81,149	6,193,356	90,719	44,878	45,841	908,885
dataset_ner_fuzzy-regex+regests_non-crossing_all_validation	35.0 MB	67,971	3,989,670	75,207	37,496	37,711	651,387
dataset_ner_fuzzy-regex+regests_all_only-relevant_training_automatically_tagged	6.6 MB	14,942	694,242	73,757	41,838	31,919	119,272
dataset_ner_fuzzy-regex_non-crossing_all_testing	34.8 MB	67,208	3,938,611	73,476	36,506	36,970	644,220
dataset_ner_fuzzy-regex_non-crossing_all_validation	35.1 MB	67,469	3,971,828	73,441	36,336	37,105	649,698
dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_training_automatically_tagged	5.3 MB	13,456	548,928	61,522	35,007	26,515	99,275
dataset_ner_fuzzy-regex_all_only-relevant_training_automatically_tagged	5.1 MB	10,622	545,005	57,489	31,361	26,128	98,843
dataset_ner_manatee+regests_all_only-relevant_training_automatically_tagged	4.6 MB	11,813	490,147	51,653	28,315	23,338	88,535
dataset_ner_fuzzy-regex_non-crossing_only-relevant_training_automatically_tagged	3.7 MB	9,136	399,691	45,254	24,530	20,724	77,963
dataset_ner_manatee+regests_non-crossing_only-relevant_training_automatically_tagged	3.8 MB	10,813	401,164	44,213	24,435	19,778	74,376
dataset_ner_manatee_all_only-relevant_training_automatically_tagged	3.1 MB	7,493	340,910	34,247	17,193	17,054	66,659
dataset_ner_manatee+regests_all_all_validation	19.5 MB	28,727	2,249,037	32,546	12,999	19,547	427,057
dataset_ner_manatee_all_all_testing	19.9 MB	29,516	2,279,822	32,234	12,555	19,679	437,414
dataset_ner_manatee_all_all_validation	19.4 MB	28,225	2,231,195	30,780	11,839	18,941	425,088
dataset_ner_fuzzy-regex+regests_all_only-relevant_training	6.3 MB	14,942	694,242	30,455	19,214	11,241	119,272
dataset_ner_manatee_non-crossing_only-relevant_training_automatically_tagged	2.3 MB	6,493	251,927	27,945	13,958	13,987	51,600
dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_training	5.0 MB	13,456	548,928	27,324	17,257	10,067	99,275
dataset_ner_manatee+regests_non-crossing_all_validation	11.8 MB	23,643	1,348,859	26,287	10,498	15,789	293,119
dataset_ner_manatee_non-crossing_all_testing	12.2 MB	24,420	1,384,547	25,937	10,068	15,869	300,862
dataset_ner_manatee_non-crossing_all_validation	11.7 MB	23,141	1,331,017	24,521	9,338	15,183	290,894
dataset_ner_manatee+regests_all_only-relevant_training	4.4 MB	11,813	490,147	24,212	13,626	10,586	88,535
dataset_ner_manatee+regests_non-crossing_only-relevant_training	3.7 MB	10,813	401,164	22,583	12,909	9,674	74,376
dataset_ner_fuzzy-regex+regests_all_only-relevant_validation_automatically_tagged	1.5 MB	2,776	158,548	16,901	9,936	6,965	44,018
dataset_ner_fuzzy-regex_all_only-relevant_training	4.8 MB	10,622	545,005	16,471	10,182	6,289	98,843
dataset_ner_regests_training_automatically_tagged	1.5 MB	4,320	149,237	16,268	10,477	5,791	29,166
dataset_ner_fuzzy-regex_all_only-relevant_validation_automatically_tagged	1.3 MB	2,274	140,706	14,874	8,590	6,284	39,612
dataset_ner_regests_training	1.5 MB	4,320	149,237	13,984	9,032	4,952	29,166
dataset_ner_fuzzy-regex_non-crossing_only-relevant_training	3.5 MB	9,136	399,691	13,340	8,225	5,115	77,963
dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_validation_automatically_tagged	1.1 MB	2,420	110,376	12,902	7,592	5,310	33,352
dataset_ner_fuzzy-regex_non-crossing_only-relevant_validation_automatically_tagged	885.1 kB	1,918	92,534	10,875	6,246	4,629	28,676
dataset_ner_manatee_all_only-relevant_training	2.9 MB	7,493	340,910	10,228	4,594	5,634	66,659
dataset_ner_manatee+regests_all_only-relevant_validation_automatically_tagged	913.3 kB	1,972	97,069	10,180	5,592	4,588	28,324
dataset_ner_manatee_non-crossing_only-relevant_training	2.2 MB	6,493	251,927	8,599	3,877	4,722	51,600
dataset_ner_manatee_all_only-relevant_validation_automatically_tagged	730.1 kB	1,470	79,227	8,153	4,246	3,907	23,569
dataset_ner_manatee+regests_non-crossing_only-relevant_validation_automatically_tagged	683.4 kB	1,751	71,948	8,136	4,501	3,635	22,133
dataset_ner_manatee_non-crossing_only-relevant_validation_automatically_tagged	500.3 kB	1,249	54,106	6,109	3,155	2,954	17,138
dataset_ner_fuzzy-regex+regests_all_only-relevant_validation	1.4 MB	2,776	158,548	4,421	2,817	1,604	44,018
dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_validation	998.7 kB	2,420	110,376	3,938	2,519	1,419	33,352
dataset_ner_manatee+regests_all_only-relevant_validation	862.5 kB	1,972	97,069	3,347	1,887	1,460	28,324
dataset_ner_manatee+regests_non-crossing_only-relevant_validation	646.8 kB	1,751	71,948	3,094	1,774	1,320	22,133
dataset_ner_fuzzy-regex_all_only-relevant_testing	1.3 MB	2,405	144,684	2,780	1,784	996	39,977
dataset_ner_fuzzy-regex_all_only-relevant_validation	1.2 MB	2,274	140,706	2,655	1,657	998	39,612
dataset_ner_fuzzy-regex_non-crossing_only-relevant_testing	867.0 kB	2,034	98,659	2,292	1,455	837	29,874
dataset_ner_regests_testing	261.7 kB	799	26,148	2,182	1,422	760	8,978
dataset_ner_fuzzy-regex_non-crossing_only-relevant_validation	818.4 kB	1,918	92,534	2,172	1,359	813	28,676
dataset_ner_regests_validation_automatically_tagged	183.1 kB	502	17,842	2,027	1,346	681	6,445
dataset_ner_regests_validation	181.7 kB	502	17,842	1,766	1,160	606	6,445
dataset_ner_manatee_all_only-relevant_validation	681.8 kB	1,470	79,227	1,581	727	854	23,569
dataset_ner_manatee_all_only-relevant_testing	678.8 kB	1,420	78,751	1,529	695	834	23,949
dataset_ner_manatee_non-crossing_only-relevant_validation	465.9 kB	1,249	54,106	1,328	614	714	17,138
dataset_ner_manatee_non-crossing_only-relevant_testing	469.1 kB	1,208	54,391	1,283	587	696	17,713
dataset_ner_regests_testing_001-400	129.8 kB	400	12,811	1,164	789	375	5,121
dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_tagged	41.6 kB	100	4,507	530	287	243	2,449
dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_automatically_tagged	41.0 kB	100	4,507	459	233	226	2,449
dataset_ner_manatee_non-crossing_only-relevant_testing_001-400	169.0 kB	400	19,554	439	201	238	7,928
dataset_ner_manatee_non-crossing_only-relevant_testing_401-500	38.5 kB	100	4,507	110	55	55	2,449

The archive named-entity-recognition-annotations-large.zip (1.31 GB) contains 16 tuples of files named *.sentences.txt and .ner_tags.txt.
These files contain sentences and NER tags for supervised training, validation, and testing of language models. We produced them with our language models.
Here are the four variables that we used to produce the different files:
1. The sentences are extracted from book OCR texts and may therefore span several pages.
  However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.
  We either allow the sentences to cross page boundaries (all) or not (non-crossing).
2. The sentences come from all book pages (all) or just those considered relevant by human annotators (only-relevant).
3. We split the sentences roughly into 90% for training (training) and 10% for validation (validation).
4. We use an ensemble of a baseline model and weak fourth-generation NER models (004) or the final seventh-generation NER model (007).

Table 3: Dataset statistics from the archive named-entity-recognition-annotations-large.zip, ordered by the number of B-* tags. In the article describing the dataset, the files dataset_mlm_non-crossing_only-relevant_*_automatically_tagged_007 are referred to as Books-Large and the files dataset_mlm_all_all_training_automatically_tagged_007 are referred to as Books-Huge.

	file size	# sentences	# tokens	# B-* tags	# B-PER tags	# B-LOC tags	# types
dataset_mlm_all_all_training_automatically_tagged_007	860.0 MB	3,227,624	95,054,481	6,340,811	3,794,991	2,545,820	6,562,841
dataset_mlm_all_all_training_automatically_tagged_004	882.6 MB	3,227,624	95,054,481	9,727,269	5,429,801	4,297,468	6,562,841
dataset_mlm_non-crossing_all_training_automatically_tagged_004	736.0 MB	3,009,481	79,003,252	8,447,053	4,721,604	3,725,449	5,660,658
dataset_mlm_non-crossing_all_training_automatically_tagged_007	716.0 MB	3,009,481	79,003,252	5,441,290	3,264,675	2,176,615	5,660,658
dataset_mlm_all_all_validation_automatically_tagged_004	114.0 MB	402,179	12,240,756	1,201,467	659,139	542,328	1,319,365
dataset_mlm_all_all_validation_automatically_tagged_007	111.2 MB	402,179	12,240,756	781,509	462,102	319,407	1,319,365
dataset_mlm_non-crossing_all_validation_automatically_tagged_004	94.0 MB	372,880	10,061,113	1,035,283	571,082	464,201	1,141,033
dataset_mlm_non-crossing_all_validation_automatically_tagged_007	91.6 MB	372,880	10,061,113	663,793	395,771	268,022	1,141,033
dataset_mlm_all_only-relevant_training_automatically_tagged_004	11.5 MB	47,835	1,277,430	133,101	64,711	68,390	183,563
dataset_mlm_all_only-relevant_training_automatically_tagged_007	11.3 MB	47,835	1,277,430	99,103	50,544	48,559	183,563
dataset_mlm_non-crossing_only-relevant_training_automatically_tagged_004	9.6 MB	44,155	1,066,545	116,176	55,996	60,180	158,622
dataset_mlm_non-crossing_only-relevant_training_automatically_tagged_007	9.4 MB	44,155	1,066,545	85,675	43,360	42,315	158,622
dataset_mlm_all_only-relevant_validation_automatically_tagged_004	1.0 MB	2,786	107,609	8,937	4,125	4,812	27,019
dataset_mlm_all_only-relevant_validation_automatically_tagged_007	989.2 kB	2,786	107,609	6,581	2,980	3,601	27,019
dataset_mlm_non-crossing_only-relevant_validation_automatically_tagged_004	754.3 kB	2,484	80,619	7,290	3,380	3,910	22,087
dataset_mlm_non-crossing_only-relevant_validation_automatically_tagged_007	740.2 kB	2,484	80,619	5,281	2,404	2,877	22,087

Corpus

The file corpus.vert.gz (1.3G compressed) contains a vertical file with the results of optical character recognition, named entity recognition, language identification, and lemmatization on all books in the AHISTO project database. See also the schema of the vertical file. (Warning: The corpus is a work in progress and may change. Last modified: 2023-05-25)

Citing

An article describing our dataset is currently under review. Preprint is available on ArXiv.

Last modified před 2 lety Naposledy změněno 29. 5. 2023 9:13:20

Stáhnout v jiných formátech:

Čistý text

Kontextová navigace

A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents

Contents

Corpus

Citing

Stáhnout v jiných formátech: