Kontextová navigace

Změny mezi verzí 8 a verzí 9 u NerDataset

v8	v9
7	7	The dataset is stored in archive [https://nlp.fi.muni.cz/projects/ahisto/ner-dataset.zip ner-dataset.zip] (1.7 GB) with following structure:
8	8
9		* 8 files named `dataset_mlm_*.txt` that contain sentences for unsupervised training of language models.[[BR]]We used the following three variables to produce the different files:
	9	* 8 files named `dataset_mlm_*.txt` that contain sentences for unsupervised training and validation of language models.[[BR]]We used the following three variables to produce the different files:
10	10	1. The sentences are extracted from book OCR texts and may therefore span several pages.[[BR]]However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.[[BR]]We either allow the sentences to cross page boundaries (`all`) or not (`non-crossing`).
11	11	1. The sentences come from all book pages (`all`) or just those considered relevant by human annotators (`only-relevant`).