Změny mezi verzí 8 a verzí 9 u NerDataset
- Časová značka:
- 28. 11. 2022 16:39:31 (před 20 měsíci)
Vysvětlivky:
- Nezměněno
- Přidáno
- Odstraněno
- Změněno
-
NerDataset
v8 v9 7 7 The dataset is stored in archive [https://nlp.fi.muni.cz/projects/ahisto/ner-dataset.zip ner-dataset.zip] (1.7 GB) with following structure: 8 8 9 * 8 files named `dataset_mlm_*.txt` that contain sentences for unsupervised training of language models.[[BR]]We used the following three variables to produce the different files:9 * 8 files named `dataset_mlm_*.txt` that contain sentences for unsupervised training and validation of language models.[[BR]]We used the following three variables to produce the different files: 10 10 1. The sentences are extracted from book OCR texts and may therefore span several pages.[[BR]]However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.[[BR]]We either allow the sentences to cross page boundaries (`all`) or not (`non-crossing`). 11 11 1. The sentences come from all book pages (`all`) or just those considered relevant by human annotators (`only-relevant`).