Kontextová navigace

Změny mezi verzí 9 a verzí 10 u NerDataset

Časová značka:: 30. 11. 2022 14:16:45 (před 3 lety)
Autor:: xnovot32@fi.muni.cz
Komentář:: --

Vysvětlivky:

: Nezměněno
: Přidáno
: Odstraněno
: Změněno

NerDataset

v9	v10
11	11	1. The sentences come from all book pages (`all`) or just those considered relevant by human annotators (`only-relevant`).
12	12	1. We split the sentences roughly into 90% for training (`training`) and 10% for validation (`validation`).
13		* 16 tuples of files named `dataset_ner_*.sentences.txt`, `.ner_tags.txt`, and in two cases also `.docx`.[[BR]]These contain sentences and NER tags for supervised training, validation, and testing of language models.[[BR]]The `.docx` files are authored by human annotators and may contain extra details missing from files `.sentences.txt` and `.ner_tags.txt`.[[BR]]Here are the five variables that we used to produce the different files:
	13	* 16 tuples of files named `dataset_ner_*.sentences.txt`, `.ner_tags.txt`, and in two cases also `.docx`.[[BR]]These files contain sentences and NER tags for supervised training, validation, and testing of language models.[[BR]]The `.docx` files are authored by human annotators and may contain extra details missing from files `.sentences.txt` and `.ner_tags.txt`.[[BR]]Here are the five variables that we used to produce the different files:
14	14	1. The sentences may originate from book OCR texts using information retrieval techniques (`fuzzy-regex` or `manatee`).[[BR]]The sentences may also originate from regests (`regests`) or both books and regests (`fuzzy-regex+regests` and `fuzzy-regex+manatee`).
15	15	1. When sentences originate from book OCR texts, they may span several pages of a book.[[BR]]However, page boundaries contain pollutants such as running heads, footnotes, and page numbers.[[BR]]We either allow the sentences to cross page boundaries (`all`) or not (`non-crossing`).