Changes between Version 14 and Version 15 of private/NlpInPracticeCourse/NamedEntityRecognition
- Timestamp:
- Oct 9, 2017, 11:36:03 AM (7 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
private/NlpInPracticeCourse/NamedEntityRecognition
v14 v15 38 38 Requirements: Java 8, python, gigabytes of memory, [raw-attachment:convert_cnec_stanford.py:wiki:en/AdvancedNlpCourse/NamedEntityRecognition convert_cnec_stanford.py], [raw-attachment:named_ent_dtest_unknown.tsv:wiki:en/AdvancedNlpCourse/NamedEntityRecognition named_ent_dtest_unknown.tsv], [raw-attachment:cnec.prop:wiki:en/AdvancedNlpCourse/NamedEntityRecognition cnec.prop] 39 39 40 1. Create `<YOUR_FILE>`, a text file named ia161-UCO-03.txt where UCOis your university ID.40 1. Create `<YOUR_FILE>`, a text file named `ia161-UCO-04.txt` where ''UCO'' is your university ID. 41 41 1. get the data: download CNEC from LINDAT/Clarin repository (https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-1B22-8) 42 42 1. open the NE hierarchy: 43 44 `evince cnec2.0/doc/ne-type-hierarchy.pdf` 43 {{{ 44 evince cnec2.0/doc/ne-type-hierarchy.pdf 45 }}} 45 46 46 47 1. the data is organized into 3 disjoint datasets: the training data is called `train`, the development test data is called `dtest` and the final evaluation data is called `etest`. 47 48 1. convert the train data to the Stanford NER format: 48 49 `python convert_cnec_stanford.py cnec2.0/data/xml/named_ent_train.xml > named_ent_train.tsv` 49 {{{ 50 python convert_cnec_stanford.py cnec2.0/data/xml/named_ent_train.xml > named_ent_train.tsv 51 }}} 50 52 51 53 Note that we removed documents that did not contain NEs. You can experiment with this option later. 52 54 1. download the Stanford NE recognizer http://nlp.stanford.edu/software/CRF-NER.shtml (and read about it) 53 55 1. train the model using the default settings (cnec.prop), N.B. that the `convert_cnec_stanford.py` only recognizes PERSON, LOCATION and ORGANIZATION, you can extend the markup conversion later: 54 55 `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop cnec.prop` 56 {{{ 57 java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop cnec.prop 58 }}} 56 59 1. convert the test data to the Stanford NER format: 57 58 `python convert_cnec_stanford.py named_ent_dtest.xml > named_ent_dtest.tsv` 60 {{{ 61 python convert_cnec_stanford.py named_ent_dtest.xml > named_ent_dtest.tsv 62 }}} 59 63 1. evaluate the model on `dtest`: 60 64 {{{