Changes between Version 13 and Version 14 of private/NlpInPracticeCourse/NamedEntityRecognition
- Timestamp:
- Oct 9, 2017, 11:33:03 AM (6 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
private/NlpInPracticeCourse/NamedEntityRecognition
v13 v14 13 13 === Example from IE === 14 14 15 In 2003, Hannibal Lecter (as portrayed by Hopkins) was chosen by the American Film Institute as the number one movie villain. 15 || In 2003, Hannibal Lecter (as portrayed by Hopkins) was chosen by the American Film Institute as the number one movie villain. || 16 16 17 17 Hannibal Lecter <-> Hopkins … … 19 19 === Example concerning syntactic parsing === 20 20 21 Wish You Were Here is the ninth studio album by the English progressive rock group Pink Floyd. 21 || Wish You Were Here is the ninth studio album by the English progressive rock group Pink Floyd. || 22 22 23 23 vs. 24 24 25 Wish_You_Were_Here is the ninth studio album by the English progressive rock group Pink Floyd. 25 || Wish_You_Were_Here is the ninth studio album by the English progressive rock group Pink Floyd. || 26 26 27 27 === References === … … 40 40 1. Create `<YOUR_FILE>`, a text file named ia161-UCO-03.txt where UCO is your university ID. 41 41 1. get the data: download CNEC from LINDAT/Clarin repository (https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-1B22-8) 42 1. open the NE hierarchy: `acroread cnec1.1/doc/ne-type-hierarchy.pdf` 42 1. open the NE hierarchy: 43 44 `evince cnec2.0/doc/ne-type-hierarchy.pdf` 45 43 46 1. the data is organized into 3 disjoint datasets: the training data is called `train`, the development test data is called `dtest` and the final evaluation data is called `etest`. 44 1. convert the train data to the Stanford NER format: `python convert_cnec_stanford.py named_ent_train.xml > named_ent_train.tsv`. Note that we removed documents that did not contain NEs. You can experiment with this option later. 47 1. convert the train data to the Stanford NER format: 48 49 `python convert_cnec_stanford.py cnec2.0/data/xml/named_ent_train.xml > named_ent_train.tsv` 50 51 Note that we removed documents that did not contain NEs. You can experiment with this option later. 45 52 1. download the Stanford NE recognizer http://nlp.stanford.edu/software/CRF-NER.shtml (and read about it) 46 1. train the model using the default settings (cnec.prop), N.B. that the `convert_cnec_stanford.py` only recognizes PERSON, LOCATION and ORGANIZATION, you can extend the markup conversion later: `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop cnec.prop` 47 1. convert the test data to the Stanford NER format: `python convert_cnec_stanford.py named_ent_dtest.xml > named_ent_dtest.tsv` 48 1. evaluate the model on `dtest`: [[BR]] 49 `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier cnec-3class-model.ser.gz -testFile named_ent_dtest.tsv`. [[BR]] 53 1. train the model using the default settings (cnec.prop), N.B. that the `convert_cnec_stanford.py` only recognizes PERSON, LOCATION and ORGANIZATION, you can extend the markup conversion later: 54 55 `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop cnec.prop` 56 1. convert the test data to the Stanford NER format: 57 58 `python convert_cnec_stanford.py named_ent_dtest.xml > named_ent_dtest.tsv` 59 1. evaluate the model on `dtest`: 60 {{{ 61 java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier \ 62 -loadClassifier cnec-3class-model.ser.gz -testFile named_ent_dtest.tsv 63 }}} 64 50 65 You should see results like: 51 66 {{{ … … 58 73 }}} 59 74 In the output, the first column is the input tokens, the second column is the correct (gold) answers. Observe the differences. Copy the training result to `<YOUR_FILE>`. Try to estimate in how many cases the model missed an entity, detected incorrectly the boundaries, or classified an entity incorrectly. 60 10. evaluate the model on `dtest` with only NEs that are not present in the train data: `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier cnec-3class-model.ser.gz -testFile named_ent_dtest_unknown.tsv`. Copy the result to `<YOUR_FILE>`. 61 11. test on your own input: `java -mx600m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier cnec-3class-model.ser.gz -textFile sample.txt`. Copy the result to `<YOUR_FILE>`. 75 10. evaluate the model on `dtest` with only NEs that are not present in the train data: 76 {{{ 77 java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier \ 78 -loadClassifier cnec-3class-model.ser.gz -testFile named_ent_dtest_unknown.tsv 79 }}} 80 81 Copy the result to `<YOUR_FILE>`. 82 11. test on your own input: 83 {{{ 84 java -mx600m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier \ 85 -loadClassifier cnec-3class-model.ser.gz -textFile sample.txt 86 }}} 87 88 Copy the result to `<YOUR_FILE>`. 62 89 63 90 12. (optional) try to improve the train data suggestions: set `useKnownLCWords` to false, add gazetteers, remove punctuation, try to change the wordshape (something following the pattern: `dan[12](bio)?(UseLC)?, jenny1(useLC)?, chris[1234](useLC)?, cluster1)` or word shape features (see the documentation). Copy the result to `<YOUR_FILE>`. 64 13. (optional) evaluate the model on dtest, final evaluation on etest91 13. (optional) evaluate the model on `dtest`, final evaluation on `etest`. 65 92