Context Navigation

NamedEntityRecognition

Timestamp:: Oct 12, 2015, 9:55:04 AM (10 years ago)
Author:: Ales Horak
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

private/NlpInPracticeCourse/NamedEntityRecognition

-                      v9
+                      v10
 . train the model using the default settings (cnec.prop), N.B. that the `convert_cnec_stanford.py` only recognizes PERSON, LOCATION and ORGANIZATION, you can extend the markup conversion later: `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop cnec.prop`
 . convert the test data to the Stanford NER format: `python convert_cnec_stanford.py named_ent_dtest.xml > named_ent_dtest.tsv`
+. evaluate the model on `dtest`:
+`java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier cnec-3class-model.ser.gz -testFile named_ent_dtest.tsv`. You should see results like:
+. evaluate the model on `dtest`: [[BR]]
+ `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier cnec-3class-model.ser.gz -testFile named_ent_dtest.tsv`. [[BR]]
+ You should see results like:
 {{{
 CRFClassifier tagged 12120 words in 441 documents at 8145.16 words per second.
 …
          Totals 0.7814  0.7711  0.7763  994     278     295
 }}}
 In the output, the first column is the input tokens, the second column is the correct (gold) answers. Observe the differences. Copy the training result to `<YOUR_FILE>`. Try to estimate in how many cases the model missed an entity, detected incorrectly the boundaries, or classified an entity incorrectly.
+ In the output, the first column is the input tokens, the second column is the correct (gold) answers. Observe the differences. Copy the training result to `<YOUR_FILE>`. Try to estimate in how many cases the model missed an entity, detected incorrectly the boundaries, or classified an entity incorrectly.
 . evaluate the model on `dtest` with only NEs that are not present in the train data: `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier cnec-3class-model.ser.gz -testFile named_ent_dtest_unknown.tsv`. Copy the result to `<YOUR_FILE>`.
 . test on your own input: `java -mx600m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier cnec-3class-model.ser.gz -textFile sample.txt`. Copy the result to `<YOUR_FILE>`.
+(optional) 12. try to improve the train data
+suggestions: set `useKnownLCWords` to false, add gazetteers, remove punctuation, try to change the wordshape (something following the pattern: `dan[12](bio)?(UseLC)?, jenny1(useLC)?, chris[1234](useLC)?, cluster1)` or word shape features (see the documentation). Copy the result to `<YOUR_FILE>`.
+(optional) 13. evaluate the model on dtest, final evaluation on etest
+. (optional) try to improve the train data suggestions: set `useKnownLCWords` to false, add gazetteers, remove punctuation, try to change the wordshape (something following the pattern: `dan[12](bio)?(UseLC)?, jenny1(useLC)?, chris[1234](useLC)?, cluster1)` or word shape features (see the documentation). Copy the result to `<YOUR_FILE>`.
+. (optional) evaluate the model on dtest, final evaluation on etest