Context Navigation

NamedEntityRecognition

Timestamp:: Oct 9, 2017, 11:33:03 AM (8 years ago)
Author:: Ales Horak
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

private/NlpInPracticeCourse/NamedEntityRecognition

-                      v13
+                      v14
 === Example from IE ===
+In 2003, Hannibal Lecter (as portrayed by Hopkins) was chosen by the American Film Institute as the number one movie villain.
+|| In 2003, Hannibal Lecter (as portrayed by Hopkins) was chosen by the American Film Institute as the number one movie villain. ||
 Hannibal Lecter <-> Hopkins
 …
 === Example concerning syntactic parsing ===
+Wish You Were Here is the ninth studio album by the English progressive rock group Pink Floyd.
+|| Wish You Were Here is the ninth studio album by the English progressive rock group Pink Floyd. ||
 vs.
+Wish_You_Were_Here is the ninth studio album by the English progressive rock group Pink Floyd.
+|| Wish_You_Were_Here is the ninth studio album by the English progressive rock group Pink Floyd. ||
 === References ===
 …
 . Create `<YOUR_FILE>`, a text file named ia161-UCO-03.txt where UCO is your university ID.
 . get the data: download CNEC from LINDAT/Clarin repository (https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-1B22-8)
+. open the NE hierarchy: `acroread cnec1.1/doc/ne-type-hierarchy.pdf`
+. open the NE hierarchy:
+ `evince cnec2.0/doc/ne-type-hierarchy.pdf`
 . the data is organized into 3 disjoint datasets: the training data is called `train`, the development test data is called `dtest` and the final evaluation data is called `etest`.
+. convert the train data to the Stanford NER format: `python convert_cnec_stanford.py named_ent_train.xml > named_ent_train.tsv`. Note that we removed documents that did not contain NEs. You can experiment with this option later.
+. convert the train data to the Stanford NER format:
+ `python convert_cnec_stanford.py cnec2.0/data/xml/named_ent_train.xml > named_ent_train.tsv`
+ Note that we removed documents that did not contain NEs. You can experiment with this option later.
 . download the Stanford NE recognizer http://nlp.stanford.edu/software/CRF-NER.shtml (and read about it)
+. train the model using the default settings (cnec.prop), N.B. that the `convert_cnec_stanford.py` only recognizes PERSON, LOCATION and ORGANIZATION, you can extend the markup conversion later: `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop cnec.prop`
+. convert the test data to the Stanford NER format: `python convert_cnec_stanford.py named_ent_dtest.xml > named_ent_dtest.tsv`
+. evaluate the model on `dtest`: [[BR]]
+ `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier cnec-3class-model.ser.gz -testFile named_ent_dtest.tsv`. [[BR]]
+. train the model using the default settings (cnec.prop), N.B. that the `convert_cnec_stanford.py` only recognizes PERSON, LOCATION and ORGANIZATION, you can extend the markup conversion later:
+ `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop cnec.prop`
+. convert the test data to the Stanford NER format:
+ `python convert_cnec_stanford.py named_ent_dtest.xml > named_ent_dtest.tsv`
+. evaluate the model on `dtest`:
+{{{
+java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier \
+  -loadClassifier cnec-3class-model.ser.gz -testFile named_ent_dtest.tsv
+}}}
  You should see results like:
 {{{
 …
 }}}
  In the output, the first column is the input tokens, the second column is the correct (gold) answers. Observe the differences. Copy the training result to `<YOUR_FILE>`. Try to estimate in how many cases the model missed an entity, detected incorrectly the boundaries, or classified an entity incorrectly.
+. evaluate the model on `dtest` with only NEs that are not present in the train data: `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier cnec-3class-model.ser.gz -testFile named_ent_dtest_unknown.tsv`. Copy the result to `<YOUR_FILE>`.
+. test on your own input: `java -mx600m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier cnec-3class-model.ser.gz -textFile sample.txt`. Copy the result to `<YOUR_FILE>`.
+. evaluate the model on `dtest` with only NEs that are not present in the train data:
+ {{{
+java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier \
+   -loadClassifier cnec-3class-model.ser.gz -testFile named_ent_dtest_unknown.tsv
+}}}
+ Copy the result to `<YOUR_FILE>`.
+. test on your own input:
+ {{{
+java -mx600m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier \
+  -loadClassifier cnec-3class-model.ser.gz -textFile sample.txt
+}}}
+ Copy the result to `<YOUR_FILE>`.
 . (optional) try to improve the train data suggestions: set `useKnownLCWords` to false, add gazetteers, remove punctuation, try to change the wordshape (something following the pattern: `dan[12](bio)?(UseLC)?, jenny1(useLC)?, chris[1234](useLC)?, cluster1)` or word shape features (see the documentation). Copy the result to `<YOUR_FILE>`.
 . (optional) evaluate the model on dtest, final evaluation on etest
+. (optional) evaluate the model on `dtest`, final evaluation on `etest`.