Context Navigation

NamedEntityRecognition

Timestamp:: Oct 11, 2015, 3:13:43 PM (10 years ago)
Author:: Zuzana Nevěřilová
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

private/NlpInPracticeCourse/NamedEntityRecognition

-                      v4
+                      v5
 == Practical Session ==
+Try naive gazetteer method (implement substring search) on prepared data.
+Observe the results:
+. what happens to every string present in the gazetteer?
+. what happens to NE not present in the gazetteer?
+=== Czech Named Entity Recognition ===
+Try machine learning approach (use the Stanford NER) with prepared data.
+Observe the results:
+. measure precision, recall, and F1-score on the test data
+. find NEs not present in the train data
+. find NEs that were not recognized
+. discuss what types of NE are easy/difficult to recognize
+In this workshop, we train a new NER application for the Czech language. We work with free resources & software tools: the Czech NE Corpus (CNEC) and the Stanford NER application.
+Requirements: Java 8, python, gigabytes of memory
+. Create `<YOUR_FILE>`, a text file named ia161-UCO-03.txt where UCO is your university ID.
+. get the data: download CNEC from LINDAT/Clarin repository (https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-1B04-C)
+. open the NE hierarchy: `acroread cnec1.1/doc/ne-type-hierarchy.pdf`
+. the data is organized into 3 disjoint datasets: the training data is called `train`, the development test data is called `dtest` and the final evaluation data is called `etest`.
+. convert the train data to the Stanford NER format: `python convert_cnec_stanford.py named_ent_train.xml > named_ent_train.tsv`. Note that we removed documents that did not contain NEs. You can experiment with this option later.
+. download the Stanford NE recognizer http://nlp.stanford.edu/software/CRF-NER.shtml (and read about it)
+. train the model using the default settings (cnec.prop), N.B. that the `convert_cnec_stanford.py` only recognizes PERSON, LOCATION and ORGANIZATION, you can extend the markup conversion later: `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop cnec.prop`
+. convert the test data to the Stanford NER format: `python convert_cnec_stanford.py named_ent_dtest.xml > named_ent_dtest.tsv`
+. evaluate the model on `dtest`:
+`java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier cnec-3class-model.ser.gz -testFile named_ent_dtest.tsv`. You should see results like:
+{{{
+CRFClassifier tagged 12120 words in 441 documents at 8145.16 words per second.
+         Entity P       R       F1      TP      FP      FN
+       LOCATION 0.7962  0.7849  0.7905  332     85      91
+   ORGANIZATION 0.7059  0.6019  0.6497  192     80      127
+         PERSON 0.8062  0.8592  0.8319  470     113     77
+         Totals 0.7814  0.7711  0.7763  994     278     295
+}}}
+In the output, the first column is the input tokens, the second column is the correct (gold) answers. Observe the differences.
+. evaluate the model on `dtest` with only NEs that are not present in the train data: `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier cnec-3class-model.ser.gz -testFile named_ent_dtest_unknown.tsv`
+. test on your own input: `java -mx600m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier cnec-3class-model.ser.gz -textFile sample.txt`
+(optional) 12. try to improve the train data
+suggestions: set useKnownLCWords to false, add gazetteers, remove punctuation, try to change the wordshape (something following the pattern: `dan[12](bio)?(UseLC)?, jenny1(useLC)?, chris[1234](useLC)?, cluster1)` or word shape features (see the documentation)
+(optional) 13. evaluate the model on dtest, final evaluation on etest