39 | | Try machine learning approach (use the Stanford NER) with prepared data. |
40 | | Observe the results: |
41 | | 1. measure precision, recall, and F1-score on the test data |
42 | | 1. find NEs not present in the train data |
43 | | 1. find NEs that were not recognized |
44 | | 1. discuss what types of NE are easy/difficult to recognize |
| 36 | In this workshop, we train a new NER application for the Czech language. We work with free resources & software tools: the Czech NE Corpus (CNEC) and the Stanford NER application. |
| 37 | |
| 38 | Requirements: Java 8, python, gigabytes of memory |
| 39 | |
| 40 | 1. Create `<YOUR_FILE>`, a text file named ia161-UCO-03.txt where UCO is your university ID. |
| 41 | 1. get the data: download CNEC from LINDAT/Clarin repository (https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-1B04-C) |
| 42 | 1. open the NE hierarchy: `acroread cnec1.1/doc/ne-type-hierarchy.pdf` |
| 43 | 1. the data is organized into 3 disjoint datasets: the training data is called `train`, the development test data is called `dtest` and the final evaluation data is called `etest`. |
| 44 | 1. convert the train data to the Stanford NER format: `python convert_cnec_stanford.py named_ent_train.xml > named_ent_train.tsv`. Note that we removed documents that did not contain NEs. You can experiment with this option later. |
| 45 | 1. download the Stanford NE recognizer http://nlp.stanford.edu/software/CRF-NER.shtml (and read about it) |
| 46 | 1. train the model using the default settings (cnec.prop), N.B. that the `convert_cnec_stanford.py` only recognizes PERSON, LOCATION and ORGANIZATION, you can extend the markup conversion later: `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop cnec.prop` |
| 47 | 1. convert the test data to the Stanford NER format: `python convert_cnec_stanford.py named_ent_dtest.xml > named_ent_dtest.tsv` |
| 48 | 1. evaluate the model on `dtest`: |
| 49 | `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier cnec-3class-model.ser.gz -testFile named_ent_dtest.tsv`. You should see results like: |
| 50 | {{{ |
| 51 | CRFClassifier tagged 12120 words in 441 documents at 8145.16 words per second. |
| 52 | Entity P R F1 TP FP FN |
| 53 | LOCATION 0.7962 0.7849 0.7905 332 85 91 |
| 54 | ORGANIZATION 0.7059 0.6019 0.6497 192 80 127 |
| 55 | PERSON 0.8062 0.8592 0.8319 470 113 77 |
| 56 | Totals 0.7814 0.7711 0.7763 994 278 295 |
| 57 | }}} |
| 58 | In the output, the first column is the input tokens, the second column is the correct (gold) answers. Observe the differences. |
| 59 | 10. evaluate the model on `dtest` with only NEs that are not present in the train data: `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier cnec-3class-model.ser.gz -testFile named_ent_dtest_unknown.tsv` |
| 60 | 11. test on your own input: `java -mx600m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier cnec-3class-model.ser.gz -textFile sample.txt` |
| 61 | |
| 62 | (optional) 12. try to improve the train data |
| 63 | suggestions: set useKnownLCWords to false, add gazetteers, remove punctuation, try to change the wordshape (something following the pattern: `dan[12](bio)?(UseLC)?, jenny1(useLC)?, chris[1234](useLC)?, cluster1)` or word shape features (see the documentation) |
| 64 | (optional) 13. evaluate the model on dtest, final evaluation on etest |
| 65 | |