Changes between Version 4 and Version 5 of private/NlpInPracticeCourse/NamedEntityRecognition


Ignore:
Timestamp:
Oct 11, 2015, 3:13:43 PM (9 years ago)
Author:
Zuzana Nevěřilová
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • private/NlpInPracticeCourse/NamedEntityRecognition

    v4 v5  
    3232== Practical Session ==
    3333
    34 Try naive gazetteer method (implement substring search) on prepared data.
    35 Observe the results:
    36   1. what happens to every string present in the gazetteer?
    37   1. what happens to NE not present in the gazetteer?
     34=== Czech Named Entity Recognition ===
    3835
    39 Try machine learning approach (use the Stanford NER) with prepared data.
    40 Observe the results:
    41   1. measure precision, recall, and F1-score on the test data
    42   1. find NEs not present in the train data
    43   1. find NEs that were not recognized
    44   1. discuss what types of NE are easy/difficult to recognize
     36In this workshop, we train a new NER application for the Czech language. We work with free resources & software tools: the Czech NE Corpus (CNEC) and the Stanford NER application.
     37
     38Requirements: Java 8, python, gigabytes of memory
     39
     401. Create `<YOUR_FILE>`, a text file named ia161-UCO-03.txt where UCO is your university ID.
     411. get the data: download CNEC from LINDAT/Clarin repository (https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-1B04-C)
     421. open the NE hierarchy: `acroread cnec1.1/doc/ne-type-hierarchy.pdf`
     431. the data is organized into 3 disjoint datasets: the training data is called `train`, the development test data is called `dtest` and the final evaluation data is called `etest`.
     441. convert the train data to the Stanford NER format: `python convert_cnec_stanford.py named_ent_train.xml > named_ent_train.tsv`. Note that we removed documents that did not contain NEs. You can experiment with this option later.
     451. download the Stanford NE recognizer http://nlp.stanford.edu/software/CRF-NER.shtml (and read about it)
     461. train the model using the default settings (cnec.prop), N.B. that the `convert_cnec_stanford.py` only recognizes PERSON, LOCATION and ORGANIZATION, you can extend the markup conversion later: `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop cnec.prop`
     471. convert the test data to the Stanford NER format: `python convert_cnec_stanford.py named_ent_dtest.xml > named_ent_dtest.tsv`
     481. evaluate the model on `dtest`:
     49`java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier cnec-3class-model.ser.gz -testFile named_ent_dtest.tsv`. You should see results like:
     50{{{
     51CRFClassifier tagged 12120 words in 441 documents at 8145.16 words per second.
     52         Entity P       R       F1      TP      FP      FN
     53       LOCATION 0.7962  0.7849  0.7905  332     85      91
     54   ORGANIZATION 0.7059  0.6019  0.6497  192     80      127
     55         PERSON 0.8062  0.8592  0.8319  470     113     77
     56         Totals 0.7814  0.7711  0.7763  994     278     295
     57}}}
     58In the output, the first column is the input tokens, the second column is the correct (gold) answers. Observe the differences.
     5910. evaluate the model on `dtest` with only NEs that are not present in the train data: `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier cnec-3class-model.ser.gz -testFile named_ent_dtest_unknown.tsv`
     6011. test on your own input: `java -mx600m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier cnec-3class-model.ser.gz -textFile sample.txt`
     61
     62(optional) 12. try to improve the train data
     63suggestions: set useKnownLCWords to false, add gazetteers, remove punctuation, try to change the wordshape (something following the pattern: `dan[12](bio)?(UseLC)?, jenny1(useLC)?, chris[1234](useLC)?, cluster1)` or word shape features (see the documentation)
     64(optional) 13. evaluate the model on dtest, final evaluation on etest
     65