Changes between Version 13 and Version 14 of private/NlpInPracticeCourse/NamedEntityRecognition


Ignore:
Timestamp:
Oct 9, 2017, 11:33:03 AM (6 years ago)
Author:
Ales Horak
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • private/NlpInPracticeCourse/NamedEntityRecognition

    v13 v14  
    1313=== Example from IE ===
    1414
    15 In 2003, Hannibal Lecter (as portrayed by Hopkins) was chosen by the American Film Institute as the number one movie villain.
     15|| In 2003, Hannibal Lecter (as portrayed by Hopkins) was chosen by the American Film Institute as the number one movie villain. ||
    1616
    1717Hannibal Lecter <-> Hopkins
     
    1919=== Example concerning syntactic parsing ===
    2020
    21 Wish You Were Here is the ninth studio album by the English progressive rock group Pink Floyd.
     21|| Wish You Were Here is the ninth studio album by the English progressive rock group Pink Floyd. ||
    2222
    2323vs.
    2424
    25 Wish_You_Were_Here is the ninth studio album by the English progressive rock group Pink Floyd.
     25|| Wish_You_Were_Here is the ninth studio album by the English progressive rock group Pink Floyd. ||
    2626
    2727=== References ===
     
    40401. Create `<YOUR_FILE>`, a text file named ia161-UCO-03.txt where UCO is your university ID.
    41411. get the data: download CNEC from LINDAT/Clarin repository (https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-1B22-8)
    42 1. open the NE hierarchy: `acroread cnec1.1/doc/ne-type-hierarchy.pdf`
     421. open the NE hierarchy:
     43
     44 `evince cnec2.0/doc/ne-type-hierarchy.pdf`
     45
    43461. the data is organized into 3 disjoint datasets: the training data is called `train`, the development test data is called `dtest` and the final evaluation data is called `etest`.
    44 1. convert the train data to the Stanford NER format: `python convert_cnec_stanford.py named_ent_train.xml > named_ent_train.tsv`. Note that we removed documents that did not contain NEs. You can experiment with this option later.
     471. convert the train data to the Stanford NER format:
     48
     49 `python convert_cnec_stanford.py cnec2.0/data/xml/named_ent_train.xml > named_ent_train.tsv`
     50
     51 Note that we removed documents that did not contain NEs. You can experiment with this option later.
    45521. download the Stanford NE recognizer http://nlp.stanford.edu/software/CRF-NER.shtml (and read about it)
    46 1. train the model using the default settings (cnec.prop), N.B. that the `convert_cnec_stanford.py` only recognizes PERSON, LOCATION and ORGANIZATION, you can extend the markup conversion later: `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop cnec.prop`
    47 1. convert the test data to the Stanford NER format: `python convert_cnec_stanford.py named_ent_dtest.xml > named_ent_dtest.tsv`
    48 1. evaluate the model on `dtest`: [[BR]]
    49  `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier cnec-3class-model.ser.gz -testFile named_ent_dtest.tsv`. [[BR]]
     531. train the model using the default settings (cnec.prop), N.B. that the `convert_cnec_stanford.py` only recognizes PERSON, LOCATION and ORGANIZATION, you can extend the markup conversion later:
     54
     55 `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop cnec.prop`
     561. convert the test data to the Stanford NER format:
     57
     58 `python convert_cnec_stanford.py named_ent_dtest.xml > named_ent_dtest.tsv`
     591. evaluate the model on `dtest`:
     60{{{
     61java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier \
     62  -loadClassifier cnec-3class-model.ser.gz -testFile named_ent_dtest.tsv
     63}}}
     64
    5065 You should see results like:
    5166{{{
     
    5873}}}
    5974 In the output, the first column is the input tokens, the second column is the correct (gold) answers. Observe the differences. Copy the training result to `<YOUR_FILE>`. Try to estimate in how many cases the model missed an entity, detected incorrectly the boundaries, or classified an entity incorrectly.
    60 10. evaluate the model on `dtest` with only NEs that are not present in the train data: `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier cnec-3class-model.ser.gz -testFile named_ent_dtest_unknown.tsv`. Copy the result to `<YOUR_FILE>`.
    61 11. test on your own input: `java -mx600m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier cnec-3class-model.ser.gz -textFile sample.txt`. Copy the result to `<YOUR_FILE>`.
     7510. evaluate the model on `dtest` with only NEs that are not present in the train data:
     76 {{{
     77java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier \
     78   -loadClassifier cnec-3class-model.ser.gz -testFile named_ent_dtest_unknown.tsv
     79}}}
     80
     81 Copy the result to `<YOUR_FILE>`.
     8211. test on your own input:
     83 {{{
     84java -mx600m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier \
     85  -loadClassifier cnec-3class-model.ser.gz -textFile sample.txt
     86}}}
     87
     88 Copy the result to `<YOUR_FILE>`.
    6289
    639012. (optional) try to improve the train data suggestions: set `useKnownLCWords` to false, add gazetteers, remove punctuation, try to change the wordshape (something following the pattern: `dan[12](bio)?(UseLC)?, jenny1(useLC)?, chris[1234](useLC)?, cluster1)` or word shape features (see the documentation). Copy the result to `<YOUR_FILE>`.
    64 13. (optional) evaluate the model on dtest, final evaluation on etest
     9113. (optional) evaluate the model on `dtest`, final evaluation on `etest`.
    6592