Changes between Version 9 and Version 10 of private/AdvancedNlpCourse/NamedEntityRecognition


Ignore:
Timestamp:
Oct 12, 2015, 9:55:04 AM (5 years ago)
Author:
Ales Horak
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • private/AdvancedNlpCourse/NamedEntityRecognition

    v9 v10  
    46461. train the model using the default settings (cnec.prop), N.B. that the `convert_cnec_stanford.py` only recognizes PERSON, LOCATION and ORGANIZATION, you can extend the markup conversion later: `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop cnec.prop`
    47471. convert the test data to the Stanford NER format: `python convert_cnec_stanford.py named_ent_dtest.xml > named_ent_dtest.tsv`
    48 1. evaluate the model on `dtest`:
    49 `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier cnec-3class-model.ser.gz -testFile named_ent_dtest.tsv`. You should see results like:
     481. evaluate the model on `dtest`: [[BR]]
     49 `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier cnec-3class-model.ser.gz -testFile named_ent_dtest.tsv`. [[BR]]
     50 You should see results like:
    5051{{{
    5152CRFClassifier tagged 12120 words in 441 documents at 8145.16 words per second.
     
    5657         Totals 0.7814  0.7711  0.7763  994     278     295
    5758}}}
    58 In the output, the first column is the input tokens, the second column is the correct (gold) answers. Observe the differences. Copy the training result to `<YOUR_FILE>`. Try to estimate in how many cases the model missed an entity, detected incorrectly the boundaries, or classified an entity incorrectly.
     59 In the output, the first column is the input tokens, the second column is the correct (gold) answers. Observe the differences. Copy the training result to `<YOUR_FILE>`. Try to estimate in how many cases the model missed an entity, detected incorrectly the boundaries, or classified an entity incorrectly.
    596010. evaluate the model on `dtest` with only NEs that are not present in the train data: `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier cnec-3class-model.ser.gz -testFile named_ent_dtest_unknown.tsv`. Copy the result to `<YOUR_FILE>`.
    606111. test on your own input: `java -mx600m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier cnec-3class-model.ser.gz -textFile sample.txt`. Copy the result to `<YOUR_FILE>`.
    6162
    62 (optional) 12. try to improve the train data
    63 suggestions: set `useKnownLCWords` to false, add gazetteers, remove punctuation, try to change the wordshape (something following the pattern: `dan[12](bio)?(UseLC)?, jenny1(useLC)?, chris[1234](useLC)?, cluster1)` or word shape features (see the documentation). Copy the result to `<YOUR_FILE>`.
    64 (optional) 13. evaluate the model on dtest, final evaluation on etest
     6312. (optional) try to improve the train data suggestions: set `useKnownLCWords` to false, add gazetteers, remove punctuation, try to change the wordshape (something following the pattern: `dan[12](bio)?(UseLC)?, jenny1(useLC)?, chris[1234](useLC)?, cluster1)` or word shape features (see the documentation). Copy the result to `<YOUR_FILE>`.
     6413. (optional) evaluate the model on dtest, final evaluation on etest
    6565