Changes between Version 14 and Version 15 of private/NlpInPracticeCourse/NamedEntityRecognition


Ignore:
Timestamp:
Oct 9, 2017, 11:36:03 AM (7 years ago)
Author:
Ales Horak
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • private/NlpInPracticeCourse/NamedEntityRecognition

    v14 v15  
    3838Requirements: Java 8, python, gigabytes of memory, [raw-attachment:convert_cnec_stanford.py:wiki:en/AdvancedNlpCourse/NamedEntityRecognition convert_cnec_stanford.py], [raw-attachment:named_ent_dtest_unknown.tsv:wiki:en/AdvancedNlpCourse/NamedEntityRecognition named_ent_dtest_unknown.tsv], [raw-attachment:cnec.prop:wiki:en/AdvancedNlpCourse/NamedEntityRecognition cnec.prop]
    3939
    40 1. Create `<YOUR_FILE>`, a text file named ia161-UCO-03.txt where UCO is your university ID.
     401. Create `<YOUR_FILE>`, a text file named `ia161-UCO-04.txt` where ''UCO'' is your university ID.
    41411. get the data: download CNEC from LINDAT/Clarin repository (https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-1B22-8)
    42421. open the NE hierarchy:
    43 
    44  `evince cnec2.0/doc/ne-type-hierarchy.pdf`
     43{{{
     44evince cnec2.0/doc/ne-type-hierarchy.pdf
     45}}}
    4546
    46471. the data is organized into 3 disjoint datasets: the training data is called `train`, the development test data is called `dtest` and the final evaluation data is called `etest`.
    47481. convert the train data to the Stanford NER format:
    48 
    49  `python convert_cnec_stanford.py cnec2.0/data/xml/named_ent_train.xml > named_ent_train.tsv`
     49{{{
     50python convert_cnec_stanford.py cnec2.0/data/xml/named_ent_train.xml > named_ent_train.tsv
     51}}}
    5052
    5153 Note that we removed documents that did not contain NEs. You can experiment with this option later.
    52541. download the Stanford NE recognizer http://nlp.stanford.edu/software/CRF-NER.shtml (and read about it)
    53551. train the model using the default settings (cnec.prop), N.B. that the `convert_cnec_stanford.py` only recognizes PERSON, LOCATION and ORGANIZATION, you can extend the markup conversion later:
    54 
    55  `java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop cnec.prop`
     56{{{
     57java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop cnec.prop
     58}}}
    56591. convert the test data to the Stanford NER format:
    57 
    58  `python convert_cnec_stanford.py named_ent_dtest.xml > named_ent_dtest.tsv`
     60 {{{
     61 python convert_cnec_stanford.py named_ent_dtest.xml > named_ent_dtest.tsv
     62}}}
    59631. evaluate the model on `dtest`:
    6064{{{