Changes between Version 21 and Version 22 of private/AdvancedNlpCourse/NamedEntityRecognition


Ignore:
Timestamp:
Jan 3, 2021, 7:45:08 PM (4 months ago)
Author:
Zuzana Nevěřilová
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • private/AdvancedNlpCourse/NamedEntityRecognition

    v21 v22  
    2929 1. David Nadeau, Satoshi Sekine: A survey of named entity recognition and classification. In Satoshi Sekine and Elisabete Ranchhod (eds.) Named Entities: Recognition, classification and use. Lingvisticæ Investigationes 30:1. 2007. pp. 3–26 [[http://brown.cl.uni-heidelberg.de/~sourjiko/NER_Literatur/survey.pdf]]
    3030 1. Charles Sutton and Andrew !McCallum: An Introduction to Conditional Random Fields. Foundations and Trends in Machine Learning 4 (4). 2012. [[http://homepages.inf.ed.ac.uk/csutton/publications/crftut-fnt.pdf]]
     31 1. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding, 2019. [[https://arxiv.org/abs/1810.04805]]
     32
    3133
    3234== Practical Session ==
     
    3436=== Czech Named Entity Recognition ===
    3537
    36 In this workshop, we train a new NER application for the Czech language. We work with free resources & software tools: the Czech NE Corpus (CNEC) and the Stanford NER application.
    37 
    38 Requirements: Java 8, python, gigabytes of memory, [raw-attachment:convert_cnec_stanford.py:wiki:en/AdvancedNlpCourse/NamedEntityRecognition convert_cnec_stanford.py], [raw-attachment:get_unknown.py:wiki:en/AdvancedNlpCourse/NamedEntityRecognition get_unknown.py], [raw-attachment:cnec.prop:wiki:en/AdvancedNlpCourse/NamedEntityRecognition cnec.prop]
     38In this workshop, we train a new NER application for the Czech language. We work with free resources & software tools: the Czech NE Corpus (CNEC) and the FastText pre-trained word embeddings. We build a neural network to solve the problem.
    3939
    40401. Create `<YOUR_FILE>`, a text file named `ia161-UCO-04.txt` where ''UCO'' is your university ID.
    41 1. get the data: download CNEC from LINDAT/Clarin repository (https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-1B22-8)
    42 1. open the NE hierarchy:
    43 {{{
    44 evince cnec2.0/doc/ne-type-hierarchy.pdf
    45 }}}
    46 
    47 1. the data is organized into 3 disjoint datasets: the training data is called `train`, the development test data is called `dtest` and the final evaluation data is called `etest`.
    48 1. convert the train data to the Stanford NER format:
    49 {{{
    50 python convert_cnec_stanford.py cnec2.0/data/xml/named_ent_train.xml \
    51   > named_ent_train.tsv
    52 }}}
    53 
    54  Note that we removed documents that did not contain NEs. You can experiment with this option later.
    55 1. download the Stanford NE recognizer http://nlp.stanford.edu/software/CRF-NER.shtml (and read about it)
    56 1. train the model using the default settings (cnec.prop), N.B. that the `convert_cnec_stanford.py` only recognizes PERSON, LOCATION and ORGANIZATION, you can extend the markup conversion later:
    57 {{{
    58 java -cp stanford-ner-2018-10-16/stanford-ner.jar \
    59   edu.stanford.nlp.ie.crf.CRFClassifier \
    60   -prop cnec.prop
    61 }}}
    62 1. convert the test data to the Stanford NER format:
    63  {{{
    64 python convert_cnec_stanford.py cnec2.0/data/xml/named_ent_dtest.xml \
    65   > named_ent_dtest.tsv
    66 }}}
    67 1. evaluate the model on `dtest`:
    68 {{{
    69 java -cp stanford-ner-2018-10-16/stanford-ner.jar \
    70   edu.stanford.nlp.ie.crf.CRFClassifier \
    71   -loadClassifier cnec-3class-model.ser.gz \
    72   -testFile named_ent_dtest.tsv
    73 }}}
    74 
    75  You should see results like:
    76 {{{
    77 CRFClassifier tagged 19993 words in 900 documents at 2388.94 words per second.
    78          Entity P       R       F1      TP      FP      FN
    79             LOC 0.7064  0.7586  0.7316  308     128     98
    80             ORG 0.6943  0.5576  0.6185  184     81      146
    81           OTHER 0.6224  0.6498  0.6358  590     358     318
    82             PER 0.7727  0.8236  0.7974  425     125     91
    83          Totals 0.6853  0.6977  0.6914  1507    692     653
    84 }}}
    85  In the output, the first column is the input tokens, the second column is the correct (gold) answers. Observe the differences. Copy the training result to `<YOUR_FILE>`. Try to estimate in how many cases the model missed an entity, detected incorrectly the boundaries, or classified an entity incorrectly.
    86 10. evaluate the model on `dtest` with only NEs that are not present in the train data. First, you need to filter out only those documents that do not contain NERs from the training data. Use the script `get_uknown.py`, then run the NER:
    87  {{{
    88 java -cp stanford-ner-2018-10-16/stanford-ner.jar \
    89   edu.stanford.nlp.ie.crf.CRFClassifier \
    90   -loadClassifier cnec-3class-model.ser.gz \
    91   -testFile named_ent_dtest_unknown.tsv
    92 }}}
    93 
    94  Copy the result to `<YOUR_FILE>`.
    95 11. test on your own input:
    96  {{{
    97 java -mx600m -cp stanford-ner-2018-10-16/stanford-ner.jar \
    98   edu.stanford.nlp.ie.crf.CRFClassifier \
    99   -loadClassifier cnec-3class-model.ser.gz -textFile sample.txt
    100 }}}
    101 
    102  Copy the result to `<YOUR_FILE>`.
    103 
    104 12. (optional) try to improve the train data suggestions: set `useKnownLCWords` to false, add gazetteers, remove punctuation, try to change the wordshape (something following the pattern: `dan[12](bio)?(UseLC)?, jenny1(useLC)?, chris[1234](useLC)?, cluster1)` or word shape features (see the documentation). Copy the result to `<YOUR_FILE>`.
    105 13. (optional) evaluate the model on `dtest`, final evaluation on `etest`.
    106 
     411. Open Google Colab at [[https://colab.research.google.com/drive/1mnz-P30CLxrxQ0yyqpcLwVJgi7e59shi?usp=sharing]]
     421. Follow the instructions in the notebook. There are three obligatory tasks. Write down your answers to `<YOUR_FILE>`.