Changes between Initial Version and Version 1 of en/AdvancedNlpCourse2018/NamedEntityRecognition


Ignore:
Timestamp:
Sep 12, 2019, 11:11:31 AM (18 months ago)
Author:
Ales Horak
Comment:

copied from private/AdvancedNlpCourse/NamedEntityRecognition

Legend:

Unmodified
Added
Removed
Modified
  • en/AdvancedNlpCourse2018/NamedEntityRecognition

    v1 v1  
     1= Named Entity Recognition =
     2
     3[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák
     4
     5Prepared by: Zuzana Nevěřilová
     6
     7== State of the Art ==
     8
     9NER aims to ''recognize'' and ''classify'' names of people, locations, organizations, products, artworks, sometimes dates, money, measurements (numbers with units), law or patent numbers etc. Known issues are ambiguity of words (e.g. ''May'' can be a month, a verb, or a name), ambiguity of classes (e.g. ''HMS Queen Elisabeth'' can be a ship), and the inherent incompleteness of lists of NEs.
     10
     11Named entity recognition (NER) is used mainly in information extraction (IE) but it can significantly improve other NLP tasks such as syntactic parsing.
     12
     13=== Example from IE ===
     14
     15|| In 2003, Hannibal Lecter (as portrayed by Hopkins) was chosen by the American Film Institute as the number one movie villain. ||
     16
     17Hannibal Lecter <-> Hopkins
     18
     19=== Example concerning syntactic parsing ===
     20
     21|| Wish You Were Here is the ninth studio album by the English progressive rock group Pink Floyd. ||
     22
     23vs.
     24
     25|| Wish_You_Were_Here is the ninth studio album by the English progressive rock group Pink Floyd. ||
     26
     27=== References ===
     28
     29 1. David Nadeau, Satoshi Sekine: A survey of named entity recognition and classification. In Satoshi Sekine and Elisabete Ranchhod (eds.) Named Entities: Recognition, classification and use. Lingvisticæ Investigationes 30:1. 2007. pp. 3–26 [[http://brown.cl.uni-heidelberg.de/~sourjiko/NER_Literatur/survey.pdf]]
     30 1. Charles Sutton and Andrew !McCallum: An Introduction to Conditional Random Fields. Foundations and Trends in Machine Learning 4 (4). 2012. [[http://homepages.inf.ed.ac.uk/csutton/publications/crftut-fnt.pdf]]
     31
     32== Practical Session ==
     33
     34=== Czech Named Entity Recognition ===
     35
     36In this workshop, we train a new NER application for the Czech language. We work with free resources & software tools: the Czech NE Corpus (CNEC) and the Stanford NER application.
     37
     38Requirements: Java 8, python, gigabytes of memory, [raw-attachment:convert_cnec_stanford.py:wiki:en/AdvancedNlpCourse/NamedEntityRecognition convert_cnec_stanford.py], [raw-attachment:named_ent_dtest_unknown.tsv:wiki:en/AdvancedNlpCourse/NamedEntityRecognition named_ent_dtest_unknown.tsv], [raw-attachment:cnec.prop:wiki:en/AdvancedNlpCourse/NamedEntityRecognition cnec.prop]
     39
     401. Create `<YOUR_FILE>`, a text file named `ia161-UCO-04.txt` where ''UCO'' is your university ID.
     411. get the data: download CNEC from LINDAT/Clarin repository (https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-1B22-8)
     421. open the NE hierarchy:
     43{{{
     44evince cnec2.0/doc/ne-type-hierarchy.pdf
     45}}}
     46
     471. the data is organized into 3 disjoint datasets: the training data is called `train`, the development test data is called `dtest` and the final evaluation data is called `etest`.
     481. convert the train data to the Stanford NER format:
     49{{{
     50python convert_cnec_stanford.py cnec2.0/data/xml/named_ent_train.xml \
     51  > named_ent_train.tsv
     52}}}
     53
     54 Note that we removed documents that did not contain NEs. You can experiment with this option later.
     551. download the Stanford NE recognizer http://nlp.stanford.edu/software/CRF-NER.shtml (and read about it)
     561. train the model using the default settings (cnec.prop), N.B. that the `convert_cnec_stanford.py` only recognizes PERSON, LOCATION and ORGANIZATION, you can extend the markup conversion later:
     57{{{
     58java -cp stanford-ner-2017-06-09/stanford-ner.jar \
     59  edu.stanford.nlp.ie.crf.CRFClassifier \
     60  -prop cnec.prop
     61}}}
     621. convert the test data to the Stanford NER format:
     63 {{{
     64python convert_cnec_stanford.py cnec2.0/data/xml/named_ent_dtest.xml \
     65  > named_ent_dtest.tsv
     66}}}
     671. evaluate the model on `dtest`:
     68{{{
     69java -cp stanford-ner-2017-06-09/stanford-ner.jar \
     70  edu.stanford.nlp.ie.crf.CRFClassifier \
     71  -loadClassifier cnec-3class-model.ser.gz \
     72  -testFile named_ent_dtest.tsv
     73}}}
     74
     75 You should see results like:
     76{{{
     77CRFClassifier tagged 12120 words in 441 documents at 8145.16 words per second.
     78         Entity P       R       F1      TP      FP      FN
     79       LOCATION 0.7962  0.7849  0.7905  332     85      91
     80   ORGANIZATION 0.7059  0.6019  0.6497  192     80      127
     81         PERSON 0.8062  0.8592  0.8319  470     113     77
     82         Totals 0.7814  0.7711  0.7763  994     278     295
     83}}}
     84 In the output, the first column is the input tokens, the second column is the correct (gold) answers. Observe the differences. Copy the training result to `<YOUR_FILE>`. Try to estimate in how many cases the model missed an entity, detected incorrectly the boundaries, or classified an entity incorrectly.
     8510. evaluate the model on `dtest` with only NEs that are not present in the train data:
     86 {{{
     87java -cp stanford-ner-2017-06-09/stanford-ner.jar \
     88  edu.stanford.nlp.ie.crf.CRFClassifier \
     89  -loadClassifier cnec-3class-model.ser.gz \
     90  -testFile named_ent_dtest_unknown.tsv
     91}}}
     92
     93 Copy the result to `<YOUR_FILE>`.
     9411. test on your own input:
     95 {{{
     96java -mx600m -cp stanford-ner-2017-06-09/stanford-ner.jar \
     97  edu.stanford.nlp.ie.crf.CRFClassifier \
     98  -loadClassifier cnec-3class-model.ser.gz -textFile sample.txt
     99}}}
     100
     101 Copy the result to `<YOUR_FILE>`.
     102
     10312. (optional) try to improve the train data suggestions: set `useKnownLCWords` to false, add gazetteers, remove punctuation, try to change the wordshape (something following the pattern: `dan[12](bio)?(UseLC)?, jenny1(useLC)?, chris[1234](useLC)?, cluster1)` or word shape features (see the documentation). Copy the result to `<YOUR_FILE>`.
     10413. (optional) evaluate the model on `dtest`, final evaluation on `etest`.
     105