wiki:en/AdvancedNlpCourse2019/NamedEntityRecognition

Context Navigation

Named Entity Recognition

IA161 Advanced NLP Course?, Course Guarantee: Aleš Horák

Prepared by: Zuzana Nevěřilová

State of the Art

NER aims to recognize and classify names of people, locations, organizations, products, artworks, sometimes dates, money, measurements (numbers with units), law or patent numbers etc. Known issues are ambiguity of words (e.g. May can be a month, a verb, or a name), ambiguity of classes (e.g. HMS Queen Elisabeth can be a ship), and the inherent incompleteness of lists of NEs.

Named entity recognition (NER) is used mainly in information extraction (IE) but it can significantly improve other NLP tasks such as syntactic parsing.

Example from IE

In 2003, Hannibal Lecter (as portrayed by Hopkins) was chosen by the American Film Institute as the number one movie villain.

Hannibal Lecter <-> Hopkins

Example concerning syntactic parsing

Wish You Were Here is the ninth studio album by the English progressive rock group Pink Floyd.

vs.

Wish_You_Were_Here is the ninth studio album by the English progressive rock group Pink Floyd.

References

David Nadeau, Satoshi Sekine: A survey of named entity recognition and classification. In Satoshi Sekine and Elisabete Ranchhod (eds.) Named Entities: Recognition, classification and use. Lingvisticæ Investigationes 30:1. 2007. pp. 3–26 http://brown.cl.uni-heidelberg.de/~sourjiko/NER_Literatur/survey.pdf
Charles Sutton and Andrew McCallum: An Introduction to Conditional Random Fields. Foundations and Trends in Machine Learning 4 (4). 2012. http://homepages.inf.ed.ac.uk/csutton/publications/crftut-fnt.pdf

Practical Session

Czech Named Entity Recognition

In this workshop, we train a new NER application for the Czech language. We work with free resources & software tools: the Czech NE Corpus (CNEC) and the Stanford NER application.

Requirements: Java 8, python, gigabytes of memory, convert_cnec_stanford.py, get_unknown.py, cnec.prop

Create <YOUR_FILE>, a text file named ia161-UCO-04.txt where UCO is your university ID.
get the data: download CNEC from LINDAT/Clarin repository (https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-1B22-8)

open the NE hierarchy:

evince cnec2.0/doc/ne-type-hierarchy.pdf

the data is organized into 3 disjoint datasets: the training data is called train, the development test data is called dtest and the final evaluation data is called etest.

convert the train data to the Stanford NER format:

python convert_cnec_stanford.py cnec2.0/data/xml/named_ent_train.xml \
  > named_ent_train.tsv

Note that we removed documents that did not contain NEs. You can experiment with this option later.

download the Stanford NE recognizer http://nlp.stanford.edu/software/CRF-NER.shtml (and read about it)
train the model using the default settings (cnec.prop), N.B. that the convert_cnec_stanford.py only recognizes PERSON, LOCATION and ORGANIZATION, you can extend the markup conversion later:
```
java -cp stanford-ner-2018-10-16/stanford-ner.jar \
  edu.stanford.nlp.ie.crf.CRFClassifier \
  -prop cnec.prop
```

convert the test data to the Stanford NER format:

python convert_cnec_stanford.py cnec2.0/data/xml/named_ent_dtest.xml \
  > named_ent_dtest.tsv

evaluate the model on dtest:

java -cp stanford-ner-2018-10-16/stanford-ner.jar \
  edu.stanford.nlp.ie.crf.CRFClassifier \
  -loadClassifier cnec-3class-model.ser.gz \
  -testFile named_ent_dtest.tsv

You should see results like:
CRFClassifier tagged 19993 words in 900 documents at 2388.94 words per second.
         Entity	P	R	F1	TP	FP	FN
            LOC	0.7064	0.7586	0.7316	308	128	98
            ORG	0.6943	0.5576	0.6185	184	81	146
          OTHER	0.6224	0.6498	0.6358	590	358	318
            PER	0.7727	0.8236	0.7974	425	125	91
         Totals	0.6853	0.6977	0.6914	1507	692	653
In the output, the first column is the input tokens, the second column is the correct (gold) answers. Observe the differences. Copy the training result to <YOUR_FILE>. Try to estimate in how many cases the model missed an entity, detected incorrectly the boundaries, or classified an entity incorrectly.

evaluate the model on dtest with only NEs that are not present in the train data. First, you need to filter out only those documents that do not contain NERs from the training data. Use the script get_uknown.py, then run the NER:
```
java -cp stanford-ner-2018-10-16/stanford-ner.jar \
  edu.stanford.nlp.ie.crf.CRFClassifier \
  -loadClassifier cnec-3class-model.ser.gz \
  -testFile named_ent_dtest_unknown.tsv
```

Copy the result to <YOUR_FILE>.

test on your own input:

java -mx600m -cp stanford-ner-2018-10-16/stanford-ner.jar \
  edu.stanford.nlp.ie.crf.CRFClassifier \
  -loadClassifier cnec-3class-model.ser.gz -textFile sample.txt

Copy the result to <YOUR_FILE>.

(optional) try to improve the train data suggestions: set useKnownLCWords to false, add gazetteers, remove punctuation, try to change the wordshape (something following the pattern: dan[12](bio)?(UseLC)?, jenny1(useLC)?, chris[1234](useLC)?, cluster1) or word shape features (see the documentation). Copy the result to <YOUR_FILE>.
(optional) evaluate the model on dtest, final evaluation on etest.

Last modified 5 years ago Last modified on Oct 1, 2020, 3:33:38 PM

Attachments (1)

cnec.prop (351 bytes) - added by Ales Horak 5 years ago.

Download all attachments as: .zip

Download in other formats:

Plain Text