= Named Entity Recognition =

[[https://is.muni.cz/auth/predmet/fi/ia161|IA161 Advanced NLP Course]], Course Guarantee: Aleš Horák

Prepared by: Zuzana Nevěřilová

== TODO til 31.5.2015 ==

 1. choose particular papers for [[#References|References]] below (that will serve as input for the lecture later on)
 1. prepare the [[#PracticalSession|Practical Session]]

== State of the Art ==

NER aims to ''recognize'' and ''classify'' names of people, locations, organizations, products, artworks, sometimes dates, money, measurements (numbers with units), law or patent numbers etc. Known issues are ambiguity of words (e.g. ''May'' can be a month, a verb, or a name), ambiguity of classes (e.g. ''HMS Queen Elisabeth'' can be a ship), and the inherent incompleteness of lists of NEs.

Named entity recognition (NER) is used mainly in information extraction (IE) but it can significantly improve other NLP tasks such as syntactic parsing.

=== Example from IE ===

In 2003, Hannibal Lecter (as portrayed by Hopkins) was chosen by the American Film Institute as the #1 movie villain.

Hannibal Lecter <-> Hopkins

=== Example concerning syntactic parsing ===

Wish You Were Here is the ninth studio album by the English progressive rock group Pink Floyd.

vs.

Wish_You_Were_Here is the ninth studio album by the English progressive rock group Pink Floyd.

=== References ===

 1. David Nadeau, Satoshi Sekine: A survey of named entity recognition and classification. In Satoshi Sekine and Elisabete Ranchhod (eds.) Named Entities: Recognition, classification and use. Lingvisticæ Investigationes 30:1. 2007. pp. 3–26 [[http://brown.cl.uni-heidelberg.de/~sourjiko/NER_Literatur/survey.pdf]]
 1. Charles Sutton and Andrew !McCallum: An Introduction to Conditional Random Fields. Foundations and Trends in Machine Learning 4 (4). 2012. [[http://homepages.inf.ed.ac.uk/csutton/publications/crftut-fnt.pdf]]

== Practical Session ==

Try naive gazetteer method (implement substring search) on prepared data. 
Observe the results: 
  1. what happens to every string present in the gazetteer?
  1. what happens to NE not present in the gazetteer?

Try machine learning approach (use the Stanford NER) with prepared data.
Observe the results:
  1. measure precision, recall, and F1-score on the test data
  1. find NEs not present in the train data
  1. find NEs that were not recognized
  1. discuss what types of NE are easy/difficult to recognize