wiki:private/AdvancedNlpCourse/NamedEntityRecognition

Version 23 (modified by Ales Horak, 6 months ago) (diff)

--

Named Entity Recognition

IA161 Advanced NLP Course, Course Guarantee: Aleš Horák

Prepared by: Zuzana Nevěřilová

State of the Art

NER aims to recognize and classify names of people, locations, organizations, products, artworks, sometimes dates, money, measurements (numbers with units), law or patent numbers etc. Known issues are ambiguity of words (e.g. May can be a month, a verb, or a name), ambiguity of classes (e.g. HMS Queen Elisabeth can be a ship), and the inherent incompleteness of lists of NEs.

Named entity recognition (NER) is used mainly in information extraction (IE) but it can significantly improve other NLP tasks such as syntactic parsing.

Example from IE

In 2003, Hannibal Lecter (as portrayed by Hopkins) was chosen by the American Film Institute as the number one movie villain.

Hannibal Lecter <-> Hopkins

Example concerning syntactic parsing

Wish You Were Here is the ninth studio album by the English progressive rock group Pink Floyd.

vs.

Wish_You_Were_Here is the ninth studio album by the English progressive rock group Pink Floyd.

References

  1. David Nadeau, Satoshi Sekine: A survey of named entity recognition and classification. In Satoshi Sekine and Elisabete Ranchhod (eds.) Named Entities: Recognition, classification and use. Lingvisticæ Investigationes 30:1. 2007. pp. 3–26 http://brown.cl.uni-heidelberg.de/~sourjiko/NER_Literatur/survey.pdf
  2. Charles Sutton and Andrew McCallum: An Introduction to Conditional Random Fields. Foundations and Trends in Machine Learning 4 (4). 2012. http://homepages.inf.ed.ac.uk/csutton/publications/crftut-fnt.pdf
  3. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding, 2019. https://arxiv.org/abs/1810.04805

Practical Session

Czech Named Entity Recognition

In this workshop, we train a new NER application for the Czech language. We work with free resources & software tools: the Czech NE Corpus (CNEC) and the FastText pre-trained word embeddings. We build a neural network to solve the problem.

  1. Create <YOUR_FILE>, a text file named ia161-UCO-04.txt where UCO is your university ID.
  2. Open Google Colab at https://colab.research.google.com/drive/1mnz-P30CLxrxQ0yyqpcLwVJgi7e59shi?usp=sharing
  3. Follow the instructions in the notebook. There are three obligatory tasks. Write down your answers to <YOUR_FILE>.

Attachments (1)

Download all attachments as: .zip