wiki:en/NlpInPracticeCourse/2021/InformationExtraction

Context Navigation

Extracting structured information from text

IA161 NLP in Practice Course, Course Guarantee: Aleš Horák

Prepared by: Zuzana Nevěřilová

State of the Art

Information extraction (IE) is a technology based on analyzing natural language in order to extract snippets of information. The process takes texts (and sometimes speech) as input and produces fixed-format, unambiguous data as output. This data may be used directly for display to users, or may be stored in a database or spreadsheet for later analysis, or may be used for indexing purposes in information retrieval (IR) applications such as Internet search engines like Google.

References

Cunningham, Hamish. An Introduction to Information Extraction. Encyclopedia of Language and Linguistics, 2nd Edition. Elsevier, 2005.
Piskorski, J. and Yangarber, R. Information Extraction: Past, Present and Future, pages 23–49. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013.
Aydar, Mehmet, Ozge Bozal, and Furkan Ozbay. Neural relation extraction: a survey. arXiv e-prints (2020): arXiv-2007.

Practical Session

We will extract information from news articles using GATE.

Create <YOUR_FILE>, a text file named ia161-UCO-08.txt where UCO is your university ID.
Download and install GATE (Java 8 is necessary) from https://gate.ac.uk/download/. Either run the MS installer or the Java installer, install and run as app or in the command line:
```
java -jar gate-<VERSION>-installer.jar
```
Run GATE
```
GATE_Developer_<VERSION>/bin/gate.sh
```
Load ANNIE (with defaults), read about its components
Create document(s):
- right click on Language Resources/New/GATE Document in the left menu
- change markupAware to false
- change sourceUrl to stringContent and paste some news text
- repeat these steps
- you can find three sample texts here: text1.txt, text2.txt, text3.txt
Create corpus:
- right click on Language Resources/New/GATE Corpus in the left menu
- drag and drop the document in order to put them into the corpus
Run ANNIE: Click on Applications/Annie in the left menu, select Corpus
Observe the annotated results, click on a document, then Annotation Sets and/or Annotation List.

So far, GATE did not much more than Stanford NER. Note, however, that all tokens are annotated and POS-tagged. Also note the annotation type Lookup.

We add rules for extracting job titles and the respective person names. The rules are defined in the grammars jobtitle.jape and jobtitleperson.jape

Right click Processing Resources/New/JAPE Transducer in the left menu
Download the grammar(s).
Click on grammmarUrl and choose the grammar file jobtitle.jape
Click on Applications/Annie in the left menu and add the JAPE Transducer to the ANNIE pipeline (Selected Processing Resources)
Run ANNIE again: Click on Applications/Annie in the left menu
Observe the annotated results, click on a document, then Annotation Sets and/or Annotation List. If applicable, you can see new annotation JobTitle.
Observe the grammars jobtitle.jape and jobtitleperson.jape
Add new transducer with the grammar jobtitleperson.jape and observe the results.
Optionally, you can add further documents and observe how universal the jobtitleperson.jape grammar is.
According to the above grammars, write your own that extracts new relations (e.g. job title in company or person works in company).

Write your observations to <YOUR_FILE>: Particularly, comment how well the Gazetteer and NE Transducer perform, describe how well the grammar works. Note that no coreference resolution is used (optionally, you can try one). Copy your grammar from the last point to <YOUR_FILE>.

Last modified 3 years ago Last modified on Aug 30, 2022, 10:39:15 AM

Attachments (10)

wiki.txt (1.8 KB) - added by Ales Horak 3 years ago.
wiki.phrases (25.1 KB) - added by Ales Horak 3 years ago.
wiki.output (883 bytes) - added by Ales Horak 3 years ago.
demo.py (1.3 KB) - added by Ales Horak 3 years ago.
jobtitle.jape (322 bytes) - added by Ales Horak 3 years ago.
jobtitleperson.jape (649 bytes) - added by Ales Horak 3 years ago.
text1.txt (1.4 KB) - added by Ales Horak 3 years ago.
text2.txt (355 bytes) - added by Ales Horak 3 years ago.
text3.txt (2.1 KB) - added by Ales Horak 3 years ago.
annie.png (8.9 KB) - added by Ales Horak 3 years ago.

Download all attachments as: .zip

Download in other formats:

Plain Text