Changes between Initial Version and Version 1 of en/NlpInPracticeCourse/2021/InformationExtraction


Ignore:
Timestamp:
Aug 30, 2022, 10:39:15 AM (20 months ago)
Author:
Ales Horak
Comment:

copied from private/NlpInPracticeCourse/InformationExtraction

Legend:

Unmodified
Added
Removed
Modified
  • en/NlpInPracticeCourse/2021/InformationExtraction

    v1 v1  
     1= Extracting structured information from text =
     2
     3[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/NlpInPracticeCourse|NLP in Practice Course]], Course Guarantee: Aleš Horák
     4
     5Prepared by: Zuzana Nevěřilová
     6
     7
     8== State of the Art ==
     9
     10Information extraction (IE) is a technology based on
     11analyzing natural language in order to extract snippets
     12of information. The process takes texts (and sometimes
     13speech) as input and produces fixed-format, unambiguous
     14data as output. This data may be used directly for
     15display to users, or may be stored in a database or
     16spreadsheet for later analysis, or may be used for
     17indexing purposes in information retrieval (IR) applications
     18such as Internet search engines like Google.
     19
     20=== References ===
     21
     22 1. Cunningham, Hamish. [https://gate.ac.uk/sale/ell2/ie/ An Introduction to Information Extraction]. Encyclopedia of Language and Linguistics, 2nd Edition. Elsevier, 2005.
     23 1. Piskorski, J. and Yangarber, R. Information Extraction: Past, Present and Future, pages 23–49. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013.
     24 1. Aydar, Mehmet, Ozge Bozal, and Furkan Ozbay. [https://arxiv.org/abs/2007.04247 Neural relation extraction: a survey.] arXiv e-prints (2020): arXiv-2007.
     25
     26== Practical Session ==
     27
     28We will extract information from news articles using GATE.
     29
     30 1. Create {{{<YOUR_FILE>}}}, a text file named {{{ia161-UCO-08.txt}}} where '''UCO''' is your university ID.
     31 1. Download and install GATE (Java 8 is necessary) from https://gate.ac.uk/download/. Either run the MS installer or the Java installer, install and run as app or in the command line:
     32 {{{
     33java -jar gate-<VERSION>-installer.jar
     34}}}
     35 1. Run GATE
     36 {{{
     37GATE_Developer_<VERSION>/bin/gate.sh
     38}}}
     39 1. Load ANNIE (with defaults), read about its components [[br]]
     40 [[Image(annie.png)]]
     41 1. Create document(s):
     42   * right click on `Language Resources/New/GATE Document` in the left menu
     43   * change {{{markupAware}}} to {{{false}}}
     44   * change {{{sourceUrl}}} to {{{stringContent}}} and paste some news text
     45   * repeat these steps
     46   * you can find three sample texts here: [raw-attachment:text1.txt text1.txt], [raw-attachment:text2.txt text2.txt], [raw-attachment:text3.txt text3.txt]
     47 1. Create corpus:
     48   * right click on `Language Resources/New/GATE Corpus` in the left menu
     49   * drag and drop the document in order to put them into the corpus
     50 1. Run ANNIE: Click on `Applications/Annie` in the left menu, select `Corpus`
     51 1. Observe the annotated results, click on a document, then `Annotation Sets` and/or `Annotation List`.
     52
     53So far, GATE did not much more than Stanford NER. Note, however, that all tokens are annotated and POS-tagged. Also note the annotation type Lookup.
     54
     55We add rules for extracting ''job titles'' and the respective ''person names''. The rules are defined in the grammars [raw-attachment:jobtitle.jape] and [raw-attachment:jobtitleperson.jape]
     56
     57 1. Right click `Processing Resources/New/JAPE Transducer` in the left menu
     58 1. Download the grammar(s).
     59 1. Click on {{{grammmarUrl}}} and choose the grammar file {{{jobtitle.jape}}}
     60 1. Click on `Applications/Annie` in the left menu and add the JAPE Transducer to the ANNIE pipeline (Selected Processing Resources)
     61 1. Run ANNIE again: Click on `Applications/Annie` in the left menu
     62 1. Observe the annotated results, click on a document, then `Annotation Sets` and/or `Annotation List`. If applicable, you can see new annotation `JobTitle`.
     63 1. Observe the grammars {{{jobtitle.jape}}} and {{{jobtitleperson.jape}}}
     64 1. Add new transducer with the grammar {{{jobtitleperson.jape}}} and observe the results.
     65 1. Optionally, you can add further documents and observe how universal the {{{jobtitleperson.jape}}} grammar is.
     66 1. According to the above grammars, write your own that extracts new relations (e.g. job title in company or person works in company).
     67
     68Write your observations to {{{<YOUR_FILE>}}}: Particularly, comment how well the  Gazetteer and NE Transducer perform, describe how well the grammar works. Note that no coreference resolution is used (optionally, you can try one).
     69Copy your grammar from the last point to {{{<YOUR_FILE>}}}.