Context Navigation

← Previous Change
Wiki History
Next Change →

InformationExtraction

Timestamp:: Aug 30, 2022, 10:39:15 AM (3 years ago)
Author:: Ales Horak
Comment:: copied from private/NlpInPracticeCourse/InformationExtraction

Legend:

: Unmodified
: Added
: Removed
: Modified

en/NlpInPracticeCourse/2021/InformationExtraction

                       v1
+= Extracting structured information from text =
+[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/NlpInPracticeCourse|NLP in Practice Course]], Course Guarantee: Aleš Horák
+Prepared by: Zuzana Nevěřilová
+== State of the Art ==
+Information extraction (IE) is a technology based on
+analyzing natural language in order to extract snippets
+of information. The process takes texts (and sometimes
+speech) as input and produces fixed-format, unambiguous
+data as output. This data may be used directly for
+display to users, or may be stored in a database or
+spreadsheet for later analysis, or may be used for
+indexing purposes in information retrieval (IR) applications
+such as Internet search engines like Google.
+=== References ===
+. Cunningham, Hamish. [https://gate.ac.uk/sale/ell2/ie/ An Introduction to Information Extraction]. Encyclopedia of Language and Linguistics, 2nd Edition. Elsevier, 2005.
+. Piskorski, J. and Yangarber, R. Information Extraction: Past, Present and Future, pages 23–49. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013.
+. Aydar, Mehmet, Ozge Bozal, and Furkan Ozbay. [https://arxiv.org/abs/2007.04247 Neural relation extraction: a survey.] arXiv e-prints (2020): arXiv-2007.
+== Practical Session ==
+We will extract information from news articles using GATE.
+. Create {{{<YOUR_FILE>}}}, a text file named {{{ia161-UCO-08.txt}}} where '''UCO''' is your university ID.
+. Download and install GATE (Java 8 is necessary) from https://gate.ac.uk/download/. Either run the MS installer or the Java installer, install and run as app or in the command line:
+ {{{
+java -jar gate-<VERSION>-installer.jar
+}}}
+. Run GATE
+ {{{
+GATE_Developer_<VERSION>/bin/gate.sh
+}}}
+. Load ANNIE (with defaults), read about its components [[br]]
+ [[Image(annie.png)]]
+. Create document(s):
+   * right click on `Language Resources/New/GATE Document` in the left menu
+   * change {{{markupAware}}} to {{{false}}}
+   * change {{{sourceUrl}}} to {{{stringContent}}} and paste some news text
+   * repeat these steps
+   * you can find three sample texts here: [raw-attachment:text1.txt text1.txt], [raw-attachment:text2.txt text2.txt], [raw-attachment:text3.txt text3.txt]
+. Create corpus:
+   * right click on `Language Resources/New/GATE Corpus` in the left menu
+   * drag and drop the document in order to put them into the corpus
+. Run ANNIE: Click on `Applications/Annie` in the left menu, select `Corpus`
+. Observe the annotated results, click on a document, then `Annotation Sets` and/or `Annotation List`.
+So far, GATE did not much more than Stanford NER. Note, however, that all tokens are annotated and POS-tagged. Also note the annotation type Lookup.
+We add rules for extracting ''job titles'' and the respective ''person names''. The rules are defined in the grammars [raw-attachment:jobtitle.jape] and [raw-attachment:jobtitleperson.jape]
+. Right click `Processing Resources/New/JAPE Transducer` in the left menu
+. Download the grammar(s).
+. Click on {{{grammmarUrl}}} and choose the grammar file {{{jobtitle.jape}}}
+. Click on `Applications/Annie` in the left menu and add the JAPE Transducer to the ANNIE pipeline (Selected Processing Resources)
+. Run ANNIE again: Click on `Applications/Annie` in the left menu
+. Observe the annotated results, click on a document, then `Annotation Sets` and/or `Annotation List`. If applicable, you can see new annotation `JobTitle`.
+. Observe the grammars {{{jobtitle.jape}}} and {{{jobtitleperson.jape}}}
+. Add new transducer with the grammar {{{jobtitleperson.jape}}} and observe the results.
+. Optionally, you can add further documents and observe how universal the {{{jobtitleperson.jape}}} grammar is.
+. According to the above grammars, write your own that extracts new relations (e.g. job title in company or person works in company).
+Write your observations to {{{<YOUR_FILE>}}}: Particularly, comment how well the  Gazetteer and NE Transducer perform, describe how well the grammar works. Note that no coreference resolution is used (optionally, you can try one).
+Copy your grammar from the last point to {{{<YOUR_FILE>}}}.