| 1 | = Extracting structured information from text = |
| 2 | |
| 3 | [[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/NlpInPracticeCourse|NLP in Practice Course]], Course Guarantee: Aleš Horák |
| 4 | |
| 5 | Prepared by: Zuzana Nevěřilová |
| 6 | |
| 7 | |
| 8 | == State of the Art == |
| 9 | |
| 10 | Information extraction (IE) is a technology based on |
| 11 | analyzing natural language in order to extract snippets |
| 12 | of information. The process takes texts (and sometimes |
| 13 | speech) as input and produces fixed-format, unambiguous |
| 14 | data as output. This data may be used directly for |
| 15 | display to users, or may be stored in a database or |
| 16 | spreadsheet for later analysis, or may be used for |
| 17 | indexing purposes in information retrieval (IR) applications |
| 18 | such as Internet search engines like Google. |
| 19 | |
| 20 | === References === |
| 21 | |
| 22 | 1. Cunningham, Hamish. [https://gate.ac.uk/sale/ell2/ie/ An Introduction to Information Extraction]. Encyclopedia of Language and Linguistics, 2nd Edition. Elsevier, 2005. |
| 23 | 1. Piskorski, J. and Yangarber, R. Information Extraction: Past, Present and Future, pages 23–49. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013. |
| 24 | 1. Aydar, Mehmet, Ozge Bozal, and Furkan Ozbay. [https://arxiv.org/abs/2007.04247 Neural relation extraction: a survey.] arXiv e-prints (2020): arXiv-2007. |
| 25 | |
| 26 | == Practical Session == |
| 27 | |
| 28 | We will extract information from news articles using GATE. |
| 29 | |
| 30 | 1. Create {{{<YOUR_FILE>}}}, a text file named {{{ia161-UCO-08.txt}}} where '''UCO''' is your university ID. |
| 31 | 1. Download and install GATE (Java 8 is necessary) from https://gate.ac.uk/download/. Either run the MS installer or the Java installer, install and run as app or in the command line: |
| 32 | {{{ |
| 33 | java -jar gate-<VERSION>-installer.jar |
| 34 | }}} |
| 35 | 1. Run GATE |
| 36 | {{{ |
| 37 | GATE_Developer_<VERSION>/bin/gate.sh |
| 38 | }}} |
| 39 | 1. Load ANNIE (with defaults), read about its components [[br]] |
| 40 | [[Image(annie.png)]] |
| 41 | 1. Create document(s): |
| 42 | * right click on `Language Resources/New/GATE Document` in the left menu |
| 43 | * change {{{markupAware}}} to {{{false}}} |
| 44 | * change {{{sourceUrl}}} to {{{stringContent}}} and paste some news text |
| 45 | * repeat these steps |
| 46 | * you can find three sample texts here: [raw-attachment:text1.txt text1.txt], [raw-attachment:text2.txt text2.txt], [raw-attachment:text3.txt text3.txt] |
| 47 | 1. Create corpus: |
| 48 | * right click on `Language Resources/New/GATE Corpus` in the left menu |
| 49 | * drag and drop the document in order to put them into the corpus |
| 50 | 1. Run ANNIE: Click on `Applications/Annie` in the left menu, select `Corpus` |
| 51 | 1. Observe the annotated results, click on a document, then `Annotation Sets` and/or `Annotation List`. |
| 52 | |
| 53 | So far, GATE did not much more than Stanford NER. Note, however, that all tokens are annotated and POS-tagged. Also note the annotation type Lookup. |
| 54 | |
| 55 | We add rules for extracting ''job titles'' and the respective ''person names''. The rules are defined in the grammars [raw-attachment:jobtitle.jape] and [raw-attachment:jobtitleperson.jape] |
| 56 | |
| 57 | 1. Right click `Processing Resources/New/JAPE Transducer` in the left menu |
| 58 | 1. Download the grammar(s). |
| 59 | 1. Click on {{{grammmarUrl}}} and choose the grammar file {{{jobtitle.jape}}} |
| 60 | 1. Click on `Applications/Annie` in the left menu and add the JAPE Transducer to the ANNIE pipeline (Selected Processing Resources) |
| 61 | 1. Run ANNIE again: Click on `Applications/Annie` in the left menu |
| 62 | 1. Observe the annotated results, click on a document, then `Annotation Sets` and/or `Annotation List`. If applicable, you can see new annotation `JobTitle`. |
| 63 | 1. Observe the grammars {{{jobtitle.jape}}} and {{{jobtitleperson.jape}}} |
| 64 | 1. Add new transducer with the grammar {{{jobtitleperson.jape}}} and observe the results. |
| 65 | 1. Optionally, you can add further documents and observe how universal the {{{jobtitleperson.jape}}} grammar is. |
| 66 | 1. According to the above grammars, write your own that extracts new relations (e.g. job title in company or person works in company). |
| 67 | |
| 68 | Write your observations to {{{<YOUR_FILE>}}}: Particularly, comment how well the Gazetteer and NE Transducer perform, describe how well the grammar works. Note that no coreference resolution is used (optionally, you can try one). |
| 69 | Copy your grammar from the last point to {{{<YOUR_FILE>}}}. |