Changes between Version 22 and Version 23 of private/NlpInPracticeCourse/InformationExtraction


Ignore:
Timestamp:
Sep 18, 2023, 1:25:38 PM (8 months ago)
Author:
Zuzana Nevěřilová
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • private/NlpInPracticeCourse/InformationExtraction

    v22 v23  
    2323 1. Piskorski, J. and Yangarber, R. Information Extraction: Past, Present and Future, pages 23–49. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013.
    2424 1. Aydar, Mehmet, Ozge Bozal, and Furkan Ozbay. [https://arxiv.org/abs/2007.04247 Neural relation extraction: a survey.] arXiv e-prints (2020).
     25 1. Li, Qing, et al. "A comprehensive exploration of semantic relation extraction via pre-trained CNNs." Knowledge-Based Systems (2020): 105488.
     26
    2527
    2628== Practical Session ==
    2729
    28 We will extract information from news articles using GATE.
    2930
    30  1. Create {{{<YOUR_FILE>}}}, a text file named {{{ia161-UCO-08.txt}}} where '''UCO''' is your university ID.
    31  1. Download and install GATE (Java 8 is necessary) from https://gate.ac.uk/download/. Either run the MS installer or the Java installer, install and run as app or in the command line:
    32  {{{
    33 java -jar gate-<VERSION>-installer.jar
    34 }}}
    35  1. Run GATE
    36  {{{
    37 GATE_Developer_<VERSION>/bin/gate.sh
    38 }}}
    39  1. Load ANNIE (with defaults), read about its components [[br]]
    40  [[Image(annie.png)]]
    41  1. Create document(s):
    42    * right click on `Language Resources/New/GATE Document` in the left menu
    43    * change {{{markupAware}}} to {{{false}}}
    44    * change {{{sourceUrl}}} to {{{stringContent}}} and paste some news text
    45    * repeat these steps
    46    * you can find three sample texts here: [raw-attachment:text1.txt text1.txt], [raw-attachment:text2.txt text2.txt], [raw-attachment:text3.txt text3.txt]
    47  1. Create corpus:
    48    * right click on `Language Resources/New/GATE Corpus` in the left menu
    49    * drag and drop the document in order to put them into the corpus
    50  1. Run ANNIE: Click on `Applications/Annie` in the left menu, select `Corpus`
    51  1. Observe the annotated results, click on a document, then `Annotation Sets` and/or `Annotation List`.
     31The task will proceed using Python notebook run in web browser in the [https://colab.research.google.com/ Google Colaboratory] environment
     32with the MU G-Suite disk access.
    5233
    53 So far, GATE did not much more than Stanford NER. Note, however, that all tokens are annotated and POS-tagged. Also note the annotation type Lookup.
     34In case of running the codes in a local environment, the requirements are
     35Python 3, and NLTK module.
    5436
    55 We add rules for extracting ''job titles'' and the respective ''person names''. The rules are defined in the grammars [raw-attachment:jobtitle.jape] and [raw-attachment:jobtitleperson.jape]
     37 1. Create {{{<YOUR_FILE>}}}, a text file named {{{ia161-UCO-08.txt}}} where '''UCO''' is your university ID.
     38 1. Access the [https://colab.research.google.com/drive/1lHphWGR-i6P7HqTJ_39Eo8FnPe8OJuSD Python notebook in the Google Colab environment] and make your own copy. Do not forget to save your work if you want to see your changes later, leaving the browser will throw away all changes!
     39 1. The colab reads file {{{input.txt}}} (each line is word|definition) and outputs hypernym for each word.
     40 1. Default approach is naive: ''first noun in definition is hypernym''
     41 1. Using the gold standard, evaluate the naive approach.
     42 1. Improve the {{{find_hyper()}}} function  to provide better results. Evaluate the new version.
     43 1. Copy the updated function {{{find_hyper()}}} and the output into {{{<YOUR_FILE>}}}. Please don't submit the whole notebook.
    5644
    57  1. Right click `Processing Resources/New/JAPE Transducer` in the left menu
    58  1. Download the grammar(s).
    59  1. Click on {{{grammmarUrl}}} and choose the grammar file {{{jobtitle.jape}}}
    60  1. Click on `Applications/Annie` in the left menu and add the JAPE Transducer to the ANNIE pipeline (Selected Processing Resources)
    61  1. Run ANNIE again: Click on `Applications/Annie` in the left menu
    62  1. Observe the annotated results, click on a document, then `Annotation Sets` and/or `Annotation List`. If applicable, you can see new annotation `JobTitle`.
    63  1. Observe the grammars {{{jobtitle.jape}}} and {{{jobtitleperson.jape}}}
    64  1. Add new transducer with the grammar {{{jobtitleperson.jape}}} and observe the results.
    65  1. Optionally, you can add further documents and observe how universal the {{{jobtitleperson.jape}}} grammar is.
    66  1. According to the above grammars, write your own that extracts new relations (e.g. job title in company or person works in company).
     45Gold standard to evaluate your result: [[raw-attachment:gold_en.txt|gold_en.txt]]
    6746
    68 Next, we will try to find similar information using BERT model. Open the Jupyter notebook from https://colab.research.google.com/drive/1F2fxnCMwxlLvZgAp2WyamHpYSGl9aFlU and experiment with texts and questions.
    69 
    70 Write your observations to {{{<YOUR_FILE>}}}: Particularly, comment how well the  Gazetteer and NE Transducer perform, describe how well the grammar works. Note that no coreference resolution is used (optionally, you can try one). Comment on differences between GATE and BERT outputs.
    71 Copy your grammar from the last point to {{{<YOUR_FILE>}}}.