Changes between Initial Version and Version 1 of en/AdvancedNlpCourse2020/LanguageResourcesFromWeb


Ignore:
Timestamp:
Aug 31, 2021, 2:11:37 PM (3 years ago)
Author:
Ales Horak
Comment:

copied from private/AdvancedNlpCourse/LanguageResourcesFromWeb

Legend:

Unmodified
Added
Removed
Modified
  • en/AdvancedNlpCourse2020/LanguageResourcesFromWeb

    v1 v1  
     1= Building Language Resources from the Web =
     2Web crawling, boilerplate removal, de-duplication and plagiarism detection.
     3
     4[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák
     5
     6
     7Prepared by: Vít Suchomel[[BR]]
     8[http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/AdvancedNlpCourse/anlp-05-LanguageResourcesFromWeb.pdf Slides]
     9
     10
     11== State of the Art ==
     12
     13=== References ===
     14 1. Chapters 19 and 20 from C. D. Manning et al. "Introduction to Information Retrieval". Cambridge University Press, 2008.
     15 1. Pomikálek, Jan. "Removing boilerplate and duplicate content from web corpora." Dissertation thesis. Masaryk University, 2011.
     16 1. Hacohen-Kerner, Yaakov, Aharon Tayeb, and Natan Ben-Dror. "Detection of simple plagiarism in computer science papers." Coling, 2010.
     17 1. Potthast, M., Hagen, M., Goering, S., Rosso, P., and Stein, B. (2015). Towards data submissions for shared tasks: first experiences for the task of text alignment. Working Notes Papers of the CLEF, pages 1613–0073.
     18 1. [http://www.uni-weimar.de/medien/webis/events/pan-15/pan15-web/ 13th evaluation lab on uncovering plagiarism, authorship, and social software misuse]
     19 1. Suchomel, Vít. "Better Web Corpora For Corpus Linguistics And NLP." Dissertation thesis. Masaryk University, 2020.
     20{{{#!comment
     21 1. Potthast et al. "Overview of the 6th International Competition on Plagiarism Detection." CLEF, 2014.
     22 1. Potthast, M., Stein, B., Barron-Cedeno, A., and Rosso, P. (2010). An Evaluation Framework for Plagiarism Detection. In Proceedings of COLING 2010, Beijing, China. ACL.
     23}}}
     24
     25
     26== Practical Session task ==
     27=== Plagiators vs. plagiarism detectors ===
     28
     29Either: Create 5 documents (with a similar topic) and 5 plagiarisms of these documents, 10 documents total. //(For the sake of simplicity: A plagiarism cannot have more sources here.)// //The minimal homework.//
     30  * 100 words <= document lenght <= 500 words
     31  * 20 % <= plagiarism content <= 90 %
     32  * File format: A POS tagged vertical consisting of structures {{{doc}}} with attributes {{{author}}}, {{{id}}}, {{{class}}}, {{{source}}}. Pair author, id is unique. Start with id = 1. Class is "original" or "plagiarism". Source is the id of the source (in the case of plagiarism) or the same as the id of the document (in the case of original).
     33  * A POS tagged vertical: 3 TAB separated columns: word, lemma (the base form of the word), POS/morphological tag.
     34  * Text processing pipelines for converting a text file to a 3-column vertical:
     35    * Czech: {{{asteria04:/opt/majka_pipe/majka-czech_v2.sh | cut -f1-3}}} or a [http://nlp.fi.muni.cz/projekty/rule_ind/index.cgi web interface] (short documents only)
     36      * See an example below.
     37    * English: {{{asteria04:/opt/treetagger_pipe/tt-english_v2.1.sh}}}
     38  * For each plagiarism:
     39    * describe plagiarsim technique(s) used
     40    * which detection methods might be able to reveal it -- give reasons
     41    * which detection methods might not be able to reveal it -- give reasons
     42  * **Submit a text file containing 10 documents according to the requirements + 1 text file describing techniques used and your estimation which detection techniques may or may not work.**
     43
     44Or: Select a detection algorithm and implement it in Python. //The right homework if you want to learn something.//
     45  * A basic detection script to extend: [raw-attachment:plagiarism_simple.py] -- usage: {{{python plagiarism_simple.py < training_data.vert}}}
     46    * A bag of words + cosine similarity of word vectors approach is implemented in this script. //(For the sake of simplicity: A plagiarism cannot have more sources here.)//
     47    * You can modify the script to
     48      * use other input attributes than the word or a combination of attributes, e.g. the lemma or the morphological tag,
     49      * or implement other lexical/syntactic based detection approach, e.g. n-grams of words or Levenshtein's distance,
     50      * or implement other semantic based detection approach, e.g. the similarity of {{{word2vec}}} vectors.
     51  * Input format: A 3-column vertical, see above. [raw-attachment:training_data.vert]
     52  * Output: One plagiarism per line: id TAB detected source id TAB real source id. Evaluation line: precision, recall F1 measure.
     53  * Your script will be evaluated using data made by others.
     54  * Describe which plagiarism detection technique(s) were implemented -- put it in a comment in the beginning of your script.
     55  * **Submit the modified script (or your own script) with a short description in a comment.** (The training set output of the script is not required.)
     56
     57=== Examples of a source document and a plagiarism document ===
     58{{{
     59<doc author="Já První" id="1" class="original" source="1">
     60<s>
     61Dnes    dnes    k6eAd1
     62je      být     k5eAaImIp3nS
     63pěkný   pěkný   k2eAgInSc4d1    pěkný
     64den     den     k1gInSc4        den
     65<g/>
     66!       !       k?
     67</s>
     68</doc>
     69<doc author="Já První" id="2" class="plagiarism" source="1">
     70<s>
     71Dnes    dnes    k6eAd1
     72je      být     k5eAaImIp3nS
     73ale     ale     k9
     74pěkný   pěkný   k2eAgInSc4d1    pěkný
     75den     den     k1gInSc4        den
     76<g/>
     77!       !       k?
     78</s>
     79</doc>
     80}}}
     81
     82How to produce the 3-column POS tagged vertical from a plaintext:
     83{{{
     84scp plagiarism.txt aurora.fi.muni.cz:~/
     85ssh aurora.fi.muni.cz
     86ssh asteria04
     87cat ~/plagiarism.txt | /opt/majka_pipe/majka-czech_v2.sh | cut -f1-3 > ~/plagiarism.vert  #Czech
     88cat ~/plagiarism.txt | /opt/treetagger_pipe/tt-english_v2.1.sh > ~/plagiarism.vert        #English
     89logout
     90logout
     91scp aurora.fi.muni.cz:~/plagiarism.vert ./
     92}}}
     93
     94How to run the sample detection script:
     95{{{
     96python2 plagiarism_simple.py < plagiarism.vert
     97}}}
     98
     99