Changes between Initial Version and Version 1 of en/AdvancedNlpCourse2015/LanguageResourcesFromWeb


Ignore:
Timestamp:
Sep 11, 2017, 4:38:19 PM (7 years ago)
Author:
Ales Horak
Comment:

copied from private/AdvancedNlpCourse/LanguageResourcesFromWeb

Legend:

Unmodified
Added
Removed
Modified
  • en/AdvancedNlpCourse2015/LanguageResourcesFromWeb

    v1 v1  
     1= Building Language Resources from the Web =
     2Web crawling, boilerplate removal, de-duplication and plagiarism detection.
     3
     4[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák
     5
     6Prepared by: Vít Suchomel
     7
     8== State of the Art ==
     9
     10=== References ===
     11 1. Chapters 19 and 20 from C. D. Manning et al. "Introduction to Information Retrieval". Cambridge University Press, 2008.
     12 1. Pomikálek, Jan. "Removing boilerplate and duplicate content from web corpora." Dissertation thesis. Masaryk University, 2011.
     13 1. Hacohen-Kerner, Yaakov, Aharon Tayeb, and Natan Ben-Dror. "Detection of simple plagiarism in computer science papers." Coling, 2010.
     14 1. Potthast, M., Hagen, M., Goering, S., Rosso, P., and Stein, B. (2015). Towards data submissions for shared tasks: first experiences for the task of text alignment. Working Notes Papers of the CLEF, pages 1613–0073.
     15 1. [http://www.uni-weimar.de/medien/webis/events/pan-15/pan15-web/ 13th evaluation lab on uncovering plagiarism, authorship, and social software misuse]
     16{{{#!comment
     17 1. Potthast et al. "Overview of the 6th International Competition on Plagiarism Detection." CLEF, 2014.
     18 1. Potthast, M., Stein, B., Barron-Cedeno, A., and Rosso, P. (2010). An Evaluation Framework for Plagiarism Detection. In Proceedings of COLING 2010, Beijing, China. ACL.
     19}}}
     20
     21== Slides ==
     22[https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/AdvancedNlpCourse/anlp-05-LanguageResourcesFromWeb.pdf]
     23
     24== Practical Session task ==
     25=== Plagiators vs. plagiarism detectors ===
     26 1. Create 5 documents (with a similar topic) and 5 plagiarisms of these documents, 10 documents total. 100 words $\leq$ document lenght $\leq$ 500 words. 20 \% $\leq$ plagiarism content $\leq$ 75 \% (100 \% if done well).
     27 1. Select detection algorithm and implement it in Python. At least 1 own plagiarism must be detected, at least 1 must be not detected by your own script.
     28 1. Input format: POS tagged vertical consisting of 10 sctructures `doc` with attributes `author`, `id`, `class`, `source`. Pair author, id is unique. Class is "original" or "plagiarism". Source is the id of the source (in case of plagiarism) or own id (in case of original).\footnote{{\tiny For the sake of simplicity: A plagiarism cannot have more sources here.}}
     29 1. Output format: One plagiarism per line: id `TAB` detected source id `TAB` real source id. Evaluation line: precision, recall F1 measure.
     30 1. Your script will be evaluated using data made by others.
     31
     32=== Text processing pipelines ===
     33 * Czech: {{{alba:/opt/majka/majka-desamb-czech.sh | cut -f1-3}}}
     34 * Czech [http://nlp.fi.muni.cz/projekty/rule_ind/index.cgi web interface]
     35 * English: {{{alba:/opt/TreeTagger/tools/tt-english\_v2.sh | awk '{print \$1"\textbackslash t"\$3"\textbackslash t"\$2}'}}}
     36
     37=== Input data example (2 + 3 documents only) ===
     38{{{
     39<doc author="Já První" id="1" class="original" source="1">
     40<s>
     41Dnes    dnes    k6eAd1
     42je  být k5eAaImIp3nS
     43pěkný   pěkný   k2eAgInSc4d1    pěkný
     44den den k1gInSc4    den
     45<g/>
     46!   !   k?
     47</s>
     48</doc>
     49<doc author="Já První" id="2" class="original" source="1">
     50<s>
     51Dnes    dnes    k6eAd1
     52je  být k5eAaImIp3nS
     53moc mnoho   k6
     54pěkný   pěkný   k2eAgInSc4d1    pěkný
     55den den k1gInSc4    den
     56<g/>
     57!   !   k?
     58</s>
     59</doc>
     60<doc author="Já První" id="3" class="plagiarism" source="1">
     61<s>
     62Dnes    dnes    k6eAd1
     63je  být k5eAaImIp3nS
     64ale ale k9
     65pěkný   pěkný   k2eAgInSc4d1    pěkný
     66den den k1gInSc4    den
     67<g/>
     68!   !   k?
     69</s>
     70</doc>
     71<doc author="Já První" id="4" class="plagiarism" source="2">
     72<s>
     73Dnes    dnes    k6eAd1
     74je  být k5eAaImIp3nS
     75ale ale k9
     76moc mnoho   k6
     77pěkný   pěkný   k2eAgInSc4d1    pěkný
     78den den k1gInSc4    den
     79<g/>
     80!   !   k?
     81</s>
     82</doc>
     83<doc author="Já První" id="5" class="plagiarism" source="1">
     84<s>
     85xx  xx  x
     86xx  xx  x
     87xx  xx  x
     88xx  xx  x
     89xx  xx  x
     90<g/>
     91xx  xx  x
     92</s>
     93</doc>
     94}}}
     95
     96=== Output example ===
     97{{{
     98Doc set by Já První
     993       1       1
     1004       2       2
     1015       5       1
     102Set precision: 0.67, recall: 1.00, F1: 0.80
     103}}}
     104
     105=== Frame script implementing a simple plagiarism detection technique to extend ===
     106[attachment:plagiarism_simple.py]
     107
     108Usage (Czech or English input):
     109{{{
     110cat *.vert | /opt/majka/majka-desamb-czech.sh | cut -f1-3 | python plagiarism_simple.py
     111cat *.vert | /opt/TreeTagger/tools/tt-english\v2.sh | awk '{print $1"\t"$3"\t"$2}' | python plagiarism_simple.py
     112}}}