Changes between Version 8 and Version 9 of private/AdvancedNlpCourse/LanguageResourcesFromWeb


Ignore:
Timestamp:
Oct 26, 2015, 3:41:39 PM (5 years ago)
Author:
xsuchom2
Comment:

Slides update, Practical Session task

Legend:

Unmodified
Added
Removed
Modified
  • private/AdvancedNlpCourse/LanguageResourcesFromWeb

    v8 v9  
    1212 1. Pomikálek, Jan. "Removing boilerplate and duplicate content from web corpora." Dissertation thesis. Masaryk University, 2011.
    1313 1. Hacohen-Kerner, Yaakov, Aharon Tayeb, and Natan Ben-Dror. "Detection of simple plagiarism in computer science papers." Coling, 2010.
     14 1. Potthast, M., Hagen, M., Goering, S., Rosso, P., and Stein, B. (2015). Towards data submissions for shared tasks: first experiences for the task of text alignment. Working Notes Papers of the CLEF, pages 1613–0073.
     15 1. [http://www.uni-weimar.de/medien/webis/events/pan-15/pan15-web/ 13th evaluation lab on uncovering plagiarism, authorship, and social software misuse]
     16{{{#!comment
    1417 1. Potthast et al. "Overview of the 6th International Competition on Plagiarism Detection." CLEF, 2014.
    15  1. [http://www.uni-weimar.de/medien/webis/events/pan-15/pan15-web/ 13th evaluation lab on uncovering plagiarism, authorship, and social software misuse]
     18 1. Potthast, M., Stein, B., Barron-Cedeno, A., and Rosso, P. (2010). An Evaluation Framework for Plagiarism Detection. In Proceedings of COLING 2010, Beijing, China. ACL.
     19}}}
    1620
    17 == Slides including Practical Session task ==
     21== Slides ==
     22[http://nlp.fi.muni.cz/~xsuchom2/anlp-05-LanguageResourcesFromWeb.pdf]
    1823
    19 [http://nlp.fi.muni.cz/~xsuchom2/anlp-05-LanguageResourcesFromWeb.pdf]
     24== Practical Session task ==
     25=== Plagiators vs. plagiarism detectors ===
     26 1. Create 5 documents (with a similar topic) and 5 plagiarisms of these documents, 10 documents total. 100 words $\leq$ document lenght $\leq$ 500 words. 20 \% $\leq$ plagiarism content $\leq$ 75 \% (100 \% if done well).
     27 1. Select detection algorithm and implement it in Python. At least 1 own plagiarism must be detected, at least 1 must be not detected by your own script.
     28 1. Input format: POS tagged vertical consisting of 10 sctructures `doc` with attributes `author`, `id`, `class`, `source`. Pair author, id is unique. Class is "original" or "plagiarism". Source is the id of the source (in case of plagiarism) or own id (in case of original).\footnote{{\tiny For the sake of simplicity: A plagiarism cannot have more sources here.}}
     29 1. Output format: One plagiarism per line: id `TAB` detected source id `TAB` real source id. Evaluation line: precision, recall F1 measure.
     30 1. Your script will be evaluated using data made by others.
     31
     32=== Text processing pipelines ===
     33 * Czech: {{{alba:/opt/majka/majka-desamb-czech.sh | cut -f1-3}}}
     34 * English: {{{alba:/opt/TreeTagger/tools/tt-english\_v2.sh | awk '{print \$1"\textbackslash t"\$3"\textbackslash t"\$2}'}}}
     35
     36=== Input data example (2 + 3 documents only) ===
     37{{{
     38<doc author="Já První" id="1" class="original" source="1">
     39<s>
     40Dnes    dnes    k6eAd1
     41je  být k5eAaImIp3nS
     42pěkný   pěkný   k2eAgInSc4d1    pěkný
     43den den k1gInSc4    den
     44<g/>
     45!   !   k?
     46</s>
     47</doc>
     48<doc author="Já První" id="2" class="original" source="1">
     49<s>
     50Dnes    dnes    k6eAd1
     51je  být k5eAaImIp3nS
     52moc mnoho   k6
     53pěkný   pěkný   k2eAgInSc4d1    pěkný
     54den den k1gInSc4    den
     55<g/>
     56!   !   k?
     57</s>
     58</doc>
     59<doc author="Já První" id="3" class="plagiarism" source="1">
     60<s>
     61Dnes    dnes    k6eAd1
     62je  být k5eAaImIp3nS
     63ale ale k9
     64pěkný   pěkný   k2eAgInSc4d1    pěkný
     65den den k1gInSc4    den
     66<g/>
     67!   !   k?
     68</s>
     69</doc>
     70<doc author="Já První" id="4" class="plagiarism" source="2">
     71<s>
     72Dnes    dnes    k6eAd1
     73je  být k5eAaImIp3nS
     74ale ale k9
     75moc mnoho   k6
     76pěkný   pěkný   k2eAgInSc4d1    pěkný
     77den den k1gInSc4    den
     78<g/>
     79!   !   k?
     80</s>
     81</doc>
     82<doc author="Já První" id="5" class="plagiarism" source="1">
     83<s>
     84xx  xx  x
     85xx  xx  x
     86xx  xx  x
     87xx  xx  x
     88xx  xx  x
     89<g/>
     90xx  xx  x
     91</s>
     92</doc>
     93}}}
     94
     95=== Output example ===
     96{{{
     97Doc set by Já První
     983       1       1
     994       2       2
     1005       5       1
     101Set precision: 0.67, recall: 1.00, F1: 0.80
     102}}}
     103
     104=== Frame script implementing a simple plagiarism detection technique to extend ===
     105[raw-attachment:plagiarism_simple.py]
     106
     107Usage (Czech or English input):
     108{{{
     109cat *.vert | /opt/majka/majka-desamb-czech.sh | cut -f1-3 | python plagiarism_simple.py
     110cat *.vert | /opt/TreeTagger/tools/tt-english\v2.sh | awk '{print $1"\t"$3"\t"$2}' | python plagiarism_simple.py
     111}}}