Context Navigation

LanguageResourcesFromWeb

Timestamp:: Oct 26, 2015, 3:41:39 PM (10 years ago)
Author:: xsuchom2
Comment:: Slides update, Practical Session task

Legend:

: Unmodified
: Added
: Removed
: Modified

private/NlpInPracticeCourse/LanguageResourcesFromWeb

-                      v8
+                      v9
 . Pomikálek, Jan. "Removing boilerplate and duplicate content from web corpora." Dissertation thesis. Masaryk University, 2011.
 . Hacohen-Kerner, Yaakov, Aharon Tayeb, and Natan Ben-Dror. "Detection of simple plagiarism in computer science papers." Coling, 2010.
+. Potthast, M., Hagen, M., Goering, S., Rosso, P., and Stein, B. (2015). Towards data submissions for shared tasks: first experiences for the task of text alignment. Working Notes Papers of the CLEF, pages 1613–0073.
+. [http://www.uni-weimar.de/medien/webis/events/pan-15/pan15-web/ 13th evaluation lab on uncovering plagiarism, authorship, and social software misuse]
+{{{#!comment
 . Potthast et al. "Overview of the 6th International Competition on Plagiarism Detection." CLEF, 2014.
+. [http://www.uni-weimar.de/medien/webis/events/pan-15/pan15-web/ 13th evaluation lab on uncovering plagiarism, authorship, and social software misuse]
+. Potthast, M., Stein, B., Barron-Cedeno, A., and Rosso, P. (2010). An Evaluation Framework for Plagiarism Detection. In Proceedings of COLING 2010, Beijing, China. ACL.
+}}}
+== Slides including Practical Session task ==
+== Slides ==
+[http://nlp.fi.muni.cz/~xsuchom2/anlp-05-LanguageResourcesFromWeb.pdf]
+[http://nlp.fi.muni.cz/~xsuchom2/anlp-05-LanguageResourcesFromWeb.pdf]
+== Practical Session task ==
+=== Plagiators vs. plagiarism detectors ===
+. Create 5 documents (with a similar topic) and 5 plagiarisms of these documents, 10 documents total. 100 words $\leq$ document lenght $\leq$ 500 words. 20 \% $\leq$ plagiarism content $\leq$ 75 \% (100 \% if done well).
+. Select detection algorithm and implement it in Python. At least 1 own plagiarism must be detected, at least 1 must be not detected by your own script.
+. Input format: POS tagged vertical consisting of 10 sctructures `doc` with attributes `author`, `id`, `class`, `source`. Pair author, id is unique. Class is "original" or "plagiarism". Source is the id of the source (in case of plagiarism) or own id (in case of original).\footnote{{\tiny For the sake of simplicity: A plagiarism cannot have more sources here.}}
+. Output format: One plagiarism per line: id `TAB` detected source id `TAB` real source id. Evaluation line: precision, recall F1 measure.
+. Your script will be evaluated using data made by others.
+=== Text processing pipelines ===
+ * Czech: {{{alba:/opt/majka/majka-desamb-czech.sh | cut -f1-3}}}
+ * English: {{{alba:/opt/TreeTagger/tools/tt-english\_v2.sh | awk '{print \$1"\textbackslash t"\$3"\textbackslash t"\$2}'}}}
+=== Input data example (2 + 3 documents only) ===
+{{{
+<doc author="Já První" id="1" class="original" source="1">
+<s>
+Dnes    dnes    k6eAd1
+je  být k5eAaImIp3nS
+pěkný   pěkný   k2eAgInSc4d1    pěkný
+den den k1gInSc4    den
+<g/>
+!   !   k?
+</s>
+</doc>
+<doc author="Já První" id="2" class="original" source="1">
+<s>
+Dnes    dnes    k6eAd1
+je  být k5eAaImIp3nS
+moc mnoho   k6
+pěkný   pěkný   k2eAgInSc4d1    pěkný
+den den k1gInSc4    den
+<g/>
+!   !   k?
+</s>
+</doc>
+<doc author="Já První" id="3" class="plagiarism" source="1">
+<s>
+Dnes    dnes    k6eAd1
+je  být k5eAaImIp3nS
+ale ale k9
+pěkný   pěkný   k2eAgInSc4d1    pěkný
+den den k1gInSc4    den
+<g/>
+!   !   k?
+</s>
+</doc>
+<doc author="Já První" id="4" class="plagiarism" source="2">
+<s>
+Dnes    dnes    k6eAd1
+je  být k5eAaImIp3nS
+ale ale k9
+moc mnoho   k6
+pěkný   pěkný   k2eAgInSc4d1    pěkný
+den den k1gInSc4    den
+<g/>
+!   !   k?
+</s>
+</doc>
+<doc author="Já První" id="5" class="plagiarism" source="1">
+<s>
+xx  xx  x
+xx  xx  x
+xx  xx  x
+xx  xx  x
+xx  xx  x
+<g/>
+xx  xx  x
+</s>
+</doc>
+}}}
+=== Output example ===
+{{{
+Doc set by Já První
+       1       1
+       2       2
+       5       1
+Set precision: 0.67, recall: 1.00, F1: 0.80
+}}}
+=== Frame script implementing a simple plagiarism detection technique to extend ===
+[raw-attachment:plagiarism_simple.py]
+Usage (Czech or English input):
+{{{
+cat *.vert | /opt/majka/majka-desamb-czech.sh | cut -f1-3 | python plagiarism_simple.py
+cat *.vert | /opt/TreeTagger/tools/tt-english\v2.sh | awk '{print $1"\t"$3"\t"$2}' | python plagiarism_simple.py
+}}}