= Building Language Resources from the Web =
Web crawling, boilerplate removal, de-duplication and plagiarism detection.

[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák

Prepared by: Vít Suchomel

== State of the Art ==

=== References ===
 1. Chapters 19 and 20 from C. D. Manning et al. "Introduction to Information Retrieval". Cambridge University Press, 2008.
 1. Pomikálek, Jan. "Removing boilerplate and duplicate content from web corpora." Dissertation thesis. Masaryk University, 2011.
 1. Hacohen-Kerner, Yaakov, Aharon Tayeb, and Natan Ben-Dror. "Detection of simple plagiarism in computer science papers." Coling, 2010.
 1. Potthast, M., Hagen, M., Goering, S., Rosso, P., and Stein, B. (2015). Towards data submissions for shared tasks: first experiences for the task of text alignment. Working Notes Papers of the CLEF, pages 1613–0073.
 1. [http://www.uni-weimar.de/medien/webis/events/pan-15/pan15-web/ 13th evaluation lab on uncovering plagiarism, authorship, and social software misuse]
{{{#!comment
 1. Potthast et al. "Overview of the 6th International Competition on Plagiarism Detection." CLEF, 2014.
 1. Potthast, M., Stein, B., Barron-Cedeno, A., and Rosso, P. (2010). An Evaluation Framework for Plagiarism Detection. In Proceedings of COLING 2010, Beijing, China. ACL.
}}}

== Slides ==
[https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/AdvancedNlpCourse/anlp-05-LanguageResourcesFromWeb.pdf]

== Practical Session task ==
=== Plagiators vs. plagiarism detectors ===
 1. Create 5 documents (with a similar topic) and 5 plagiarisms of these documents, 10 documents total. 100 words $\leq$ document lenght $\leq$ 500 words. 20 \% $\leq$ plagiarism content $\leq$ 75 \% (100 \% if done well).
 1. Select detection algorithm and implement it in Python. At least 1 own plagiarism must be detected, at least 1 must be not detected by your own script.
 1. Input format: POS tagged vertical consisting of 10 sctructures `doc` with attributes `author`, `id`, `class`, `source`. Pair author, id is unique. Class is "original" or "plagiarism". Source is the id of the source (in case of plagiarism) or own id (in case of original).\footnote{{\tiny For the sake of simplicity: A plagiarism cannot have more sources here.}}
 1. Output format: One plagiarism per line: id `TAB` detected source id `TAB` real source id. Evaluation line: precision, recall F1 measure.
 1. Your script will be evaluated using data made by others.

=== Text processing pipelines ===
 * Czech: {{{alba:/opt/majka/majka-desamb-czech.sh | cut -f1-3}}}
 * Czech [http://nlp.fi.muni.cz/projekty/rule_ind/index.cgi web interface]
 * English: {{{alba:/opt/TreeTagger/tools/tt-english\_v2.sh | awk '{print \$1"\textbackslash t"\$3"\textbackslash t"\$2}'}}}

=== Input data example (2 + 3 documents only) ===
{{{
<doc author="Já První" id="1" class="original" source="1">
<s>
Dnes    dnes    k6eAd1
je  být k5eAaImIp3nS
pěkný   pěkný   k2eAgInSc4d1    pěkný
den den k1gInSc4    den
<g/>
!   !   k?
</s>
</doc>
<doc author="Já První" id="2" class="original" source="1">
<s>
Dnes    dnes    k6eAd1
je  být k5eAaImIp3nS
moc mnoho   k6
pěkný   pěkný   k2eAgInSc4d1    pěkný
den den k1gInSc4    den
<g/>
!   !   k?
</s>
</doc>
<doc author="Já První" id="3" class="plagiarism" source="1">
<s>
Dnes    dnes    k6eAd1
je  být k5eAaImIp3nS
ale ale k9
pěkný   pěkný   k2eAgInSc4d1    pěkný
den den k1gInSc4    den
<g/>
!   !   k?
</s>
</doc>
<doc author="Já První" id="4" class="plagiarism" source="2">
<s>
Dnes    dnes    k6eAd1
je  být k5eAaImIp3nS
ale ale k9
moc mnoho   k6
pěkný   pěkný   k2eAgInSc4d1    pěkný
den den k1gInSc4    den
<g/>
!   !   k?
</s>
</doc>
<doc author="Já První" id="5" class="plagiarism" source="1">
<s>
xx  xx  x
xx  xx  x
xx  xx  x
xx  xx  x
xx  xx  x
<g/>
xx  xx  x
</s>
</doc>
}}}

=== Output example ===
{{{
Doc set by Já První
3       1       1
4       2       2
5       5       1
Set precision: 0.67, recall: 1.00, F1: 0.80
}}}

=== Frame script implementing a simple plagiarism detection technique to extend ===
[attachment:plagiarism_simple.py]

Usage (Czech or English input):
{{{
cat *.vert | /opt/majka/majka-desamb-czech.sh | cut -f1-3 | python plagiarism_simple.py
cat *.vert | /opt/TreeTagger/tools/tt-english\v2.sh | awk '{print $1"\t"$3"\t"$2}' | python plagiarism_simple.py
}}}