Building Language Resources from the Web

Web crawling, boilerplate removal, de-duplication and plagiarism detection.

IA161 Advanced NLP Course, Course Guarantee: Aleš Horák

Prepared by: Vít Suchomel

State of the Art

References

Chapters 19 and 20 from C. D. Manning et al. "Introduction to Information Retrieval". Cambridge University Press, 2008.
Pomikálek, Jan. "Removing boilerplate and duplicate content from web corpora." Dissertation thesis. Masaryk University, 2011.
Hacohen-Kerner, Yaakov, Aharon Tayeb, and Natan Ben-Dror. "Detection of simple plagiarism in computer science papers." Coling, 2010.
Potthast, M., Hagen, M., Goering, S., Rosso, P., and Stein, B. (2015). Towards data submissions for shared tasks: first experiences for the task of text alignment. Working Notes Papers of the CLEF, pages 1613–0073.
13th evaluation lab on uncovering plagiarism, authorship, and social software misuse

Slides

https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/AdvancedNlpCourse/anlp-05-LanguageResourcesFromWeb.pdf

Practical Session task

Plagiators vs. plagiarism detectors

Create 5 documents (with a similar topic) and 5 plagiarisms of these documents, 10 documents total. 100 words $\leq$ document lenght $\leq$ 500 words. 20 \% $\leq$ plagiarism content $\leq$ 75 \% (100 \% if done well).
Select detection algorithm and implement it in Python. At least 1 own plagiarism must be detected, at least 1 must be not detected by your own script.
Input format: POS tagged vertical consisting of 10 sctructures doc with attributes author, id, class, source. Pair author, id is unique. Class is "original" or "plagiarism". Source is the id of the source (in case of plagiarism) or own id (in case of original).\footnote{{\tiny For the sake of simplicity: A plagiarism cannot have more sources here.}}
Output format: One plagiarism per line: id TAB detected source id TAB real source id. Evaluation line: precision, recall F1 measure.
Your script will be evaluated using data made by others.

Text processing pipelines

Czech: alba:/opt/majka/majka-desamb-czech.sh | cut -f1-3
Czech web interface
English: alba:/opt/TreeTagger/tools/tt-english\_v2.sh | awk '{print \$1"\textbackslash t"\$3"\textbackslash t"\$2}'

Input data example (2 + 3 documents only)

<doc author="Já První" id="1" class="original" source="1">
<s>
Dnes    dnes    k6eAd1
je  být k5eAaImIp3nS
pěkný   pěkný   k2eAgInSc4d1    pěkný
den den k1gInSc4    den
<g/>
!   !   k?
</s>
</doc>
<doc author="Já První" id="2" class="original" source="1">
<s>
Dnes    dnes    k6eAd1
je  být k5eAaImIp3nS
moc mnoho   k6
pěkný   pěkný   k2eAgInSc4d1    pěkný
den den k1gInSc4    den
<g/>
!   !   k?
</s>
</doc>
<doc author="Já První" id="3" class="plagiarism" source="1">
<s>
Dnes    dnes    k6eAd1
je  být k5eAaImIp3nS
ale ale k9
pěkný   pěkný   k2eAgInSc4d1    pěkný
den den k1gInSc4    den
<g/>
!   !   k?
</s>
</doc>
<doc author="Já První" id="4" class="plagiarism" source="2">
<s>
Dnes    dnes    k6eAd1
je  být k5eAaImIp3nS
ale ale k9
moc mnoho   k6
pěkný   pěkný   k2eAgInSc4d1    pěkný
den den k1gInSc4    den
<g/>
!   !   k?
</s>
</doc>
<doc author="Já První" id="5" class="plagiarism" source="1">
<s>
xx  xx  x
xx  xx  x
xx  xx  x
xx  xx  x
xx  xx  x
<g/>
xx  xx  x
</s>
</doc>

Output example

Doc set by Já První
3       1       1
4       2       2
5       5       1
Set precision: 0.67, recall: 1.00, F1: 0.80

Frame script implementing a simple plagiarism detection technique to extend

plagiarism_simple.py

Usage (Czech or English input):

cat *.vert | /opt/majka/majka-desamb-czech.sh | cut -f1-3 | python plagiarism_simple.py
cat *.vert | /opt/TreeTagger/tools/tt-english\v2.sh | awk '{print $1"\t"$3"\t"$2}' | python plagiarism_simple.py