Building Language Resources from the Web
Web crawling, boilerplate removal, de-duplication and plagiarism detection.
IA161 Advanced NLP Course, Course Guarantee: Aleš Horák
Prepared by: Vít Suchomel
State of the Art
References
- Chapters 19 and 20 from C. D. Manning et al. "Introduction to Information Retrieval". Cambridge University Press, 2008.
- Pomikálek, Jan. "Removing boilerplate and duplicate content from web corpora." Dissertation thesis. Masaryk University, 2011.
- Hacohen-Kerner, Yaakov, Aharon Tayeb, and Natan Ben-Dror. "Detection of simple plagiarism in computer science papers." Coling, 2010.
- Potthast, M., Hagen, M., Goering, S., Rosso, P., and Stein, B. (2015). Towards data submissions for shared tasks: first experiences for the task of text alignment. Working Notes Papers of the CLEF, pages 1613–0073.
- 13th evaluation lab on uncovering plagiarism, authorship, and social software misuse
Slides
Practical Session task
Plagiators vs. plagiarism detectors
- Create 5 documents (with a similar topic) and 5 plagiarisms of these documents, 10 documents total. 100 words $\leq$ document lenght $\leq$ 500 words. 20 \% $\leq$ plagiarism content $\leq$ 75 \% (100 \% if done well).
- Select detection algorithm and implement it in Python. At least 1 own plagiarism must be detected, at least 1 must be not detected by your own script.
- Input format: POS tagged vertical consisting of 10 sctructures
docwith attributesauthor,id,class,source. Pair author, id is unique. Class is "original" or "plagiarism". Source is the id of the source (in case of plagiarism) or own id (in case of original).\footnote{{\tiny For the sake of simplicity: A plagiarism cannot have more sources here.}} - Output format: One plagiarism per line: id
TABdetected source idTABreal source id. Evaluation line: precision, recall F1 measure. - Your script will be evaluated using data made by others.
Text processing pipelines
- Czech:
alba:/opt/majka/majka-desamb-czech.sh | cut -f1-3 - Czech web interface
- English:
alba:/opt/TreeTagger/tools/tt-english\_v2.sh | awk '{print \$1"\textbackslash t"\$3"\textbackslash t"\$2}'
Input data example (2 + 3 documents only)
<doc author="Já První" id="1" class="original" source="1"> <s> Dnes dnes k6eAd1 je být k5eAaImIp3nS pěkný pěkný k2eAgInSc4d1 pěkný den den k1gInSc4 den <g/> ! ! k? </s> </doc> <doc author="Já První" id="2" class="original" source="1"> <s> Dnes dnes k6eAd1 je být k5eAaImIp3nS moc mnoho k6 pěkný pěkný k2eAgInSc4d1 pěkný den den k1gInSc4 den <g/> ! ! k? </s> </doc> <doc author="Já První" id="3" class="plagiarism" source="1"> <s> Dnes dnes k6eAd1 je být k5eAaImIp3nS ale ale k9 pěkný pěkný k2eAgInSc4d1 pěkný den den k1gInSc4 den <g/> ! ! k? </s> </doc> <doc author="Já První" id="4" class="plagiarism" source="2"> <s> Dnes dnes k6eAd1 je být k5eAaImIp3nS ale ale k9 moc mnoho k6 pěkný pěkný k2eAgInSc4d1 pěkný den den k1gInSc4 den <g/> ! ! k? </s> </doc> <doc author="Já První" id="5" class="plagiarism" source="1"> <s> xx xx x xx xx x xx xx x xx xx x xx xx x <g/> xx xx x </s> </doc>
Output example
Doc set by Já První 3 1 1 4 2 2 5 5 1 Set precision: 0.67, recall: 1.00, F1: 0.80
Frame script implementing a simple plagiarism detection technique to extend
Usage (Czech or English input):
cat *.vert | /opt/majka/majka-desamb-czech.sh | cut -f1-3 | python plagiarism_simple.py
cat *.vert | /opt/TreeTagger/tools/tt-english\v2.sh | awk '{print $1"\t"$3"\t"$2}' | python plagiarism_simple.py







