Version 9 (modified by 8 years ago) (diff) | ,
---|
Building Language Resources from the Web
Web crawling, boilerplate removal, de-duplication and plagiarism detection.
IA161 Advanced NLP Course?, Course Guarantee: Aleš Horák
Prepared by: Vít Suchomel
State of the Art
References
- Chapters 19 and 20 from C. D. Manning et al. "Introduction to Information Retrieval". Cambridge University Press, 2008.
- Pomikálek, Jan. "Removing boilerplate and duplicate content from web corpora." Dissertation thesis. Masaryk University, 2011.
- Hacohen-Kerner, Yaakov, Aharon Tayeb, and Natan Ben-Dror. "Detection of simple plagiarism in computer science papers." Coling, 2010.
- Potthast, M., Hagen, M., Goering, S., Rosso, P., and Stein, B. (2015). Towards data submissions for shared tasks: first experiences for the task of text alignment. Working Notes Papers of the CLEF, pages 1613–0073.
- 13th evaluation lab on uncovering plagiarism, authorship, and social software misuse
Slides
http://nlp.fi.muni.cz/~xsuchom2/anlp-05-LanguageResourcesFromWeb.pdf
Practical Session task
Plagiators vs. plagiarism detectors
- Create 5 documents (with a similar topic) and 5 plagiarisms of these documents, 10 documents total. 100 words $\leq$ document lenght $\leq$ 500 words. 20 \% $\leq$ plagiarism content $\leq$ 75 \% (100 \% if done well).
- Select detection algorithm and implement it in Python. At least 1 own plagiarism must be detected, at least 1 must be not detected by your own script.
- Input format: POS tagged vertical consisting of 10 sctructures
doc
with attributesauthor
,id
,class
,source
. Pair author, id is unique. Class is "original" or "plagiarism". Source is the id of the source (in case of plagiarism) or own id (in case of original).\footnote{{\tiny For the sake of simplicity: A plagiarism cannot have more sources here.}} - Output format: One plagiarism per line: id
TAB
detected source idTAB
real source id. Evaluation line: precision, recall F1 measure. - Your script will be evaluated using data made by others.
Text processing pipelines
- Czech:
alba:/opt/majka/majka-desamb-czech.sh | cut -f1-3
- English:
alba:/opt/TreeTagger/tools/tt-english\_v2.sh | awk '{print \$1"\textbackslash t"\$3"\textbackslash t"\$2}'
Input data example (2 + 3 documents only)
<doc author="Já První" id="1" class="original" source="1"> <s> Dnes dnes k6eAd1 je být k5eAaImIp3nS pěkný pěkný k2eAgInSc4d1 pěkný den den k1gInSc4 den <g/> ! ! k? </s> </doc> <doc author="Já První" id="2" class="original" source="1"> <s> Dnes dnes k6eAd1 je být k5eAaImIp3nS moc mnoho k6 pěkný pěkný k2eAgInSc4d1 pěkný den den k1gInSc4 den <g/> ! ! k? </s> </doc> <doc author="Já První" id="3" class="plagiarism" source="1"> <s> Dnes dnes k6eAd1 je být k5eAaImIp3nS ale ale k9 pěkný pěkný k2eAgInSc4d1 pěkný den den k1gInSc4 den <g/> ! ! k? </s> </doc> <doc author="Já První" id="4" class="plagiarism" source="2"> <s> Dnes dnes k6eAd1 je být k5eAaImIp3nS ale ale k9 moc mnoho k6 pěkný pěkný k2eAgInSc4d1 pěkný den den k1gInSc4 den <g/> ! ! k? </s> </doc> <doc author="Já První" id="5" class="plagiarism" source="1"> <s> xx xx x xx xx x xx xx x xx xx x xx xx x <g/> xx xx x </s> </doc>
Output example
Doc set by Já První 3 1 1 4 2 2 5 5 1 Set precision: 0.67, recall: 1.00, F1: 0.80
Frame script implementing a simple plagiarism detection technique to extend
Usage (Czech or English input):
cat *.vert | /opt/majka/majka-desamb-czech.sh | cut -f1-3 | python plagiarism_simple.py cat *.vert | /opt/TreeTagger/tools/tt-english\v2.sh | awk '{print $1"\t"$3"\t"$2}' | python plagiarism_simple.py
Attachments (3)
- training_data.vert (491.9 KB) - added by 4 years ago.
- plagiarism_simple.py (6.7 KB) - added by 8 weeks ago.
- anlp-05-LanguageResourcesFromWeb.pdf (6.2 MB) - added by 8 weeks ago.