wiki:private/AdvancedNlpCourse/LanguageResourcesFromWeb

Version 15 (modified by xsuchom2, 3 years ago) (diff)

--

Building Language Resources from the Web

Web crawling, boilerplate removal, de-duplication and plagiarism detection.

IA161 Advanced NLP Course, Course Guarantee: Aleš Horák

Prepared by: Vít Suchomel

State of the Art

References

  1. Chapters 19 and 20 from C. D. Manning et al. "Introduction to Information Retrieval". Cambridge University Press, 2008.
  2. Pomikálek, Jan. "Removing boilerplate and duplicate content from web corpora." Dissertation thesis. Masaryk University, 2011.
  3. Hacohen-Kerner, Yaakov, Aharon Tayeb, and Natan Ben-Dror. "Detection of simple plagiarism in computer science papers." Coling, 2010.
  4. Potthast, M., Hagen, M., Goering, S., Rosso, P., and Stein, B. (2015). Towards data submissions for shared tasks: first experiences for the task of text alignment. Working Notes Papers of the CLEF, pages 1613–0073.
  5. 13th evaluation lab on uncovering plagiarism, authorship, and social software misuse

Slides

http://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/AdvancedNlpCourse/anlp-05-LanguageResourcesFromWeb.pdf

Practical Session task

Plagiators vs. plagiarism detectors

See the slides.

Text processing pipelines

  • Czech: alba:/opt/majka/majka-desamb-czech.sh | cut -f1-3
  • Czech web interface
  • English: alba:/opt/TreeTagger/tools/tt-english\_v2.sh | awk '{print \$1"\textbackslash t"\$3"\textbackslash t"\$2}'

Input data example (2 + 3 documents only)

<doc author="Já První" id="1" class="original" source="1">
<s>
Dnes    dnes    k6eAd1
je  být k5eAaImIp3nS
pěkný   pěkný   k2eAgInSc4d1    pěkný
den den k1gInSc4    den
<g/>
!   !   k?
</s>
</doc>
<doc author="Já První" id="2" class="original" source="1">
<s>
Dnes    dnes    k6eAd1
je  být k5eAaImIp3nS
moc mnoho   k6
pěkný   pěkný   k2eAgInSc4d1    pěkný
den den k1gInSc4    den
<g/>
!   !   k?
</s>
</doc>
<doc author="Já První" id="3" class="plagiarism" source="1">
<s>
Dnes    dnes    k6eAd1
je  být k5eAaImIp3nS
ale ale k9
pěkný   pěkný   k2eAgInSc4d1    pěkný
den den k1gInSc4    den
<g/>
!   !   k?
</s>
</doc>
<doc author="Já První" id="4" class="plagiarism" source="2">
<s>
Dnes    dnes    k6eAd1
je  být k5eAaImIp3nS
ale ale k9
moc mnoho   k6
pěkný   pěkný   k2eAgInSc4d1    pěkný
den den k1gInSc4    den
<g/>
!   !   k?
</s>
</doc>
<doc author="Já První" id="5" class="plagiarism" source="1">
<s>
xx  xx  x
xx  xx  x
xx  xx  x
xx  xx  x
xx  xx  x
<g/>
xx  xx  x
</s>
</doc>

Output example

Doc set by Já První
3       1       1
4       2       2
5       5       1
Set precision: 0.67, recall: 1.00, F1: 0.80

Frame script implementing a simple plagiarism detection technique to extend

plagiarism_simple.py

Usage (Czech or English input):

cat *.vert | /opt/majka/majka-desamb-czech.sh | cut -f1-3 | python plagiarism_simple.py
cat *.vert | /opt/TreeTagger/tools/tt-english\v2.sh | awk '{print $1"\t"$3"\t"$2}' | python plagiarism_simple.py

Attachments (3)