Context Navigation

← Previous Change
Wiki History
Next Change →

LanguageResourcesFromWeb

Timestamp:: Sep 11, 2017, 4:38:19 PM (8 years ago)
Author:: Ales Horak
Comment:: copied from private/AdvancedNlpCourse/LanguageResourcesFromWeb

Legend:

: Unmodified
: Added
: Removed
: Modified

en/AdvancedNlpCourse2015/LanguageResourcesFromWeb

                       v1
+= Building Language Resources from the Web =
+Web crawling, boilerplate removal, de-duplication and plagiarism detection.
+[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák
+Prepared by: Vít Suchomel
+== State of the Art ==
+=== References ===
+. Chapters 19 and 20 from C. D. Manning et al. "Introduction to Information Retrieval". Cambridge University Press, 2008.
+. Pomikálek, Jan. "Removing boilerplate and duplicate content from web corpora." Dissertation thesis. Masaryk University, 2011.
+. Hacohen-Kerner, Yaakov, Aharon Tayeb, and Natan Ben-Dror. "Detection of simple plagiarism in computer science papers." Coling, 2010.
+. Potthast, M., Hagen, M., Goering, S., Rosso, P., and Stein, B. (2015). Towards data submissions for shared tasks: first experiences for the task of text alignment. Working Notes Papers of the CLEF, pages 1613–0073.
+. [http://www.uni-weimar.de/medien/webis/events/pan-15/pan15-web/ 13th evaluation lab on uncovering plagiarism, authorship, and social software misuse]
+{{{#!comment
+. Potthast et al. "Overview of the 6th International Competition on Plagiarism Detection." CLEF, 2014.
+. Potthast, M., Stein, B., Barron-Cedeno, A., and Rosso, P. (2010). An Evaluation Framework for Plagiarism Detection. In Proceedings of COLING 2010, Beijing, China. ACL.
+}}}
+== Slides ==
+[https://nlp.fi.muni.cz/trac/research/raw-attachment/wiki/en/AdvancedNlpCourse/anlp-05-LanguageResourcesFromWeb.pdf]
+== Practical Session task ==
+=== Plagiators vs. plagiarism detectors ===
+. Create 5 documents (with a similar topic) and 5 plagiarisms of these documents, 10 documents total. 100 words $\leq$ document lenght $\leq$ 500 words. 20 \% $\leq$ plagiarism content $\leq$ 75 \% (100 \% if done well).
+. Select detection algorithm and implement it in Python. At least 1 own plagiarism must be detected, at least 1 must be not detected by your own script.
+. Input format: POS tagged vertical consisting of 10 sctructures `doc` with attributes `author`, `id`, `class`, `source`. Pair author, id is unique. Class is "original" or "plagiarism". Source is the id of the source (in case of plagiarism) or own id (in case of original).\footnote{{\tiny For the sake of simplicity: A plagiarism cannot have more sources here.}}
+. Output format: One plagiarism per line: id `TAB` detected source id `TAB` real source id. Evaluation line: precision, recall F1 measure.
+. Your script will be evaluated using data made by others.
+=== Text processing pipelines ===
+ * Czech: {{{alba:/opt/majka/majka-desamb-czech.sh | cut -f1-3}}}
+ * Czech [http://nlp.fi.muni.cz/projekty/rule_ind/index.cgi web interface]
+ * English: {{{alba:/opt/TreeTagger/tools/tt-english\_v2.sh | awk '{print \$1"\textbackslash t"\$3"\textbackslash t"\$2}'}}}
+=== Input data example (2 + 3 documents only) ===
+{{{
+<doc author="Já První" id="1" class="original" source="1">
+<s>
+Dnes    dnes    k6eAd1
+je  být k5eAaImIp3nS
+pěkný   pěkný   k2eAgInSc4d1    pěkný
+den den k1gInSc4    den
+<g/>
+!   !   k?
+</s>
+</doc>
+<doc author="Já První" id="2" class="original" source="1">
+<s>
+Dnes    dnes    k6eAd1
+je  být k5eAaImIp3nS
+moc mnoho   k6
+pěkný   pěkný   k2eAgInSc4d1    pěkný
+den den k1gInSc4    den
+<g/>
+!   !   k?
+</s>
+</doc>
+<doc author="Já První" id="3" class="plagiarism" source="1">
+<s>
+Dnes    dnes    k6eAd1
+je  být k5eAaImIp3nS
+ale ale k9
+pěkný   pěkný   k2eAgInSc4d1    pěkný
+den den k1gInSc4    den
+<g/>
+!   !   k?
+</s>
+</doc>
+<doc author="Já První" id="4" class="plagiarism" source="2">
+<s>
+Dnes    dnes    k6eAd1
+je  být k5eAaImIp3nS
+ale ale k9
+moc mnoho   k6
+pěkný   pěkný   k2eAgInSc4d1    pěkný
+den den k1gInSc4    den
+<g/>
+!   !   k?
+</s>
+</doc>
+<doc author="Já První" id="5" class="plagiarism" source="1">
+<s>
+xx  xx  x
+xx  xx  x
+xx  xx  x
+xx  xx  x
+xx  xx  x
+<g/>
+xx  xx  x
+</s>
+</doc>
+}}}
+=== Output example ===
+{{{
+Doc set by Já První
+       1       1
+       2       2
+       5       1
+Set precision: 0.67, recall: 1.00, F1: 0.80
+}}}
+=== Frame script implementing a simple plagiarism detection technique to extend ===
+[attachment:plagiarism_simple.py]
+Usage (Czech or English input):
+{{{
+cat *.vert | /opt/majka/majka-desamb-czech.sh | cut -f1-3 | python plagiarism_simple.py
+cat *.vert | /opt/TreeTagger/tools/tt-english\v2.sh | awk '{print $1"\t"$3"\t"$2}' | python plagiarism_simple.py
+}}}