Building Large Language Resources from the Web

Web crawling, boilerplate removal, de-duplication and plagiarism detection.

IV161 NLP in Practice Course, Course Guarantee: Aleš Horák

Prepared by: Vít Suchomel
Slides

References

State of the Art

Suchomel, Vít. "Better Web Corpora For Corpus Linguistics And NLP." Dissertation thesis. Masaryk University, 2020.
Wu, Junchao, Shu Yang, Runzhe Zhan, Yulin Yuan, Lidia Sam Chao, and Derek Fai Wong. "A survey on llm-generated text detection: Necessity, methods, and future directions." Computational Linguistics 51, no. 1 (2025): 275-338.
Jauhiainen, Tommi, Heidi Jauhiainen, and Krister Lindén. "HeLI-OTS, Off-the-shelf Language Identifier for Text." In Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022). European Language Resources Association (ELRA), 2022.
Janek Bevendorff et al. "Overview of PAN 2021.": Authorship Verification, Profiling Hate Speech Spreaders on Twitter, and Style Change Detection. In Advances in Information Retrieval (ECIR 2021), March 2021. Springer.

Other useful references

Chapters 19 and 20 from C. D. Manning et al. "Introduction to Information Retrieval". Cambridge University Press, 2008.
Schäfer, Roland, and Felix Bildhauer. "Web corpus construction". Morgan & Claypool Publishers, 2013.
Pomikálek, Jan. "Removing boilerplate and duplicate content from web corpora." Dissertation thesis. Masaryk University, 2011.
Broder, Andrei Z. "Identifying and filtering near-duplicate documents." In Annual symposium on combinatorial pattern matching, pp. 1-10. Berlin, Heidelberg: Springer Berlin Heidelberg, 2000.

Practical Session task

Plagiators vs. plagiarism detectors

Either: Create 5 documents (with a similar topic) and 5 plagiarisms of these documents, 10 documents total. (For the sake of simplicity: A plagiarism cannot have more sources here.) The minimal homework.

100 words <= document lenght <= 500 words
20 % <= plagiarism content <= 90 %
File format: A POS tagged vertical consisting of structures doc with attributes author, id, class, source. Pair author, id is unique. Start with id = 1. Class is "original" or "plagiarism". Source is the id of the source (in the case of plagiarism) or the same as the id of the document (in the case of original).
A POS tagged vertical: 3 TAB separated columns: word, lemma (the base form of the word), POS/morphological tag.
Text processing pipelines for converting a text file to a 3-column vertical:
- Czech: asteria04:/opt/majka_pipe/majka-czech_v2.sh | cut -f1-3 or a web interface (short documents only)
  - See an example below.
- English: asteria04:/opt/treetagger_pipe/tt-english_v3.1.sh
For each plagiarism:
- describe plagiarsim technique(s) used
- which detection methods might be able to reveal it – give reasons
- which detection methods might not be able to reveal it – give reasons
Submit a text file containing 10 documents according to the requirements + 1 text file describing techniques used and your estimation which detection techniques may or may not work.
If you don't have access to NLPC machines, it is permitted to submit plain text with document structures instead of a POS tagged vertical.

Or: Select a detection algorithm and implement it in Python. The right homework if you want to learn something.

A basic detection script to extend: the interactive version in Google Colab or download plagiarism_simple.py and run it on your own – python plagiarism_simple.py < training_data.vert.
- A bag of words + cosine similarity of word vectors approach is implemented in this script. (For the sake of simplicity: A plagiarism cannot have more sources here.)
- You can modify the script to
  - use other input attributes than the word or a combination of attributes, e.g. the lemma or the morphological tag
  - or implement other lexical/syntactic based detection approach, e.g. n-grams of words or Levenshtein's distance
  - or implement other semantic based detection approach, e.g. the similarity of word2vec vectors
  - or do it another way, be creative – describe how it works in comments in the code.
Input format: A 3-column vertical, see above. training_data.vert
Output: One plagiarism per line: id TAB detected source id TAB real source id. Evaluation line: precision, recall F1 measure.
Your script will be evaluated using data made by others.
Describe which plagiarism detection technique(s) were implemented – put it in a comment in the beginning of your script.
Submit the modified script (or your own script) with a short description in a comment. (The training set output of the script is not required.)

Examples of a source document and a plagiarism document

<doc author="Já První" id="1" class="original" source="1">
<s>
Dnes    dnes    k6eAd1
je      být     k5eAaImIp3nS
pěkný   pěkný   k2eAgInSc4d1    pěkný
den     den     k1gInSc4        den
<g/>
!       !       k?
</s>
</doc>
<doc author="Já První" id="2" class="plagiarism" source="1">
<s>
Dnes    dnes    k6eAd1
je      být     k5eAaImIp3nS
ale     ale     k9
pěkný   pěkný   k2eAgInSc4d1    pěkný
den     den     k1gInSc4        den
<g/>
!       !       k?
</s>
</doc>

How to produce the 3-column POS tagged vertical from a plaintext:

scp plagiarism.txt aurora.fi.muni.cz:~/
ssh aurora.fi.muni.cz
ssh asteria04
cat ~/plagiarism.txt | /opt/majka_pipe/majka-czech_v2.sh | cut -f1-3 > ~/plagiarism.vert  #Czech
cat ~/plagiarism.txt | /opt/treetagger_pipe/tt-english_v3.1.sh > ~/plagiarism.vert        #English
logout
logout
scp aurora.fi.muni.cz:~/plagiarism.vert ./

How to run the sample detection script with the training data:

python3 plagiarism_simple.py < training_data.vert

or with your own vertical:

python3 plagiarism_simple.py < plagiarism.vert