Stylometry

IV161 NLP in Practice Course, Course Guarantee: Aleš Horák

Prepared by: Honza Rygl, Aleš Horák, Radoslav Sabol

State of the Art

The analysis of author's characteristic writing style and vocabulary has been used to uncover author's traits such as authorship, age, or gender documents by both manual linguistic approaches and automatic algorithmic methods.

The most common approach to stylometry problems is to combine stylistic analysis with machine learning techniques:

specific style markers are extracted,
a classification procedure is applied to extracted markers

References

HUANG, Baixiang; CHEN, Canyu; SHU, Kai. Authorship attribution in the era of llms: Problems, methodologies, and challenges. ACM SIGKDD Explorations Newsletter, 2025, 26.2: 21-43 pdf
Bevendorff, J. et al. (2023). Overview of PAN 2023: Authorship Verification, Multi-author Writing Style Analysis, Profiling Cryptocurrency Influencers, and Trigger Detection. In: Kamps, J., et al. Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, vol 13982. Springer, Cham link
Lemmens, J., Markov, I., & Daelemans, W. (2021). Improving hate speech type and target detection with hateful metaphor features. In Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda (pp. 7-16). pdf
Kestemont, M. (2014), Function Words in Authorship Attribution From Black Magic to Theory? Proceedings of the 3rd Workshop on Computational Linguistics for Literature, EACL 2014, 59–66 pdf
Daelemans, W. (2013). Explanation in computational stylometry. In International conference on intelligent text processing and computational linguistics (pp. 451-462). Springer, Berlin, Heidelberg. pdf
Lukin, E., Roberts, J.C., Berdik, D. et al. Adjectives and adverbs as stylometric analysis parameters. Int J Digit Humanities (2023). link

Practical Session (For working in Google Colab)

The task will proceed using Python notebook run in a web browser in the Google Colaboratory environment.

In the case of running the codes in a local environment, the requirements are Python 3 and Jupyter Notebook.

Stylometric Feature Extraction

In this workshop, we will experiment with a feature extraction of style markers.

Access the Python notebook in the Google Colab environment. Please make your own copy of the notebook (File->Save a Copy in Drive).

Do not forget to save your work if you want to see your changes later; leaving the browser will throw away all changes!

Follow the instructions present in the notebook
Implement a feature extractor that improves the performance of existing system
Download and submit your assignment.py and upload it to the [en/NlpInPracticeCourse homework vault]