Changes between Initial Version and Version 1 of en/NlpInPracticeCourse/2023/Stylometry


Ignore:
Timestamp:
Sep 3, 2024, 2:50:03 PM (10 months ago)
Author:
Ales Horak
Comment:

copied from private/NlpInPracticeCourse/Stylometry

Legend:

Unmodified
Added
Removed
Modified
  • en/NlpInPracticeCourse/2023/Stylometry

    v1 v1  
     1= Stylometry =
     2
     3[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/NlpInPracticeCourse|NLP in Practice Course]], Course Guarantee: Aleš Horák
     4
     5Prepared by: Honza Rygl, Aleš Horák, Radoslav Sabol
     6
     7== State of the Art ==
     8
     9The analysis of author's characteristic
     10writing style and vocabulary has been used to uncover author's traits such as authorship, age, or gender
     11documents by both manual linguistic approaches and automatic algorithmic methods.
     12
     13The most common approach to stylometry problems
     14is to combine stylistic analysis with machine learning techniques:
     15 1. specific style markers are extracted,
     16 2. a classification procedure is applied to extracted markers
     17
     18
     19=== References ===
     20
     21 1. Bevendorff, J. et al. (2023). Overview of PAN 2023: Authorship Verification, Multi-author Writing Style Analysis, Profiling Cryptocurrency Influencers, and Trigger Detection. In: Kamps, J., et al. Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, vol 13982. Springer, Cham [[https://link.springer.com/chapter/10.1007/978-3-031-28241-6_60 | link]]
     22 1. Lemmens, J., Markov, I., & Daelemans, W. (2021). Improving hate speech type and target detection with hateful metaphor features. In Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda (pp. 7-16). [[https://aclanthology.org/2021.nlp4if-1.2.pdf | pdf]]
     23 2. Kestemont, M. (2014), Function Words in Authorship Attribution From Black Magic to Theory? Proceedings of the 3rd Workshop on Computational Linguistics for Literature, EACL 2014, 59–66 [[http://aclweb.org/anthology/W14-0908 | pdf]]
     24 1. Daelemans, W. (2013). Explanation in computational stylometry. In International conference on intelligent text processing and computational linguistics (pp. 451-462). Springer, Berlin, Heidelberg. [[https://www.clips.uantwerpen.be/sites/default/files/daelemans2013.pdf | pdf]]
     25 1. Lukin, E., Roberts, J.C., Berdik, D. et al. Adjectives and adverbs as stylometric analysis parameters. Int J Digit Humanities (2023). [[https://link.springer.com/article/10.1007/s42803-023-00065-y | link]]
     26
     27== Practical Session ==
     28
     29{{{
     30#!div class="wiki-toc" style="width: 40%"
     31**Note:** If you are new to the [https://en.wikipedia.org/wiki/Command-line_interface command line interface] via a [https://en.wikipedia.org/wiki/Terminal_emulator terminal window], you may find the **[https://ubuntu.com/tutorials/command-line-for-beginners#3-opening-a-terminal tutorial for working in terminal]** useful.
     32}}}
     33
     34Students will work with the ''Style & Identity Recognition'' (SIR) tool. They will test this tool on prepared data.
     35The goal will be to implement a small function to extract style markers from a text.
     36
     371. go to `asteria04.fi.muni.cz` server:
     38{{{
     39ssh asteria04.fi.muni.cz
     40}}}
     412. Download a  [[htdocs:bigdata/stylometry-assignment.zip|ZIP with python packages of the assignment]]
     42{{{
     43wget https://nlp.fi.muni.cz/trac/research/chrome/site/bigdata/stylometry-assignment.zip
     44}}}
     453. Unzip the downloaded file
     46{{{
     47unzip stylometry-assignment.zip
     48}}}
     494. Go to the unziped folder
     50{{{
     51cd sir-assignment
     52}}}
     535. Test the prepared program that analyses data from on-line dating services to distinguish gender (masculine/feminine) by text style features
     54{{{
     55./run.sh
     56}}}
     57
     58The default dataset is for the Czech language, this may be changed by editing `run.sh` (use e.g. `vi` or `nano` as the editor):
     59{{{
     60nano run.sh
     61}}}
     62For English, unncomment the line with `export DATA_LANGUAGE='en'`:
     63{{{
     64#export DATA_LANGUAGE='cs'
     65export DATA_LANGUAGE='en'
     66}}}
     67
     68`run.sh` can have two optional parameters:
     69{{{
     70./run.sh  [number_of_testing_cycles]  [show_first_N_erroneously_predicted_documents]
     71}}}
     72The default values, i.e. running `./run.sh` without parameters, are `10` cycles and `no documents` (`./run.sh 10 0`). With longer feature testing `./run.sh 100` could provide better results (but not necessarily).
     73
     74Example with document output (second parameter greater than `0`):
     75{{{
     76[xrygl@asteria04:~/temp/sir-assignment]$ ./run.sh 10 1
     77pos: 5
     78expected: on
     79predicted: ona
     80text: Ahoj, (nejen) pro výlety do víru podivnězimního velkoměsta, či divočiny
     81      venkova, hledá se partnerka přiměřených rozměrů, tvarů a úrovně. Slečny
     82      veselé povahy preferovány; ona je to nejspíš nutnost :-)
     83morphology: [
     84     1. <s>           <s>
     85     2. Ahoj          N.N.I.S.1.-.-.-.-.-.A.-.-.-.-
     86     3. ,             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
     87     4. (             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
     88     5. nejen         T.T.-.-.-.-.-.-.-.-.-.-.-.-.-
     89     6. )             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
     90     7. pro           R.R.-.-.4.-.-.-.-.-.-.-.-.-.-
     91     8. výlety        N.N.I.P.4.-.-.-.-.-.A.-.-.-.-
     92     9. do            R.R.-.-.2.-.-.-.-.-.-.-.-.-.-
     93    10. víru          N.N.I.S.2.-.-.-.-.-.A.-.-.-.-
     94    11. podivnězimního  A.A.N.S.2.-.-.-.-.1.A.-.-.-.-
     95    12. velkoměsta    N.N.N.S.2.-.-.-.-.-.A.-.-.-.-
     96    13. ,             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
     97    14. či            J.^.-.-.-.-.-.-.-.-.-.-.-.-.-
     98    15. divočiny      N.N.F.S.2.-.-.-.-.-.A.-.-.-.-
     99    16. venkova       N.N.I.S.2.-.-.-.-.-.A.-.-.-.-
     100    17. ,             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
     101    18. hledá         V.B.-.S.-.-.-.3.P.-.A.A.-.-.-
     102    19. se            P.7.-.X.4.-.-.-.-.-.-.-.-.-.-
     103    20. partnerka     N.N.F.S.1.-.-.-.-.-.A.-.-.-.-
     104    21. přiměřených   A.A.I.P.2.-.-.-.-.1.A.-.-.-.-
     105    22. rozměrů       N.N.I.P.2.-.-.-.-.-.A.-.-.-.-
     106    23. ,             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
     107    24. tvarů         N.N.I.P.2.-.-.-.-.-.A.-.-.-.-
     108    25. a             J.^.-.-.-.-.-.-.-.-.-.-.-.-.-
     109    26. úrovně        N.N.F.S.2.-.-.-.-.-.A.-.-.-.-
     110    27. .             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
     111    28. <s>           <s>
     112    29. Slečny        N.N.F.P.1.-.-.-.-.-.A.-.-.-.-
     113    30. veselé        A.A.N.S.1.-.-.-.-.1.A.-.-.-.-
     114    31. povahy        N.N.F.S.2.-.-.-.-.-.A.-.-.-.-
     115    32. preferovány   V.s.T.P.-.-.-.X.X.-.A.P.-.-.-
     116    33. ;             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
     117    34. ona           P.P.F.S.1.-.-.3.-.-.-.-.-.-.-
     118    35. je            V.B.-.S.-.-.-.3.P.-.A.A.-.-.-
     119    36. to            P.D.N.S.1.-.-.-.-.-.-.-.-.-.-
     120    37. nejspíš       D.g.-.-.-.-.-.-.-.3.A.-.-.-.-
     121    38. nutnost       N.N.F.S.1.-.-.-.-.-.A.-.-.-.-
     122    39. :             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
     123    40. -             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
     124    41. )             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
     125]
     126Acc: 73.8 +- 2.1% (baseline 50.0%, 10 iterations)
     127}}}
     128You may print more details in `http_server/basic_task.py` after `# print explanation(s)` comment.
     129
     130[[https://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/hmptagqr.html | Tagset reference]]
     131
     132=== Task ===
     133Examine files in  `stylometry_features` folder.
     134Modify the `assignment.py` file to increase the accuracy of methods (use `vi` or `nano` command for editing).
     135You can create another classes inside, change their names and class names to improve the accuracy score. Don't forget to add new classes into the `assignments` list in this file.
     136
     137Suggestions for your inspiration include:
     138 * diacritics usage (yes/no), a regular expression will be needed
     139 * sentence endings (number of sentences, or typical endings)
     140 * repetitions of words in sentences/in the text
     141 * usage of uppercase letters
     142 * length of sentences/text
     143 * POS tags (n-grams)
     144 * word n-grams
     145 * character n-grams
     146
     147Each modification can be tested by running `./run.sh` again.
     148The first call of `run.sh` can be slower, because documents are morphologically analysed during the first run.
     149
     150Write the resulting `Acc:` line into the top comment of the `assignment.py` file. [[br]]
     151Submit your `assignment.py` file.