wiki:en/AdvancedNlpCourse2020/Stylometry

Stylometry

IA161 Advanced NLP Course?, Course Guarantee: Aleš Horák

Prepared by: Honza Rygl, Aleš Horák

State of the Art

The analysis of author's characteristic writing style and vocabulary has been used to uncover author's traits such as authorship, age, or gender documents by both manual linguistic approaches and automatic algorithmic methods.

The most common approach to stylometry problems is to combine stylistic analysis with machine learning techniques:

  1. specific style markers are extracted,
  2. a classification procedure is applied to extracted markers

References

  1. Bevendorff, Janek, et al.(2020), Overview of PAN 2020: Authorship Verification, Celebrity Profiling, Profiling Fake News Spreaders on Twitter, and Style Change Detection. International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, Cham. pdf
  2. Stamatatos, E. (2009), A Survey of Modern Authorship Attribution Methods (2009), Journal of the American Society for Information Science and Technology, 60(3), 538-556. pdf
  3. Kestemont, M. (2014), Function Words in Authorship Attribution From Black Magic to Theory? Proceedings of the 3rd Workshop on Computational Linguistics for Literature, EACL 2014, 59–66 pdf
  4. Daelemans, W. (2013). Explanation in computational stylometry. In International conference on intelligent text processing and computational linguistics (pp. 451-462). Springer, Berlin, Heidelberg. pdf

Practical Session

Students will work with the Style & Identity Recognition (SIR) tool. They will test this tool on prepared data. The goal will be to implement a small function to extract style markers from a text.

  1. go to asteria04.fi.muni.cz server:
    ssh asteria04.fi.muni.cz
    
  2. Download a ZIP with python packages of the assignment
    wget https://nlp.fi.muni.cz/trac/research/chrome/site/bigdata/stylometry-assignment.zip
    
  3. Unzip the downloaded file
    unzip stylometry-assignment.zip
    
  4. Go to the unziped folder
    cd sir-assignment
    
  5. Test the prepared program that analyses data from on-line dating services to distinguish gender (masculine/feminine) by text style features
    ./run.sh
    

run.sh can have two optional parameters:

./run.sh  [number_of_testing_cycles]  [show_first_N_erroneously_predicted_documents]

The default values, i.e. running ./run.sh without parameters, are 10 cycles and no documents (./run.sh 10 0). With longer feature testing ./run.sh 100 could provide better results (but not necessarily).

Example with document output (second parameter >0):

[xrygl@asteria04:~/temp/sir-assignment]$ ./run.sh 10 1
pos: 5
expected: on
predicted: ona
text: Ahoj, (nejen) pro výlety do víru podivnězimního velkoměsta, či divočiny
      venkova, hledá se partnerka přiměřených rozměrů, tvarů a úrovně. Slečny
      veselé povahy preferovány; ona je to nejspíš nutnost :-)
morphology: [
     1. <s>           <s>
     2. Ahoj          N.N.I.S.1.-.-.-.-.-.A.-.-.-.-
     3. ,             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
     4. (             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
     5. nejen         T.T.-.-.-.-.-.-.-.-.-.-.-.-.-
     6. )             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
     7. pro           R.R.-.-.4.-.-.-.-.-.-.-.-.-.-
     8. výlety        N.N.I.P.4.-.-.-.-.-.A.-.-.-.-
     9. do            R.R.-.-.2.-.-.-.-.-.-.-.-.-.-
    10. víru          N.N.I.S.2.-.-.-.-.-.A.-.-.-.-
    11. podivnězimního  A.A.N.S.2.-.-.-.-.1.A.-.-.-.-
    12. velkoměsta    N.N.N.S.2.-.-.-.-.-.A.-.-.-.-
    13. ,             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
    14. či            J.^.-.-.-.-.-.-.-.-.-.-.-.-.-
    15. divočiny      N.N.F.S.2.-.-.-.-.-.A.-.-.-.-
    16. venkova       N.N.I.S.2.-.-.-.-.-.A.-.-.-.-
    17. ,             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
    18. hledá         V.B.-.S.-.-.-.3.P.-.A.A.-.-.-
    19. se            P.7.-.X.4.-.-.-.-.-.-.-.-.-.-
    20. partnerka     N.N.F.S.1.-.-.-.-.-.A.-.-.-.-
    21. přiměřených   A.A.I.P.2.-.-.-.-.1.A.-.-.-.-
    22. rozměrů       N.N.I.P.2.-.-.-.-.-.A.-.-.-.-
    23. ,             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
    24. tvarů         N.N.I.P.2.-.-.-.-.-.A.-.-.-.-
    25. a             J.^.-.-.-.-.-.-.-.-.-.-.-.-.-
    26. úrovně        N.N.F.S.2.-.-.-.-.-.A.-.-.-.-
    27. .             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
    28. <s>           <s>
    29. Slečny        N.N.F.P.1.-.-.-.-.-.A.-.-.-.-
    30. veselé        A.A.N.S.1.-.-.-.-.1.A.-.-.-.-
    31. povahy        N.N.F.S.2.-.-.-.-.-.A.-.-.-.-
    32. preferovány   V.s.T.P.-.-.-.X.X.-.A.P.-.-.-
    33. ;             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
    34. ona           P.P.F.S.1.-.-.3.-.-.-.-.-.-.-
    35. je            V.B.-.S.-.-.-.3.P.-.A.A.-.-.-
    36. to            P.D.N.S.1.-.-.-.-.-.-.-.-.-.-
    37. nejspíš       D.g.-.-.-.-.-.-.-.3.A.-.-.-.-
    38. nutnost       N.N.F.S.1.-.-.-.-.-.A.-.-.-.-
    39. :             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
    40. -             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
    41. )             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
]
Acc: 73.8 +- 2.1% (baseline 50.0%, 10 iterations)

You may print more details in http_server/basic_task.py after # print explanation(s) comment.

Task

Examine files in stylometry_features folder. Modify the assignment.py file to increase the accuracy of methods (use vi or nano command for editing). You can create another classes inside, change their names and class names to improve the accuracy score. Don't forget to add new classes into the assignments list in this file.

Suggestions for your inspiration include:

  • diacritics usage (yes/no), a regular expression will be needed
  • sentence endings (number of sentences, or typical endings)
  • repetitions of words in sentences/in the text
  • usage of uppercase letters
  • length of sentences/text
  • POS tags (n-grams)
  • word n-grams
  • character n-grams

Each modification can be tested by running ./run.sh again. The first call of run.sh can be slower, because documents are morphologically analysed during the first run.

Write the resulting Acc: line into the top comment of the assignment.py file.
Submit your assignment.py file.

Last modified 3 years ago Last modified on Aug 31, 2021, 2:11:05 PM