Changes between Initial Version and Version 1 of en/NlpInPracticeCourse/2022/Stylometry


Ignore:
Timestamp:
Sep 13, 2023, 2:46:07 PM (10 months ago)
Author:
Ales Horak
Comment:

copied from private/NlpInPracticeCourse/Stylometry

Legend:

Unmodified
Added
Removed
Modified
  • en/NlpInPracticeCourse/2022/Stylometry

    v1 v1  
     1= Stylometry =
     2
     3[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/NlpInPracticeCourse|NLP in Practice Course]], Course Guarantee: Aleš Horák
     4
     5Prepared by: Honza Rygl, Aleš Horák
     6
     7== State of the Art ==
     8
     9The analysis of author's characteristic
     10writing style and vocabulary has been used to uncover author's traits such as authorship, age, or gender
     11documents by both manual linguistic approaches and automatic algorithmic methods.
     12
     13The most common approach to stylometry problems
     14is to combine stylistic analysis with machine learning techniques:
     15 1. specific style markers are extracted,
     16 2. a classification procedure is applied to extracted markers
     17
     18
     19=== References ===
     20
     21 1. Bevendorff, Janek, et al.(2020), Overview of PAN 2020: Authorship Verification, Celebrity Profiling, Profiling Fake News Spreaders on Twitter, and Style Change Detection. International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, Cham. [https://pan.webis.de/downloads/publications/papers/bevendorff_2020.pdf pdf]
     22 1. Lemmens, J., Markov, I., & Daelemans, W. (2021). Improving hate speech type and target detection with hateful metaphor features. In Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda (pp. 7-16). [[https://aclanthology.org/2021.nlp4if-1.2.pdf | pdf]]
     23 2. Kestemont, M. (2014), Function Words in Authorship Attribution From Black Magic to Theory? Proceedings of the 3rd Workshop on Computational Linguistics for Literature, EACL 2014, 59–66 [[http://aclweb.org/anthology/W14-0908 | pdf]]
     24 1. Daelemans, W. (2013). Explanation in computational stylometry. In International conference on intelligent text processing and computational linguistics (pp. 451-462). Springer, Berlin, Heidelberg. [https://www.clips.uantwerpen.be/sites/default/files/daelemans2013.pdf pdf]
     25
     26== Practical Session ==
     27
     28Students will work with the ''Style & Identity Recognition'' (SIR) tool. They will test this tool on prepared data.
     29The goal will be to implement a small function to extract style markers from a text.
     30
     311. go to `asteria04.fi.muni.cz` server:
     32{{{
     33ssh asteria04.fi.muni.cz
     34}}}
     352. Download a  [[htdocs:bigdata/stylometry-assignment.zip|ZIP with python packages of the assignment]]
     36{{{
     37wget https://nlp.fi.muni.cz/trac/research/chrome/site/bigdata/stylometry-assignment.zip
     38}}}
     393. Unzip the downloaded file
     40{{{
     41unzip stylometry-assignment.zip
     42}}}
     434. Go to the unziped folder
     44{{{
     45cd sir-assignment
     46}}}
     475. Test the prepared program that analyses data from on-line dating services to distinguish gender (masculine/feminine) by text style features
     48{{{
     49./run.sh
     50}}}
     51
     52The default dataset is for the Czech language, this may be changed by editing `run.sh` (use e.g. `vi` or `nano` as the editor):
     53{{{
     54nano run.sh
     55}}}
     56For English, unncomment the line with `export DATA_LANGUAGE='en'`:
     57{{{
     58#export DATA_LANGUAGE='cs'
     59export DATA_LANGUAGE='en'
     60}}}
     61
     62`run.sh` can have two optional parameters:
     63{{{
     64./run.sh  [number_of_testing_cycles]  [show_first_N_erroneously_predicted_documents]
     65}}}
     66The default values, i.e. running `./run.sh` without parameters, are `10` cycles and `no documents` (`./run.sh 10 0`). With longer feature testing `./run.sh 100` could provide better results (but not necessarily).
     67
     68Example with document output (second parameter greater than `0`):
     69{{{
     70[xrygl@asteria04:~/temp/sir-assignment]$ ./run.sh 10 1
     71pos: 5
     72expected: on
     73predicted: ona
     74text: Ahoj, (nejen) pro výlety do víru podivnězimního velkoměsta, či divočiny
     75      venkova, hledá se partnerka přiměřených rozměrů, tvarů a úrovně. Slečny
     76      veselé povahy preferovány; ona je to nejspíš nutnost :-)
     77morphology: [
     78     1. <s>           <s>
     79     2. Ahoj          N.N.I.S.1.-.-.-.-.-.A.-.-.-.-
     80     3. ,             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
     81     4. (             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
     82     5. nejen         T.T.-.-.-.-.-.-.-.-.-.-.-.-.-
     83     6. )             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
     84     7. pro           R.R.-.-.4.-.-.-.-.-.-.-.-.-.-
     85     8. výlety        N.N.I.P.4.-.-.-.-.-.A.-.-.-.-
     86     9. do            R.R.-.-.2.-.-.-.-.-.-.-.-.-.-
     87    10. víru          N.N.I.S.2.-.-.-.-.-.A.-.-.-.-
     88    11. podivnězimního  A.A.N.S.2.-.-.-.-.1.A.-.-.-.-
     89    12. velkoměsta    N.N.N.S.2.-.-.-.-.-.A.-.-.-.-
     90    13. ,             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
     91    14. či            J.^.-.-.-.-.-.-.-.-.-.-.-.-.-
     92    15. divočiny      N.N.F.S.2.-.-.-.-.-.A.-.-.-.-
     93    16. venkova       N.N.I.S.2.-.-.-.-.-.A.-.-.-.-
     94    17. ,             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
     95    18. hledá         V.B.-.S.-.-.-.3.P.-.A.A.-.-.-
     96    19. se            P.7.-.X.4.-.-.-.-.-.-.-.-.-.-
     97    20. partnerka     N.N.F.S.1.-.-.-.-.-.A.-.-.-.-
     98    21. přiměřených   A.A.I.P.2.-.-.-.-.1.A.-.-.-.-
     99    22. rozměrů       N.N.I.P.2.-.-.-.-.-.A.-.-.-.-
     100    23. ,             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
     101    24. tvarů         N.N.I.P.2.-.-.-.-.-.A.-.-.-.-
     102    25. a             J.^.-.-.-.-.-.-.-.-.-.-.-.-.-
     103    26. úrovně        N.N.F.S.2.-.-.-.-.-.A.-.-.-.-
     104    27. .             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
     105    28. <s>           <s>
     106    29. Slečny        N.N.F.P.1.-.-.-.-.-.A.-.-.-.-
     107    30. veselé        A.A.N.S.1.-.-.-.-.1.A.-.-.-.-
     108    31. povahy        N.N.F.S.2.-.-.-.-.-.A.-.-.-.-
     109    32. preferovány   V.s.T.P.-.-.-.X.X.-.A.P.-.-.-
     110    33. ;             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
     111    34. ona           P.P.F.S.1.-.-.3.-.-.-.-.-.-.-
     112    35. je            V.B.-.S.-.-.-.3.P.-.A.A.-.-.-
     113    36. to            P.D.N.S.1.-.-.-.-.-.-.-.-.-.-
     114    37. nejspíš       D.g.-.-.-.-.-.-.-.3.A.-.-.-.-
     115    38. nutnost       N.N.F.S.1.-.-.-.-.-.A.-.-.-.-
     116    39. :             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
     117    40. -             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
     118    41. )             Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-
     119]
     120Acc: 73.8 +- 2.1% (baseline 50.0%, 10 iterations)
     121}}}
     122You may print more details in `http_server/basic_task.py` after `# print explanation(s)` comment.
     123
     124=== Task ===
     125Examine files in  `stylometry_features` folder.
     126Modify the `assignment.py` file to increase the accuracy of methods (use `vi` or `nano` command for editing).
     127You can create another classes inside, change their names and class names to improve the accuracy score. Don't forget to add new classes into the `assignments` list in this file.
     128
     129Suggestions for your inspiration include:
     130 * diacritics usage (yes/no), a regular expression will be needed
     131 * sentence endings (number of sentences, or typical endings)
     132 * repetitions of words in sentences/in the text
     133 * usage of uppercase letters
     134 * length of sentences/text
     135 * POS tags (n-grams)
     136 * word n-grams
     137 * character n-grams
     138
     139Each modification can be tested by running `./run.sh` again.
     140The first call of `run.sh` can be slower, because documents are morphologically analysed during the first run.
     141
     142Write the resulting `Acc:` line into the top comment of the `assignment.py` file. [[br]]
     143Submit your `assignment.py` file.