Changes between Initial Version and Version 1 of en/AdvancedNlpCourse2018/Stylometry


Ignore:
Timestamp:
Sep 12, 2019, 11:11:11 AM (5 years ago)
Author:
Ales Horak
Comment:

copied from private/AdvancedNlpCourse/Stylometry

Legend:

Unmodified
Added
Removed
Modified
  • en/AdvancedNlpCourse2018/Stylometry

    v1 v1  
     1= Stylometry =
     2
     3[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák
     4
     5Prepared by: Honza Rygl
     6
     7== State of the Art ==
     8
     9The analysis of author's characteristic
     10writing style and vocabulary has been used to uncover author's traits such as authorship, age, or gender
     11documents by both manual linguistic approaches and automatic algorithmic methods.
     12
     13The most common approach to stylometry problems
     14is to combine stylistic analysis with machine learning techniques:
     15 1. specific style markers are extracted,
     16 2. a classification procedure is applied to extracted markers
     17
     18
     19=== References ===
     20
     21 1. Stamatatos, E. (2009), A Survey of Modern Authorship Attribution Methods (2009), Journal of the American Society for Information Science and Technology, 60(3), 538-556. [[http://www.clips.ua.ac.be/~walter/educational/material/Stamatatos_survey2009.pdf | pdf]]
     22 2. Kestemont, M. (2014), Function Words in Authorship Attribution From Black Magic to Theory? Proceedings of the 3rd Workshop on Computational Linguistics for Literature, EACL 2014, 59–66 [[http://aclweb.org/anthology/W14-0908 | pdf]]
     23 1. Walter, D.  Explanation in Computational Stylometry
     24
     25== Practical Session ==
     26
     27Student will get to know a *Style & Identity Recognition* tool. They will test this tool on prepared data.
     28Their goal will be to implement a small function to extract style markers from a text.
     29
     301. go to `asteria04.fi.muni.cz` server:
     31{{{
     32ssh asteria04.fi.muni.cz
     33}}}
     342. Download a  [[htdocs:bigdata/stylometry-assignment.zip|python package with the assignment]]
     35{{{
     36wget https://nlp.fi.muni.cz/trac/research/chrome/site/bigdata/stylometry-assignment.zip
     37}}}
     383. Unzip the downloaded file
     39{{{
     40unzip stylometry-assignment.zip
     41}}}
     424. Go to the unziped folder
     43{{{
     44cd sir-assignment
     45}}}
     465. Test the prepared program that analyses data from on-line dating services to distinguish gender (masculine/feminine) by text style features
     47{{{
     48./run.sh
     49}}}
     50
     51`run.sh` can have two optional parameters:
     52{{{
     53./run.sh  [number_of_testing_cycles]  [show_first_N_erroneously_predicted_documents]
     54}}}
     55The default values, i.e. running `./run.sh` without parameters, are `100` cycles and `no documents` (`./run.sh 100 0`). For faster feature testing even `./run.sh 10` should be sufficient.
     56
     57Example with document output:
     58{{{
     59[xrygl@asteria04:~/temp/sir-assignment]$ ./run.sh 10 1
     60author: on
     61text: Ahoj, (nejen) pro výlety do víru podivnězimního velkoměsta, či divočiny
     62venkova, hledá se partnerka přiměřených rozměrů, tvarů a úrovně. Slečny veselé
     63povahy preferovány; ona je to nejspíš nutnost :-)
     64morphology: ((u'<s>', u'<s>'), (u'Ahoj', u'N.N.I.S.1.-.-.-.-.-.A.-.-.-.-'),
     65(u',', u'Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-'), (u'(',
     66u'Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-'), (u'nejen', u'T.T.-.-.-.-.-.-.-.-.-.-.-.-.-'),
     67(u')', u'Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-'), (u'pro',
     68u'R.R.-.-.4.-.-.-.-.-.-.-.-.-.-'), (u'v\xfdlety',
     69u'N.N.I.P.4.-.-.-.-.-.A.-.-.-.-'), (u'do', u'R.R.-.-.2.-.-.-.-.-.-.-.-.-.-'),
     70(u'v\xedru', u'N.N.I.S.2.-.-.-.-.-.A.-.-.-.-'), (u'podivn\u011bzimn\xedho',
     71u'A.A.N.S.2.-.-.-.-.1.A.-.-.-.-'), (u'velkom\u011bsta',
     72u'N.N.N.S.2.-.-.-.-.-.A.-.-.-.-'), (u',', u'Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-'),
     73(u'\u010di', u'J.^.-.-.-.-.-.-.-.-.-.-.-.-.-'), (u'divo\u010diny',
     74u'N.N.F.S.2.-.-.-.-.-.A.-.-.-.-'), (u'venkova',
     75u'N.N.I.S.2.-.-.-.-.-.A.-.-.-.-'), (u',', u'Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-'),
     76(u'hled\xe1', u'V.B.-.S.-.-.-.3.P.-.A.A.-.-.-'), (u'se',
     77u'P.7.-.X.4.-.-.-.-.-.-.-.-.-.-'), (u'partnerka',
     78u'N.N.F.S.1.-.-.-.-.-.A.-.-.-.-'), (u'p\u0159im\u011b\u0159en\xfdch',
     79u'A.A.I.P.2.-.-.-.-.1.A.-.-.-.-'), (u'rozm\u011br\u016f',
     80u'N.N.I.P.2.-.-.-.-.-.A.-.-.-.-'), (u',', u'Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-'),
     81(u'tvar\u016f', u'N.N.I.P.2.-.-.-.-.-.A.-.-.-.-'), (u'a',
     82u'J.^.-.-.-.-.-.-.-.-.-.-.-.-.-'), (u'\xfarovn\u011b',
     83u'N.N.F.S.2.-.-.-.-.-.A.-.-.-.-'), (u'.', u'Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-'),
     84(u'<s>', u'<s>'), (u'Sle\u010dny', u'N.N.F.P.1.-.-.-.-.-.A.-.-.-.-'),
     85(u'vesel\xe9', u'A.A.N.S.1.-.-.-.-.1.A.-.-.-.-'), (u'povahy',
     86u'N.N.F.S.2.-.-.-.-.-.A.-.-.-.-'), (u'preferov\xe1ny',
     87u'V.s.T.P.-.-.-.X.X.-.A.P.-.-.-'), (u';', u'Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-'),
     88(u'ona', u'P.P.F.S.1.-.-.3.-.-.-.-.-.-.-'), (u'je',
     89u'V.B.-.S.-.-.-.3.P.-.A.A.-.-.-'), (u'to', u'P.D.N.S.1.-.-.-.-.-.-.-.-.-.-'),
     90(u'nejsp\xed\u0161', u'D.g.-.-.-.-.-.-.-.3.A.-.-.-.-'), (u'nutnost',
     91u'N.N.F.S.1.-.-.-.-.-.A.-.-.-.-'), (u':', u'Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-'),
     92(u'-', u'Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-'), (u')',
     93u'Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-'))
     94Acc: 73.7 +- 2.2% (baseline 50.0%, 10 iterations)
     95}}}
     96
     97=== Task ===
     98Examine files in  `stylometry_features` folder.
     99Modify the `assignment.py` file to increase the accuracy of methods (use `vi` or `nano` command for editing).
     100You can create another classes inside, change their names and class names to improve the accuracy score. Don't forget to add new classes into the `assignments` list in this file.
     101
     102Suggestions for your inspiration include:
     103 * diacritics usage (yes/no), a regular expression will be needed
     104 * sentence endings (number of sentences, or typical endings)
     105 * repetitions of words in sentences/in the text
     106 * usage of uppercase letters
     107 * length of sentences/text
     108 * POS tags (n-grams)
     109 * word n-grams
     110 * character n-grams
     111
     112Each modification can be tested by running `./run.sh` again.
     113The first call of `run.sh` can be slower, because documents are morphologically analysed during the first run.
     114
     115Submit your assignment file. Write nice Python code and don't forget about PEP8 (https://www.python.org/dev/peps/pep-0008/).