= Stylometry =

[[https://is.muni.cz/auth/predmet/fi/ia161|IA161]] [[en/AdvancedNlpCourse|Advanced NLP Course]], Course Guarantee: Aleš Horák

Prepared by: Honza Rygl

== State of the Art ==

The test analysis of author's characteristic
writing style and vocabulary has been used to uncover author's traits such as authorship, age, or gender
documents by both manual linguistic approaches and automatic algorithmic methods.

The most common approach to stylometry problems
is to combine stylistic analysis with machine learning techniques:
 1. specific style markers are extracted,
 2. a classification procedure is applied to extracted markers


=== References ===

 1. Stamatatos, E. (2009), A Survey of Modern Authorship Attribution Methods (2009), Journal of the American Society for Information Science and Technology, 60(3), 538-556. [[http://www.clips.ua.ac.be/~walter/educational/material/Stamatatos_survey2009.pdf | pdf]]
 2. Kestemont, M. (2014), Function Words in Authorship Attribution From Black Magic to Theory? Proceedings of the 3rd Workshop on Computational Linguistics for Literature, EACL 2014, 59–66 [[http://aclweb.org/anthology/W14-0908 | pdf]]
 1. Walter, D.  Explanation in Computational Stylometry

== Practical Session ==

Student will get to know a *Style & Identity Recognition* tool. They will test this tool on prepared data.
Their goal will be to implement a small function to extract style markers from a text.

1. go to `asteria04.fi.muni.cz` server:
{{{
ssh asteria04.fi.muni.cz
}}}
2. Download a  [[htdocs:bigdata/stylometry-assignment.zip|python package with the assignment]] 
3. Unzip the downloaded file
{{{
unzip stylometry-assignment.zip
}}}
4. Go to the unziped folder
{{{
cd sir-assignment
}}}
5. Test the prepared program that analyses data from on-line dating services to distinguish gender (masculine/feminine) by text style features
{{{
./run.sh
}}}

`run.sh` can have two optional parameters:
{{{
./run.sh  [number_of_testing_cycles]  [show_first_N_erroneously_predicted_documents]
}}}
The default values, i.e. running `./run.sh` without parameters, are `100` cycles and `no documents` (`./run.sh 100 0`). For faster feature testing even `./run.sh 10` should be sufficient.

Example with document output:
{{{
[xrygl@asteria04:~/temp/sir-assignment]$ ./run.sh 10 1
author: on
text: Ahoj, (nejen) pro výlety do víru podivnězimního velkoměsta, či divočiny
venkova, hledá se partnerka přiměřených rozměrů, tvarů a úrovně. Slečny veselé
povahy preferovány; ona je to nejspíš nutnost :-)
morphology: ((u'<s>', u'<s>'), (u'Ahoj', u'N.N.I.S.1.-.-.-.-.-.A.-.-.-.-'),
(u',', u'Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-'), (u'(',
u'Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-'), (u'nejen', u'T.T.-.-.-.-.-.-.-.-.-.-.-.-.-'),
(u')', u'Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-'), (u'pro',
u'R.R.-.-.4.-.-.-.-.-.-.-.-.-.-'), (u'v\xfdlety',
u'N.N.I.P.4.-.-.-.-.-.A.-.-.-.-'), (u'do', u'R.R.-.-.2.-.-.-.-.-.-.-.-.-.-'),
(u'v\xedru', u'N.N.I.S.2.-.-.-.-.-.A.-.-.-.-'), (u'podivn\u011bzimn\xedho',
u'A.A.N.S.2.-.-.-.-.1.A.-.-.-.-'), (u'velkom\u011bsta',
u'N.N.N.S.2.-.-.-.-.-.A.-.-.-.-'), (u',', u'Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-'),
(u'\u010di', u'J.^.-.-.-.-.-.-.-.-.-.-.-.-.-'), (u'divo\u010diny',
u'N.N.F.S.2.-.-.-.-.-.A.-.-.-.-'), (u'venkova',
u'N.N.I.S.2.-.-.-.-.-.A.-.-.-.-'), (u',', u'Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-'),
(u'hled\xe1', u'V.B.-.S.-.-.-.3.P.-.A.A.-.-.-'), (u'se',
u'P.7.-.X.4.-.-.-.-.-.-.-.-.-.-'), (u'partnerka',
u'N.N.F.S.1.-.-.-.-.-.A.-.-.-.-'), (u'p\u0159im\u011b\u0159en\xfdch',
u'A.A.I.P.2.-.-.-.-.1.A.-.-.-.-'), (u'rozm\u011br\u016f',
u'N.N.I.P.2.-.-.-.-.-.A.-.-.-.-'), (u',', u'Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-'),
(u'tvar\u016f', u'N.N.I.P.2.-.-.-.-.-.A.-.-.-.-'), (u'a',
u'J.^.-.-.-.-.-.-.-.-.-.-.-.-.-'), (u'\xfarovn\u011b',
u'N.N.F.S.2.-.-.-.-.-.A.-.-.-.-'), (u'.', u'Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-'),
(u'<s>', u'<s>'), (u'Sle\u010dny', u'N.N.F.P.1.-.-.-.-.-.A.-.-.-.-'),
(u'vesel\xe9', u'A.A.N.S.1.-.-.-.-.1.A.-.-.-.-'), (u'povahy',
u'N.N.F.S.2.-.-.-.-.-.A.-.-.-.-'), (u'preferov\xe1ny',
u'V.s.T.P.-.-.-.X.X.-.A.P.-.-.-'), (u';', u'Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-'),
(u'ona', u'P.P.F.S.1.-.-.3.-.-.-.-.-.-.-'), (u'je',
u'V.B.-.S.-.-.-.3.P.-.A.A.-.-.-'), (u'to', u'P.D.N.S.1.-.-.-.-.-.-.-.-.-.-'),
(u'nejsp\xed\u0161', u'D.g.-.-.-.-.-.-.-.3.A.-.-.-.-'), (u'nutnost',
u'N.N.F.S.1.-.-.-.-.-.A.-.-.-.-'), (u':', u'Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-'),
(u'-', u'Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-'), (u')',
u'Z.:.-.-.-.-.-.-.-.-.-.-.-.-.-'))
Acc: 73.7 +- 2.2% (baseline 50.0%, 10 iterations)
}}}

=== Task ===
Examine files in  `stylometry_features` folder.
Modify the `assignment.py` file to increase the accuracy of methods.
You can create another classes inside, change their names and class names to improve the accuracy score. Don't forget to add new classes into the `assignments` list in this file.

Suggestions for your inspiration include:
 * diacritics usage (yes/no), a regular expression will be needed
 * sentence endings (number of sentences, or typical endings)
 * repetitions of words in sentences/in the text
 * usage of uppercase letters
 * length of sentences/text
 * POS tags (n-grams)
 * word n-grams
 * character n-grams

Each modification can be tested by running `./run.sh` again.
The first call of `run.sh` can be slower, because documents are morphologically analysed during the first run.

Submit your assignment file. Write nice Python code and don't forget about PEP8 (https://www.python.org/dev/peps/pep-0008/).