| 1 | = Towards the automatic detection of syntactic differences |
| 2 | |
| 3 | **Author: [https://www.universiteitleiden.nl/en/staffmembers/martin-kroon Martin Kroon]**, PhD candidate, Leiden University, The Netherlands[[br]] |
| 4 | |
| 5 | **Tuesday 12:00, November 12, 2019**[[br]] |
| 6 | **NLP lab, room B203**[[br]] |
| 7 | |
| 8 | |
| 9 | === Abstract: |
| 10 | |
| 11 | The field of comparative syntax aims at developing a theoretical model |
| 12 | of the syntactic properties all languages have in common and of the |
| 13 | range and limits of syntactic variation. Massive automatic comparison of |
| 14 | languages in parallel corpora will greatly speed up and enhance the |
| 15 | development of such a model. In this talk I will discuss previously |
| 16 | obtained results, as well as briefly touch on future research ideas. |
| 17 | |
| 18 | First I will discuss a preprocessing tool that selects parallel sentence |
| 19 | pairs that are suitable for comparative syntactic research, filtering |
| 20 | out sentence pairs that are syntactically too different. Results were |
| 21 | obtained through experiments on Dutch, German and English, and suggest a |
| 22 | graph edit distance on parse trees yields the best results. |
| 23 | |
| 24 | I will furthermore discuss recent results in extracting syntactic |
| 25 | differences from parallel corpora. We build on Wiersma et al.'s (2011) |
| 26 | method, and apply the Minimal Description Length Principle in the task. |
| 27 | After mining for characteristic part-of-speech patterns by compressing |
| 28 | the data, we extract differences in distribution of found patterns |
| 29 | between languages. Results were obtained through experiments on Dutch, |
| 30 | English and Czech, and show useful and meaningful differences, which can |
| 31 | guide linguists in their comparative syntactic research. |