Corpus Pattern Analysis

Corpus Pattern Analysis (CPA) is a procedure in corpus linguistics that associates word meaning with word use by means of analysis of phraseological patterns and collocations. The first product of CPA is PDEV (Pattern Dictionary of English Verbs). To browse completed verbs in PDEV, click on public access.

The work on CPA is now carried on in the DVC project (Disambiguation of Verbs by Collocation) hosted by RIILP from the University of Wolverhampton.

The current version of the Shallow Ontology of CPA Semantic Types can now be accessed here

Corpus Pattern Analysis (CPA) is a new technique for mapping meaning onto words in text. It is currently being used to build a 'Pattern Dictionary of English Verbs', which will be a fundamental resource for use in computational linguistics, language teaching, and cognitive science. It is based on the Theory of Norms and Exploitations (TNE, see Hanks 2004 and forthcoming, Hanks and Pustejovsky 2005). TNE in turn is a theory that owes much to the work of Pustejovsky on the Generative Lexicon (see Pustejovsky 1995), to Wilks's theory of preference semantics (e.g. Wilks 1975), to Sinclair's work on corpus analysis and collocations (eg. Sinclair 1966, 1987, 1991, 2004), to the Cobuild project in lexical computing (Sinclair et al. 1987), and to the Hector project (Atkins 1993 ; Hanks 1994).  CPA is also influenced by frame semantics (Fillmore and Atkins, 1992). It is complementary to FrameNet. Where FrameNet offers an in-depth analysis of semantic frames, CPA offers a systematic analysis of the patterns of meaning and use of each verb. Each CPA pattern can in principle be plugged into a FN semantic frame.

The focus of the analysis is on the prototypical syntagmatic patterns with which words in use are associated. Patterns for verbs and patterns for nouns are different in kind. Noun patterns consist of a number of corpus-derived gnomic statements, into which the most significant collocates are grouped and incorporated. Verb patterns consist not only of the basic "argument structure" or "valency structure" of each verb (typically with semantic values stated for each of the elements), but also of subvalency features, where relevant, such as the presence or absence of a determiner in noun phrases constituting a direct object. For example, the meaning of take place is quite different from the meaning of take his place. The possessive determiner makes all the difference to the meaning.

No attempt is made in CPA to identify the meaning of a verb or noun directly, as a word in isolation. Instead, meanings are associated with prototypical sentence contexts. Concordance lines are grouped into semantically motivated syntagmatic patterns. Associating a "meaning" with each pattern is a secondary step, carried out in close coordination with the assignment of concordance lines to patterns. The identification of a syntagmatic pattern is not an automatic procedure: it calls for a great deal of lexicographic art. Among the most difficult of all lexicographic decisions is the selection of an appropriate level of generalization on the basis of which senses are to be distinguished. For example, one might say that the intransitive verb abate has only one sense ("become less in intensity"), or one might separate storm abate from political protest abate, on the grounds that the two contexts have different implicatures. That is a simple example, but in more complex cases (e.g. the verb bear) patterns are indispensible for effective disambiguation. Bearing a heavy burden is a pattern that normally has an abstract interpretation in English (as opposed to, say, carrying a heavy load), and the meaning is associated with the prototypical phrase, which is quite different in turn from I can't bear it.

In CPA, the "meaning" of a pattern is expressed as a set of basic implicatures. E.g., for the verb file one pattern is: [[Human = Plaintiff]] file [[Procedure = Lawsuit]], of which the implicature may be expressed as "If you file a law suit, you are acting as the plaintiff and you activate a procedure by which you hope to obtain redress for some wrong that you believe has been done to you"). Depending on the proposed application, the implicature of a pattern may be expressed in any of a wide variety of other ways, e.g. as a translation into another language or as a synonym set such as "file = activate, start, begin, lodge". Each argument of each pattern is linked to a node in a shallow semantic ontology (Pustejovsky et al. 2004).



CPA is a collaborative research project at the Masaryk University in Brno and Brandeis University, Waltham, Massachusetts.



CPA software tools were developed within projects LC536 and 2C06009. The owner of these tools is NLP Centre, Faculty of Informatics, Masaryk University. Licence is available here.

Pavel Rychly