Third International Workshop on DICTIONARY WRITING SYSTEMS (DWS 2004)

Brno, Czech Republic, 6-7 September 2004


[DWS 2004 information]    

Pavel Rychlý
Bonito: Corpus Management for Dictionary Writing
Judith Rosenhouse
A Trilingual Dictionary: Trilingual Problems?
Presentation with annotationsSilvie Cinková
Swedish-Czech Combinatorial Valency Lexicon of Support Nouns
PresentationKristin Bakken & Oddrun Grønvik
Norsk Ordbok 2014 (The Norwegian Dictionary 2014)
Oddrun Grønnvik, Christian-Emil Ore, Daniel Ridings, Lars Jørgen Tvedt
The dictionary writing system for The Norwegian Dictionary
Presentation 1Presentation 2Marie Bilde Rasmussen
Revising dictionaries - 14 years of SGML editing experience
Pornpimon Palingoon, Pornchan Chantanapraiwan, Sapa Chanyachatchawan
Users' Expectation on Web-based Thai<->English Dictionary
PresentationGintaras Barisevicius & Elvinas Cernys
English-Lithuanian and Lithuanian-English Lexicon Database Management System for MT
Mutsuko Tomokiyo
Description of pragmatic properties of lexis in monolingual dictionaries for the Papillon database
Jens Erlandsen
iLex - new DWS
html zipMaddalena Toscano and Giuseppe Marzatico and Salvatore La Gala and Massimiliano Sorrentino
Building a corpus based Kiswahili-Italian on-line lexical data base
PresentationThatsanee Charoenporn, Canasai Kruengkrai, Virach Sornlertlamvanich, Thanaruk Theeramunkong, and Hitoshi Isahara
Corpus-based Dictionary Development System
PresentationDavid Joffe & Gilles-Maurice de Schryver
TshwaneLex - Professional off-the-shelf lexicography software
User DocumentationMathieu Poumeyrol
Dictionary Writing Searches
PresentationDave Moskovitz
Mātāpuna Dictionary Database System

Pavel Rychlý: Bonito: Corpus Management for Dictionary Writing

Bonito is a graphical user interface (GUI) of the Manatee corpus management system. It enables queries to be formed and given to various corpora. The results are clearly displayed and can be changed in various ways. Statistics can also be computed on them. Manatee is language/encoding/tag-set independent and can handle huge corpora with extensive annotation.

Sketch Engine is a new dictionary writing system based on Manatee (including Bonito). It creates word sketches and thesaurus from a corpus. A word sketch is one-page, automatically-generated summary of a word's grammatical and collocational behavior.

In the demo, basic Bonito and Sketch Engine features will be presented together with examples of corpus configuration and definition of word sketch grammatical relations.


Judith Rosenhouse: A Trilingual Dictionary: Trilingual Problems?

For the first time a tri-lingual Hebrew - Literary Arabic – Colloquial Arabic dictionary appeared in 2001. The dictionary under discussion is a practical modern dictionary including over 20,000 items. The paper discusses several problematic points from the human compiler’s perspective. In this case the lexicographer had to decide which items to include in the dictionary (linguistic considerations) and in which order to let them appear (linguistic and editorial decisions), how to deal with the orthographic problems (linguistic, editorial and computational considerations). The paper answers how these problems were solved and examples of these issues are given. Conclusions are drawn concerning language-specific and general linguistic dictionaries.


Silvie Cinková: Swedish-Czech Combinatorial Valency Lexicon of Support Nouns

Support Verb Constructions (SVCs) are combinations of an abstract noun and a lexical verb. From the semantic point of view, the noun seems to be part of a complex predicate rather than the object (or subject) of the verb, despite what the surface syntax suggests. A SVC is usually semantically transparent. Its meaning is concentrated in the noun phrase, whereas the semantic content of the verb is reduced or generalized. If we look upon SVCs as collocations, the noun is apparently the base, while the verb is the collocate. The matching verb is generally unpredictable, though often a metaphorical motivation can be traced back. Even in the cross-linguistic perspective it is the noun that makes up the common denominator for equivalent support verb constructions, whereas the support verbs do not necessarily match. Hence, SVCs hardly affect foreign language reception but they cause problems in foreign language production.

To make this field more accessible to Czech students of Swedish, Swedish SVCs are being extracted, provided by Czech translation equivalents and ordered into a XML-structured lexicon. They are lemmatized by their noun components. Different readings of the given nouns are separated. "Reading" is defined by a valency frame within the Praguian FGD framework (Functional Generative Description). A deep-syntactic valency frame is stated for each reading of the given noun. When describing the valency frame of a noun, the noun is not considered as part of a SVC.

For each noun frame, the relevant SVCs are sorted by the basic Lexical Functions Oper, Func, Labor and Copul. In order to describe their semantics in more detail, complementary LFs are employed, such as the phasal LFs (Incep, Dur and Fin) and the causative LFs (Caus, Liqu and Perm), as well as the LF Anti to state the negation patterns. The "value" field of each LF includes information on definiteness, number and adjectival modification restrictions in the noun phrase. It is also stated when the noun takes an obligatory prepositional complement in a SVC. Besides that, the SVCs are (whenever possible) marked by "telic"/"atelic". The last feature should - in combination with the information on noun number and definiteness - make it easier for Czech students to express the event structure in Swedish. Unlike Swedish, Czech always indicates the event structure linguistically as it employs aspect as a grammatical category in verbs. More to say, the category of aspect is anchored in the very morphology of most verbs. Some other event features, e.g. iterativity, also find expression in morphology. Thus the event structure in Germanic languages occurs rather puzzling to Czech speakers who are missing a general way to express aspect. SVCs are often referred to as one means of marking event structure since the nouns in SVCs typically denote events and states. A kind of event structure opposition is assumed between a synthetic predicate and the corresponding SVC. SVCs obviously can emphasize inchoativity, durativity and terminativity. However, all this gives no direct correspondence to the Slavic category of aspect, which apparently is the product of more event structure features in combination, one of which being telicity.

On one hand, a vast amount of apparently lexicalized SVCs can be extracted from monolingual lexicons and corpora. On the other hand, the SVCs form regular patterns that enable the production of well-formed ad hoc SVCs that often show the same morphosyntactic behavior as the lexicalized ones. This lexicon does not cover the latter aspect, as it is merely meant to make up a starting point for a combinatorial valency lexicon of verbs that typically act as support verbs. The productive patterns will be stated in the lexicon of verbs.


Kristin Bakken & Oddrun Grønvik: Norsk Ordbok 2014 (The Norwegian Dictionary 2014)

The presentation will give a short outline of the history and scope of Norsk Ordbok (“The Norwegian Dictionary”) which will be a 12 volume dictionary covering both one of Norway’s written standards and the Norwegian dialects. The dictionary was granted substantial fresh fundings from the government in 2002, thus allowing us to expand the editorial staff and plan for a completion in 2014. We will comment on the need to “translate” projects like ours into a political reality in order to secure our financial bases. The bulk of the presentation will be concerned with the ideals and challenges connected to our ambitious transition to a digital editing platform. We originally had four goals for the work with new digital tools: they should make the dictionary writing more time-efficient, they should ease the process of training many new editors in a short time, we wanted a simpler and faster production phase, and finally we wanted to improve the quality of our dictionary. In order to obtain these goals we have now for two years cooperated closely with the Unit for digital documentation at the University of Oslo, and our common results will be presented in a separate paper at the workshop (cf. abstract by Ridings & Ore). But the route we have followed to reach these results will be commented on in this paper. The interdisciplinary dialogue and experiences that we have obtained from it, have been both rewarding and challenging, and we will share some of our experiences with the process.


Oddrun Grønnvik, Christian-Emil Ore, Daniel Ridings, Lars Jørgen Tvedt: The dictionary writing system for The Norwegian Dictionary

The dictionary writing system used in NO2014 consists of three major components: a metadictionary, an editing system and a corpus system. These are all tightly integrated and are being used in production. They will be presented in this section from a practical viewpoint with a constant reference to the working processes of NO2014. The top level, metadictionary, is the hub of the system. It consists of all data that is relevant for making a dictionary article: lemmata with associated attributes; electronic slips that have been transcribed and corpus evidence are just a few of pieces of information it contains. It ties together all the individual data collections into one system in such a way that each independent collection can access other collections through the metadictionary. The basic outline of dictionary articles are generated from this collection and refined by the editing application.

The second layer consists of an editing application. The lexicographers use it to present an analysis of the language evidence contained in the metadictionary in categories that are then combined into published dictionary articles. The application applies constraints to the categories in order to insure consistency between articles written by various lexicographers and conformity to the project's style manual. It is tightly integrated with the metadictionary so evidence can be reused and refined analyses can be stored back in the metadictionary for production. Since the metadictionary is the hub of the system, those working with other pieces of evidence found in the system, the slips, the facsimiles, and the corpus evidence, will have access to the refined analysis that a lexicographer has produced.

The third aspect of the editorial system being used by NO2014 is the corpus. It consists of 30 million words annotated according to LE-PAROLE conventions. The corpus resides in an Oracle database. Access to the corpus is provided through an application integrated with the metadictionary and over the web. Initially the corpus was designed to complement the evidence that had been gathered in earlier decades with modern material. It now includes even whole works that had previously only been excerpted and stored on slips.

The application consists of a concordance and various routines to identify collocations, idioms and semantic fields.


Marie Bilde Rasmussen: Revising dictionaries - 14 years of SGML editing experience

Compared to compiling a dictionary from scratch, revising an existing dictionary involves quite a different editorial process. At Gyldendal Publishers in Copenhagen, the majority of dictionary projects consist of revising and reusing dictionary data in some way or another. In my presentation I will show some of the editing tasks involved in a dictionary revision and discuss the related DWS functionality that is needed.

In our understanding, the revision of a dictionary consists of adding and deleting content. Usually it also involves a great deal of structural adjustment in order to obtain a high degree of conformity between the existing data and the lexicographic and editorial principles we want to apply to the new edition.

In most cases this structural adjustment cannot be done automatically. Sometimes our data are first generation in SGML format (having been parsed from e.g. a digital photocomposition format) and are therefore likely to be less consequently marked up than we could wish for. Sometimes the existing data are well structured, but we want to mark up information types that haven’t been marked up before. Sometimes the rectification of structure just cannot be done correctly without adding or deleting content. These are all tasks that cannot be carried out automatically, but must be performed by a human.

In our experience, this kind of work is costly and time-consuming. In each project we therefore carefully analyse the different needs for structural adjustment and then prioritise which of the tasks we are willing to spend time and money on. The prioritised, well-defined tasks are then carried out subsequently, one task at a time and applied on the whole dictionary. This means that we do not rectify all articles one by one to a stage of 100% conformity with the editorial principles.

I will take a closer look at tasks that might involve DWS functionalities, for instance navigation in structure, moving around tree fragments, validation, structural search and replace functionality and the ability to change the grammar during the editorial process.


Pornpimon Palingoon, Pornchan Chantanapraiwan, Sapa Chanyachatchawan: Users' Expectation on Web-based Thai<->English Dictionary

In Thailand, NECTEC management team has anticipated the need of web-based dictionary and has started the project since 2000. In this paper we conduct an experiment in order to evaluate users’ expectation in using a Thai <-> English web-based dictionary during June 2003-2004. Users’ information has been collected from a first questionnaire survey. For the survey results, the subjects can be divided into two groups – paper-based and web-based preference, who have to answer a second questionnaire. The survey results reveal that there are different users of ages, genders, occupations, education levels, and different places of residence. Besides the respondents prefer web-based to paper-based bilingual dictionaries because the web-based can provide some better properties such as sounding, link pages, and useful tips. From this experiment, we can forecast the need of web-based dictionary, and it should be gradually developed.


Gintaras Barisevicius & Elvinas Cernys: English-Lithuanian and Lithuanian-English Lexicon Database Management System for MT

In our lexicon we have implemented all English and Lithuanian parts of speech including noun, verb, adjective, pronoun, numeral etc. We have chosen very flexible way of implementing it. Since we use Database Control System (exact software is not essential, but we use MySql) the lexicon is very easy to modify: to add new attributes, delete them, or modify the names or types of the old ones. It is possible to extend the same database to other languages as well. It only requires new tables for new language and the old target (or old source language) remains the same. We considered the context possibility for the nouns, thus user can choose the priorities of finding words in certain domain.

We use Java programming language for implementation, so it is possible to make system available on-line. Besides, the system thus is available on Windows, Linux, Mac and other OS.

The user friendly interface and possibility to see all generated forms (in Lithuanian usually there are 14 forms of nouns, 147 forms of adjective, more than 229 forms of verb etc.) is very efficient in the process of filling dictionary with new words.

We also took into account the possibility to enter words with a lot of meanings. They are numerated in ascending order by their priority. The polysemy is realized to both translation directions.

Since the system is planned to be developed further, we have thought of additional interface features that might improve user work with machine translation system. I must state that such functions as phrase translation, syntax rules, translation using text collections are only in theoretical level and are planned to implement in near future.

It is possible to demonstrate the working system as well. For the moment it is off-line system.


Mutsuko Tomokiyo: Description of pragmatic properties of lexis in monolingual dictionaries for the Papillon database

The paper aims to propose to add pragmatic characteristic tags for lexis to actual tag set in monolingual dictionaries where the sense-text theory is used as lexicography, and use them to make an automatic acception links on the Papillon database.

We have three motivations for proposing it:

First, information on pragmatics of lexis is, namely for non-native speakers, indispensable from user oriented viewpoint, in order to give whole information on lexis and to enable them to handle adequately foreign languages.

Second, the description of pragmatic aspect of lexis will enable publishers to edit conversation-oriented dictionaries, and also to develop a dictionary which contains efficient information for speech dialogue translation.

Finally, in the position of database developer, one of difficulties for construction of multilingual database is to make semantic pair of acceptions between language L1 and L2, namely to establish matching pair of corresponding acceptions of polysemy.

Actually, a method to make automatically matching pairs by computing meaning distance between lexis in thesaurus is proposed [Laf, 02].

Our idea resides in describing pragmatic information in monolingual dictionaries, and to cope with automatic acception link by using it. In this paper, firstly, we will pick up and observe some deictic expressions in Japanese from the point of view of pragmatics, and, secondly, examine pragmatic tags, which will be assigned to them. Finally, we will make dictionary description for some words, and show a possibility to make matching pairs of lexies for polysemic words.


Jens Erlandsen: iLex - new DWS

This paper will present a new DWS, iLEX, developed and marketed by EMP in Denmark. Taking a number of real life situations as starting points, the presentation will concentrate on the following areas of functionality: 1. Presentation of content and structure for editing; 2. Editing functionality; 3. Searching in dictionary data and structure; and finally, 4. Step-wise refinement of project setup and dictionary design.

iLEX is an ergonomic and powerful tool combining effective and flexible editing with easy and fast access to data. It is based on XML and Unicode and is implemented in Java. It consists of a flexible editor and a fast, powerful XML repository database developed to cover the special needs for these types of data and projects.

Searching as an integrated part of a DWS based on xml and Unicode

Storing dictionary data in XML enables very specific and precise searches. Powerful searching with combinations of metadata, entry structure and content is always included in requirement specifications for DWS. XML path and XQUERY offers very powerful and general queries but at the price of user friendliness. The same counts for Regular Expressions.

Powerful and efficient searching in XML has up to now been rather costly and complicated to provide and maintain. Furthermore, the needs might differ in different phases of projects, for different tasks, different projects, and works. This might lead to use of several query languages or one very flexible.

A number of concrete needs related to different situations and data will be identified, classified and related to the query language developed for the iLEX data base – and a number of searches will be demonstrated live. Presentation of search results and further use of them as an integrated part of the editing and project managing process are demonstrated and discussed.

Flexible presentations for editing dictionary data stored in XML

One of the important features of XML is that presentation is separated from structure and content. The very high structural complexity of dictionary entries with very little content per element makes it difficult to read and edit data presented on screen directly in standard XML -format.

In many cases editing has already been done for years in other formats. Moving the editorial process to XML might not be easy. In addition, different presentations might be useful for different tasks, users and situations. Integrated presentation of structure and content might not only make editing easier and more ergonomic, and facilitate easy shifts between different presentations. It might also offer greater flexibility and an easier project design process.

XSLT and XSL:FO is included in iLEX, a new DWS for data exchange and printing. But a profound analysis of users needs and relations between data structures and presentations has lead to inclusion of a new formatting concept for the editing process. It is powerful but still easy to work with. It not only offers a number of formatting aids, but is also open for inclusion of functionality in the presentations on screen.

Starting from mapping out needs, the presentation will continue with an analysis of structures and their related presentations. Finally, live demonstrations are given and possibilities and limitations of the current version of the language are discussed.


Maddalena Toscano and Giuseppe Marzatico and Salvatore La Gala and Massimiliano Sorrentino: Building a corpus based Kiswahili-Italian on-line lexical data base

The project aims at building a corpus-based on-line accessible kiswahili-Italian lexical data base, meant specially for the Italian students of Swahili language course.

The lexical data base (M. Toscano and G. Marzatico and S. La Gala) has been especially designed fot the project. It is structured in groups and elements -according to the TEI guidelines- adapted to the needs of Kiswahili entries. Information related to an entry can be structured according to main groups (Dictionary Scrap, Esempio, Etimologia, Forma, Gruppo grammaticale, Omografo, Termini correlati, Significato, Traduzione, Confronta). Each main group can contain information structured according to elements. In order to allow maximum flexibility almost every element can appear in almost every group. Part of speech labels are being prepared that are suitable for describing specific Kiswahili language elements. Though the set of available groups cannot be modified, the set of elements is accessible to the operator and can be adapted according to the needs. As for the filling of the entry, the operator can either select groups / elements to be used in the entry under preparation or, alternativley, can prepare a (set of) masque(s) containing the necessary fields and then proceed to fill in the data. Once an entry has been completetd, the operator can allow pubblication and have it available for the external search. It is possible to allow multiple user access, for filling in the data.

The data base is searchable by an external user according to the entry and according to part of speeches.

The data contained in the data base can be exported either as .doc documents or with .xml tagging.

The corpus used as a source to the lexical data base is made mainly of Kiswahili contemporary creative literature. An ad-hoc tailored software for search and retrieval of context of kiswahili inflected forms is under preparation (by M. Toscano and M. Sorrentino).

The project is a small project that relies on limited funds and on cooperation from the local staff. Recently a cooperation has been established with the TUKI -UDSM (Institute of Kiswahili Reaserch - University of Dar es- Salaam) which will allow staff exchange Dar es-Salaam - Naples. The full list of part of speech and labels, plus the pubblication of main grammatical entries, will be the first result of the project, possibly within a year.


Thatsanee Charoenporn, Canasai Kruengkrai, Virach Sornlertlamvanich, Thanaruk Theeramunkong, and Hitoshi Isahara: Corpus-based Dictionary Development System

This paper describes the ideas behind the construction of TCL's Computational Lexicon, which is a basic lexical knowledge base for Natural Language Processing. We focus on three main points. The first describes the structure of lexical information retrieval, which the entries and the general information are extracted and converted from the existing dictionary and mapped into a well-formed XML document. Our current computational lexical system, available on www, provides bilingual (Thai–English) information. The second describes the development of the statistical corpus-based editor for inserting, updating, and refining lexical entries. The third describes our further step to link the TCL's computational lexicon with other existing dictionary to extend it from bilingual to the multilingual lexicon. The requirement of appropriate method to link the lexicon with others together with the development of multilingual interface and the problem on concept alignment are also described.


David Joffe & Gilles-Maurice de Schryver: TshwaneLex - Professional off-the-shelf lexicography software

0. Introduction: On user-friendliness

TshwaneLex is a software program for compiling monolingual, bilingual or semi-bilingual dictionaries. TshwaneLex contains various innovative features designed to optimise the process of producing dictionaries, and to improve consistency and quality of the final dictionary product. The research question that drove the development was whether an off-the-shelf and language-independent dictionary writing system (DWS) could be designed – for output to paper or an electronic medium (e.g. CD-ROM or the Web) – that any lexicographers could customise to their own taste for the production of any type of (mainstream) dictionary. In this presentation we show that this bold aim was indeed achieved, and we will illustrate this with data from some of the current TshwaneLex users, which include amongst others the South African National Lexicography Units under PanSALB (Pan South African Language Board), Macmillan, and Oxford University Press.

From a designers’ point of view the aim implied that a DWS had to be created with (1) full Unicode support, and (2) customisable DTDs, and which (3) ensures and enforces cross-reference integrity, (4) allows for various visualisations of the data distribution structure on micro- and macro-level (through e.g. tree views, linked view modes and the use of Rulers), and (5) is user-friendly. Regarding the latter, strong emphasis has indeed been placed during the design of TshwaneLex on producing a user-friendly tool, to reduce required training time, and also based on the principle that lexicographers should not need advanced computer literacy skills in order to compile dictionaries. Another major underlying design principle of TshwaneLex has been that the software should automate as much as possible for the lexicographer.

A selection of TshwaneLex features and accompanying screenshots is presented below.

1. Full Unicode Support

Unicode, the international character set standard, is fully supported for every aspect throughout TshwaneLex.

Windows IMEs: Data can be entered directly into TshwaneLex using any of the IMEs (Input Method Editors) available in Microsoft Windows 2000 or Windows XP, such as those for Chinese, Japanese, Korean or Arabic.

2. Customisable DTD (Document Type Definition)

A DTD (Document Type Definition) is used to describe the structure of lemmas for a particular dictionary project. TshwaneLex allows the DTD to be fully customised by the user on a per dictionary project basis.

A user-friendly interface design, supplemented with detailed documentation, allows end-users to be able to configure the DTD without the necessity of assistance from an IT expert. TshwaneLex also creates a sensible default DTD for new dictionary projects, allowing a user to get “up and running” within minutes.

Template DTDs may be created, allowing a DTD to be easily re-used for other dictionary projects, or allowing new projects to be initially based on standard DTDs.

TshwaneLex enforces DTD constraints, preventing lexicographers from creating invalid entries and thus ensuring consistency throughout the dictionary.

Compatible with XML DTDs (supports elements, attributes, entities, and relational constraints such as “one child only”, “one or more”, “zero or more”, etc.).

The DTD allows certain fields to be restricted to a selection from a list, such as for instance from a list of parts of speech or labels.

A powerful Styles system allows all aspects of the visual output of every element type to be configured.

3. Cross-Reference System

Show related cross-references: Whenever one views or works on a lemma in TshwaneLex, all lemmas with cross-references to the current lemma as well as all lemmas cross-referenced by the current lemma are immediately shown in the lemma preview area.

Automatic homonym and sense number updating: TshwaneLex automatically updates the cross-reference target homonym and sense numbers when these change on the cross-reference target lemma.

Ensures and enforces cross-reference integrity throughout the dictionary editing process. There is no need to keep track of cross-references manually.

4. Bilingual Editing Features

Side by side editing window layout allows the lexicographer to view or work on both sides of the dictionary simultaneously.

When Linked View mode is selected, TshwaneLex automatically shows all lemmas on the other side of the dictionary related to the currently selected lemma.

Automated Lemma Reversal functions save the lexicographer valuable time when creating the reverse side of the dictionary.

5. Miscellaneous Other Features

Direct export to RTF (Microsoft Word), XML, HTML or HTML/CSS.

Sound files can be attached to any field.

The Filter function allows the lexicographer to define criteria for viewing a subset of the data, for example “show all homonyms” or “show all cross-references”. More advanced filters may also be defined through the use of Boolean operators.

Customisation of the language of the meta-language: TshwaneLex allows multiple translated sets of labels to be defined for displaying information such as cross-reference type, part of speech and usage information.

The Full Dictionary Search tool allows fast text searches on the entire dictionary, with options such as case-sensitivity or whole-word/partial-word matching. Advanced users may also use regular expressions.

The Compare/Merge tool allows different versions of a database to be visually compared with one another, allowing changes to be merged into the current database. Changes made by a lexicographer working at home, or by lexicographers that are not connected directly to the main database, can be easily merged back into the main database via a user-friendly interface.

An online (Web) dictionary module and an electronic (CD-ROM) dictionary module are also available for TshwaneLex.

For more information about TshwaneLex, please visit the TshwaneDJe HLT website ( Low-cost licenses are available for academic/non-profit use, as well as for use for endangered languages. TshwaneDJe HLT also provides consulting and training services for all aspects of the dictionary compilation process.


Mathieu Poumeyrol: Dictionary Writing Searches


IDM has been involved in the Dictionary Writing System for 5 years. The first major development has been the DPS/CorpusViewer couple, designed with Longman Dictionary as a replacement for their previous tool. It now appears that these requirements where meeting most of the needs of other dictionary publishers, and IDM proposes commercial licenses and services around both tools.

IDM has also been chosen by OED to design their new editing system. The complexity and the demands of the OED project is leading to important modification to the dictionary writing system platforms, and IDM is planning a version 2 of the DPS. One big issue addressed has been dictionary searches.

Dictionary editing and search

The OED team has always been using the search engine as the main (if not unique) entry point in their dictionary. The dictionary itself is stored in one unique SGML-like database, along with all its attached meta-data. The tagging, designed in the early 80's, was compact, and many semantic elements where implicit (“anything in a sense and not in a quotation is a definition” for instance).

As the team was feeling the need for more data structuration and meta-data separation, the searches scenarios had to be listed and understood in order to provide efficient cross-database (dictionary and workflow) search features.

Dictionary content search

In the new system, the meta-data will be stored separately in a SQL database. The searches will be handled using SQL built-in features. The dictionary content is stored, in XML format, in the SQL database, and we have designed and implemented a separate indexing and retrieval set of tools to provide indexing, search and retrieval of the dictionary content.

The variety of search use cases lead to an important list of search features, including structural-level operators and word-level operators and character-level operators:

structural level operators apply on the XML tagging, allowing to look for tagging pattern. For instance, a lexicographer may want to a pattern similar to what he is about to do, in order to check that he is allowed to do that. Or a dictionary editor or manager may want to check the implications of a small DTD changes before making huge parts of the dictionary invalid (looking for “senses without definition”, or “cross-references in etymology”).

word-level operators allow to find phrases or concordances in the dictionary: sequence and proximity operators. In order to be efficient and reliable, it is useful to handle flected forms. Several dictionary of equivalent forms may also be relevant, to support British and American or historic spelling variations...

Character-level operators allow to check use of abbreviations, or to look for spelling variations not included in the dictionary. Regular expressions support is important, as well as flexible selection of case sensitiveness and accent sensitiveness.

Most real-life use cases require to mix these three level of searches: for instance, a lexicographer looking for the right way to abbreviate Shakespeare when quoting from The Tempest may want to look for occurrences of close enough (proximity, word-level) words starting by “shak” and “temp” (character-level) inside a quotation element (XML, structural level).

Other requirements

It was also required that a command line interface was enough to handle all of the search features, without requiring dozens of clicks to get a request launched. The request had to be virtually instantaneous for a user (less than 1 second for most queries).

Indexing speed had to allow compilation of the dictionary data overnight.


Dave Moskovitz: Mātāpuna Dictionary Database System

The Mātāpuna Dictionary Database System (MDDS) provides a platform supporting the work of the writers, editors, and managers of Te Matapuna, the first monolingual dictionary of the Maori language. All of the text in the dictionary, including definitions and examples, will be in Maori. The system has been extremely successful in allowing Maori language experts from all over the country to collaborate on this important scholarly work. The system is flexible and very easy to use, and was inexpensive to commission.

The system is Open Source software, and will be freely available to anyone, anywhere in the world needing a lexicographical software system. To our knowledge, this is the only Open Source project that has been initiated by a New Zealand government entity, as well as being the first Open Source lexicography software project.

MDDS has made a huge difference to the monolingual dictionary project, enabling users to easily collect, validate, analyse, share, and report on the data, as well as printing out draft copies of the dictionary.