Integrating Semantic Dictionaries for English, French and Bulgarian into the NooJ System for the Purposes of Information Retrieval

 

Svetla Koeva (Sofia Univ. Bulgarian)

Max Silberztein (univ. de Franche-Comté)

 

 

Abstract

 

The aims of the paper are to provide a sufficient methodology for the implementation of the natural language semantic relations into the NooJ system. The idea is to create specialized Semantic Dictionaries for English, French and Bulgarian based on different semantic relations and to use the structured data in information retrieval from multilingual texts.

The idea of the integration of semantic relations into the INTEX system was initially proposed at the sixth INTEX workshop by Svetla Koeva & Stoyan Mihov (Semantic Relations in INTEX) and later on was advanced into the Joint research RILA project Information retrieval based on semantic relations between the LASELDI, Université de Franche-Comté and .the Department of Computational Linguistics, IBL, Bulgarian Academy of Sciences.

All word meanings in a language are interconnected by means of semantic relations constituting a huge network – WordNet, initially developed at Princeton University for English. The success of the English WordNet has leaded the emergence of several projects aiming at the development of WordNets for languages other than English – including French and Bulgarian. The Bulgarian WordNet – BulNet (Koeva, 2004) has been under development for four years within the framework of the BalkaNet project – over 21 000 synsets and 40 000 literals were encoded till now.

A sophisticated selection determines which of the seventeen semantic, morpho-semantic and extralinguistic relations included so far into the WordNet structure are the most appropriate for the purposes of information retrieval. Those are relations of equivalence, inheritance, similarity, and thematic domains affiliations: synonymy (reflexive, symmetric, and transitive relation of equivalence); hypernymy (inverse, asymmetric, and transitive relation between synonym sets – synsets,  that relate general and more specific concepts), meronymy (asymmetric relation which link synsets denoting wholes with those denoting their parts); similar to (symmetric relation between similar adjectival synsets); verb group (symmetric relation between semantically related verb synsets); also see (symmetric relation between synsets - verbs or adjectives, that are close in meaning); and category domain (extralinguistic relation between synsets denoting a concept and the sphere of knowledge it belongs to). At the first stage of our research we selected those of the above enumerated relations that relate nouns (and verbs) only – synonymy, hypernymy, meronymy (in fact including three different semantic types of meronymy – part, portion and member), and category domain.

For each of the considered relations a particular Semantic Dictionary is created for English, French and Bulgarian. The Semantic Dictionaries are designed using the WordNet structures for the corresponding languages, on the one hand, and the respective inflectional dictionaries for those languages, on the other. The DELAF dictionaries (Silberztein, 1993), initially used in INTEX, have been converted into the NooJ format (Silberztein, 2004). These dictionaries consist of pairs of literals defined for the corresponding semantic relation (all possible combinations between literals in the given synsets are listed). Thus, we have created six different semantic sub-dictionaries for each language, which can be combined in different ways within a particular language, as well as across languages. Using the WordNet structure we can generate the Semantic Dictionaries of citation forms only, as literals in the WordNet are thus encoded, which would significantly reduce the retrieval capacity of the dictionaries. This problem can be handled through the combination of the Semantic Dictionaries with the respective inflectional dictionaries, presented in the DELAF format. With the available inflective system for Bulgarian, analysed and formally described within the FST's framework (Koeva, 1998), subdictionaries different in scope and size can be developed (DELAF format included) according to different purposes. As a result all existing word forms in the inflectional DELAF dictionaries are associated with the entry words in order to provide appropriate lexis coverage. Problems arise when a particular citation form is ambiguous (associated with different parts of speech or with different inflectional types) that require validation procedures to be involved in the dictionary creation.

WordNet structure coverage also includes compound literals some of which have only one form while others are inflected. Thus, additional work has to be provided in order to incorporate the compound entries into the WordNet structure, on the one hand, and to associate them with the relevant inflection type on the other hand. So far we have represented the citation forms of the compound literals into the Bulgarian WordNet.

The main applications of the Semantic Dictionaries implemented in NooJ are directed towards Information retrieval by means of: semantic equivalence with synonymy dictionaries, semantic specification with hyperonymy dictionaries, similarity relations, and domain specific relations. The next step is the incorporation of translation equivalents into the semantic dictionaries for the three languages : Bulgarian – English has been already done. Some problems have to be solved concerning semantic ambiguity – different meanings of the same word can lead to wrong extractions.