DEVELOPING THE BULGARIAN NOOJ MODULE WITH A VIEW TO ENABLING LARGE COVERAGE LEXICAL MATERIAL

Developing the Bulgarian Nooj Module with a View to Enabling Large Coverage Lexical Material

Svetla Koeva, Borislav Rizov, Svetlozara Leseva

(Institute for Bulgarian Language Bulgarian Academy of Sciences)

svetla@ibl.bas.bg ; bobby@ibl.bas.bg ; zara@ibl.bas.bg

Abstract

The presentation aims at describing the existing Bulgarian NLP resources and their adaptation to NooJ with a view to creating a reliable framework for the development and enhancement of the Bulgarian NooJ module. This includes as a first step the incorporation of the DELAS and DELACF dictionaries in the .nod format. The DELAS dictionary used in the Bulgarian INTEX module is derived from a large coverage Grammar Dictionary of Bulgarian (Koeva, 1998) cosisting of more than 80000 entries assigned the FST name for the inflection type they pertain to. The conversion thus involves the BGD itself and the automatic association of the respective inflectional FST's with the inflection types in the .nod format and their description in the .flx format. The generated resulting dictionary is then tested on a corpus excerpts and mistakes are removed.

The existing DELACF dictionaries of Bulgarian present a more difficult task as they are to be associated with their corresponding inflection types. A complementary source of compound forms along with the DELACF dictionaries is the Bulgarian WordNet database.

A further step towards creating exhaustive lexical material coverage system is the testing of the resulting dictionaries on a large corpus, the analysis of the results and the enlargement of the dictionaries. To this end it is envisaged that the unrecognised tokens be categorised and handled in the dictionary or with morphological grammars for recognizing different distinct derivation types, for example diminutive nouns, certain prefigated words such as classes of verbs, verbal nouns derived from prefigated words, feminine person nouns derived from the corresponding masculine nouns, etc. Thus, the semantic and morpho-semantic relations existing between the lexical items that follow these derivation patterns will be accounted for and encoded in the Bulgarian NooJ module enlarging and making more powerful automatic information retrieval.

The work in this direction is to be carried out within the NooJ framework along the RILA project under which an exhaustive lexical material will ensure the basis for the automatic information extraction based on semantic relation linking the words in a language and cross-linguistically.