Migration of Intex resources towards NooJ the case of Serbian
Ranka Stanković 1, Ivan Obradović 1, Cvetana Krstev 2,
Duško Vitas 3, Gordana Pavlović-Lažeić 3
1Faculty of Mining and Geology, University of Belgrade, Đušina 7, CS – 11000 Belgrade
2Faculty of Philology, University of Belgrade
Studentski trg 3, CS – 11000 Belgrade
3Faculty of Mathematics, University of Belgrade
Studentski trg 16, CS – 11000 Belgrade
Intex has been used for processing of Serbian text for almost a decade. To that end numerous lexical resources have been developed which fall in two broad categories :
(1) dictionaries in DELA format and
Among the dictionaries, the most developed is the dictionary of simple words, which currently comprises 73,000 lemmas (DELAS entries) and more than a million word forms (DELAF entries). Besides these dictionaries of general lexica, dictionaries of proper names have also been developed, comprising at this moment approximately 21,000 lemmas, or 145,000 forms. A dictionary of compound words is under development, namely of compound nouns, prepositions, conjunctions and adverbs, as well as compound toponyms and proper names. Many auxiliary dictionaries have also been developed, such as special purpose filter dictionaries and auxiliary dictionaries for the processing of particular texts.
As for the transducers, the ones used for description of inflectional classes have been developed in the first place, namely transducers that generate the DELAF dictionaries from the corresponding DELAS dictionaries. There are 770 transducers in this group (333 for nouns, 60 for adjectives and 344 for verbs). A large number of lexical transducers has also been developed for the purpose of derivation in Serbian, as well as a number of transducers for the identification of specific forms, such as acronyms with their appropriate inflection and derivation (e.g. OEBS, OEBS-a, OEBS-ov). Transducers for disambiguation are under development.
A specific feature of Serbian is the simultaneous use of two alphabets, Latin and Cyrillic. Although the Cyrillic alphabet is being promoted as the official alphabet, the Latin alphabet is equally being used both for traditional paper publishing and electronic publishing. In order to avoid the creation and maintenance of dual resources, the Latin alphabet and the ASCII code are being used for development purposes. This means that letters specific to the Serbian alphabet are coded by two alphabetic signs: for example, Latin š (Cyrillic ш) is being coded as sx, and Latin nj (Cyrillic њ) as nx. From this unique entry, collections of lexical resources were derived for the Intex users, based on any of the 8-bit coding schemes for the Latin or Cyrillic alphabet.
The migration of further development of lexical resources for language processing of Serbian texts from the Intex environment to the NooJ environment, as well as the tailoring of the NooJ environment to the needs of the end user, presupposes the conversion of all resources into Unicode. In view of the specific use of two alphabets in Serbian, the decision has been made to keep the development resources in the Latin alphabet, with the adopted transliteration scheme. This would then allow for automatic production of both Latin and Cyrillic Unicode versions of the resources, which could be used, as needed, either separately or jointly (for those texts where both alphabets are present). As for the dictionaries, this procedure will be applied to the DELAF resources directly, which will thus make the translation of numerous transducers for inflection unnecessary. In this paper we will present these automatic procedures, and also investigate the possibilities for their inclusion into the NooJ environment.
Lexical resources developed within the Intex environment have been used for the development of textual and lexical resources which use some other format, MULTEXT- east morphosyntactic description in the first place (for the development of lexicon and morphosyntactically annotated Serbian translation of Orwell’s 1984). In this paper we will also examine procedures for automatic import and export between resources in NooJ format and resources in MULTEXT-east and other currently accepted formats, such as LMF/MAF.