On the Ambiguity of Serbian Texts and Methods to disambiguate it

On the Ambiguity of Serbian Texts and Methods to disambiguate it

Cvetana Krstev (University of Belgrade)

and Duško Vitas (University of Belgrade)

{cvetana,vitas}@matf.bg.ac.yu

Abstract

The usage of electronic dictionaries and transducers with lexical constraints in Intex environment enables successful annotation of Serbian texts in respect to the lexical coverage. Many word forms are, however, ambiguous. In this paper we will first distinguish different levels of ambiguity, e.g. the assignment of different part of speech, the assignment of different lemmas, and the assignment of different grammatical categories. Then we will present the origins of the ambiguity of Serbian texts, such as lack of the stress marks in written texts, that is many Serbian word forms are homographs but not homophones.

The ambiguity can be resolved in different ways, some of which will be presented in this paper. The ambiguity can be resolved manually, and for this purpose the program was developed that helps in the process of manual disambiguation. To disambiguate automatically, first of all special dictionaries can be used, such as dictionaries of the type Filter and Disamb. The structure of these dictionaries for Serbian and how good they are for the disambiguation will be outlined. The possibility to disambiguate by restricting the general dictionaries will also be explored. Namely, the lemmas belonging to the dialect or pronunciation not used in certain text can be excluded from the dictionary, as well as certain word forms, e.g. verb tenses.

Disambiguation can be further resolved by the use of local grammars that can use different kind of constraints. First of them relies on the fact that sequential occurrence of certain word forms can always (or often) be unambiguously interpreted. For instance, the sequence no do koga can be unambiguously interpreted although all three word forms regarded individually are ambiguous. The second kind of constraints is positional and it relies on the word order that is in certain cases fixed (although Serbian is usually described as a language with a free word order). For instance, at the beginning of a text unit, such as sentence, word form mi ‘we’ can be unambiguously interpreted as the nominative case — if regarded out of context, it can also be the dative case of the pronoun ja ‘I’. The third kind of constraints relies on the agreement of word forms in certain syntactic constructions. Namely, certain prepositions assign certain cases, which means that, for instance, noun phrase that follows has to be in that case and that excludes other possibilities. Moreover, in a noun phrase adjectives have to agree with a corresponding noun not only in case, but also in number and gender. We will investigate how the order of the application of local grammars affects the quality of disambiguation.

Finally, the rate of successfulness of the applied methods for automatic disambiguation will be analyzed, as well as possibilities for the improvement. In particular, we will investigate what benefit can be expected from switching to NooJ.