Building a lexicon-grammar of frozen sentences of Portuguese : The inheritance problem revisited
Graça Fernandes (Universidade do Algarve, Portugal.)
A. Correia (Universidade do Algarve, Portugal.)
and Jorge Baptista (Universidade do Algarve, Portugal and LF – Spoken Language Systems Laboratory, Inesc-ID Lisboa, Lisbon, Portugal.
(jbaptis@ualg.pt)
Abstract
For some time, a small team of linguists has been building a lexicon‑grammar of frozen sentences (or idioms) of European Portuguese (Baptista et al. 2004), based on the theoretical and methodological principles of M. Gross (1982, 1996). At its current stage, this lexicon-grammar contains over 3,500, frequently used, frozen sentences, belonging to several formal classes, along with a detailed description of their syntactical properties. Before their application to real, large-sized corpora is possible, several linguistic issues must be addressed, including, for example, the (pre-) processing of compound tenses (M. Gross 1999a, b), insertions, word-order constraints, pronouns and negation, just to cite some few. Many of these aspects are dealt with in Intex v. 4.33 (Silberztein 2004: 180-190), by building lexical finite-state transducers from the lexicon-grammar tables, by way of a master graph where the sentence forms resulting from these phenomena are described.
In this paper, we present briefly the current state of the lexicon‑grammar of Portuguese frozen sentences, in order to focus on the problem of inheritance of morphological features in the lexical analysis of these complex forms. Tagging frozen sentences implies to add them inflection values in as much the same way as to simple verbs. The problem becomes more complicated with compound tenses, where some form of calculus of the complex verb is required (Silberztein 2004:189-190). For languages with limited number of auxiliary constructions, compound tenses may not constitute a major problem. For Portuguese, however, there are a large number of auxiliaries which show non-trivial, complex, combinatorial patterns, making it necessary to deal with them before and independently from the lexical analysis of idioms. In other cases, it may also be interesting to state in the idiom’s tag certain inflection values resulting from long-distance constraints. For example, the gender-number values of some words in the frozen sentence, which depend on the gender-number values of one of its subject, may function as a clue to establish the subject-verb relation during syntactic analysis, especially in the context of anaphora resolution.
References
Baptista, J.; A. Correia; G. Fernandes. 2004. Frozen Sentences of Portuguese: Formal Descriptions for NLP. Workshop on Multiword Expressions: Integrating Processing, International Conference of the European Chapter of the Association for Computational Linguistics, Barcelona (Spain), July 26, 2004. ACL: Barcelona, pp. 72-79.
Gross, M. 1982. Une classification des phrases figées du français. Revue Québecoise de Linguistique 11-2:151-185. Montréal: UQAM.
Gross, M. 1996. Lexicon-Grammar. in Brown, K.; Miller, J. (Eds.). Concise Encyclopedia of Syntactic Theories. Oxford: Pergamon Press, pp. 224‑259.
Gross, M. 1999. Lemmatization of Compound Tenses in English. in C. Fairon (ed.), Analyse Lexicale et Syntaxique: Le système INTEX, Lingvisticae Investigationes 22 :71‑122. John Benjamins Pub.Co.
Silberztein, M. 2004. Intex Manual [v. 4.33]. http://msh.univ-fcomte.fr/intex/downloads/Manual.pdf