Building a lexicon-grammar of frozen sentences of Portuguese : The inheritance problem revisited

 

Graça Fernandes (Universidade do Algarve, Portugal.)

A. Correia (Universidade do Algarve, Portugal.)

and Jorge Baptista (Universidade do Algarve, Portugal and  LF – Spoken Language Systems Laboratory, Inesc-ID Lisboa, Lisbon, Portugal.

 (jbaptis@ualg.pt)

 

Abstract

For some time, a small team of linguists has been building a lexicon‑grammar of fro­zen sentences (or idioms) of European Portuguese (Baptista et al. 2004), based on the theoretical and methodological principles of M. Gross (1982, 1996). At its current stage, this lexicon-grammar contains over 3,500, frequently used, frozen sentences, belonging to several formal classes, along with a detailed description of their syntactical properties. Be­fore their application to real, large-sized corpora is possible, several linguistic issues must be addressed, including, for example, the (pre-) processing of compound tenses (M. Gross 1999a, b), insertions, word-order constraints, pronouns and negation, just to cite some few. Many of these aspects are dealt with in Intex v. 4.33 (Silberztein 2004: 180-190), by building lexical finite-state transducers from the lexicon-grammar tables, by way of a master graph where the sentence forms resulting from these phenomena are de­scribed.

In this paper, we present briefly the current state of the lexicon‑grammar of Portu­guese frozen sentences, in order to focus on the problem of inheritance of morphological features in the lexical analysis of these complex forms. Tagging frozen sentences implies to add them inflection values in as much the same way as to simple verbs. The problem becomes more complicated with compound tenses, where some form of calculus of the complex verb is required (Silberztein 2004:189-190). For languages with limited number of auxiliary constructions, compound tenses may not constitute a major problem. For Portuguese, however, there are a large number of auxiliaries which show non-trivial, complex, combinatorial patterns, making it necessary to deal with them before and inde­pendently from the lexical analysis of idioms. In other cases, it may also be interesting to state in the idiom’s tag certain inflection values resulting from long-distance constraints. For example, the gender-number values of some words in the frozen sentence, which de­pend on the gender-number values of one of its subject, may function as a clue to estab­lish the subject-verb relation during syntactic analysis, especially in the context of anaph­ora resolution.

 

References

Baptista, J.; A. Correia; G. Fernandes. 2004. Frozen Sentences of Portuguese: Formal Descriptions for NLP. Work­shop on Multiword Expressions: Integrating Processing, International Conference of the European Chapter of the Association for Computational Linguistics, Barcelona (Spain), July 26, 2004. ACL: Barcelona, pp. 72-79.

Gross, M. 1982. Une classification des phrases figées du français. Revue Québecoise de Linguistique 11-2:151-185. Mon­tréal: UQAM.

Gross, M. 1996. Lexicon-Grammar. in Brown, K.; Miller, J. (Eds.). Concise Encyclopedia of Syntactic Theories. Oxford: Pergamon Press, pp. 224‑259.

Gross, M. 1999. Lemmatization of Compound Tenses in English. in C. Fairon (ed.), Analyse Lexicale et Syntaxique: Le système INTEX, Lingvisticae Investigationes 22 :71‑122. John Benjamins Pub.Co.

Silberztein, M. 2004. Intex Manual [v. 4.33]. http://msh.univ-fcomte.fr/intex/downloads/Manual.pdf