In what follows we would like to present our applied linguistics research which aims at increasing the efficiency of an already existing rule-based information extraction system by enhancing it with further grammatical knowledge. Our work concentrates on the NewsPro information extraction system [Prószéky2004], developed jointly by MorphoLogic Ltd., Institute of Informatics at Szeged University, and Linguistics Insitute of Hungarian Academy of Sciences. The system was developed and tested on a corpus of short business news.
Newspro performs a shallow syntactic analysis on the input text, then it matches pre-defined semantic patterns – so-called 'event frames' – to the text. In case of succesful pattern matching, slots of the event frames are filled by the elements of the text, thus the output identifies the main event of the news as well as its participants and circumstances. Semantic patterns are centered around finite verbs while their complements and adjuncts represent participants, circumstances and other additional information. Thus, pattern matching is based on the verb previously recognised as predicate, and its argument structure. This method relies on the supposition that in short news it is always the verbal predicate that expresses the main event. Although this approach proves to be working in most cases, it has the disadvantage of omitting secondary information (frequently indicated as the cause or the antecedent of the main event) from pattern matching. The reason is that secondary information is represented grammatically by non-finite verbal forms such as participles or deverbal nouns. For example:
[A gyártók által tegnap bejelentett árcsökkentések] nyomán megnőtt a kereslet az új autók iránt.
Due to [the decrease in the prices announced yesterday by the manufacturers], demand for new cars is
In the sentence above, NewsPro is capable to identify the main event (i.e. the increase in the demand), but not the bracketed constituent, which expresses an earlier event, concieved as the cause of the main event. However, the user may be interested to learn about the antecedents and the connection between the two pieces of information.
This phenomenon is supposed to be handled by a preprocessing module within NewsPro. The module transforms input participial structures into complete sentences with a finite verb as their predicate. The further steps of the processing, such as syntactic parsing and semantic pattern matching may run on the transformed sentences without any modification. Moreover, as Hungarian constituent order is relatively free, we expect the system to yield better results on automatically generated sentences, as their constituent order is homogeneously SVO.
As a first step, we examined NP-internal participial structures, namely those containing a past participle. We supposed that these phrases may be transformed into sentences because a) the participle preserves the meaning of the base verb, and b) its arguments – or at least some of them – can be derived from the structure of the noun phrase containing it. As past participles express anteriority, the predicate of the output sentences have to be in past tense.
The success of the preprocessing module depends not only on the grammatical well-formedness of the output but also on the degree of informativity of the transformed sentences. We made an attempt to elaborate an algorithm for filtering supposedly informative participial structures on the basis of solely grammatical information.
Preparation of transformational rules as well as other tasks related to the preprocessing of the text were performed by Intex [Silberztein1993]. Its local grammars make an extensive use of the dictionaries which have the advantage of encoding morphosyntactic and semantic information in one level, thus they are accessible at all the levels of the analysis. This feature of the system was particularly useful for us, since our transformational grammars refer to the participle's base verb as well as to the verb's morphosyntactic features.
Prószéky,G.: Information Extraction from Short Business News Items. In: Alexin Zoltán - Csendes Dóra (eds.): Proceedings of the Hungarian Computational Linguistics Conference, Szeged University Press, 2003. Szeged, p. 161-167.
Silberztein,M.: Dictionnaires électroniques et analyse automatique de textes: Le systeme Intex. Masson, 1993. Paris