Shallow versus deep syntactic dependencies for relation extraction tasks

Assuming that the syntactic representation of an English sentence (as determined by an automatic parser) is a 'dependency structure' - a rooted, acyclic, directed graph, whose nodes are a subset of the words in the input sentence, and whose edges are labelled with symbols from a vocabulary of grammatical relations like "subject" and "object": What additional properties should dependency structures have in order to provide more effective input to a typical relation extraction system.

The aim of this project is to evaluate three systems of syntactic dependency representation for English, differing in terms of the relative "depth" of the underlying linguistic analyses. The evaluation will measure how effective each representation format is as the input into typical relation extraction tasks.

Syntactic dependency representation systems

Recent years have seen growing research interest in syntactic parsers which output, not labelled bracketings corresponding to syntactic phrase structure trees, but rather sets of labelled dependencies between heads and dependents. It has been argued that parser output based on syntactic dependencies is a better option for two main reasons: (a) this format is more theory-neutral, allowing a more level playing field for parser evaluation; and (b) syntactic dependencies are more appropriate for information extraction tasks than labelled bracketings, since they are closer to the underlying predicate-argument structure.

A number of different systems of syntactic dependency representation have been proposed for English, which can be seen as varying according to the "depth" of the linguistic analyses they presuppose. The most basic, "surface-y" systems assume that the syntactic representation of a sentence constitutes a TREE - in other words EVERY word (apart from that which functions as the "root") is a dependent of exactly ONE other word. (e.g. Link parser, Minipar, CoNLL shared tasks 2006-2008). More sophisticated systems allow for "reentrant" structures where a word may simultaneously be a dependant of two distinct heads, allowing for a better analysis of phenomena like control, relativisation and coordination (e.g. Stanford Typed Dependencies, RASP grammatical relations). McConville and Dzikovska (2008) argue in favour of taking this trend to its logical conclusion - a system of "deep" syntactic dependencies, involving full normalisation of well-known syntactic alternations such as passive, dative shift and the distinctions between meaningful and non-meaningful prepositions, and between predicative and attributive adjectives.

In this project, three distinct syntactic dependency representation systems will be evaluated:

  • CoNLL dependencies (unordered trees)
  • Stanford typed dependencies (limited reentrancy/normalisation)
  • Deep syntactic dependencies (full reentrancy/normalisation)

Let's assume that dependency structures are rooted and acyclic. We can then experiment with the presence/absence of the following additional structural constraints (keeping issues of vocabulary constant):

  • projective (i.e. discontinuous dependencies)
  • non-reentrant (i.e. control, V coordination)
  • connected (i.e. ignoring function words and expletives)

Then, we can experiment with vocabulary issues:

  • normalisation of non-canonical constructions (i.e. bounded - passives, control, dative shift; unbounded - relative clauses, topicalisation, tough-movement)
  • distinction between raising and control
  • conjunctions

Relation extraction tasks

[Short introduction to relation extraction]

Three distinct relation extraction tasks will be undertaken, from three contrasting domains:

  • biomedical - protein-protein interactions and tissue expressions in the ITI TXM corpora
  • educational - some relation extraction task using the Beetle corpus?
  • cultural heritage - some relation extraction task using Kate Byrne's corpus?

Workplan

Given some corpus C which has already been annotated with respect to named entities and relations:

  1. annotate C for CoNLL dependencies, Stanford typed dependencies and deep syntactic dependencies
  2. train three relation extractors on C's training set, using each of the syntactic dependency annotations as features
  3. determine the accuracy of each of the three relation extractors on C's test set

References

Mark McConville and Myroslava O. Dzikovska (2008). 'Deep' Grammatical Relations for Semantic Interpretation. In: Proceedings of the COLING'08 Workshop on Cross-Framework and Cross-Domain Parser Evaluation.

-- MarkMcConville - 12 Aug 2008

Topic revision: r5 - 26 Aug 2008 - 14:55:23 - MarkMcConville
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies