Alex et al (2008)

Bea Alex, Claire Grover, Barry Haddow, Mijail Kabadjov, Ewan Klein, Michael Matthews, Stuart Roebuck, Richard Tobin and Xinglong Wang (2008). "The ITI TXM Corpora: tissue expressions and protein-protein interactions". Proceedings of LREC'08 Workshop on . . .

Document selection (PPI corpus)

  • 12,704 full-text XML-encoded articles downloaded from PubMedCentral OpenAccess
  • 7,720 of these articles were selected automatically, since they contained at least one of the following words/phrases (typically associated with PPIs) - bind, complex, interact, apoptosis, ubiquitination, mitosis, nuclear envelope, cell cycle, phosphorylation, glycosylation, signal transduction, nuclear receptors
  • 213 of these articles were selected manually by domain experts, since they contained interactions that were experimentally proven within the paper
  • 133 articles were left over after excluding those used during piloting and those judged unsuitable for annotation
  • an additional 84 articles were harvested from PubMed in the same way (non-XML encoded), in order to ensure we had enough articles
  • the resulting 217 articles were split into training (133), development (39) and test (45) sets

Document selection (TE corpus)

  • 12,060 articles downloaded from PubMed containing one of the following words/phrases - gene expression regulation, signal transduction, protein biosynthesis, cell differentiation, apoptosis, mitosis, cell cycle, phosphorylation
  • these articles were randomised and the first 4,237 were examined by a domain expert; 1,600 were selected as containing mention of the presence or absence of mRNA or protein in any organism or tissue
  • the first 238 of these articles that were not used in testing or rejected by annotators made it into the final version of the corpus
  • the resulting 238 articles were split into training (151), development (41) and test (46) sets

Annotation (preprocessing)

  • texts were tokenised and sentence boundaries inserted
  • PPI corpus contains 75K sentences and 2.0M tokens
  • TE corpus contains 63K sentences and 1.9M tokens

Annotation (named entities)

PPI corpus:

  • proteins or other related entities involved in PPIs
    • Protein (89K instances)
    • Complex (8K)
    • Fusion (4K)
    • Fragment (13K)
    • Mutant (5K)
  • attributes of PPI relations

TE corpus:

  • entities involved in TE relations
    • Tissue (36K)
    • Protein (61K)
    • Complex (4K)
    • Fusion (1.5K)
    • Fragment (4K)
    • Mutant (2K)
    • Gene (12K)
    • mRNAcDNA (8K)
    • GOMOP [Gene or mRNAcDNA or Protein] (5K)
  • attributes of TE relations

Named entities are allowed to NEST but not CROSS. Named entities must be CONTINUOUS, and any discontinuous ones must be marked as such with an XML attribute, e.g. "A and B cells" is marked as two distinct named entities:

  • A and B cells [marked as discontinuous]
  • B cells
Annotators could override the pre-existing tokenisation using character offsets.

Certain typed of named entity were normalised to one or more of the standard, publically available biomedical databases:

Normalisation of protein, gene and mRNAcDNA entities involved TWO identifiers:
  • full normalisation - RefSeq /EntrezGene identifiers
  • species normalisation - NCBI Taxonomy identifiers
Note that the training portion of the PPI corpus was only species normalised, due to lack of time.

Annotation (relations)

PPI corpus:

  • PPI - interaction between two Proteins (11.5K instances) [InteractionWord is also marked]
  • FRAG - connects a Mutant/Fragment to its parent Protein (16K)

TE corpus:

  • TE - a Gene or gene products is expressed in a particular Tissue (12.4K) [ExpressionLevelWord is also marked]
  • CHILD-PARENT - connects a Mutant/Fragment with its parent Protein (4.7K)

Both intra- and inter-sentential relations were annotated.

Annotation (properties)

Added to PPI and TE relations to give more information:

  • IsPositive [Positive/Negative] - Is the relation affirmed or denied?
  • IsDirect [Direct/NonDirect] - Is the relation direct or not?
  • IsProven [Proven/Referenced/Unspecified] - Is the relation experimentally proven in the article?

Annotation (attributes)

Named links from relations to other entities:

  • PPI corpus:
  • TE corpus:
    • te_rel_ent-drug-compound (any drug compund applied)
    • te_rel_ent-exp-method1 (the method used to detect the expression participants)
    • te_rel_ent-disease (any disease affecting the tissue)
    • te_rel_ent-dev-stage (the developmental stage of the tissue)
    • te_rel_ent-expr-word (a term indicating the level of expression)

Annotation (process)

-- MarkMcConville - 28 Aug 2008

Topic revision: r3 - 29 Aug 2008 - 09:31:45 - MarkMcConville
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies