225 biomedical paper abstracts, taken from Medline, with 1,970 sentences. 200 were known to mention protein-protein interactions (PPIs); the remaining 25 were known not to.

Sentence-split, tokenised, and annotated with proteins and protein-protein interactions (PPIs).


Directory structure:

> proteins
    > abstract1-1
    > abstract9-242   [747 files in total]
> interactions
    > abstract_11780382
    > abstract_11815670   [28 files in total]
    > abstract_for_10074428
    > abstract_for_9427624   [197 files in total]

Structure of proteins files

Each abstract file in the proteins directory is structured as follows:

> ArticleTitle
> AbstractText
Both elements consist of PCDATA where strings denoting proteins are included within 'prot' elements. Note that these are recursive:
  stress-activated protein kinase
inhibitor reverses 
  <prot>bradykinin B(1)</prot> 
-mediated component of inflammatory hyperalgesia.
Where annotators have been unsure as to whether a given string denotes a protein or not, they have used an element called '?'. To make the files valid XML, I have changed this to an element called 'x' (as well as adding the root element 'Abstract').

Here is a complete example abstract (1-1), with protein names in square brackets:

[[p38] stress-activated protein kinase] inhibitor reverses [[bradykinin B(1)] receptor]-mediated 
component of inflammatory hyperalgesia.

The effects of a [[p38] stress-activated protein kinase] inhibitor, 
4-(4-fluorophenyl)-2-(-4-methylsulfonylphenyl)-5-(4-pyridynyl) imidazole (SB203580), were 
evaluated in a rat model of inflammatory hyperalgesia. 
Oral, but not intrathecal, administration of SB203580 significantly reversed inflammatory 
mechanical hyperalgesia induced by injection of complete Freund's adjuvant into the hindpaw. 
SB203580 did not, however, affect the increased levels of [interleukin-1beta] and 
[cyclo-oxygenase 2 protein] observed in the hindpaw following complete Freund's adjuvant injection.
Intraplantar injection of [interleukin-1beta] into the hindpaw elicited mechanical hyperalgesia in 
the ipsilateral paw, as well as in the contralateral paw, following intraplantar injection of the 
[[bradykinin B(1)] receptor] agonist [des-Arg(9)-bradykinin]. 
Oral administration of SB203580 1 h prior to [interleukin-1beta] administration prevented the 
development of hyperalgesia in the ipslateral paw and the contralateral 
[[bradykinin B(1)] receptor]-mediated hyperalgesia. 
In addition, following [interleukin-1beta] injection into the ipsilateral paw, co-administration 
of SB203580 with [des-Arg(9)-bradykinin] into the contralateral paw inhibited the 
[[bradykinin B(1)] receptor]-mediated hyperalgesia.
In human embryonic kidney 293 cells expressing the human [[bradykinin B(1)] receptor], its agonist
[des-Arg(10)-kallidin] produced a rapid phosphorylation of endogenous 
[[p38] stress-activated protein kinase]. 
Our data suggest that [[p38] stress-activated protein kinase] is involved in the development of 
inflammatory hyperalgesia in the rat, and that its pro-inflammatory effects involve the induction 
of the [[bradykinin B(1)] receptor] as well as functioning as its downstream effector.

Structure of interactions files

The 28 files named 'abstract_N':

Each sentence is on its own line. Protein names are tagged in 'prot' elements as discussed above. No other structural elements appear. These files do not contain any PPIs.

The 197 files named 'abstract_for_N':

Again, each sentence is on its own line, and protein names are tagged in 'prot' elements. Protein names are also embedded in <p1 pair=n> or <p2 pair=n> depending on whether or not the protein has been annotated as respectively the first or second protein in a PPI uniquely identified as 'n'. Presumably, each of these files has at least one such PPI marked.


R. Bunescu and R.J.Mooney (2004). "Collective information extraction with relational Markov networks". Proceedings of ACL'04.


Used in: Yusuke Miyao, Rune Saetre, Kenji Sagae, Takuya Matsuzaki and Jun'ichi Tsujii (2008). "Task-oriented evaluation of syntactic parsers and their representations". Proceedings of ACL'08.

Lorenza Romano, Milen Kouylekov, Idan Szpektor, Ido Dagan and Alberto Lavelli (2006). "Investigating a generic paraphrase-based approach for relation extraction". Proceedings of EACL'06.

Giulano, Lavelli and Romano (2006).

Claudio Giulano, Alberto Lavelli and Lorenza Romano (2006). "Exploiting shallow linguistic information for relation extraction from biomedical literature". Proceedings of EACL'06.


  • formulated a list of lexico-syntactic templates which are commonly instantiated by PPIs, along with an associated normalised form
  • took the AImed corpus and eliminated all abstracts which do not contain at least one annotated PPI
  • eliminated all markings of protein references which are embedded within some other protein reference, as well as all PPIs which refer to one of the eliminated protein references
  • transformed all remaining protein names into the form "ProtN" where N is a number
  • eliminated all annotated PPIs which are reflexive
  • divided the remaining abstracts into a development set (60%) and a test set (40%)
  • two annotators looked at each of the 575 surviving annotated PPIs in the development set and: (a) determined whether or not the PPI instantiates one of the pre-identified lexico-syntactic templates (kappa = 0.85); and (b) if so, noted the normalised form of the relevant template (agreement = 0.96)

