Skip to main content
  • Software review
  • Published:

Protein-protein interaction databases: keeping up with growing interactomes

Abstract

Over the past few years, the number of known protein-protein interactions has increased substantially. To make this information more readily available, a number of publicly available databases have set out to collect and store protein-protein interaction data. Protein-protein interactions have been retrieved from six major databases, integrated and the results compared. The six databases (the Biological General Repository for Interaction Datasets [BioGRID], the Molecular INTeraction database [MINT], the Biomolecular Interaction Network Database [BIND], the Database of Interacting Proteins [DIP], the IntAct molecular interaction database [IntAct] and the Human Protein Reference Database [HPRD]) differ in scope and content; integration of all datasets is non-trivial owing to differences in data annotation. With respect to human protein-protein interaction data, HPRD seems to be the most comprehensive. To obtain a complete dataset, however, interactions from all six databases have to be combined. To overcome this limitation, meta-databases such as the Agile Protein Interaction Database (APID) offer access to integrated protein-protein interaction datasets, although these also currently have certain restrictions.

The nature of protein-protein interaction data

Proteins do not act independently but in a network of complex molecular interactions. Therefore, it is important to identify physical interactions between proteins. Different experimental techniques have been developed to measure physical interactions between proteins; these methods vary considerably, not least in terms of the data they produce.

To give some examples, two widely used methods adapted for high-throughput approaches are the yeast two-hybrid (Y2H) system [1] and affinity purification followed by mass spectrometry (AP-MS) [2].

The Y2H system assays whether two proteins physically interact with each other (Figure 1). Genetically modified yeast strains are used to express a 'bait' and a 'prey' protein, which, if they interact, trigger the expression of a reporter gene. The method has been used for large-scale screening studies of a variety of model organisms, including yeast, fly and humans.

Figure 1
figure 1

The yeast two-hybrid experiment. The yeast two-hybrid system utilises the DNA binding domain and the activation domain of a yeast transcription factor. The bait proteins A and C are fused to a DNA binding domain, allowing them to bind to a binding site in the promoter region of a reporter gene. The prey protein B is fused to an activation domain that can activate the expression of a gene. (a) Protein B does not bind to protein A, therefore, the reporter gene is not activated. (b) Protein B binds to protein C, thereby activating the reporter gene.

In an AP-MS experiment, a protein of interest is fused to a protein fragment (the 'tag'), which allows its purification (Figure 2). This modified or tagged protein is expressed and purified from the cell extract using the tag -- for example, by antibodies binding specifically to the tag. Proteins binding the tagged protein are co-purified and subsequently identified by MS. The most widely used variation of the AP-MS method is tandem affinity purification followed by mass spectrometry (TAP-MS). In TAP-MS, the protein of interest is attached to a larger protein tag, which allows two consecutive affinity purification steps [2]. Large-scale TAP-MS experiments have been performed for yeast and human proteins [3–5]. Currently, several variations of these two methods, as well as a number of other methods, are used to identify protein--protein interactions (PPIs) [6–8].

Figure 2
figure 2

An affinity purification experiment followed by mass spectrometry. The protein of interest, F (red circle), is fused to a protein fragment -- the 'tag' (red rectangle). The tag allows this protein to be purified biochemically. Proteins binding to the tagged protein (blue) are co-purified, whereas proteins not binding to protein F (yellow) are discarded. The purified proteins can be released using enzymatic cleavage (scissors) or other methods, depending on the nature of the tag. These proteins are then identified by mass spectrometry.

PPI datasets are often visualised as graphs [9, 10]. Proteins are represented as nodes, and interactions as connections between nodes. For example, if the interaction between two proteins is detected by a Y2H experiment, we represent this physical interaction by an undirected connection between the two nodes. In a more detailed representation, we could make a distinction between bait and prey proteins and use a directed connection to represent the interaction between two proteins, using an arrow pointing from bait to prey. The use of graphs to describe the experimental results of AP-MS protein interaction screens is not always as straightforward as for Y2H data. Due to the nature of an AP-MS experiment, which identifies a whole protein complex rather than pairwise interactions, its results can be represented as a graph, using either the matrix or the spokes model (Figure 3). The matrix model assumes that all proteins of a purified complex interact; therefore, in the graph each protein is connected to each other. The spokes model assumes no additional interactions between proteins in a complex other than between the tagged protein and each co-purified protein.

Figure 3
figure 3

Spokes and matrix models. In order to represent the complexes identified by the AP-MS methods as binary pairs, the spokes model (a) or the matrix model (b) can be used. In the spokes model, it is assumed that only the tagged protein interacts with the other proteins in the complex. In the matrix model, it is assumed that all proteins in a complex interact with each other. While the spokes model probably leads to an underestimate of the real interactions, the matrix model leads to an overestimate.

Graph representation allows the data to be analysed using a graph-theoretical framework. Many graph analysis algorithms have been applied to PPI datasets; these approaches have been reviewed in detail elsewhere [11–16].

PPI databases

The primary resources for PPI data are individual scientific publications. Several public databases collect published PPI data and provide researchers access to their curated datasets. These usually reference the original publication and the experimental method that determined every individual interaction. Database designers choose to represent these data in different ways, and the wide spectrum of experimental methods makes it difficult to design a single data model to capture all necessary experimental detail. To overcome this problem, the International Molecular Exchange (IMEx; http://imex.sourceforge.net/) consortium was formed. IMEx aims to enable the exchange of data and to avoid the duplication of the curation effort. To that end, an XML-based proteomics standard, termed the proteomics standards initiative - molecular interaction (PSI-MI) has been developed [17]. At the time of writing, however, no data had yet been exchanged, and it was therefore necessary to combine PPI data from all available databases using the authors' own scripts to obtain as comprehensive a network as possible.

Here, the focus is on six databases: the Biological General Repository for Interaction Datasets (BioGRID) [18], the Molecular INTeraction database (MINT) [19], the Biomolecular Interaction Network Database (BIND) [20], the Database of Interacting Proteins (DIP) [21], the IntAct molecular interaction database (IntAct)[22] and the Human Protein Reference Database (HPRD)[23] (see Table 1). These databases report only experimentally verified interactions.

Table 1 PPI databases

DIP, IntAct and MINT are active members of the IMEx initiative; the curation accuracy of these three databases was assessed recently by Cusick et al. [24] HPRD focuses entirely on human proteins, providing not only information on protein interactions, but also a variety of protein-specific information, such as post-translational modifications, disease associations and enzyme-substrate relationships. One of the first interaction databases, BIND, initiated in 2001 by the University of Toronto and the University of British Columbia, is part of the Biomolecular Object Network Databank (BOND) and was subsequently acquired by the company Thomson Reuters.

The following comparison is based on complete sets of binary interactions that were downloaded from the individual databases in May 2008. IntAct and MINT derive binary interactions from protein complexes using the spokes model. No other database provided any information on which model is applied. Only 'physical interactions' are considered here, although most databases also provide 'genetic interactions' -- that is, two non-essential genes that lead to a non-viable phenotype if they are knocked out simultaneously. Furthermore, interactions were only accepted if a publication identifier was provided along with the interacting proteins.

Currently, the most comprehensive database in terms of individual interactions is IntAct, with almost 130,000 unique interactions from up to 131 different organisms. Despite these large numbers, it cites only about 3,000 different publications. Whereas IntAct seems to be concentrating on high-throughput studies, HPRD also takes into account small-scale publications. Although being restricted to human proteins, it reports over 36,000 unique interactions from more than 18,000 publications. Only BioGRID cites a similar number of publications (16,369); it is also the second largest database in terms of the number of unique interactions. It should be noted that the databases examine publications in different depth, and that higher numbers of publications do not necessarily involve a higher curation effort.

The majority of known protein interactions account for proteins from Saccharomyces cerevisiae and Homo sapiens. Individual high-throughput interaction screens were carried out for some other organisms; these high-throughput studies usually account for the majority of all known interactions in the corresponding organism. By contrast, known protein interactions for S. cerevisiae and H. sapiens are dispersed over numerous publications. For this reason, the number of interactions for humans and yeast can vary considerably between different databases, depending on their coverage of the literature.

Differences between the PPI databases

Ideally, every database would extract the same interactions from a given publication. Unfortunately, this is not the case. Of the 14,899 publications shared by at least two databases, 5,782 (39 per cent) were reported with a different number of interactions in different databases. For example, for the publication reporting the most interactions [25], a minimum of 18,877 (BIND) and a maximum of 20,800 interactions (DIP) were reported. According to the abstract, the number of interactions is 20,405, which, again, is different from the number reported by all five databases that cite this publication. In this case, the variation is presumably due to problems with identifier mapping. Many databases use different identifiers, which do not always map in a perfect one-to-one relationship to the originally published identifiers. BioGRID (20,220 interactions) uses the original gene identifiers, but still lacks 185 interactions.

As a second example, using a Y2H screen, Rual et al. detected 2,754 interactions between human proteins [26]. The authors compared their experimental findings with a literature-curated PPI network of 4,076 interactions. This resulted in a combined network of 6,438 interactions. HPRD (2,371 interactions), IntAct (2,671 interactions) and MINT (2,463 interactions) report only experimentally detected interactions for this reference. BioGRID reports 6,295 interactions for this study, of which 2,594 quote Y2H as the detection method. These also overlap with the interactions reported by the other databases for this reference. The remaining 3,895 interactions quote affinity capture as the detection method and possibly refer to the literature-curated interactions.

For a number of other publications, differences can be explained by different confidence sets or thresholds [27, 28] or differences in the application of the matrix or spokes model. Often, no obvious reason for different numbers of interactions could be found.

Integration of PPI data

Integration of data from the different databases is not trivial. Although many databases provide their interactions in the proteomics standards initiative-molecular interactions (PSI-MI) format, its controlled vocabulary is often not used or is used incorrectly. Furthermore, a variety of different gene or protein identifiers are used, even within some of the databases. Although a gene can give rise to several different proteins (due to alternative splicing), we mapped all identifiers to Ensembl gene identifiers to avoid any ambiguities. This procedure is based on mapping tables obtained from UniProt [29]. Only interactions in which both proteins could be mapped to an Ensembl gene identifier were considered for further analysis.

After unifying all identifiers for eukaryotic organisms, the four model organisms Caenorhabditis elegans, Drosophila melanogaster, S. cerevisiae and H. sapiens showed the highest number of interactions (Table 2). The focus here has been on PPIs in eukaryotes, but the reader should note that high-throughput datasets also exist for a variety of pro-karyotes, including Escherichia coli, Campylobacter jejuni and Helicobacter pylori. Previous studies reported little overlap between individual PPI datasets [15]. Likewise, there is little redundancy in the combined set of interactions (Table 2). Between 1 per cent (D. melanogaster) and 18 per cent (H. sapiens) of all interactions are reported by more than one publication. Interestingly, the proportion of interactions that were reported by different methods reaches up to 25 per cent for yeast and 42 per cent for humans (Table 2). Although many small-scale publications apply more than one method to confirm an interaction, this number is most likely an overestimate, because databases use different nomenclature and spelling variations to describe experimental detection methods. Therefore, more interactions appear to be confirmed by several methods than really are.

Table 2 Redundancy of PPIs.

As mentioned above, databases focus their curation efforts on different publications. Consequently, only a subset of all protein interactions can be found in more than one database (Table 2). These range from 42 per cent of yeast interactions and 51 per cent of human interactions to 72 per cent of fly interactions and 86 per cent of worm interactions.

To assess these differences in more detail, the relative pairwise overlap of human protein interactions between databases was calculated (Table 3). All databases have their highest relative overlap when compared with HPRD, which reports the most interactions. High overlaps were also found between DIP and BioGRID (55 per cent) and between MINT and IntAct (59 per cent). Even the most abundant database (HPRD), however, covers only two-thirds of all reported human protein interactions.

Table 3 Overlap of human PPIs between databases.

Meta-databases

None of the existing PPI databases provides an exhaustive dataset. Therefore, some groups have set up meta-databases that provide protein interaction data extracted and integrated from other databases. Currently, one of the most comprehensive meta-database appears to be the Agile Protein Interaction Database (APID) [30]. APID extracts interactions from the six databases described above, mapping all proteins to UniProt identifiers [29]. Via a web interface, the user can query for proteins of interest. APID references the database from which an interaction is derived and provides the related information available in the original database, such as the detection method and the publication identifier. In addition, APID incorporates biological information from various other databases, such as the Gene Ontology [31] and Pfam databases [32]. Unfortunately, a download of the complete dataset is currently not possible due to licensing issues. APID is generally in good agreement with the results of the authors' data integration. For the time being, APID seems a good source of interactome data.

Several other meta-databases exist, but these usually focus on a single organism [33] or incorporate various other types of interactions, such as computationally predicted protein interactions and co-citation of proteins [34]. For a comprehensive list of available databases, the reader is referred to the Pathguide [35].

Conclusions

PPI databases not only report their data in different ways, using different ontologies, but their curators also report different PPIs when examining the same publication. In addition, all databases include different publications. It is therefore not surprising that every database reports different PPIs. The pairwise overlap among databases analysed here reaches up to 75 per cent, but always falls short of a perfect 100 per cent. Similar results were obtained in related studies [12, 24]. Until a data exchange between databases is implemented, a comprehensive set of interactions can only be obtained through data integration of several databases. Meta-databases, such as APID, provide access to more comprehensive datasets, but do not always allow the download of their complete data. Furthermore, by their very nature, meta-databases will always be less up to date than the original databases.

PPI databases have improved greatly over the past couple of years, and important issues, such as data exchange, are being currently addressed by some of the databases described here. An important step towards increasing the number and quality of protein interaction data would be to introduce a submission requirement -- as, indeed, already exists for sequence and microarray data. These data have to be submitted to public databases prior to publication in a scientific journal, which ensures data availability and consistent annotation, and enables researchers to utilise the data with greatest efficiency.

References

  1. Fields S, Song O: A novel genetic system to detect protein-protein interactions. Nature. 1989, 340: 245-246. 10.1038/340245a0.

    Article  CAS  PubMed  Google Scholar 

  2. Rigaut G, Shevchenko A, Rutz B, et al: A generic protein purification method for protein complex characterization and proteome exploration. Nat Biotech. 1999, 17: 1030-1032. 10.1038/13732.

    Article  CAS  Google Scholar 

  3. Gavin AC, Aloy P, Grandi P, et al: Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006, 440: 631-636. 10.1038/nature04532.

    Article  CAS  PubMed  Google Scholar 

  4. Bouwmeester T, Bauch A, Ruffner H, et al: A physical and functional map of the human TNF-alpha/NF-kappa B signal transduction pathway. Nat Cell Biol. 2004, 6: 97-105. 10.1038/ncb1086.

    Article  CAS  PubMed  Google Scholar 

  5. Gavin AC, Bosche M, Krause R, et al: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002, 415: 141-147. 10.1038/415141a.

    Article  CAS  PubMed  Google Scholar 

  6. Berggård T, Linse S, James P: Methods for the detection and analysis of protein-protein interactions. Proteomics. 2007, 7: 2833-2842. 10.1002/pmic.200700131.

    Article  PubMed  Google Scholar 

  7. Phizicky EM, Fields S: Protein-protein interactions: Methods for detection and analysis. Microbiol Rev. 1995, 59: 94-123.

    PubMed Central  CAS  PubMed  Google Scholar 

  8. Shoemaker BA, Panchenko AR: Deciphering protein-protein interactions. Part I. Experimental techniques and databases. PLoS Comput Biol. 2007, 3: e42-10.1371/journal.pcbi.0030042.

    Article  PubMed Central  PubMed  Google Scholar 

  9. Suderman M, Hallett M: Tools for visually exploring biological networks. Bioinformatics. 2007, 23: 2651-2659. 10.1093/bioinformatics/btm401.

    Article  CAS  PubMed  Google Scholar 

  10. Cline MS, Smoot M, Cerami E, et al: Integration of biologi-cal networks and gene expression data using Cytoscape. Nat Protocols. 2007, 2: 2366-2382. 10.1038/nprot.2007.324.

    Article  CAS  PubMed  Google Scholar 

  11. Albert R, Barabasi AL: Statistical mechanics of complex networks. Rev Mod Phys. 2002, 74: 47-97. 10.1103/RevModPhys.74.47.

    Article  Google Scholar 

  12. Futschik ME, Chaurasia G, Herzel H: Comparison of human protein protein interaction maps. Bioinformatics. 2007, 23: 605-611. 10.1093/bioinformatics/btl683.

    Article  CAS  PubMed  Google Scholar 

  13. Huber W, Carey V, Long L, et al: Graphs in molecular biology. BMC Bioinformatics. 2007, 8: S8-

    Article  PubMed Central  PubMed  Google Scholar 

  14. Sharan R, Ulitsky I, Shamir R: Network-based prediction of protein function. Mol Syst Biol. 2007, 3: 88-

    Article  PubMed Central  PubMed  Google Scholar 

  15. von Mering C, Krause R, Snel B, et al: Comparative assessment of large-scale data sets of protein-protein interactions. Nature. 2002, 417: 399-403.

    Article  CAS  PubMed  Google Scholar 

  16. Schwikowski B, Uetz P, Fields S: A network of protein-protein interactions in yeast. Nat Biotechnol. 2000, 18: 1257-1261. 10.1038/82360.

    Article  CAS  PubMed  Google Scholar 

  17. Kerrien S, Orchard S, Montecchi-Palazzi L, et al: Broadening the horizon - Level 2.5 of the HUPO-PSI format for molecular interactions. BMC Biol. 2007, 5: 44-10.1186/1741-7007-5-44.

    Article  PubMed Central  PubMed  Google Scholar 

  18. Stark C, Breitkreutz BJ, Reguly T, et al: BioGRID: A general repository for interaction datasets. Nucl Acids Res. 2006, 34: D535-D539. 10.1093/nar/gkj109.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  19. Zanzoni A, Montecchi-Palazzi L, Quondam M, et al: MINT: A Molecular INTeraction database. FEBS Lett. 2002, 513: 135-140. 10.1016/S0014-5793(01)03293-8.

    Article  CAS  PubMed  Google Scholar 

  20. Bader GD, Donaldson I, Wolting C, et al: BIND - The Biomolecular Interaction Network Database. Nucl Acids Res. 2001, 29: 242-245. 10.1093/nar/29.1.242.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  21. Xenarios I, Rice DW, Salwinski L, et al: DIP: The Database of Interacting Proteins. Nucl Acids Res. 2000, 28: 289-291. 10.1093/nar/28.1.289.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  22. Hermjakob H, Montecchi-Palazzi L, Lewington C, et al: IntAct: An open source molecular interaction database. Nucl Acids Res. 2004, 32: D452-D455. 10.1093/nar/gkh052.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  23. Peri S, Navarro JD, Amanchy R, et al: Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 2003, 13: 2363-2371. 10.1101/gr.1680803.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  24. Cusick ME, Hu H, Smolyar A, et al: Literature-curated protein interaction datasets. Nat Meth. 2009, 6: 39-46. 10.1038/nmeth.1284.

    Article  CAS  Google Scholar 

  25. Giot L, Bader JS, Brouwer C, et al: A protein interaction map of Drosophila melanogaster, Science. 2003, 302: 1727-1736.

    Google Scholar 

  26. Rual J-F, Venkatesan K, Hao T, et al: Towards a proteome-scale map of the human protein-protein interaction network. Nature. 2005, 437: 1173-1178. 10.1038/nature04209.

    Article  CAS  PubMed  Google Scholar 

  27. John PM, Russell SL, Asa BH, et al: Large-scale identification of yeast integral membrane protein interactions. Proc Natl Acad Sci USA. 2005, 102: 12123-12128. 10.1073/pnas.0505482102.

    Article  Google Scholar 

  28. Formstecher E, Aresta S, Collura V, et al: Protein interaction mapping: A Drosophila case study. Genome Res. 2005, 15: 376-384. 10.1101/gr.2659105.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  29. The UniProt C: The Universal Protein Resource (UniProt). Nucl Acids Res. 2008, 36: D190-D195. 10.1093/nar/gkn141.

    Article  Google Scholar 

  30. Prieto C, De Las Rivas J: APID: Agile Protein Interaction Data Analyzer. Nucl Acids Res. 2006, 34: W298-W302. 10.1093/nar/gkl128.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  31. Ashburner M, Ball CA, Blake JA, et al: Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  32. Finn RD, Tate J, Mistry J, et al: The Pfam protein families database. Nucl Acids Res. 2008, 36: D281-D288. 10.1093/nar/gkn226.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  33. Chaurasia G, Iqbal Y, Hanig C, et al: UniHI: An entry gate to the human protein interactome. Nucl Acids Res. 2007, 35: D590-D594. 10.1093/nar/gkl817.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  34. Jensen LJ, Kuhn M, Stark M, et al: STRING 8 - A global view on proteins and their functional interactions in 630 organisms. Nucl Acids Res. 2009, 37: D412-D416. 10.1093/nar/gkn760.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  35. Bader GD, Cary MP, Sander C: Pathguide: A pathway resource list. Nucl Acids Res. 2006, 34: D504-D506. 10.1093/nar/gkj126.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

The authors would like to thank all developers and curators of the protein--protein interaction databases. Without their effort, our life would be much harder. We thank Henning Hermjakob for helpful discussions. We are grateful for funding from the British Council/DAAD as part of the ARC programme (ARC1297).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thomas Schlitt.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lehne, B., Schlitt, T. Protein-protein interaction databases: keeping up with growing interactomes. Hum Genomics 3, 291 (2009). https://doi.org/10.1186/1479-7364-3-3-291

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1479-7364-3-3-291

Keywords