L.H. Baker Center for Bioinformatics and Biological Statistics, Iowa State University Ames, IA 50011, USA

Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA 50011, USA

Abstract

Background

The abundant data available for protein interaction networks have not yet been fully understood. New types of analyses are needed to reveal organizational principles of these networks to investigate the details of functional and regulatory clusters of proteins.

Results

In the present work, individual clusters identified by an eigenmode analysis of the connectivity matrix of the protein-protein interaction network in yeast are investigated for possible functional relationships among the members of the cluster. With our functional clustering we have successfully predicted several new protein-protein interactions that indeed have been reported recently.

Conclusion

Eigenmode analysis of the entire connectivity matrix yields both a global and a detailed view of the network. We have shown that the eigenmode clustering not only is guided by the number of proteins with which each protein interacts, but also leads to functional clustering that can be applied to predict new protein interactions.

Background

Systems biology is a new frontier for bioinformatics research, aimed at understanding complex biological systems in cells by integrating interactions between large numbers of constituent components, including genes, proteins, and metabolites. Examples of systems biology research include studies of gene interaction networks

Proteins represent the major category of large functional biomolecules. How proteins interact with one another is a current subject of many high-throughput studies. The number of proteins in an organism can reach tens of thousands. Comprehending the functional, developmental, and regulatory networks comprising these temporal and spatial protein pairs is a formidable task

Protein clustering in global interaction networks is important for revealing cellular functionality (for example

Our approach in this paper using connectivity matrix and subsequent eigenvalue/eigenvector decomposition is also based on the topological properties of the interaction network as a whole. Although significant proteins (in each eigenvector) form clusters, these clusters differ from those obtained by methods that are based solely on protein properties, because they reflect the organizational patterns of the protein interactions themselves based on topological considerations.

In the present study, we show that computational analyses of experimental data on protein networks can lead to discoveries of new, unexpected relationships, which emphasizes the importance of the global view of a protein interaction network.

Results and discussion

In this paper, we have used spectral analysis of graphs methodology. We earlier applied a similar approach for protein dynamics analysis using elastic network models

We used the yeast protein interaction data available in the GRID (General Repository for Interaction Data sets)

Connectivity matrix and its eigenmode analysis

We have converted the pairwise interaction information obtained from the GRID data into a connectivity matrix **C **for subsequent analyses. The individual elements of the symmetric matrix **C **are as follows: 1, if two proteins interact; 0, if they do not; and the diagonal elements of the contact matrix are taken as the negative sum of the other row (or column) elements. We then applied the standard method of matrix eigenanalysis used in algebra. Readers who are non-experts in this field may read a brief tutorial provided in the Methods section.

We should note that our definition of the diagonal elements of the connectivity matrix as sums of all non-diagonal elements of the given column (or row), automatically leads to a connectivity matrix that is singular, and must be analyzed through the Singular Value Decomposition technique. The definition of diagonal elements of the matrix that implies its singularity has some deep physical meaning when the technique is applied to protein structures. For example in the case of elastic network models of proteins

A similar study of the protein-protein interaction network in budding yeast by using spectral graph theory has been published by Bu

Eigenvalue distribution for the yeast protein-protein interaction network

Eigenvalue distribution for the yeast protein-protein interaction network.

Singular Value Decomposition (SVD) computations

In this work, we used SVD (since det **C **= 0) to extract eigenvalues and eigenvectors instead of alternative clustering methods. The SVD method has been extremely useful in the development of elastic network models

We have applied the SVD subroutine available in the LAPACK

The sorted eigenvalues of the connectivity matrix for the yeast protein-protein interaction network are shown in Figure

The rank of the average degree of connectivity of eigenclusters as a function of the eigenvector index for the cutoff values of (a) 0.01, (b) 0.05, (c) 0.1

**The rank of the average degree of connectivity of eigenclusters as a function of the eigenvector index **for the cutoff values of (a) 0.01, (b) 0.05, (c) 0.1

As an example showing how we have chosen significant proteins in an eigenvector, Fig. ^{th }component of the eigenvector corresponds to the ^{th }protein in the connectivity matrix **C**. In this way, we may define an eigencluster as a set of proteins corresponding to the most significant components of a given eigenvector.

Significant proteins corresponding to eigencluster #21

**Significant proteins corresponding to eigencluster #21**. Only two proteins represent significant non-zero components: SER3 (YER081W) with a value of -0.99, and SLT2 (YHR030C) with a value of 0.055 (shown with arrows). Interestingly, both of these proteins have 96 neighbors.

We should note that the same protein(s) may belong to several different clusters. This is because each cluster corresponds to an eigenvector related to a specific eigenvalue. Because the whole protein interaction network database for yeast contains 4906 proteins, there are also 4906 eigenvectors, and corresponding eigenclusters. Since each cluster contains at least several proteins, every protein belongs usually to several clusters. This corresponds to the situation in the normal mode analysis of protein motions, where a given residue can be involved at once in several functionally important motions, which may lead to functional promiscuity of proteins

The inherent limitations in interaction maps

Interactome datasets contain protein interaction information obtained using a wide range of experimental methods, each providing data with differing reliability due to the limitations of the method used. One approach to reconcile the reliability of the protein interaction data obtained using various experimental methods is to assign weights to the interactions based on either the confidence of the particular experimental method, or the confirmation of interactions by additional experimental methods (e.g. ref

Classical wet-bench molecular biology approaches that focus on a single protein interaction are generally accurate. However, when a high-throughput method (e.g. the yeast two-hybrid assays) is used, the number of wrongly annotated interactions (i.e. false positives) increases, and sometimes, even some reported protein interactions cannot be reconciled with the known protein complexes

The exact false positive rates and completeness of these large-scale experiments are relatively unknown because of coverage limitations: When Vidal and co-workers

Another limitation in protein networks is that global protein interaction networks present a rather static picture of protein interactions, neglecting transport and kinetic aspects. There are two distinct

Despite these difficulties, computational analyses of protein interaction networks could be extremely useful, for example, if they can suggest new likely pairings that have not been yet discovered, or reveal new structural or functional linkages within clusters of proteins from the protein network.

The yeast GRID database: physical, genetic, and functional interactions

The GRID database used for our calculations contains not only physical interactions, but also genetic and functional interactions. We should keep in mind that the available physical interactions in the database cannot always be understood in the sense that two molecules are selectively and specifically binding

The Protein contact matrix decomposition leads to clusters of proteins having similar numbers of interactions

The rank of the average degree of connectivity for eigenvector clusters is shown in Figure ^{th }cluster, which is an interesting finding in itself. The remainder of the clusters contains proteins with small numbers of neighbors, and as a result the presence of noise disturbs the linearity of the plot. Figure

Highly non-random nature of interaction clustering

To investigate whether the observed clustering of interactions detected by the eigenanalysis of the GRID database is random, we have performed a simple numerical experiment. We have compared the distribution of eigenvalues for three different cases: for the original GRID yeast dataset matrix; for the same matrix but with randomly shuffled interactions; and for the matrix obtained from the original one by randomly removing 10% of interactions. The results shown in Fig.

Comparison of eigenvalue distributions for (a) the GRID yeast protein interaction matrix (denoted as Original), (b) the randomly shuffled matrix that has same number of connections as the original matrix (Shuffled), and (c) the matrix that has 10% fewer interactions than the original matrix (Reduced)

**Comparison of eigenvalue distributions **for (a) the GRID yeast protein interaction matrix (denoted as Original), (b) the randomly shuffled matrix that has same number of connections as the original matrix (Shuffled), and (c) the matrix that has 10% fewer interactions than the original matrix (Reduced). There is a clear difference between randomly shuffled case and the original or reduced GRID data sets.

The clusters revealed by the eigenanalysis show order and contain functional information. This is an important observation motivating further, more detailed studies. The distributions of eigenvalues of the original and the reduced matrices, in contrast to the shuffled matrices, are quite similar. This proves that despite possible experimental errors and many undiscovered interactions in the GRID database, the overall shape of the eigenvalue distribution and the resulting interaction clusters are conserved. This conservation can be exploited for predictive purposes.

Eigenvector cutoff and node connectivities

We have investigated all eigenmodes in our analysis. For each eigenvector, we have tabulated all the proteins corresponding to components having absolute values above 0.05 and have examined the connectivities among them as specified by the GRID database. We have also calculated the number of neighbors (degree of connectivity) for each protein. We found that the protein with the highest number of protein connections is JSN1 (YJR091C – names in parentheses are the systematic names, called sometimes ORF names/numbers) with 288 neighbors. The protein that has the second highest number of neighbors is YKE2 (YLR200W), with 166 neighbors. In our analysis, JSN1 is the protein that corresponds to the smallest eigenvalue, and YKE2 corresponds to the second smallest eigenvalue (Table

The 5 smallest eigenvalues and the proteins related to the corresponding eigenvectors. The number of connections for each protein, and its rank order based on the number of connections are shown in the last two columns.

Eigenvalue index

Eigenvalue

Proteins

# of neighbors

Rank Order

1

-289.02

JSN1 (YJR091C)

288

1

2

-167.12

YKE2 (YLR200W)

166

2

3

-161.21

PAC10 (YGR078C)

160

3

GIM5 (YML094W)

160

4

4

-161.02

PAC10 (YGR078C)

160

3

GIM5 (YML094W)

160

4

5

-149.11

YPT6 (YLR262C)

148

5

A critical question remains: are these clusters formed solely according to the number of interacting proteins (i.e. spectral clustering)? Or does the function of proteins influence clustering (i.e. functional clustering)? The data we provide in this paper support the functional clustering hypothesis.

The spectral and functional nature of clusters may not be exclusive: their detailed nature could drive evolution in such a manner that the function of the protein is influenced not only by its functional type, but also by the number of protein neighbors in the whole network in order to create some vital control mechanisms to support cellular fitness. We will explore the presence of functional clustering in the following examples.

Extracting sub-nets with significant interconnections (Eigenvector #23)

In the case of eigenvector #23, there are 6 significant proteins. These proteins, shown in Table

The significant proteins in eigencluster #23, their number of connections in the protein-protein interaction network, their corresponding eigenvalues, and GO molecular function annotations.

Proteins

Value

Number of connections

GO Molecular Function annotation

CLA4 (YNL298W)

-0.41

91

protein serine/threonine kinase activity

FKS1 (YLR342W)

-0.34

90

1,3-beta-glucan synthase activity

ARP2 (YDL029W)

-0.09

88

ATP binding; actin binding; structural constituent of cytoskeleton

SMI1 (YGR229C)

-0.05

82

molecular function unknown

PHO85 (YPL031C)

0.06

81

cyclin-dependent protein kinase activity

RVS167 (YDR388W)

0.83

92

cytoskeletal protein binding

Connections for the proteins interacting within cluster #23

**Connections for the proteins interacting within cluster #23**. The edges represent experimental protein interactions.

Sub-nets with few interconnections – are there missing links? (eigenvector #67)

The cluster for eigenvector #67 has more proteins than do the clusters for eigenvectors #21 and #23. The significant proteins in this eigencluster are shown in Table

The significant proteins in eigencluster #67, their number of connections in the protein-protein interaction network, their corresponding eigenvalues, and GO molecular function annotations.

Proteins

Value

Number of connections

GO Molecular Function Annotations

MUS81 (YDR386W)

-0.12

45

endonuclease activity

CSM3 (YMR048W)

-0.11

53

molecular function unknown

PSE1 (YMR308C)

-0.07

42

protein carrier activity

CKA1 (YIL035C)

0.05

66

protein kinase CK2 activity

RPC40 (YPR110C)

0.05

67

DNA-directed RNA polymerase activity

HRR25 (YPL204W)

0.08

63

casein kinase activity

GLC7 (YER133W)

0.11

52

protein phosphatase type 1 activity

BUD20 (YLR074C)

0.20

56

molecular function unknown

SEN15 (YMR059W)

0.30

55

tRNA-intron endonuclease activity

HHF1 (YBR009C)

0.87

53

DNA binding

The protein cluster from eigenvector #67, which has a star-like form

**The protein cluster from eigenvector #67, which has a star-like form**. This form offers a major contrast with that of the cluster shown in Fig. 5, which has more interconnections. CSM3 (YMR048W) is unconnected to the rest of the proteins in this cluster.

There is also a question as to whether CSM3 may in fact be functionally disconnected from the other proteins in this cluster as suggested in the GRID database. Is it possible to functionally relate CSM3 to other proteins in the cluster? The function of CSM3 is currently unknown according to the GRID database, however, it is known that the protein participates in meiotic chromosome segregation and DNA replication

Even small subnets that are not fully connected may have missing links (eigenvector #4850)

For the upper end of eigenvalue distribution, we have analyzed the case for eigenvector #4850. The connectivities and GO annotations for this cluster are also shown in Fig.

The significant proteins in eigencluster #4850, their number of connections in the protein-protein interaction network, their corresponding eigenvalues, and GO molecular function annotations.

Proteins

Value

Number of connections

GO Molecular Function Annotations

URA10 (YMR271C)

0.15

2

orotate phosphoribosyltransferase activity

MNE1 (YOR350C)

0.26

3

molecular function unknown

RPS20 (YHL015W)

0.58

2

structural constituent of ribosome

GPI13 (YLL031C)

0.75

1

transferase activity, transferring phosphorus- containing groups

The three proteins connected in cluster #4850

**The three proteins connected in cluster #4850**. The fourth protein URA10 (YMR271C) present in this cluster is not connected to the three shown here.

Functional modules assemble proteins with similar functions and biological processes

For each eigenvector of the interaction matrix, we have analyzed the similarities of Gene Ontology (GO) annotations within a given cluster using FunSpec

GO assignments of biological processes and molecular functions for examples of individual eigenclusters with FunSpec.[82] The number of proteins with the same GO annotation is in parenthesis, and the numbers in square brackets are the p-values for the assignments.

Eigenvector

Number of significant proteins

GO Biological Process

GO Molecular Function

106

17

DNA metabolism (10) [3 × 10^{-8}], chromosome organization and biogenesis (7) [1 × 10^{-7}], nuclear organization and biogenesis (7) [4 × 10^{-7}], M phase (7) [3 × 10^{-6}], cell organization and biogenesis (11) [4 × 10^{-6}]

Double-stranded DNA binding (2) [2 × 10^{-4}], single-stranded DNA binding (2) [4 × 10^{-4}], DNA helicase (2) [1 × 10^{- }^{3}], DNA binding (5) [3 × 10^{-3}], Binding (8) [5 × 10^{-3}]

267

55

Cell growth and maintenance (50) [9 × 10^{-8}], RNA metabolism (14) [2 × 10^{-7}], RNA processing (13) [5 × 10^{-7}], microtubule-based process (8) [8 × 10^{-7}], mRNA processing (9) [9 × 10^{-7}], nucleobase, nucleoside, nucleotide, and nucleic acid metabolism (24) [2 × 10^{-6}]

Binding (28) [1 × 10^{-8}], nucleic acid binding (21) [2 × 10^{-7}], RNA binding (13) [2 × 10^{-7}], mRNA binding (6) [5 × 10^{-5}]

304

51

Nucleobase, nucleoside, nucleotide, and nucleic acid metabolism (36) [1 × 10^{-14}], RNA processing (19) [1 × 10^{-13}], mRNA processing (14) [4 × 10^{-13}], RNA metabolism (19) [1 × 10^{-12}], RNA splicing (12) [6 × 10^{-11}], mRNA splicing (11) [1 × 10^{-10}], metabolism (44) [2 × 10^{-10}], cell growth and/or maintenance (49) [7 × 10^{-10}]

Binding (32) [8 × 10^{-13}], nucleic acid binding (25) [2 × 10^{- }^{11}], RNA binding (14) [7 × 10^{-9}], mRNA binding (6) [4 × 10^{-5}]

We show the sub-network diagram for eigenvector #106 in Figure

The sub-network diagram of significant proteins and their interactions for eigenvector #106

**The sub-network diagram of significant proteins and their interactions for eigenvector #106**. Each node represents a protein, and each edge an experimental interaction in the GRID database. No interactions were indicated for the four central proteins, but by inference their function should be related. Experimental data indicate that SGS1 can be essential[83] in the absence of TOP1, so possibly these two proteins may substitute functionally for one another, thereby suggesting the additional interactions shown by dotted lines.

Another interesting aspect of this eigencluster shown in Fig.

Functional modules can be utilized to successfully predict new interactions

After we obtained the preliminary results, new interactions obtained by Krogan ^{th }eigenvector cluster shown in Fig.

The cluster corresponding to proteins and interactions in eigenvector #124

**The cluster corresponding to proteins and interactions in eigenvector #124**. The edges represent interaction information confirmed by experiments. Each node represents a protein. TOP1 and ARP1 are proteins originally found to be unconnected to the network in the cluster in the previous version of the GRID database. The thick line is the newly discovered interaction between proteins TOP1 (YOL006C) and TIF4631 (YGR162W) in the newer version of the GRID database. We expect that protein ARP1 (YHR129C) (shown with an arrow) most likely should also be connected to some other members of the cluster.

In this eigenvector, all significant proteins except TOP1 and ARP1 are connected to each other forming a full interactive cluster according to older GRID yeast data. According to our functional clustering hypothesis, these two proteins should, however, be connected to the interactive module of other proteins in the cluster. Was this discrepancy due to limitation of our clustering hypothesis or the lack of data? The new interaction data from Krogan

Conclusion

We have analyzed the yeast protein interaction network by building a connectivity matrix and by applying singular value decomposition to obtain eigenvectors. We have observed that significant proteins in each eigenvector not only have similar degrees, but also are most likely to interact with each other. These proteins therefore form "functional clusters", and these clusters can guide future experiments to predict new interactions. More detailed interpretations of these networks can be obtained by further studies utilizing information about protein structures. Our method can be especially useful for larger, more complex organisms where collection of the protein interaction data is more complicated. Our results encourage further analyses to confirm that functional clusters detected by our method reflect the modular nature of protein interaction networks and originate from evolutionarily preservation of cellular fitness.

Methods

Eigenanalysis of matrices

For a given square matrix **A **of size N × N the eigenvalues λ_{i }and eigenvectors **x**_{i }(1 ≤ i ≤ N) of size N correspond to the solution of the equation

**Ax **= λ**x ** (1)

The equation **Ax **= λ**x **represents a concise notation of system of linear equations, that have nontrivial solutions only if the determinant

det (**A **- λ**I**_{N}) = 0 (2)

where **I**_{N }is the identity matrix of size N × N. This is satisfied only for certain values of λ, called eigenvalues, which are roots of the characteristic equation of **A **(that is a polynomial of degree N in λ). For each eigenvalue λ_{i }(1 ≤ i ≤ N) there is a corresponding eigenvector **x**_{i }that satisfies the equation **Ax**_{i }= λ_{i}**x**_{i}. If some eigenvalues of the matrix **A **are zeros, than the matrix **A **is singular, its determinant det **A **= 0, and generally the inverse matrix **A**^{-1 }that satisfies the relation **AA**^{-1 }= **A**^{-1}**A **= **I**_{N }does not exist. A standard mathematical approach to deal with such cases is the computation of the matrix pseudoinverse by using singular value decomposition method, which will be discussed in the next sub-section.

Singular value decomposition

Generally, any matrix **A **of size M × N (with M ≥ N) can be written as a product

** A **=

where ** Λ **is the square matrix of size N × N containing non-negative values λ

It can be shown that the original contact (connectivity) matrix **C **= [C_{ij}] for the protein network can be written as

** C **=

where ** Λ **is the diagonal matrix containing eigenvalues λ

where _{ki }denotes the ^{th }component of the eigenvector corresponding to the ^{th }eigenvalue. Equation 5 can be viewed as the eigenvalue expansion of the contact matrix. From Eq. 5 it follows:

The eigenvalues with the smallest indices (that correspond to the largest absolute values of λ, as seen in Fig.

Acknowledgements

The authors acknowledge the financial support provided by the NIH grants R01GM072014 and R33GM066387. The authors would also like to thank James C. Coyle for his assistance with LAPACK.