Abstract
Background
Graph theory provides a computational framework for modeling a variety of datasets including those emerging from genomics, proteomics, and chemical genetics. Networks of genes, proteins, small molecules, or other objects of study can be represented as graphs of nodes (vertices) and interactions (edges) that can carry different weights. SpectralNET is a flexible application for analyzing and visualizing these biological and chemical networks.
Results
Available both as a standalone .NET executable and as an ASP.NET web application, SpectralNET was designed specifically with the analysis of graphtheoretic metrics in mind, a computational task not easily accessible using currently available applications. Users can choose either to upload a network for analysis using a variety of input formats, or to have SpectralNET generate an idealized random network for comparison to a realworld dataset. Whichever graphgeneration method is used, SpectralNET displays detailed information about each connected component of the graph, including graphs of degree distribution, clustering coefficient by degree, and average distance by degree. In addition, extensive information about the selected vertex is shown, including degree, clustering coefficient, various distance metrics, and the corresponding components of the adjacency, Laplacian, and normalized Laplacian eigenvectors. SpectralNET also displays several graph visualizations, including a linear dimensionality reduction for uploaded datasets (Principal Components Analysis) and a nonlinear dimensionality reduction that provides an elegant view of global graph structure (Laplacian eigenvectors).
Conclusion
SpectralNET provides an easily accessible means of analyzing graphtheoretic metrics for data modeling and dimensionality reduction. SpectralNET is publicly available as both a .NET application and an ASP.NET web application from http://chembank.broad.harvard.edu/resources/ webcite. Source code is available upon request.
Background
The field of graph theory concerns itself with the formal study of graphs – structures containing vertices and edges linking these vertices. Scientifically, graphs can be used to represent networks embodying many different relationships among data, including those emerging from genomics, proteomics, and chemical genetics. Networks of genes, proteins, small molecules, or other objects of study can be represented as nodes (vertices) and interactions (edges) that can carry different weights.
Graphtheoretic metrics, including eigenspectra, have been used to analyze diverse sets of data in the fields of computational chemistry and bioinformatics. Proteinprotein interaction networks in Saccharomyces cerevisiae, for example, have been shown to exhibit scalefree properties [1], and databases of mRNAs can be mined using spectral properties of graphs created by their secondary structure [2]. Graph theory has also been used in conjunction with combinations of smallmolecule probes to derive signatures of biological states using chemicalgenomic profiling [3].
Despite the widespread use of graph theory in these fields, however, there are few userfriendly tools for analyzing network properties. SpectralNET is a graphical application that calculates a wide variety of graphtheoretic metrics, including eigenvalues and eigenvectors of the adjacency matrix (a simple matrix representation of the nodes and edges of a graph) [4], Laplacian matrix [5], and normalized Laplacian matrix, for networks that are either randomly generated or uploaded by the user. SpectralNET is available both as an ASP.NET web application and as a standalone .NET executable. While SpectralNET was originally written to analyze chemical genetic assay data, it should be of use to any researcher interested in graphtheoretic metrics and eigenspectra.
Implementation
SpectralNET was originally written as an ASP.NET application in C#, and has subsequently been ported to a standalone .NET executable version (also written in C#). ASP.NET was originally chosen because it offered a fast, easy way to offer a thin client to users, obviating the need for large amounts of computational power on the client machine, as is often needed to perform large matrix calculations. A standalone version was created for three primary reasons: it avoids the problem of timeouts inherent when using a web interface (a potential issue when performing longduration calculations), it is more easily distributable, and porting from ASP.NET to a .NET executable is a relatively simple matter.
Many computations are performed directly in C#, such as graph instantiation and metric calculation. Matrix computations (including eigendecomposition) are performed using the NMath Suite (CenterSpace Software, Corvallis, Oregon). Because the NMath Suite is a commercially licensed library, those receiving source code from the authors must supply their own means of performing matrix eigendecomposition in order to modify and redeploy the application. The implementation of the FermiDirac integral, used in the calculation of spectral density, is ported from Michele Goano's implementation in FORTRAN (Goano, 1995). Because SpectralNET uses a thirdparty library for matrix calculations that is partially implemented using Managed Extensions for C++, SpectralNET will not be portable to Linux until the Mono implementation of this C++ language feature is complete.
Results and discussion
Graph creation
Idealized random networks can be automatically generated by the application, or networks can be uploaded by the user for analysis. SpectralNET can automatically generate random ErdosRenyi graphs [7], BarabasiAlbert (scalefree) graphs [8], rewiring BarabasiAlbert graphs [9], WattsStrogatz (smallworld) graphs [10], or hierarchical graphs [11]. Each automatically generated graph type is customizable with algorithmic parameters. SpectralNET was designed with extensibility in mind, so that users may request additional random graph types provided they submit a succinct algorithm to the author or create their own.
Networks can be uploaded by the user in the form of a Pajek file [12] or a tabdelimited text file with one edge per line (see 1: HumanPPI_nodenodeweight.txt for an example network definition file defining a network of human proteinprotein interactions). Raw data files can also be uploaded to the application, where each line of data is represented as a labeled vertex. Vertices can be connected with edge weights equal to the square of the correlation of their associated input data, or according to their Euclidean distance as defined by the Eigenmap algorithm [13]. If raw data is uploaded by the user, principal component analysis (PCA) [14,15] can optionally be performed on the data before calculating edge weights.
Additional File 1. Human PPI network definition file. Network definition file representing a network of human proteinprotein interactions. Data for this network was parsed from the MIPS Mammalian ProteinProtein Interaction Database. The numbers contained in this file correspond to the "shortLabel" annotation of proteins in the XML representation of the MIPS database.
Format: TXT Size: 1KB Download file
Graph analysis
After processing the input network, SpectralNET displays for the user a wide variety of graphanalytic metrics. For example, the degree and clustering coefficient is displayed for each vertex. The degree of a vertex is the number of edges incident upon that vertex; for weighted graphs, SpectralNET calculates this as the sum of these edges' weights. The clustering coefficient of a node represents the proportion of its neighbors that are connected to each other, and is calculated for a node i as:
where n_{i }denotes the number of edges connecting neighbors of node i to each other, and k_{i }denotes the number of neighbors of node i [16]. In addition, the minimum, average, and maximum distances of each vertex are displayed, which are defined as the shortest, average, and maximum distances, respectively, from the node to any other node in the graph. The components of the adjacency, Laplacian, and normalized Laplacian eigenvectors corresponding to the vertex are also shown, where the adjacency matrix is defined as the matrix A with the following elements:
the Laplacian matrix is defined as the matrix L with the following elements:
where w(e) denotes the weight of edge e; and the normalized Laplacian matrix is defined as the matrix with the following elements:
where d_{i }denotes the degree of node i [5]. It should be noted that Chung defines the Laplacian matrix as the normalized form above, but we use the more commonly found definition (for an example, see Mohar [17]).
Many large networks derived from biological data are composed of multiple subgraphs that are not always connected together. SpectralNET computes many properties based on the selected or "active" connected component. For the active connected component, its size and average diameter are displayed in addition to graphs of degree distribution [18], clustering coefficient by degree, and average distance by degree [19]. Graphs of eigenvalues, eigenvectors, inverse participation ratios, and spectral densities of the three matrix types are also displayed. The inverse participation ratio is defined for each eigenvector as:
where e_{j }represents the eigenvector. Spectral density, or the density of the eigenvalues, is plotted for each eigenvalue as on the horizontal axis and on the vertical axis, with the function p defined on any eigenvalue as:
where λ is the eigenvalue and δ represents the delta function, implemented as described above [20]. Most graphs can be mouseclicked to select the vertex corresponding to a desired data point, and eigenvalue graphs can be sorted by value or by vertex degree. All calculated graph metrics can be exported as a tabdelimited text file for further analysis.
Visualization and dimensionality reduction
The main graph display window of SpectralNET offers two interactive graphical networks displays that support zooming and allow vertex selection by mouseclick. The default display view is the resulting graph processed by the FruchtermanReingold algorithm [21], which positions vertices by forcedirected placement. The other available display is the network's Laplacian embedding, which locates vertices in twodimensional Euclidean space using the corresponding second and third Laplacian eigenvector components (the first eigenvector component of the Laplacian matrix is degenerate). Exportation of the other Laplacian eigenvector components allows for visualization in higher dimensions.
In conjunction with uploaded raw data, Laplacian embedding allows the user to see a reduceddimensionality view of highdimensionality input, once this input is converted into a network. If the user chooses to process input data using the Eigenmap algorithm, Laplacian embedding shows the reduceddimensionality result [13]. Dimensionality reduction has proven to be a useful tool in computational chemistry and bioinformatics; for example, Agrafiotis [22] used multidimensional scaling (MDS) to reduce the dimensionality of combinatorial library descriptors, and Lin [15] used PCA to analyze single nucleotide polymorphisms from genomic data. We chose to implement Laplacian embedding rather than MDS or other algorithms in SpectralNET because of promising results in the field of machine learning [23]. Although dimensionality reduction is especially useful for analyzing highdimensional data, Laplacian embedding is an elegant display choice for any input network (see the next section for an example using a scalefree biological network). For a simpler (linear) dimensionalityreduced view of the input data, SpectralNET also has the option of viewing the results of PCA (though this view is not available when a network definition file, such as a Pajek file, is used). Both Laplacian embedding and PCA can be viewed in three dimensions with a Virtual Reality Modeling Language (VRML) viewer.
Example analysis of a randomlygenerated smallworld network and a biological scalefree network
SpectralNET provides an easytouse interface for creating a randomly generated smallworld network. All that is required is to supply the desired number of nodes, the desired number of neighbors to which to connect each node, and the desired random probability that an edge is rewired. For this example we create a network with 300 nodes in which each node is connected to four neighbors, and edges are rewired with 4% probability.
The default view of the graph is its FruchtermanReingold display, which, as noted above, uses forcedirected placement to draw graph nodes (Figure 1). While the FruchtermanReingold display offers a quickly generated view of large networks, relatively little information about the global organization of the network is observable in the display of this smallworld network (one cannot tell, for example, that the graph is a smallworld network by its FruchtermanReingold display alone). In order to see the graph as drawn by the Laplacian eigenvector components of each node, the "Laplacian Embedding" radio button underneath the graph display is selected. In contrast to the FruchtermanReingold display, the Laplacian embedding of this smallworld network (Figure 2) conveys significantly more information about its topology. In this display, it is clear that the smallworld network was generated by placing neighboring nodes next to each other in a ringlike fashion – the theoretical ringstructure is represented literally in the Laplacian embedding.
Figure 1. FruchtermanReingold display of a smallworld network. FruchtermanReingold display of a randomly generated smallworld graph. The node selection panel and node information panel are visible to the left of the display.
Figure 2. Laplacian embedding of a smallworld network. Laplacian embedding of the randomly generated smallworld network depicted in Figure 2, as drawn by SpectralNET.
Realworld biological networks are also amenable to topological analysis using Laplacian embeddings. In order to generate a suitable biological network to analyze, the MIPS Mammalian ProteinProtein Interaction Database [26] was downloaded and parsed into a nodenodeweight file for import into SpectralNET (see: 1: HumanPPI_nodenodeweight.txt). The Laplacian embedding of the largest connected component of the resulting graph (Figure 3) shows a central hub of highly connected proteins connected to four connected branches. Spectral analysis similar to that performed below shows that the network is scalefree in nature, as is further evidenced by the fact that there are many more lowdegree proteins than highdegree proteins, with the relationship between number of proteins and protein degree following a powerlaw distribution (data not shown). The scalefree nature of this network suggests that highlyconnected proteins in the central hub may perform a coordinating role for the proteins in this interaction network. Examining the most highly connected protein in the central hub of the network (indicated in Figure 3) shows that, indeed, it is the transcriptional coactivator SRC1, which receives and augments signals from multiple pathways [27]. Readers with further interest in topological analysis of biological networks are encouraged to read Farkas et al. [28] for a global analysis of the transcriptional regulatory network of S. cerevisiae or Jeong et al. [29] for an analysis of the protein interaction network of the yeast.
Figure 3. Virtual Reality Modeling Language (VRML) diagram of a human protein interaction network. Laplacian embedding of a scalefree biological network generated from a curated online database of protein interactions in humans (MIPS Mammalian ProteinProtein Interaction Database). For data see 1 HumanPPI_nodenodeweight.txt.
In addition to the graphical display of networks, SpectralNET enables analysis of spectral properties of input networks, which can shed light on graph topology. One way this can be achieved is to compare a smallworld network similar to, but not identical to, the randomly generated smallworld network described above. This graph is a smallworld network created by attaching complete subgraphs, varying in size from three to six nodes, to nodes arrayed in a ring (see 2: Smallworld_nodenodeweight for the network definition file, originally described by Comellas [24]) (Figure 4). The spectral properties of this graph can be used to help identify the topology of the original graph, in this case by comparing their adjacency and Laplacian spectral densities (Figure 5) [5,20]. Spectral density measures the density of surrounding eigenvalues at each eigenvalue and serves as an especially useful metric of global graph topology. The plot of these values for the example network is most similar to the corresponding plots for a WattsStrogatz network (in this network, there are 500 nodes connected to 6 neighbors, with a rewiring probability of 1%), despite the fact that there are only 33 nodes in the example network. Thus, even when an example network has relatively few nodes, comparison of spectral properties of the graph to idealized graphs can yield clues about network's topology.
Additional File 2. Smallworld network definition file. Network definition file for a 33node smallworld network with attached complete subgraphs.
Format: TXT Size: 1KB Download file
Figure 4. Laplacian embedding of an uploaded smallworld network. Laplacian embedding of a smallworld network (n = 33) created by attaching complete subgraphs to nodes arrayed in a ring. The subgraphs each appear as a single point because their constituent nodes have identical connectivity profiles, yielding identical Laplacian eigenvector components.
Figure 5. Comparison of spectral properties of two smallworld networks. Plots of spectral density of the adjacency and Laplacian eigenvalues for a randomlygenerated ErdosRényi graph, a randomlygenerated WattsStrogatz graph, a randomlygenerated BarabásiAlbert graph, and the smallworld network depicted in Figure 4 consisting of complete subgraphs attached to nodes arrayed in a ring. The input smallworld network is most similar to the randomlygenerated WattsStrogatz network, since they have the most similar topologies.
Dimensionality reduction of a realworld chemical dataset to analyze QSAR
In addition to performing spectral analysis of networks, SpectralNET can also perform dimensionality reduction on chemical datasets to analyze quantitative structure activity relationships (QSAR). In this example, we upload a set of chemical descriptor data into SpectralNET and analyze it using the Laplacian Eigenmap algorithm originally developed by Belkin and Niyogi [13]. This dataset contains one small molecule, each created by the same diversityoriented synthesis pathway [25], per row of the input file. Each column of the data represents a different molecular descriptor – metrics used to capture an aspect of the compound, such as volume, surface area, number of rings, etc.
The Laplacian Eigenmap algorithm in SpectralNET connects these small molecules to their Knearest neighbors (measured by Euclidean distance), where K is an algorithmic parameter supplied by the user. In this example, we choose K = 7 to yield a reasonable number of edges in the resulting graph. Weights are assigned to each edge in one of two ways – every edge can have a weight of one, or weights can be assigned to edges by the following formula:
where W_{ij }represents the weight of an edge connecting edges i and j and t is an algorithmic parameter [13]. For the molecular descriptor dataset, edge weights of one were chosen (it should be noted that when applying the second method to this dataset, increasing values of t eventually resulted in convergence to the same result as this method around t = 20,000). SpectralNET also offers the choice of performing PCA on input data before performing the Laplacian Eigenmap algorithm, which is performed by default and remains enabled for this example.
The resultant Laplacian embedding of the graph, which can be viewed by selecting the "Laplacian Embedding" radio button underneath the graph view pane, is the reduced dimensionality result of the Laplacian Eigenmap algorithm (Figure 6). Like PCA, the Laplacian Eigenmap algorithm performs dimensionality reduction on an input dataset such that relationships among the data are captured by fewer dimensions. Unlike PCA, however, it is not a linear transformation of the data, and the resulting nonlinear dimensionality reduction can offer a more powerful view of the data than does PCA.
Figure 6. Laplacian Eigenmap result for a molecular descriptor dataset. A network of small molecules encoded as molecular descriptors, connected by similarity and displayed using the Laplacian Eigenmap algorithm, which plots each small molecule according to its corresponding Laplacian eigenvector components. Small molecules are colored according to the value of their minimized energy, one of the molecular descriptors of the original dataset.
Because Laplacian Eigenmaps is a local, rather than global, algorithm, it seeks to preserve local topological features of the data in its reduceddimensionality space [13]. Thus, it is difficult to compare its performance relative to a linear, global algorithm like PCA without labeled features on which to classify the data and a rigorous comparison across multiple datasets and datatypes. However, by visual inspection of points clustered together in the Laplacian Eigenmap result (from the highlighted areas in Figure 6), one can see that they are structurally similar relative to a set of random compounds selected from the space as a whole (Figure 7), and the two outlier groups visible in the original image are also chemically similar (data not shown). The same dataset plotted on its first two principal components (via PCA) yields no significant clustering comparable to that of Laplacian Eigenmaps with instead one large and a second smaller diffuse cluster visible (Figure 8). Additional support for nonlinear QSAR methods comes from Douali et al. [30], which found that a nonlinear QSAR approach using neural networks predicted activities very well, outperforming other methods found in the literature. A more rigorous comparison of these algorithms in the context of molecular descriptor data is ongoing.
Figure 7. Comparison of chemical structures from Laplacian Eigenmap clusters. Comparison of chemical structures from the example realworld dataset of molecular descriptors depicted in Figure 5, taken either (A) from the group labeled "A", (B) from the group labeled "B", or (C) at random from the entire set.
Figure 8. Principal Components Analysis result for a molecular descriptor dataset. The network of small molecules depicted in Figure 5, displayed using the first two principal components of the data as derived from PCA. Small molecules are colored according to the values of their minimized energies.
Conclusion
SpectralNET provides an easily accessible means of analyzing graphtheoretic metrics for data modeling and dimensionality reduction. The software allows users to analyze idealized random networks or uploaded realworld datasets, and exposes metrics like the clustering coefficient, average distance, and degree distribution in an easytouse graphical manner. In addition, SpectralNET calculates and plots eigenspectra for three important matrices related to the network and provides several powerful graph visualizations.
SpectralNET is available as both a standalone .NET executable and an ASP.NET web application. Source code is available by request from the author.
Availability and requirements
Project name: SpectralNET
Project home page: http://chembank.broad.harvard.edu/resources/ webcite
Operating system(s): Windows
Programming language: C#
Other requirements: The .NET framework v1.1 or higher
License: The SpectralNET software is provided "as is" with no guarantee or warranty of any kind. SpectralNET is freely redistributable in binary format for all noncommercial use. Source code is available to noncommercial users by request of the primary author. Any other use of the software requires special permission from the primary author.
Any restriction to use by nonacademics: Contact authors
Authors' contributions
JF developed and tested the software, wrote the initial version of the manuscript, and codesigned the software; PC provided feedback and data for molecular descriptor analysis, assisted with design of the software, and edited the manuscript; SS provided project guidance and edited the manuscript; SH initially conceived of and codesigned the software and edited the manuscript. All authors read and approved the final manuscript.
Acknowledgements
We gratefully acknowledge the Broad Institute of Harvard University and MIT, the National Cancer Institute (Initiative for Chemical Genetics), and the National Institute of General Medical Sciences (Center of Excellence for Chemical Methodology and Library Development) for support of this research. S.L.S. is an Investigator at the Howard Hughes Medical Institute.
References

Eisenberg E, Levanon E: Preferential attachment in the protein network evolution.
Phys Rev Lett 2003, 91:138701. PubMed Abstract  Publisher Full Text

Fera D, Kim N, Shiddelfrim N, Zorn J, Laserson U, Gan HH, Schlick T: RAG: RNAAsGraphs web resource.
BMC Bioinformatics 2004, 5:88. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Haggarty S, Clemons P, Schreiber S: Chemical genomic profiling of biological networks using graph theory and combinations of small molecule perturbations.
J Am Chem Soc 2003, 125:1054310545. PubMed Abstract  Publisher Full Text

Chartrand G: Introductory Graph Theory. New York: Dover; 1985.

Chung F: Spectral Graph Theory. Providence: American Mathematical Society; 1997.

Goano M: Algorithm 745: Computation of the complete and incomplete FermiDirac integral.
ACM Trans Math Software 1995, 21:221232. Publisher Full Text

Barabási AL, Albert R: Emergence of scaling in random networks.

Albert R, Barabási AL: Topology of evolving networks: Local events and universality.
Phys Rev Lett 2000, 85:52345237. PubMed Abstract  Publisher Full Text

Watts D, Strogatz S: Collective dynamics of 'smallworld' networks.
Nature 1998, 393:440442. PubMed Abstract  Publisher Full Text

Barabási AL, Dezso Z, Ravasz E, Yook SH, Oltvai Z: Scalefree and hierarchical structures in complex networks.

Batagelj V, Mrvar A: PAJEK – program for large network analysis.

Belkin M, Niyogi P: Laplacian eigenmaps for dimensionality reduction and data representation.
Neural Computation 2003, 15:13731396. Publisher Full Text

Hotelling H: Analysis of a complex of statistical variables into principal components.

Lin Z, Altman R: Finding haplotype tagging SNPs by use of principal components analysis.
Am J hum Genet 2004, 75:850861. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Nacher JC, Ueda N, Yamada T, Kanehisa M, Akutsu T: Clustering under the line graph transformation: application to reaction network.
BMC Bioinformatics 2004, 5:207. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Mohar B: Some applications of Laplace eigenvalues of graphs. In Graph Symmetry: Algebraic Methods and Applications. Edited by Hahn G, Sabidussi G. Dordrecht: Kluwer Academic Publishers; 1997:225275.

Barabási AL, Ravasz E, Vicsek T: Deterministic scalefree networks.
Physica A 2001, 299:559564. Publisher Full Text

Farkas I, Derenyi I, Barabási AL, Vicsek T: Spectra of "realworld" graphs: Beyond the semicircle law.
Phys Rev E 2001, 64:026704. Publisher Full Text

Fruchterman T, Reingold E: Graph drawing by forcedirected placement.

Agrafiotis D, Lobanov V: Multidimensional scaling of combinatorial libraries without explicit enumeration.
J Comput Chem 2001, 22:17121722. Publisher Full Text

Belkin M, Niyogi P: Semisupervised learning on Riemannian manifolds.
Machine Learning 2004, 56:209239. Publisher Full Text

Comellas F, Sampels M: Deterministic smallworld networks.
Physica A 2002, 309:231235. Publisher Full Text

Stavenger R, Schreiber S: Asymmetric Catalysis in DiversityOriented Organic Synthesis: Enantioselective Synthesis of 4320 Encoded and Spatially Segregated Dihydropyrancarboxamides.
Angew Chem Intl Ed 2001, 40:34173421. Publisher Full Text

Pagel P, Kovac S, Oesterheld M, Brauner B, DungerKaltenback I, Frishman G, Montrone C, Mark P, Stümpflen V, Mewes H, Ruepp A, Frishman D: The MIPS mammalian proteinprotein interaction database.
Bioinformatics 2005, 21:832834. PubMed Abstract  Publisher Full Text

Kalkhoven E, Valentine J, Heery D, Parker M: Isoforms of steroid receptor coactivator 1 differ in their ability to potentiate transcription by the oestrogen receptor.
The EMBO Journal 1998, 17:232243. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Farkas I, Jeong H, Vicsek T, Barabási AL, Oltvai Z: The topology of the transcription regulatory network in the yeast, S. cerevisiae.
Physica A 2003, 318:601612. Publisher Full Text

Jeong H, Mason S, Barabási AL, Oltvai Z: Lethality and centrality in protein networks.
Nature 2001, 411:4142. PubMed Abstract  Publisher Full Text

Douali L, Villemin D, Cherqauoi C: Neural networks: Accurate nonlinear QSAR model for HEPT derivatives.
Chem Inf Comput Sci 2003, 43:12001207. Publisher Full Text