European Bioinformatics Institute, European Molecular Biology Laboratory, Cambridge CB10 1SD, UK

Channing Laboratory, Brigham and Women's Hospital, 75 Francis Street, Boston MA 02115, USA

Vital-IT Center, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland

Fred Hutchinson Cancer Research Center, Computational Biology Group, 1100 Fairview Avenue North – M2-B876, P.O. Box 19024, Seattle WA 98109-1024, USA

Abstract

Graph theoretical concepts are useful for the description and analysis of interactions and relationships in biological systems. We give a brief introduction into some of the concepts and their areas of application in molecular biology. We discuss software that is available through the Bioconductor project and present a simple example application to the integration of a protein-protein interaction and a co-expression network.

Introduction

Molecular biology is concerned with enumerating and characterizing all the building blocks of living systems, as well as with their relationships, how the properties and the activity of one element affects those of another. For example, certain proteins have the capability of binding to particular regions of a cell's DNA, thereby activating or inhibiting the transcription of messenger RNA that codes for another protein, or even for that protein itself. Many proteins have the capability of binding to other proteins, forming a complex that can perform actions that none of the individual constituent proteins would be able to do. There are thousands, perhaps millions of different types and states of proteins in a living organism, and the number of possible interactions between them is enormous. The language of graph theory offers a mathematical abstraction for the description of such relationships. The beauty and usefulness of this abstraction is that it allows to develop concepts and tools independent of the concrete application. Many scientists and engineers are familiar with the benefits of abstraction that lie in linear algebra, calculus or probability theory; the goal of this article is to demonstrate some of the scope and power of the theory of graphs for the biology of gene regulation.

A graph consists of a set of nodes and a set of edges that connect the nodes. The nodes are the entities of interest and the edges represent relationships between the entities. For example, the entities may be a set of proteins in a cell, and the relationship modeled may be the existence of a physical interaction between two proteins. A graph is specified by the set of nodes

We will use the terms

Applications

Graphs play roles in three complementary areas. First, graphs provide a data structure for

A second application of graphs to molecular biology is to model measured data. Many types of molecular biological experiments produce data that convey relationships between molecules. For example, in a Yeast-Two-Hybrid screen, the data is the observation that a pair of proteins worked together to create a transcription initiation complex. In a chromatin immuno-precipitation microarray experiment (ChIP-chip), the data is the strength of the binding of the pulled down protein to the queried DNA regions, which themselves may be linked to one or several genes whose transcription is regulated through them.

A further role for graphs is in

Graphical models

Definitions

The presentation and notation here is based on that used in

A graph is specified by the set of nodes (the term

In some cases, such as in transcription factor networks, the relationships between the nodes in the graph are directed. A graph can simultaneously contain directed and undirected relationships.

An edge is said to be

Two nodes are said to be _{ij }denote the presence (and possibly, weight or type) of an edge from node

The

In Figure

A simple graph

**A simple graph**. An example for a simple directed graph.

A _{0 }to node _{n }is an alternating sequence of nodes and edges,

_{0}, _{1}, _{1},⋯, _{n-1}, _{n}, _{n}⟩

such that the endpoints of _{i }are _{i-1 }and _{i }for _{0 }= _{n}.

A node

The

For a graph

Set operations on graphs

**Set operations on graphs**. Set operations on two undirected graphs ug1 and ug1.

Connectivity properties can also be described in terms of nodes. Sometimes there is interest in those nodes whose deletion from a connected graph

A

A

For the sake of simplicity, we now diverge somewhat from the standard graph theoretic terminology for concepts of graph unions and intersections. In many applications, the node and edge sets of the graphs we need to consider are subsets of a large, but limited set of nodes and edges, for example all annotated genes in a genome. We define the

Cohesive subgroups

In application, the identification of maximal cliques is often of limited interest as the requirement of complete connectivity is so restrictive. When dealing with imperfect systems or with experimental data, we may need to consider more general notions of cohesive subgroups. Our description here follows that of

An _{s }such that the distance _{s}. A 1-clique is the same as a clique.

A _{s }containing _{s }nodes, in which each node is adjacent to no fewer than _{s }- _{s}(_{s}. Then a _{s }such that _{s}(_{s }- _{s}, and such that there is no node _{s }such that _{s}(_{s }-

One way to view this definition is that we are allowing up to

A

Distances

The length of paths between nodes in a graph can be used to induce a distance between nodes. In many cases, the shortest path will be used, but other alternatives may be appropriate for applications. If the graph has weighted edges, then these can easily be accommodated. Multi-graphs (graphs with multiple types of edges) can have different distances determined by the different types of edges. Other notions of distance, such as the number of paths that exist between two points

For example, the Gene Ontology _{i }and subgraph _{j}, _{i}, _{j}) is computed as the length of the shortest shared path to the root node, and (ii) the similarity between subgraph _{i }and subgraph _{j}, _{i}, _{j}) is computed as |_{i }⋂ _{j}| divided by |_{i }⋃ _{j}|. We note though that the relations in the GO graph are not designed to imply distances between the terms.

Once a decision has been made about a distance measure for objects organized in a graph, standard tools for cluster analysis or multidimensional scaling can be applied to the inter-object distances. Naturally, the choice of the distance measure is essential for outcome of the analysis, and the choice should not be driven by mathematical or computational convenience, but rather by a good understanding of the biological question.

Special types of graphs

There are special types of graphs that deserve attention because they play important roles in applications. The main ones are bipartite graphs, hypergraphs, and directed acyclic graphs (DAGs).

Bipartite Graphs

If the nodes of a graph

Two graphs called ^{t }and the one mode graph for ^{t }⊗

The

In social network analysis, the two types of nodes are often referred to as

It is worth noting that

Hypergraphs

Hypergraphs are closely related to bipartite graphs

A _{i }= {

The hyperedges in a _{i }= (_{i }while _{0 }≡ _{1}, _{1},..., _{n}, _{n }≡ _{i }is distinct, and for _{i-1 }= tail(_{i}) and _{i }= head(_{i}).

Directed acyclic graphs

An important class of directed graphs are the

Uncertainty and missing edges

Using graphs as models for data analysis and data representation poses a number of challenges. In many cases, the reported graphs are imperfect.

While the presence of an edge between two nodes has usually a well-defined interpretation, for non-edges the interpretation is often less clear. We can distinguish between two cases: the existence of the edge was tested and not found, or it was never tested in the first place. Both cases are usually reported by the absence of an edge, but their interpretation is quite different.

The error rates in binary data are often described by the concepts of false positives and negatives, but in many applications we will need to address the following three categories:

**false positives **– relationships that appear in the experimental data, but are not real;

**false negatives **– relationships that are real and were probed experimentally, but were erroneously not detected; and

**untested relationships **– where no measurement was attempted and hence no information is available.

In order to make appropriate use of the data, we will need to keep these issues in mind as we explore the resultant graphs. Uncertainty is usually not part of a purely mathematical approach to graph theory, but it cannot be ignored in the context of experimental data. Uncertainty affects how we use and think about graphs or networks. Uncertainty of relationships being modeled also impacts the design of software, the choice of algorithms, and the interpretation of the output.

Particular attention is due to the fact that the three sources of error mentioned above do often not occur "randomly", but may be associated with properties of the nodes. For example, more research has been directed towards genes that are known to be implicated in human diseases, hence it should come as no surprise that literature-based interaction networks are more dense, and may indeed contain more false positives and less untested relationships in regions around these genes and than around less popular genes.

Computational aspects

Representation

An abstract graph can be represented for computational purposes in many different ways. Among the common representations are

**node and edge list **– a list whose elements correspond to the nodes in the graph, and each element consists of two objects: the name of a node, and the list of other nodes to which it is connected.

**adjacency matrices **– a square matrix whose elements can be Boolean, real-valued or categorical variables and denote the existence, weight or type of an edge.

**from-to matrices **– a matrix with two or more columns, each row contains start and end nodes of an edge and possibly weights, types, etc.

For bipartite graphs with node sets

The representation used for a graph can have a profound effect on the running time of algorithms that are applied to it. It is advisable to make timing comparisons on different representations before committing to a particular one. The most appropriate or efficient strategy for representing the graph will depend on many factors such as the size of the graph and the types of operations that are going to be applied to it. The

Algorithms

There are many existing, well-tested and high-quality implementations of graph algorithms. It is inefficient and often more error-prone to reimplement algorithms for which good implementations already exist. Bioconductor provides interfaces to many of the algorithms coded in the open source Boost graph library

However, good implementations for many of the algorithms required in bioinformatics are still needed. Algorithms adapted to deal with incompleteness and uncertainty are of particular interest. For example, Scholtens and Gentleman

Software from the Bioconductor project

Bioconductor is an open source and open development software project for the analysis and comprehension of genomic data

Among the graphs-related packages, it is worth differentiating between packages that are mainly infrastructure (sets of tools that can be used to create other pieces of software) and packages that are designed to provide an end-user application. The packages

The

Case study: using graphs for comparing transcription and interaction data

As a very simple example, we demonstrate how graph concepts can be used to do an analysis that relates gene expression data to protein interaction data.

Proteins that form a functional complex need to be expressed concurrently, hence we expect that something can be learned from comparing co-expression and protein complex co-membership. In particular, we consider the question is whether genes in a protein complex are more likely to have a similar pattern of gene expression than genes in different complexes.

The analysis that we demonstrate in the following was reported by Balasubramanian et al.

For concreteness, we will show the R programming code to perform this analysis. Figures

**Sweave source code of the article**. The Sweave markup of this paper, including the text in LATEX format and the program code for the example analysis in the Case Study and the generation of all figures.

Click here for file

The largest connected component of the PPI graph

**The largest connected component of the PPI graph**. Layout of the connected component sG1 of the protein-protein interaction graph litG.

The second-largest connected component of the PPI graph

**The second-largest connected component of the PPI graph**. Layout of the connected component sG2 of the protein-protein interaction graph litG.

Statistical significance of the overlap between PPI and co-expression graphs

**Statistical significance of the overlap between PPI and co-expression graphs**. The

2-cliques in the overlap graph

**2-cliques in the overlap graph**. Layout of the overlap graph commomnG. There are three 2-cliques, each of size 4, marked by node color. Two nodes are part of two different 2-cliques, marked in a darker color.

** k-cores in the overlap graph**. Layout of the overlap graph commonG. Three 2-cores are marked by node color.

We set up the comparison by creating the two graphs as objects in the R language and counting how many edges they have in common. To see whether this number is significantly above what could be expected

The Data

The R package

A graphNEL graph with undirected edges

Number of Nodes = 2885

Number of Edges = 315

The code above shows the first two rows (genes) of ccyclered, the sizes of the 30 clusters, and a summary of the

Exploration of the PPI graph

To explore the graph litG, we can employ the functionality of the package

cc is a list of the connected components of litG. There are 2587 singletons (connected components of size 1), and the largest connected component has size 88. Let us plot the two largest components using the

select the largest subgraph,

lay it out using the function agopen, which is an interface to the

The graph is shown in Figure

Construction of the cluster graph

There is a specialized graph class

Statistical analysis of the graph overlap

It is now easy to determine how many pairs of genes have both a protein-protein interaction and are found in the same expression cluster. We find the intersection of the cluster-graph and the literature graph using the R function intersection.

A graphNEL graph with undirected edges

Number of Nodes = 2885

Number of Edges = 42

We find that 42 edges are in common, now we will try to determine whether this number is statistically interesting, i. e. different from what could be expected by chance. We will do this by generating a null distribution via permutation of node labels on the observed graph. The following function implements this.

Figure

Cohesive subgroups

Let us look at cohesive subgroups of the intersection graph commonG. First, we remove the singleton nodes,

then we use the functions from the

kcliq, the return value of kCliques is a list whose

[[1]]

[1] "YBR009C" "YBR010W" "YNL030W" "YNL031C"

[[2]]

[1] "YBL035C" "YJR043C" "YNL102W" "YPR135W"

[[3]]

[1] "YBR088C" "YDL102W" "YJR006W" "YJR043C" "YNL102W"

Remember that a 2-clique is a subgraph in which the distance between each pair of nodes is ≤ 2. Any subgraph of size ≤ 3 satisfies this requirement trivially, hence we consider those with size ≥ 4. They are shown in Figure

YBL035C YJR043C YNL102W YPR135W

"POL12" "POL32" "POL1" "CTF4"

YBL035C

B subunit of DNA polymerase alpha-primase complex, required for initiation of DNA replication during mitotic and premeiotic DNA synthesis; also functions in telomere capping and length regulation

YJR043C

Third subunit of DNA polymerase delta, involved in chromosomal DNA replication; required for error-prone DNA synthesis in the presence of DNA damage and processivity; interacts with Hys2p, PCNA (Pol30p), and Pol1p

YNL102W

Catalytic subunit of the DNA polymerase alpha-primase complex, required for the initiation of DNA replication during mitotic DNA synthesis and premeiotic DNA synthesis

YPR135W

Chromatin-associated protein, required for sister chromatid cohesion; interacts with DNA polymerase alpha (Pol1p) and may link DNA synthesis to sister chromatid cohesion

The first 2-clique is a duplicated pair of histone proteins:

YBR009C YBR010W YNL030W YNL031C

"HHF1" "HHT1" "HHF2" "HHT2"

A _{max }in the graph, the second is a list of length 3 with the

[1] 2

[1] "lambda-0 sets" "lambda-1 sets" "lambda-2 sets"

[[1]]

[1] "YBR009C" "YBR010W" "YNL030W" "YNL031C"

[[2]]

[1] "YDL102W" "YJR006W" "YJR043C"

[[3]]

[1] "YDR356W" "YHR172W" "YNL126W"

In this particular example, we note that the

Discussion

There are many ways in which graphs play a role in computational molecular biology, among these the representation and integration of experimental datasets as graphs; the interactive navigation and visualization of these large and complex datasets by a human researcher; the computation of solutions to problems such as cliques and cohesive subgroups, graph alignment, optimal paths or path-sets; the estimation of and statistical inference on an underlying ("hidden") graph from noisy observational data.

There is a substantial body of existing methodology in graph theory that is relevant to these questions, and it is a challenging and exciting task to establish the most appropriate and effective models. There is a need for theoretical development of the field, but also for software that integrates data analytic and statistical inference capabilities with methods for querying and manipulating graphs.

We have produced an approach to such an environment in Bioconductor. We made extensive use of existing software in particular from the Graphviz

Acknowledgements

W.H. and R.G. acknowledge support from HFSP research grant RGP0022/2005-C. L.L. was supported by a grant from Intel Corp. to the Vital-IT Center. We are thankful to the Boost and Graphviz groups for providing their software.

This article has been published as part of