Faculty of Computer and Information Science, University of Ljubljana, Slovenia

Department of Molecular and Human Genetics, Baylor College of Medicine, Houston TX, USA

Abstract

Background

Researchers in systems biology use network visualization to summarize the results of their analysis. Such networks often include unconnected components, which popular network alignment algorithms place arbitrarily with respect to the rest of the network. This can lead to misinterpretations due to the proximity of otherwise unrelated elements.

Results

We propose a new network layout optimization technique called FragViz which can incorporate additional information on relations between unconnected network components. It uses a two-step approach by first arranging the nodes within each of the components and then placing the components so that their proximity in the network corresponds to their relatedness. In the experimental study with the leukemia gene networks we demonstrate that FragViz can obtain network layouts which are more interpretable and hold additional information that could not be exposed using classical network layout optimization algorithms.

Conclusions

Network visualization relies on computational techniques for proper placement of objects under consideration. These algorithms need to be fast so that they can be incorporated in responsive interfaces required by the explorative data analysis environments. Our layout optimization technique FragViz meets these requirements and specifically addresses the visualization of fragmented networks, for which standard algorithms do not consider similarities between unconnected components. The experiments confirmed the claims on speed and accuracy of the proposed solution.

Background

From the onset of systems biology, visualization of networks has played a key role in communicating the relations between objects of interest and the structure of the problem domain. Gene networks

Formally, a network is a graph which consist of vertices (nodes) linked by edges. In systems biology, vertices can represent genes, proteins, metabolites, diseases, or other objects of interest. Edges abstract the relations between these objects.

The network often consists of a large number of unconnected components, like the recently published yeast protein interaction network

For illustration consider the network from Figure

Four components of the leukemia gene network from Figure 2

**Four components of the leukemia gene network from Figure 2**. The layout was optimized by a standard Fruchterman-Reingold algorithm (a) and by FragViz (b). FragViz optimization additionally used the information on vertex distances.

The network (N1) of the significantly differentially expressed genes from the leukemia data set, where the similarity between the chosen genes was calculated based on their co-membership in biological pathways

**The network (N1) of the significantly differentially expressed genes from the leukemia data set, where the similarity between the chosen genes was calculated based on their co-membership in biological pathways**. Only components with at least two vertices (similarity threshold equal to 0.7) are included in the network. Genes represented with solid circles were significantly over-expressed in the ALL samples and genes shown as empty circles had higher expression in the AML samples. The individual components are named according to the prevailing Gene Ontology annotation. Components were grouped and labeled manually by the expert.

In the paper we introduce a generally applicable technique called FragViz for placing the components according to the background data on their similarity. For example, rendering a network from Figure

FragViz uses a two-step network layout optimization procedure. It first applies the standard Fruchterman-Reingold algorithm separately on each unconnected component to optimize the layout of its vertices. Then it optimizes the global placement and orientation of components using a semi-physical model where the forces between components are inferred from similarities between the corresponding vertices in these components.

The data on similarity of the network nodes can either come from the same data source used to infer the structure of the network, or can be provided by supplying any additional information. Most often, the network's structure itself is derived from the

Related work

The proposed approach belongs to the family of algorithms for force-directed placement of objects into two-dimensional projections, and is strongly related to two kinds of algorithms: the optimization of network layout and multidimensional scaling (MDS)

The two kinds of algorithms are related. It is possible to lay out a network by representing it with a distance matrix and performing MDS-based optimization. Or, vice versa, we can convert a distance matrix into a weighed complete graph and use a graph layout optimization in place of MDS. The optimizations would yield different results, as each of the methods uses its own stress function that is being optimized and was designed to match the goals of particular projection. For instance, in network layout optimization, projected distance between unconnected vertices has no effect for as long as it is large in comparison with distances between the connected vertices. In contrast, MDS optimizes distances between all pairs of objects, including the most distant.

With regard to the optimization procedure, algorithms make assumptions about the structure of the data. Network layout algorithms work best for graphs in which most vertices have only a small number of neighbors. MDS, on the contrary, considers distances between all pairs of objects, a data structure that can be represented with a complete weighed graph. Force-directed network layout optimization algorithms do not work well on densely connected graphs (

There are a number of algorithms that use the metaphors from either network layout algorithms or MDS or both, trying to adapt each one for a particular data structure or heuristically improve runtime performance. Clustered graphs, for one, include groups of vertices that are related to each other. Clusters can be determined by observing the density of mutual connections between vertices or they can be based on data describing the vertices. Various algorithms have been designed that can detect such clusters

The method described in this paper, FragViz, is a representative of context-specific methods for layout optimization. Unlike other methods we have reviewed in this section, it specifically addresses the layout optimization for graphs consisting of isolated components, which are given in advance and represent meaningful entities, such as groups of genes related to a particular process. The components, in turn, need to be considered jointly, based on their mutual relations which may stem from individual relations between member vertices. The natural approach that deals with this particular data structure is to first optimize the layout of each component independently, and then optimize the position and rotation of the components. We achieve this by combination of network layout and MDS-based algorithms. Notice that, as further addressed in the Discussion, other, perhaps more straightforward adaptations of existing approaches could address such data, but perform worse both in terms of runtime and quality of the resulting layout.

Methods

The input to FragViz is a list of network components and a matrix of (dis)similarities between the network's vertices. FragViz first uses a network layout optimization technique, like Fruchterman-Reingold algorithm

Formally, we are given a graph _{
k
}is fixed and given by positions of its vertices inside its own fixed coordinate system. We will denote the position of vertex _{
i
}by **v**
_{
i
}. We also assume that the internal coordinate systems are centered, _{
k
}. The task is to find the placement **c**
_{
k
}and orientation _{
k
}of coordinate systems for all components, which reflect the given dissimilarities

Description of a physical system

We will base the method on a physical metaphor. Imagine each component as a board with vertices as pegs. Pegs from different components are connected with springs of different lengths corresponding to the given dissimilarities

Assume that all vertices have equivalent mass _{
k
}is

and component's moment of inertia is

The force between a pair of points (_{
i
}, _{
j
}) is defined by Hooke's law,

where **g**
_{
i
}and **g**
_{
j
}are positions of vertices in a global coordinate system,

where _{
i
}∈ _{
k
}.

Let **F**
_{
i
}be the sum of forces acting on vertex _{
i
}

The force causes linear acceleration

and angular acceleration

of the component. We shall assume infinite friction, so the component does not retain any momentum. At each instance, the component moves by a distance proportional to the linear acceleration, **Δc**
_{
k
}~ **a**
_{
k
}and rotates by an angle proportional to the angular acceleration, **Δ**ϕ_{
k
}~ _{
k
}, so

These equations allow for a computer simulation of the physical process. Starting from a random placement of components, we iteratively compute the forces **F**
_{
i
}, and move and rotate the components accordingly until the system reaches an optimum in which all **F**
_{
i
}are negligible.

Approximate simulation

Computer simulation of the system described above is rather slow. We can speed it up by first computing the positions of components and then rotating them in place. The result is only approximately optimal with regard to the total stress (3), yet we will experimentally show that the difference is negligible.

For positioning the components, the approximate method measures and optimizes distances between components rather than the distances between vertices. We define the distance between components _{
k
}and _{
l
}as the average of distances between the corresponding vertices, similar to average linkage in hierarchical clustering analysis

The task is then to find the positions in a two dimensional plane, in which the distance between every pair of component centers **c**
_{
k
}and **c**
_{
l
}matches the given _{
kl
}as close as possible. This approach is much faster than the simulation from the previous section since the computation of all pairwise distances at each step of optimization is replaced by a single such computation in (10). This translates the problem of placing the components into the familiar multidimensional scaling problem (MDS). There exist many efficient solutions of the MDS, such as, for instance, SMACOFF

By considering only the centers of components, MDS ignores their sizes, which can cause the components to overlap. This can be fixed by introducing a scaling factor between the global coordinate system and the internal coordinate systems of components by replacing (4) by

The scaling factor is equal for all components and should be such that the components are just as large as possible without too much of overlap. A simple rule of a thumb is to use the ratio between the average size of components

where

and

For rotation of components we use the original vertex-wise definition of force (3) computed in the scaled coordinate system (11). We apply the same procedure as in the exact simulation, except that we only compute the rotation without the translation. To avoid ending up in local minima, we use simulated annealing where the component can also rotate in the "wrong direction", with the probability of doing so decreasing with time. Although this optimization recomputes the pairwise distances between all vertices at each step, it is not overly time consuming since it requires only a small number of iterations.

In the remainder of the paper we only show layouts optimized by the approximate method.

Data

The performance of the proposed algorithm was assessed on four different networks (N1, N2.1, N2.2 and N3) showing relations between genes which were most differentially expressed in the leukemia gene expression data set

Based on different means to estimate the gene similarity, we have defined four distinct gene networks:

• N1 - biological function similarity score: the similarity of genes relates to their biological functions and was calculated based on their membership in canonical biological pathways using the Jaccard index

• N2.1 - Huttenhower similarity score: the similarity between genes as computed by

The network (N2.1) of the most differentially expressed genes from the leukemia data set.

**The network (N2.1) of the most differentially expressed genes from the leukemia data set.** The similarity matrix of the chosen genes was taken from the recently published work of Huttenhower

• N2.2 - Huttenhower similarity score: the same similarity scores and threshold as in N2.1 were used (the Huttenhower

The network (N2.2) of the most differentially expressed genes from the leukemia data set as the network in Figure 3, but including the isolated vertices (genes not connected to any other gene), in order to observe the similarity of all the differentially expressed genes

**The network (N2.2) of the most differentially expressed genes from the leukemia data set as the network in Figure 3, but including the isolated vertices (genes not connected to any other gene), in order to observe the similarity of all the differentially expressed genes**.

• N3 - protein-protein interaction network (Figure

The network (N3) of genes from the leukemia data set

**The network (N3) of genes from the leukemia data set**. Vertices are connected based on their protein interactions from the MIPS database. Only the interactions with the confidence level equal to 1 are shown. The (dis)similarity matrix was added from a different data source and relates to genes biological functions. The individual components and their clusters are named according to the prevailing Gene Ontology annotation. Blue lines are drawn in the background, connecting each component with two most similar components. Line widths correspond to component similarity.

The average local clustering coefficient

Basic characteristics of the networks used in experiments, describing the average local clustering coefficient and the number of vertices, edges and components

**network**

**vertices**

**edges**

**components**

**clustering coeff**.

N1

72

73

28

0.985

N2.1

240

223

54

0.864

N2.2

858

223

672

0.864

N3

132

121

41

0.852

Results and Discussion

The goal of FragViz is to find the network layout in which the arrangement of components uncovers new insights on relations between them and their constituents. We evaluated the method in an experimental study that considered FragViz visualization of the leukemia gene networks N1, N2.1, N2.2 and N3. For additional assistance to the domain expert, the network components were named according to their most specific term from biological process or molecular function aspect of Gene Ontology

The leukemia gene network (N1)

Our goal was to obtain a clear visualization relating the most important genes and their biological functions for two major types of acute leukemia, yielding insight and valuable clues about the disrupted biological processes and pathways in leukemic cells. Solid vertices in Figure

FragViz allows for the exploration of biological processes related to acute myeloid and acute lymphoblastic leukemia on different levels, from specific to more general ones. In Figure

For example, the "guanylate cyclase activity", "nucleotide metabolic process", "RNA polymerase activity", and "DNA replication" components in Figure

The Huttenhower similarity network (N2.1 and N2.2)

The N1 and both N2 networks contain the same 1,025 differentially expressed genes from the leukemia data set. However, in N2.1 and N2.2 a combined gene distance score was used, computed from multiple biological data sources (

As in the N1 network, most of the graph components in N2 networks (Figures

For example, the genes in components "spliceosomal snRNP biogenesis", "tRNA aminoacylation for protein translation", "sequence-specific DNA binding" and the nearest genes in the component "protein binding" participate in processes of cell proliferation. All these genes have higher expression in ALL samples. Excessive cell proliferation is a characteristic of all leukemic cells. However, previous studies

Since the distance information is used to adjust the position of unconnected components, the layout allows for the exploration of the data on different levels, using genes from a single component or from clusters of biologically related components.

The protein-protein interaction network (N3)

The placement of unconnected components in a fragmented network can be optimized using the vertex distance information from a source other than that used in the inference of network structure. For example, the N3 network (Figure

Several gene products (proteins) that lie close to each other in the FragViz optimized network (Figure

We added an optional component similarity visualization to the network. The similarity between network components is visualized by blue lines in Figure

Performance comparison

Table

Average layout optimization time in seconds for all four networks

**network**

**FR**

**FragViz**

**(simulation)**

**FragViz**

**(approximation)**

**MDS**

**Eades 1**

**Eades 2**

N1

0.4

33

6

36

3

1

N2.1

1.3

63

6

64

31

2

N2.2

8

301

240

320

410

29

N3

1.1

76

14

55

8

1.5

All measurements have been conducted on a desktop PC, with Intel Core 2 Duo 2.20GHz processor and 4 GB of RAM, using the 64-bit Windows 7 OS. The results represent an average over 10,000 runs of the algorithms on the N1-N3 networks, starting from random positions of vertices.

The Fruchterman-Reingold algorithm is by far the fastest, but it uses less data than the others and the resulting projections are much less informative. Running times of Eades 2 are comparable to those of Fruchterman-Reingold. This was expected, as both approaches run on a similar graph. Eades 1 employs a complete graph, which makes it much slower. On large networks, Eades 1 (N2.2) is even slower than MDS. The running times of FragViz simulation are similar to those of MDS, which is also expected. The approximate method runs much faster, except for the large network N2.2, where most vertices are unconnected, which essentially translates the visualization problem to MDS.

Table

Pearson's correlation between elements of the gene distance matrix and the Euclidean distance between the corresponding vertices in the two-dimensional network layout

**network**

**FR**

**Fragviz**

**(simulation)**

**FragViz**

**(approximation)**

**MDS**

**random**

**Eades 1**

**Eades 2**

N1

0.311

0.391

0.380

0.415

0.007

0.173

0.215

N2.1

0.086

0.290

0.302

0.654

0.002

0.009

0.156

N2.2

0.401

0.591

0.609

0.593

0.006

0.391

0.043

N3

0.179

0.224

0.285

0.361

0.060

0.092

0.199

For all four networks, the correlation coefficients of the FragViz algorithms are very similar. The correlation was always lower with the FR algorithm and, for three out of four networks, the highest correlation was obtained with MDS. In one of the compared networks (N2.2) MDS performed slightly worse than approximation, suggesting MDS got trapped in a local minimum. As expected, when the vertices were arbitrary placed in the graph, the correlation between the position of vertices in the graph and their actual distances is close to 0.

Clustered graph approaches (Eades 1 and Eades 2) are in general faster than FragViz, but performed worse in terms of layout quality. Eades 2 performed better than Eades 1 on smaller graphs (N1, N2.1 and N3), whereas Eades 1 had a high correlation for a large network (N2.2). However, Eades 1 approach is not appropriate for analyzing large fragmented networks as it works prohibitively slow.

Note that the compared algorithms pursue different goals. The tests were run on data suitable for the method presented in this paper, while in other contexts another method could give better results. In particular, clustered graph methods could not be directly applied to the original data, so its results depend on the proposed transformation of the original problem.

Impact of the network fragmentation

We also investigated the behavior of layout optimization methods with respect to the degree of network fragmentation. We constructed 1,000 networks of the most differentially expressed genes from the leukemia data set (visualized in Figure

Influence of the selected similarity threshold on the layout optimization

**Influence of the selected similarity threshold on the layout optimization**. Biological function similarity score was used on input. Horizontal axis measures the connectedness of the network, where 0 represents a complete graph and 1 means the graph has no edges.

FragViz and FR algorithms are equivalent when the network consists of only one component (threshold values lower than 0.1). For the FR algorithm, the correlation decreases when the network gets more fragmented. However, when the fragmentation increases (threshold value greater than 0.2), the correlation score of the FragViz algorithm increases and rises above the best score obtained by the FR algorithm. Correlation for MDS does not depend on the threshold.

Alternative approaches

Projections similar to those by FragViz could in principle be obtained with other algorithms (Figure

The N2.1 network layout optimized with four different methods, two different approaches were used for clustered graph visualization

**The N2.1 network layout optimized with four different methods, two different approaches were used for clustered graph visualization**. In 7.a the network was optimized with the FragViz algorithm. For 7.b a complete weighted graph was first constructed from the original network and similarity matrix. The weights of the network edges were scaled so that the largest weight equalled 1. Virtual edges were added to all unconnected pairs of vertices, with weights inversely linear with the distances from the similarity matrix and scaled to interval [0, 0.01]. The complete graph was then optimized with the FR algorithm. For 7.c the original network was merged to the dissimilarity matrix, where pairs of connected vertices from the original network had the lowest value in the similarity network 0, while other values from the dissimilarity matrix were 100 times smaller [0.99, 1]. The dissimilarity matrix was than optimized with the MDS algorithm. In 7.d and 7.e we optimized a network using clustered graph visualization. We transformed the original graph

Besides the projection quality issues, FragViz is also faster than the above approaches since it splits the optimization problem into a set of much smaller problems, laying out small individual components and then arranging a small number of components instead of all vertices at once. Using the graph layout optimization algorithms instead of FragViz, as described above, would be slower since these algorithms do not perform well on complete graphs. For MDS, to get similar running times as FragViz, one needs to employ fast heuristic MDS algorithms, which gain speed by somewhat compromising the quality of the projection

Figures

Conclusions

We have recently witnessed the emergence of large repositories of biomedical research and clinical data. Methods are needed that would allow the domain experts to sieve through these data sets to gather information, reason on the hidden patterns and form plausible hypotheses to be tested in subsequent studies. Here, visualization combined with visual data analytics plays a major role, as it can reveal the data patterns and allow the experts to explore the data.

Visualizations require the development of dedicated algorithms that craft the proper placement of the object under consideration. Explorative data analysis requests these to be fast to be able to construct responsive interfaces. We have developed a layout optimization technique FragViz that meets these requirements and specifically addresses the visualization of fragmented networks, where standard algorithms do not consider similarities between unconnected components.

FragViz is neither faster than all existing algorithms nor more accurate in terms of the match between the given and the projected distances. FragViz is slower than the Fructherman-Reingold algorithm, which is a direct consequence of considering more information. The resulting vertex distances may match the given distance matrix worse than in multidimensional scaling, a consequence of fixing the layout of the components. This is a matter of design decision: the goal of FragViz is to provide a sensible local picture and a global overview, thence the two level optimization. It can happen, for instance, that in a chain-like component the two vertices on the edge are weakly related to a common third vertex not belonging to the component. While other layout optimization algorithms would bend the chain, FragViz keeps it straight. Our experiments confirmed usefulness of the proposed solution. The case study on the leukemia gene networks shows that derived visualizations may be helpful in uncovering the relations between the components.

The data, networks, their visualizations, and the implementation of the described methods in an open-source analytics framework Orange

Availability and Requirements

**Project name: **Orange FragViz

**Project home page: **

**Operating system: **Platform independent

**Programming language: **Python, C++

**Other requirements: **PyQt, PyQwt, Numpy

**License: **GNU GPL

**Any restrictions on use by non-academics: **none

Authors' contributions

BZ identified the problem and suggested its solution. MS developed and implemented the algorithm, performed the experiments and drafted the manuscript. MM designed and interpreted the case study. JD formulated the optimization problem based on the physical metaphor. All authors co-wrote the article and approved its final version.

Acknowledgements

This work was supported by grants from the Slovenian Research Agency (P2-0209, J2-9699, L2-1112).