Institute of Biochemistry, Center for Structural and Cell Biology in Medicine, University of Lübeck, Ratzeburger Allee 160, 23538 Lübeck, Germany

Graduate School for Computing in Medicine and Life Sciences, University of Lübeck, Ratzeburger Allee 160, 23538 Lübeck, Germany

Institute for Neuro- and Bioinformatics, University of Lübeck, Ratzeburger Allee 160, 23538 Lübeck, Germany

Key Laboratory of Mental Health, Institute of Psychology, Chinese Academy of Sciences, Beijing 100101, China

Laboratory for Structural Biology of Infection and Inflammation, c/o DESY, Building 22a, Notkestr. 85, 22603 Hamburg, Germany

Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zu Chong Zhi Rd., Shanghai 201203, China

Abstract

Background

Results of phylogenetic analysis are often visualized as phylogenetic trees. Such a tree can typically only include up to a few hundred sequences. When more than a few thousand sequences are to be included, analyzing the phylogenetic relationships among them becomes a challenging task. The recent frequent outbreaks of influenza A viruses have resulted in the rapid accumulation of corresponding genome sequences. Currently, there are more than 7500 influenza A virus genomes in the database. There are no efficient ways of representing this huge data set as a whole, thus preventing a further understanding of the diversity of the influenza A virus genome.

Results

Here we present a new algorithm, "PhyloMap", which combines ordination, vector quantization, and phylogenetic tree construction to give an elegant representation of a large sequence data set. The use of PhyloMap on influenza A virus genome sequences reveals the phylogenetic relationships of the internal genes that cannot be seen when only a subset of sequences are analyzed.

Conclusions

The application of PhyloMap to influenza A virus genome data shows that it is a robust algorithm for analyzing large sequence data sets. It utilizes the entire data set, minimizes bias, and provides intuitive visualization. PhyloMap is implemented in JAVA, and the source code is freely available at

Background

Phylogenetic trees are commonly used as a visualization tool

Higgins used Principal Coordinate Analysis (PCoA)

Here, we present a new method - Phylogenetic Map (PhyloMap) - that combines PCoA, vector quantization, and phylogenetic tree construction to give an elegant visualization of a large sequence data set using all the data while still trying to capture the accurate relationships among them. Compared to traditional phylogenetic tree analysis, which is practicable only with a maximum of a few hundred sequences, PhyloMap can handle thousands of sequences at one time. PhyloMap first uses PCoA to help depict the main trends and then uses the "Neural-Gas" approach

Influenza A viruses are commonly classified by serological differences in their hemagglutinin (HA) and neuraminidase (NA) proteins. The gene sequences between different HAs or NAs are also significantly divergent and can be easily classified by serological type. However, the recent emergence of the 2009 H1N1 swine-origin human influenza A (H1N1) virus (S-OIV)

Methods

The PhyloMap algorithm

The input to PhyloMap is a set of aligned sequences, either amino acids or nucleotides. The algorithm involves five steps as shown in Figure

Flow chart of the PhyloMap algorithm

**Flow chart of the PhyloMap algorithm**.

1. Distance Matrix

The idea of ordination is to map the input sequences onto a low-dimensional space so that the distances and relationships of the sequence set are preserved as much as possible. In order to do that, one has to calculate a distance matrix

2. Principal coordinate analysis

PCoA was first described by Gower _{
ij
}, to the similarity matrix

where

The eigenvectors and the eigenvalues of the matrix

3. Vector quantization (Clustering)

The clustering algorithm we choose here is the "Neural-Gas"

4. Phylogenetic tree construction

Subsequently, we use the sequences selected by the Neural-Gas to build a phylogenetic tree. The Neighbor-joining (NJ) tree is used in PhyloMap with the same distance measurement used for calculating the distance matrix for PCoA. Other non-distance-based tree building methods can also be used (see the discussion below). The NJ tree is unrooted since we just want to find the major lineages of the sequences rather than to portray the exact evolutionary history.

5. Mapping the phylogenetic tree onto the PCoA result

The core algorithm of PhyloMap is to map the phylogenetic tree onto the two-dimensional coordinates calculated by PCoA. We adopted a multidimensional scaling method (MDS) similar to "Sammon's mapping"

A phylogenetic tree has two types of nodes:

• Leaf nodes: nodes that do not have any children; each node represents a sequence.

• Inner nodes: nodes have children nodes and a parent node. The root node of the tree can be considered a special inner node that has no parent node.

Each leaf node corresponds to one point in the two-dimensional PCoA result. The positions of these points are fixed, which means the coordinates of the leaf nodes are predefined and cannot be changed when drawing the tree. If we want to preserve the edge length between nodes, only the inner nodes can be moved. Unlike other MDS problems where the distances of one data point to all other data points are known, in PhyloMap each inner node is only constrained by three other nodes: one parent node and two children nodes.

We first define an error function _{
s
}similar to "Sammon's mapping":

where _{
ij
}is the distance between node

The algorithm will then employ gradient descent on the inner nodes to minimize_{
s
}. The distance _{
ij
}defined between node _{
s
}or a plot that is difficult to inspect visually. This is because the leaf nodes cannot move and, hence, all the distance constraints have to be satisfied by the inner nodes. If the inner nodes only explore a small space, which will provide attractive visual results,_{
s
}might be too large to accurately preserve the tree distances. To solve this problem, we use the Bezier curve _{
b
}after Bezier curve compensation is defined as:

NP PhyloMap

**NP PhyloMap**. (

where

The algorithm can be summarized as follows:

**Input**: tree: _{
leaf
}; scaling factor:

**Output**: all node coordinates _{
node
}, corresponding Bezier curve control point _{
bezier
}and error _{
b
}after Bezier curve compensation.

1:

2: _{
node
}:= randomly initializing the coordinates of the inner nodes and attach _{
leaf
}.

3: _{
s
}:= calculate the actual distance matrix using _{
node
}.

4: **while **
_{
i
}≤

5: **for each **inner node

6: update the coordinate of the inner node using gradient decent once every five iters.

7: update the coordinates of the inner node using gradient decent only if

there exists at least one edge connected to this node with

8: update _{
s
}using the new coordinates.

9: **end for each**

10: _{
i
}:= calculate error using equation (3).

11: **end while**

12: **for each **

13: _{
bezier
}:= calculate the Bezier curve control point so that

14: **end for each**

15:_{
b
}:= calculate error using equation (4).

Influenza A virus genome data

We compiled a data set containing 74,309 sequences of influenza A virus internal proteins as available from the NCBI database

Number of protein sequences used in the data set

**PB2**

**PB1**

**PA**

**NP**

**M1**

**M2**

**NS1**

**NS2**

No. of sequences

8397

8577

8522

8590

11258

10111

9982

8872

No. of non-redundant sequences

4384

4022

4173

2984

1496

2016

3734

1650

All eight gene products were aligned separately using MUSCLE

Results

PhyloMap reduces the risk of misinterpretation

We have generated the PhyloMap for all influenza A virus internal genes using their protein sequences, i.e. PB2, PB1, PA, NP, M1, M2, NS1, and NS2 (Figures

PB2 PhyloMap

**PB2 PhyloMap**. (

PB1 PhyloMap

**PB1 PhyloMap**. (

PA PhyloMap

**PA PhyloMap**. (

M1 PhyloMap

**M1 PhyloMap**. (

M2 PhyloMap

**M2 PhyloMap**. (

NS1 PhyloMap

**NS1 PhyloMap**. (

NS1 PhyloMap excluding Group B

**NS1 PhyloMap excluding Group B**. (

NS2 PhyloMap

**NS2 PhyloMap**. (

NS2 PhyloMap excluding Group B

**NS2 PhyloMap excluding Group B**. (

It is obvious that PCoA alone can already identify most of the major lineages; however, without the support of the mapping tree, it fails to portray the distances between some strains. The straight-line distance between "29: A/equine/Sao Paulo/4/1976(H7N7)" and "33: A/smew/Sweden/V820/2006(H5N1)" is short, but if we follow the tree, the distance is substantially longer. The real distance may need another dimension in the PCoA to be displayed. The tree here has served to add more dimensions to the 2D PCoA plot.

While the topology of the tree is defined, different tree-drawing algorithms can generate very different tree representations. The subtrees can be arbitrarily placed by the tree-drawing algorithms

The diversity of influenza A virus internal genes

Six distinct major lineages can be identified from the PhyloMap for all genes, i.e. seasonal human H3N2, seasonal human H1N1, early human, classical swine, equine, and avian viruses. The latter have been further separated into two sublineages (western hemisphere avian lineage and eastern hemisphere avian lineage) in a previous study

The PhyloMap shows similar patterns for PB2, PA, NP, M1, and M2 (Figures

PB1 also shows a pattern very different from other genes. PB1 of human H3N2 was derived from avian strains in 1968 through reassortment

The swine influenza viruses spread throughout the entire PhyloMap, further supporting the idea of swine being a "mixing-vessel"

By observing the first few dimensions of PCoA results, one can tell what are the major forces causing the data to variate from each other. We can see that the first dimension in our PCoA results on the internal genes generally reflects the host differences, and the second dimension reflects some of the subtype differences. The third dimension (not shown in the figures) further separates the swine and equine strains from others. The above observations show that the diversities of influenza A virus internal genes are mainly shaped by host differences and virus subtypes. However, using only subtype and host information is still not enough to distinguish major lineages among internal genes. For instance, the human H1N1 strains contain three major lineages: human seasonal H1N1, early human H1N1, and 2009 pandemic H1N1. These are highlighted in additional files (Additional files

**NP PhyloMap highlights human H1N1 influenza A virus**. The figure of NP PhyloMap highlights human H1N1 influenza A virus

Click here for file

**PB2 PhyloMap highlights human H1N1 influenza A virus**. The figure of PB2 PhyloMap highlights human H1N1 influenza A virus

Click here for file

**PB1 PhyloMap highlights human H1N1 influenza A virus**. The figure of PB1 PhyloMap highlights human H1N1 influenza A virus

Click here for file

**PA PhyloMap highlights human H1N1 influenza A virus**. The figure of PA PhyloMap highlights human H1N1 influenza A virus

Click here for file

**M1 PhyloMap highlights human H1N1 influenza A virus**. The figure of M1 PhyloMap highlights human H1N1 influenza A virus

Click here for file

**M2 PhyloMap highlights human H1N1 influenza A virus**. The figure of M2 PhyloMap highlights human H1N1 influenza A virus

Click here for file

**NS1 PhyloMap highlights human H1N1 influenza A virus**. The figure of NS1 PhyloMap highlights human H1N1 influenza A virus

Click here for file

**NS2 PhyloMap highlights human H1N1 influenza A virus**. The figure of NS2 PhyloMap highlights human H1N1 influenza A virus

Click here for file

**Figure legend**. The figure legend for Additional file

Click here for file

PhyloMap helps locating the origin of emerging influenza A virus

As the main patterns of influenza A internal genes can be clearly seen from the PhyloMap result, one can start to investigate the more subtle relationships of the data by zooming in onto certain clusters or adding sequences of interest into the sampling tree. The sequences of the sampling tree found by the Neural-Gas approach minimize the quadratic errors. As a result, they can well represent the diversity of the data set. When it comes to finding the origin of a new strain, the samplings can provide a good reference data set that would not miss important lineages. We have mapped the genes of 1918 "Spanish flu" ("**
A/Brevig Mission/1/1918(H1N1)
**") and S-OIV ("

Discussion

While phylogenetic tree inference methods are relatively well developed, their interpretation relies heavily on visual inspection

Other means of adding more information to ordination such as superimposing a minimal spanning tree and a relative neighborhood graph have been proposed by Guiller

The PCoA used here is a linear dimensionality-reduction technique

In PhyloMap, we use distance-based methods to build the sampling tree. As the distances are measured in the same way both in PCoA and in the phylogenetic tree, when mapping the tree onto the PCoA result, the error can be minimized. However, the sampling tree can also be built with parsimony-based or maximum-likelihood based methods. But in such cases, the edge lengths in the tree and the 2D PCoA result might not be on the same scale. We need to estimate the scaling factor

The accuracy of an inferred phylogenetic tree depends on many factors such as the number of sequences, number of characters (number of aligned positions), and substitution rate. In general, the accuracy of the inferred phylogenetic tree increases while more characters are used

Conclusions

PhyloMap is a robust algorithm for analyzing phylogenetic relationships in large sequence data sets. It can utilize the entire data set and avoids the bias introduced by manual samplings. PhyloMap introduces two data compression techniques (dimensionality reduction and vector quantization) into phylogenetic studies to reduce the data without losing important information. The visualizations generated summarize the main phylogeny information and overcome the shortcomings of phylogenetic tree construction and ordination analysis when used alone.

There have been only a few studies targeting the phylogenetic diversity of the internal genes of influenza A virus

Research on influenza A viruses has suggested that they are constantly undergoing frequent reassortment

PhyloMap is implemented in JAVA, and the source code is freely available for download at

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JZ designed and implemented the PhyloMap algorithm, analyzed the data and drafted the manuscript, AMM participated in designing the PhyloMap algorithm, TM evaluated the algorithm and drafted the manuscript, SC participated in cleaning and analyzing the data, JW participated in drafting the manuscript, RH designed the research, analyzed the data and drafted the manuscript. All authors read and approved the final manuscript.

Acknowledgements

We acknowledge support by the Graduate School for Computing in Medicine and Life Sciences, University of Lübeck.

Funding: Germany's Excellence Initiative [DFG GSC 235/1]; International Consortium on Antivirals (