Department of Botany, The Field Museum, 1400 South Lake Shore Drive, Chicago, IL 60605-2496, USA

Department of Biology, Box 90338, Duke University, Durham, NC 27708-0338, USA

The Exelixis Lab, Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Schloss-Wolfsbrunnenweg 35, D-69118, Heidelberg, Germany

Department of Biology and Biochemistry, University of Houston, 369 Science & Research Bldg 2, Houston, TX 77204-5001, USA

Abstract

Background

We present a novel method to encode ambiguously aligned regions in fixed multiple sequence alignments by 'Pairwise Identity and Cost Scores Ordination' (PICS-Ord). The method works via ordination of sequence identity or cost scores matrices by means of Principal Coordinates Analysis (PCoA). After identification of ambiguous regions, the method computes pairwise distances as sequence identities or cost scores, ordinates the resulting distance matrix by means of PCoA, and encodes the principal coordinates as ordered integers. Three biological and 100 simulated datasets were used to assess the performance of the new method.

Results

Including ambiguous regions coded by means of PICS-Ord increased topological accuracy, resolution, and bootstrap support in real biological and simulated datasets compared to the alternative of excluding such regions from the analysis a priori. In terms of accuracy, PICS-Ord performs equal to or better than previously available methods of ambiguous region coding (e.g., INAASE), with the advantage of a practically unlimited alignment size and increased analytical speed and the possibility of PICS-Ord scores to be analyzed together with DNA data in a partitioned maximum likelihood model.

Conclusions

Advantages of PICS-Ord over step matrix-based ambiguous region coding with INAASE include a practically unlimited number of OTUs and seamless integration of PICS-Ord codes into phylogenetic datasets, as well as the increased speed of phylogenetic analysis. Contrary to word- and frequency-based methods, PICS-Ord maintains the advantage of pairwise sequence alignment to derive distances, and the method is flexible with respect to the calculation of distance scores. In addition to distance and maximum parsimony, PICS-Ord codes can be analyzed in a Bayesian or maximum likelihood framework. RAxML (version 7.2.6 or higher that was developed for this study) allows up to 32-state ordered or unordered characters. A GTR, MK, or ORDERED model can be applied to analyse the PICS-Ord codes partition, with GTR performing slightly better than MK and ORDERED.

Availability

An implementation of the PICS-Ord algorithm is available from

Background

Sequence alignment is the most critical step in molecular phylogenetic analysis. It defines homologous sites and putative evolution of site-specific variation

Methods that do not require a single MSA provide one solution to this problem. Direct optimization (DO) optimizes alignments and trees simultaneously under parsimony, likelihood, or in a Bayesian framework

An alternative to DO or to excluding ambiguous regions is the separate analysis of indels and encoding them as non-DNA characters

The solution to the size and performance limitations of step-matrix-based analysis is to transform the multidimensional step matrix into unidimensional scores prior to phylogenetic analysis. This way, computing pairwise alignment scores can be applied to a theoretically unlimited number of OTUs and to ambiguous regions with high length variation and complexity. This is achieved by ordinating the step matrix and dissecting it into perpendicular axes. The axis coordinates for each OTU can then be used to obtain codes to replace the ambiguously aligned regions. The ordination method of choice must accept similarity (identity) or dissimilarity (cost) matrices as input, which excludes principal component analysis (PCA).

Three commonly used methods can ordinate OTUs based on identity or distance matrices: polar ordination (Bray-Curtis), non-metric multidimensional scaling (NMS), and principal coordinates analysis or 'metric multidimensional scaling', PCoA

In this paper, we describe the computational procedure to encode ambiguous regions: (1) compute pair-wise distance matrices for ambiguous regions of an alignment, (2) ordinate the distance matrices, and (3) encode the ordination scores and integrate them into a phylogenetic data matrix. Our novel method, PICS-Ord, was tested using three biological and 100 simulated datasets. One biological dataset (100 OTUs, mtSSU) was extracted from a large dataset of over 600 OTUs and three genes (mtSSU, nuLSU, RPB2) of the lichenized fungal family Graphidaceae [

Results

Maximum Parsimony

The three ambiguous regions of the 100-OTU Graphidaceae dataset showed different degrees of congruence with the non-ambiguous alignment portion (Figure

Correlation between Clustal sequence identity scores of the non-ambiguous alignment portion (x-axis) and each of the ambiguous regions (y-axis)

**Correlation between Clustal sequence identity scores of the non-ambiguous alignment portion (x-axis) and each of the ambiguous regions (y-axis)**. Left column: scatterplots of sequence identity scores, with linear correlation tested using Spearman rank correlation. Right column: same data but categorized to show emerging pattern (1: 70-75%; 2: >75-80%; 3: >80-85%; 4: >85-90%; 5: >90-95%; 6: >95-100%). Box plots indicate mean, standard deviation, and maximum/minimum values.

Recoding of the non-ambiguous alignment portion of 31 OTUs with ARC, INAASE, and PICS-Ord with CLUSTAL, and PICS-Ord with Ngila distances, resulted in partially deviating maximum parsimony topologies ('distortions') when compared to the tree derived from the uncoded, original DNA alignment (Figure

Maximum parsimony trees computed from the non-ambiguous alignment portion, using original DNA data and data recoding by means of INAASE, ARC, PICS-Ord with Clustal sequence identity and PICS-Ord with Ngila zeta cost scores

**Maximum parsimony trees computed from the non-ambiguous alignment portion, using original DNA data and data recoding by means of INAASE, ARC, PICS-Ord with Clustal sequence identity and PICS-Ord with Ngila zeta cost scores**. The five major backbone nodes that are also supported in multigene studies are indicated by grey circles. Branches with good or strong support (70% or higher) and indicated by thick lines and branches with weak support (less than 70%) by slightly thickened lines. Exact bootstrap support values for backbone and terminal nodes are indicated in the table in the upper right corner.

All recoding methods resulted in some loss of backbone support, whereas support for terminal nodes remained largely unchanged (Figure

Maximum Likelihood

Maximum likelihood analysis of the 100-OTU Graphidaceae dataset with ambiguous regions either excluded or encoded using PICS-Ord (Ngila with zeta model) resulted in largely congruent topologies, with only one major clade switching positions between analyses (Figure

Maximum likelihood trees computed from the 100-OTU Graphidaceae dataset with ambiguous regions excluded (left) and recoded using PICS-Ord with Ngila zeta cost scores (right)

**Maximum likelihood trees computed from the 100-OTU Graphidaceae dataset with ambiguous regions excluded (left) and recoded using PICS-Ord with Ngila zeta cost scores (right)**. A GTR-Gamma model was applied to the DNA partition and a GTR model for the PICS-Ord code partition (GTR-CAT for rapid bootstrapping in both cases). Bootstrap support values are indicated next to the branches. Grey triangles indicate major clades with different position in both analyses, and black lines indicate clades with internal topology differing between analyses. Short arrows indicate nodes with increased (black) or decreased (grey) support under PICS-Ord and long arrows indicate nodes present either with ambiguous regions excluded (grey) or under PICS-Ord (black).

Simulations

We generated 100 simulated datasets of aligned sequences, each split into five partitions. Partitions 1 and 2 had unambiguous alignments, while 3-5 had different degrees of alignment ambiguity. Sections 1-4 were combined in one analysis, while 1, 2, and 5 in another. RAxML analysis of the 100 simulated datasets recovered the best trees when sections 1-4 (1+2+5; results below given in parentheses for each treatment) were trea-ted as pre-aligned without changes, with a mean relative RF value of 2.74% (3.33%) and recovering the true tree 50 (47) times out of 100 (Figure

Distribution of RF values of recovered tree topologies under different methodological approaches of excluding and including ambiguous sections in the simulated datasets (compared to the true tree from which the simulated datasets were generated); all = all sections pre-aligned, exc = ambiguous sections excluded, PIC = PICS-Ord coding (Ngila zeta model)

**Distribution of RF values of recovered tree topologies under different methodological approaches of excluding and including ambiguous sections in the simulated datasets (compared to the true tree from which the simulated datasets were generated); all = all sections pre-aligned, exc = ambiguous sections excluded, PIC = PICS-Ord coding (Ngila zeta model)**. Numbers in upper part of boxes indicate recovered true trees (out of 100). Box plots indicate mean, standard deviation, and maximum/minimum values.

Wilcoxon matched pairs test comparing the RF values of simulated datasets.

**1-2, 3-4 pre-aligned**

**3-4 excluded**

3-4 excluded

*** (-)

NA

3-4 PICS-Ord

** (-)

** (+)

**1-2, 5 pre-aligned**

**5 excluded**

5 excluded

*** (-)

NA

5 PICS-Ord

-- (-)

* (+)

Simulated datasets are given as sections 1-2, 3-4 or sections 1-2, 5 under different treatments of ambiguous regions (all included and pre-aligned or sections 3-4 and 5 excluded or PICS-Ord-recoded). ***/**/* = significant at the 0.001/0.01/0.05 level; -- = not significant; (+)/(-) = topology improved/worse.

The 705-OTU Physciaceae dataset showed 100 nodes at the backbone, genus group, genus, species group, and species level (with at least three samples per species; tree not shown). Eighteen nodes were present under PICS-Ord but absent when ambiguous regions were excluded; of these, nine had support values ranging between 14% and 69% and nine had values ranging between 77% and 100% under PICS-Ord (Figure

Proportion of increased or decreased support values for 100 backbone, genus group, genus, species group, and species nodes of the 705-OTU Physciaceae dataset analysed under maximum likelihood with ambiguous regions either excluded or recoded with PICS-Ord

**Proportion of increased or decreased support values for 100 backbone, genus group, genus, species group, and species nodes of the 705-OTU Physciaceae dataset analysed under maximum likelihood with ambiguous regions either excluded or recoded with PICS-Ord**. Nodes were divided according to whether PICS-Ord recoding performed better than, identical to, or worse than excluding ambiguous regions. Numbers in parentheses indicate mean difference in support values using PICS-Ord versus the other two methods.

Discussion

Our study shows that ordination of distance matrices, while introducing a small amount of distortion, recovers phylogenetic signal remarkably well. For non-ambiguous data with a 'known' topology derived from uncoded DNA, INAASE and PICS-Ord with Clustal identity scores performed similarly, with most but not all clades recovered accurately. PICS-Ord with Ngila zeta cost scores slightly outperformed both methods, whereas the performance of ARC could be best characterized as fair. Problems with ARC have been reported

Since PCoA ordination is an eigenvector analysis, the eigenvalues can be used to assess the amount of information represented by each ordination axis and be implemented as weight factor. However, if the PICS-Ord codes are used as ordered characters, the coding method encodes the ordination scores proportionally to the amount of variance explained by each axis, and a weighting factor will not markedly affect the overall performance. Weighting of the axes based on eigenvalues is recommended when the codes (equivalent to columns or sites) produced by PICS-Ord are analyzed as unordered characters or in a GTR model under maximum likelihood, although tests (results not shown) did not suggest marked changes in topology or support with unweighted or weighted PICS-Ord codes. One might also consider weighting to balance the influence of DNA versus PICS-Ord characters in a partitioned dataset. However, in general this will not be necessary. The number of code columns (sites) retained by PICS-Ord for each ambiguous region depends on the number of different sequence motifs present, with a maximum number corresponding to the number of OTUs. In our experience, only about 25-35% of sites will have positive eigenvalues and about 15-25% will be retained after removing invariant sites. The first ambiguous region each of the 100-OTU Graphidaceae, the 706-OTU Physciaceae, and the 1814-OTU Parmeliaceae dataset retained 20, 172, and 320 sites, respectively. In addition, only the first few axes will be clade-informative, that is they contain structure largely congruent with clades resolved by non-coded DNA, and hence increase clade support, whereas the higher axes tend to be 'near-constant'. In a typical dataset of 100-1000 OTUs, the number of sites retained by PICS-Ord for each ambiguous region that are 'clade-informative', will be roughly 5-25. In ITS datasets containing roughly 450 unambiguously-aligned nucleotide sites, the 'clade-informative' PICS-Ord axes, assuming 2-3 ambiguous regions, would therefore add roughly about 15-75 sites, replacing originally ambiguous portions of roughly 100-150 bases in length.

The usefulness of including ambiguous regions in phylogenetic analyses and the performance of the corresponding recoding method can be evaluated using two criteria: improved confidence (statistical support) and improved topology (phylogenetic accuracy). Topology can be judged indirectly: when two different methods applied to the same dataset result in topological differences, but under certain conditions the topologies converge, this can be seen as improvement towards phylogenetic accuracy, as long as the resolution does not decrease and no novel topologies appear

The simulation study showed that excluding ambiguous regions resulted in significantly worse topologies and that including them by means of PICS-Ord allowed the recovery of a substantial part of the phylogenetic signal contained therein. The most accurate topologies were obtained when analyzing the simulated datasets unchanged ('as is'); however, since in real biological data we cannot know the true alignment, the inclusion of ambiguous regions by means of recoding, rather than excluding them, is the next best option. In PICS-Ord, recoding ambiguous regions is based on a single optimal solution for each pairwise alignment given NGILA's model of log-affine gap costs, and the transformation of these pairwise alignments into distances reduces the risk of misinterpretation of positional homologies compared to frequency-based methods such as ARC.

The potential power of recovering phylogenetic signal contained in ambiguous regions is shown in our analysis of the 100-OTU Graphidaceae dataset. The topology and support obtained when including ambiguous regions of the mtSSU gene by means of PICS-Ord matches the topology and support obtained by a three-gene tree [unpubl. data] better than the topology based on exclusion of ambiguous regions. Published 2-gene and 3-gene phylogenies of Graphidaceae [

During our study, we made some preliminary comparisons (results not shown) between PICS-Ord and direct optimization methods such as POY, BALi-Phy, PRANK, and SATè

PICS-Ord thus offers a simple and cheap-to-compute alternative to direct optimization and recoding methods such as INAASE

The modularity of PICS-Ord allows for flexible parameter settings, including transition:transversion ratio and gap penalties similar to those of INAASE when calculating simple pairwise cost scores in Ngila

While PICS-Ord recoding was here applied to DNA data, the underlying method can be used to incorporate any kind of multidimensional distance matrix as unidimensional columns in a phylogenetic dataset and hence simplify the analytical approach and considerably increase computational speed.

Conclusions

PICS-Ord offers a simple and fast method to recode regions in multiple sequence alignments that exhibit low alignment confidence scores ('ambiguous regions') and include them as separate partition in phylogenetic analyses. PICS-Ord can deal with datasets of practically unlimited size and the codes can be analyzed under maximum likelihood and Bayesian approaches, thus eliminating the disadvantages of previously available methods of ambiguous region coding while retaining the relative accuracy of distance-based recoding methods. The incorporation of Ngila allows for a variety of models of indel evolution to be implemented in the coding process, including a power-law zeta model. PICS-Ord is especially useful for phylogenetic analyses that use ribosomal genes (mitochondrial small subunit, mtSSU; nuclear internal transcribed spacer, ITS), as these genes are difficult to align even across closely related taxa, and is therefore a useful alternative to computationally intensive methods that optimize alignments and trees simultaneously. For typical mtSSU and ITS datasets or other multiple sequence or protein alignments that contain portions aligned with low confidence but containing phylogenetic signal, PICS-Ord coding will substantially improve topology and increase support compared to excluding such portions from the analysis.

Methods

Biological and simulated datasets and delimitation of ambiguous regions

Three datasets of real biological data were analyzed. One dataset was extracted from a larger dataset of the lichen family Graphidaceae that originally consisted of three genes (nuLSU, mtSSU,

The delimitation of ambiguous regions is in itself a difficult task

After initial multiple alignment using ClustalW2

In addition to the three biological datasets, we generated 100 simulated datasets using DAWG 1.2

Computing, ordinating, and coding distance and cost score matrices (PICS-Ord)

Ambiguous regions (biological datasets) and partitions containing indels (simulated datasets) were subjected to pairwise alignment to derive distance and cost score matrices. The alignment algorithms implemented in ClustalW2

In addition to simple sequence identity and cost matrices, Ngila 1.3 was applied to find the most likely alignment between two homologous sequences and its log-likelihood score _{k }

Distance and cost score matrices derived via ClustalW and Ngila were subjected to principal coordinates analysis (PCoA). PCoA is found as a stand-alone application in the freely available executables PCO.exe [

Since fractional ordination scores cannot be used in phylogenetic analyses, we encoded the ordination scores obtained from axes with positive eigenvalues as integers. For each axis, the maximum and mini-mum score (S_{max}, S_{min}) and the range (S_{Range }= S_{max }- S_{min}) were computed. The maximum range S_{Range(max) }across all axes was retained; usually it corresponded to the first axis, more rarely to axes of higher order (because axis variance is determined by both range and dispersion). For each individual OTU, its axis coordinates S_{OTU }were then rescaled using the following equation: S_{rescaled }= (S_{OTU }- S_{min})/S_{Range(max)}, which transformed all original scores into values ranging between 0.00 and 1.00. Integer scores INT_{OTU }were subsequently computed by multiplying S_{rescaled }with 9.99, subtracting 0.495, and rounding to the closest integer value, resulting in 10-state ordered integer scores ranging from 0 to 9. The rescaling by multiplication with 9.99 and subtraction of 0.495 ensures that each integer code represents a nearly equal range of 1.0 prior to rounding. We also explored other scoring schemes by comparing uncoded DNA with recoded data, including 4-state ordered integer scores and 20-state unordered integer scores, and found that 10-state ordered integer scores performed best in terms of preserving phylogenetic signal contained in uncoded DNA.

For the 100-OTU Graphidaceae dataset, we used a simple approach to assess the level of congruence and potential homoplasy between each of the ambiguous regions and the non-ambiguous alignment portion. For all 100 OTUs, Clustal pairwise sequence identity scores were computed for each ambiguous region of the alignment and for the non-ambiguous portion. The resulting distance matrices were plotted against each other and the degree of linear correlation was assessed by means of the Pearson product-moment correlation coefficient as implemented in STATISTICA 6.0.

Comparative analysis of coding methods using a non-ambiguously aligned biological dataset

To compare the output of coding methods with original, non-coded DNA data, we used a subset of 31 OTUs of the Graphidaceae dataset and the non-ambiguous portion of the alignment, trimmed to 720 positions. The number of 31 OTUs (30 ingroup plus one outgroup) was chosen to accommodate the limitations of INAASE, which can only handle up to 32 distinct sequences patterns per alignment portion. The alignment was divided into 12 portions of 60 positions each, and each portion was subjected to recoding using: (1) INAASE cost scores (step matrix) with a transition: transversion:gap ratio of 1:1:1; (2) ARC; (3) PICS-Ord with Clustal pairwise identity scores (default ratio of 1:1:1); and (4) PICS-Ord with Ngila pairwise log likelihood cost scores (zeta power-law model with default settings); the latter two ordinated with uncorrected PCoA retaining axes with positive eigenvalues only and rescaled as ordered 10-state integer codes. The encoded datasets resulted in 12 characters (step matrices) for INAASE, 276 for ARC, 204 for PICS-Ord with Clustal scores, and 141 for PICS-Ord with Ngila scores, as compared to 364 parsimony informative sites in the original DNA matrix. The original DNA alignment and all encoded datasets were subjected to maximum parsimony analysis in PAUP* 4.0b10

Comparative analysis of PICS-Ord coding versus ambiguous regions excluded or automatically aligned

Using the 100-OTU Graphidaceae dataset, the 705-OTU Physciaceae dataset, and the simulated datasets, we performed a comparative analysis of the multiple alignments as follows: (1) ambiguous regions excluded, and (2) ambiguous regions encoded using PICS-Ord. For option (1), we used the non-ambiguous portions of the two biological datasets and the non-ambiguous partitions 1-2 of the simulated datasets. For option (2), ambiguous regions (biological datasets) or partitions (simulated data) were pairwise aligned and Ngila log likelihood cost scores were computed under the zeta power-law model (default settings). The cost score matrices were ordinated using PCoA without correcting for non-metricity and all axes with positive eigenvalues were retained. Ordination scores were rescaled to 10-state ordered integer codes.

The biological datasets were analyzed under maximum likelihood using the most recent version 7.2.6 of RAxML [

Phylogenetic inferences on simulated datasets were conducted using the SSE3-vectorized version of RAxML 7.2.6

To compute the topological distances of all resulting trees to the true tree, we used the respective RAxML option (-f r) to obtain the relative Robinson-Foulds (RF) distance

PICS-Ord Implementation

A reference implementation of PICS-Ord is available from

Authors' contributions

RL provided the idea of the novel methodology presented here and performed initial computations under parsimony, as well as the individual Ngila and PCoA analyses and ordination scores recoding and ML analyses of the 706-OTU and 1814-OTU biological datasets. BPH and RL further developed the idea and made analytical comparisons with alternative ambiguous region coding methods. AS developed an updated version of RAxML (7.2.6 and higher) to allow for phylogenetic analysis of ambiguous region codes using a mixed model under maximum likelihood, and performed ML analyses, including computation of RF values, on the simulated datasets and part of the biological datasets. RAC developed an updated version of Ngila for pairwise alignment score matrices to be directly analyzed by PCoA, generated the simulated datasets, and wrote an R script for automated recoding using the PICS-Ord approach. All authors contributed equally to writing the manuscript and read and approved the final manuscript.

Acknowledgements

This study was elaborated in the framework of phylogenetic studies of the lichen family Graphidaceae (= Thelotremataceae) and the basidiolichen-containing family Hygrophoraceae. Both studies received support from the National Science Foundation under the titles 'Phylogeny and Taxonomy of Ostropalean Fungi, with Emphasis on the Lichen-forming Thelotremataceae' (NSF-DEB 0516116 to The Field Museum; PI H.T. Lumbsch; Co-PI R. Lücking) and 'Phylogenetic Diversity of Mycobionts and Photobionts in the Cyanolichen Genus