Abstract
Background
We present a novel method to encode ambiguously aligned regions in fixed multiple sequence alignments by 'Pairwise Identity and Cost Scores Ordination' (PICSOrd). The method works via ordination of sequence identity or cost scores matrices by means of Principal Coordinates Analysis (PCoA). After identification of ambiguous regions, the method computes pairwise distances as sequence identities or cost scores, ordinates the resulting distance matrix by means of PCoA, and encodes the principal coordinates as ordered integers. Three biological and 100 simulated datasets were used to assess the performance of the new method.
Results
Including ambiguous regions coded by means of PICSOrd increased topological accuracy, resolution, and bootstrap support in real biological and simulated datasets compared to the alternative of excluding such regions from the analysis a priori. In terms of accuracy, PICSOrd performs equal to or better than previously available methods of ambiguous region coding (e.g., INAASE), with the advantage of a practically unlimited alignment size and increased analytical speed and the possibility of PICSOrd scores to be analyzed together with DNA data in a partitioned maximum likelihood model.
Conclusions
Advantages of PICSOrd over step matrixbased ambiguous region coding with INAASE include a practically unlimited number of OTUs and seamless integration of PICSOrd codes into phylogenetic datasets, as well as the increased speed of phylogenetic analysis. Contrary to word and frequencybased methods, PICSOrd maintains the advantage of pairwise sequence alignment to derive distances, and the method is flexible with respect to the calculation of distance scores. In addition to distance and maximum parsimony, PICSOrd codes can be analyzed in a Bayesian or maximum likelihood framework. RAxML (version 7.2.6 or higher that was developed for this study) allows up to 32state ordered or unordered characters. A GTR, MK, or ORDERED model can be applied to analyse the PICSOrd codes partition, with GTR performing slightly better than MK and ORDERED.
Availability
An implementation of the PICSOrd algorithm is available from http://scit.us/projects/ngila/wiki/PICSOrd webcite. It requires both the statistical software, R http://www.rproject.org webcite and the alignment software Ngila http://scit.us/projects/ngila webcite.
Background
Sequence alignment is the most critical step in molecular phylogenetic analysis. It defines homologous sites and putative evolution of sitespecific variation [15]. However, sequence portions in multiple sequence alignments (MSA) may have low alignment confidence, that is they are ambiguously aligned (often called 'ambiguous regions') due to a variable number of indels, and thus different alignment solutions with identical cost scores are possible [6,7]. Such portions are usually excluded from further analysis [810], as methodologies often only work on a single MSA, and ambiguities require subjective prioritization of a single alignment solution. Lengthvariable regions also have an increased probability of homoplastic evolution. However, it is recognized that ambiguously aligned portions do carry substantial phylogenetic signal [1113].
Methods that do not require a single MSA provide one solution to this problem. Direct optimization (DO) optimizes alignments and trees simultaneously under parsimony, likelihood, or in a Bayesian framework [1421]. However, while DO can handle uncertainty in alignments due to sequence length variation, it is computationally intensive and support is usually calculated by sampling of alternative alignments and trees, as credible intervals, posterior probabilities, or Bremer support. These are not directly comparable to bootstrap support, which in DO is only possible by creating pseudoalignments, which defies the purpose of DO. Furthermore, it is also disputed whether alignments with optimized cost or likelihood scores under a given setting will return optimized tree topologies [2226].
An alternative to DO or to excluding ambiguous regions is the separate analysis of indels and encoding them as nonDNA characters [13]. The simplest approach is to encode residues by sequence length (each length receiving a different code) or coding them as identical or different without measuring the degree of identity or difference [27]. Tools such as SNAP [28,29] and MAFFT [3032] include options to encode short indels (gaps). More complex methods dealing with larger gaps include frequencybased, motifbased, and pairwise distancebased methods. Frequencybased methods such as 'ambiguous regions coding', ARC [33] calculate the relative frequencies of bases and base pairs and their spatial distribution within sequences of ambiguously aligned regions. Motif or 'word'based methods search for common 'substrings' in ambiguously aligned sequences [34]. Applications include ARC and 'Nlocal decoding' [35,36]. The only pairwise distancebased method appears to be 'integration of ambiguously aligned sequences', INAASE [37], which encodes each region as a single character with a corresponding step matrix. INAASE recovers phylogenetic signal contained in ambiguous regions rather accurately, whereas ARC produces topologies that do not always agree with the underlying signal [38]. The size of step matrices is limited to a fixed number of states: 32 states in PAUP* 4.0b10 (as an unsigned integer has 32 bits), 64 states on newer CPUs [39], therefore INAASE cannot be applied to alignments that include a large number of taxa or complex and highly variable ambiguous regions. In addition, the search time (including bootstrapping) is increased about 10 to 100 times depending on the number and size of the step matrices included (unpubl. data based on comparative analysis of datasets on singleprocessor APPLE Aluminium G4 Powerbook and DELL Inspiron 1720 laptop computers).
The solution to the size and performance limitations of stepmatrixbased analysis is to transform the multidimensional step matrix into unidimensional scores prior to phylogenetic analysis. This way, computing pairwise alignment scores can be applied to a theoretically unlimited number of OTUs and to ambiguous regions with high length variation and complexity. This is achieved by ordinating the step matrix and dissecting it into perpendicular axes. The axis coordinates for each OTU can then be used to obtain codes to replace the ambiguously aligned regions. The ordination method of choice must accept similarity (identity) or dissimilarity (cost) matrices as input, which excludes principal component analysis (PCA).
Three commonly used methods can ordinate OTUs based on identity or distance matrices: polar ordination (BrayCurtis), nonmetric multidimensional scaling (NMS), and principal coordinates analysis or 'metric multidimensional scaling', PCoA [40,41]. NMS has become the default ordination method for ecological data [4143], but the ordination is computed via an iterative numerical procedure that starts from a random starting configuration. Therefore, sample points that have identical original data scores (e.g., identical sequences), and hence should fall on exactly the same point in the ordination diagram, will instead slightly disperse. This would affect sequence ordination, since identical sequences would result in slightly different ambiguous region codes. Polar ordination and PCoA do not exhibit this problem, and the latter method has the advantage of being an eigenvector analysis, which calculates the degree of variance explained by each axis [40,41,44]. PCoA has been applied to sequence ordination, although mostly for visualizing sequence and tree spaces [13,45]; http://pbil.univlyon1.fr/mva/pco.php webcite). Because of its properties, it is the default method for the algorithm proposed here.
In this paper, we describe the computational procedure to encode ambiguous regions: (1) compute pairwise distance matrices for ambiguous regions of an alignment, (2) ordinate the distance matrices, and (3) encode the ordination scores and integrate them into a phylogenetic data matrix. Our novel method, PICSOrd, was tested using three biological and 100 simulated datasets. One biological dataset (100 OTUs, mtSSU) was extracted from a large dataset of over 600 OTUs and three genes (mtSSU, nuLSU, RPB2) of the lichenized fungal family Graphidaceae [[46,47]; unpubl. data], whereas the second dataset represented 706 OTUs and one partial gene (ITS) of the family Physciaceae. The third dataset, representing 1814 OTUs and one partial gene (ITS) of the lichen family Parmeliaceae, was used to assess computational speed for large datasets. Both similarity matrices created by ClustalW [48] and Ngila [49] were tested.
Results
Maximum Parsimony
The three ambiguous regions of the 100OTU Graphidaceae dataset showed different degrees of congruence with the nonambiguous alignment portion (Figure 1). The phylogenetic signal of region 1 correlated better with that of the nonambiguous alignment portion compared to regions 2 and 3. For regions 2 and 3, identity scores between ambiguous sequences were on average higher than expected when identity scores between nonambiguous alignment portions were low, suggesting some degree of homoplasy through a saturation effect in short ambiguous sequences.
Figure 1. Correlation between Clustal sequence identity scores of the nonambiguous alignment portion (xaxis) and each of the ambiguous regions (yaxis). Left column: scatterplots of sequence identity scores, with linear correlation tested using Spearman rank correlation. Right column: same data but categorized to show emerging pattern (1: 7075%; 2: >7580%; 3: >8085%; 4: >8590%; 5: >9095%; 6: >95100%). Box plots indicate mean, standard deviation, and maximum/minimum values.
Recoding of the nonambiguous alignment portion of 31 OTUs with ARC, INAASE, and PICSOrd with CLUSTAL, and PICSOrd with Ngila distances, resulted in partially deviating maximum parsimony topologies ('distortions') when compared to the tree derived from the uncoded, original DNA alignment (Figure 2). ARC and PICSOrd with CLUSTAL distances resulted in comparatively high relative RF (Robinson Foulds) distance values of nearly 40%, whereas INAASE and PICSOrd with Ngila distance gave better values near 25%. INAASE recoding produced two conflicts (conflictive topology with bootstrap support 70% or higher) compared to the uncoded DNA topology: Fissurina was resolved as paraphyletic and the GraphisChroodiscus clade was nested within the ChapsaLeucodectonThelotrema clade. ARC recoding also exhibited two conflicts: nonmonophyly of the ChapsaLeucodectonThelotrema clade and Leucodecton being nested within the AmpliotremaFibrillithecisMelanotremaMyriotremaOcellularia clade. PICSOrd recoding with Clustal distances resulted in one conflict, viz. paraphyly of the DiorygmaGlyphisPlatygrammeSarcographa clade, whereas. PICSOrd recoding with Ngila distances did not show any conflict with the topology derived from the uncoded DNA.
Figure 2. Maximum parsimony trees computed from the nonambiguous alignment portion, using original DNA data and data recoding by means of INAASE, ARC, PICSOrd with Clustal sequence identity and PICSOrd with Ngila zeta cost scores. The five major backbone nodes that are also supported in multigene studies are indicated by grey circles. Branches with good or strong support (70% or higher) and indicated by thick lines and branches with weak support (less than 70%) by slightly thickened lines. Exact bootstrap support values for backbone and terminal nodes are indicated in the table in the upper right corner.
All recoding methods resulted in some loss of backbone support, whereas support for terminal nodes remained largely unchanged (Figure 2). ARC did not support the large sister clade of the basal split, which received absolute support (100%) in uncoded DNA analysis and good support (8688%) with the other recoding methods. Genus group support was generally lower with encoded data and especially under ARC and PICSOrd. Genus support was high for all recoding methods and especially using PICSOrd with Ngila distances, with the exception of INAASE which recovered Diorygma with lower support and did not recover the otherwise strongly supported Fissurina. The best overall recoding performance (least amount of distortion) was thus found with PICSOrd using Ngila distances. The performance of ARC was fairly poor, with high relative RF value, conflictive topology, and absence of support for the largest backbone node.
Maximum Likelihood
Maximum likelihood analysis of the 100OTU Graphidaceae dataset with ambiguous regions either excluded or encoded using PICSOrd (Ngila with zeta model) resulted in largely congruent topologies, with only one major clade switching positions between analyses (Figure 3). Three further major clades had internal topologies changed between analyses: the AmpliotremaFibrillithecisMelanotremaMyriotremaOcellularia clade, the DiorygmaGlyphisPhaeographisPlatygrammeSarcographa clade, and the ChapsaLeucodectonThelotrema clade. However, this topological conflict was not supported, except for the clustering of Chapsa and Leucodecton under PICSOrd. The two analyses involved 36 backbone and terminal nodes of interest. Of these, 14 nodes had absolute support (100%) in both cases (Figure 3). Fourteen further nodes had increased support under PICSOrd, with an average increase of 23% for five backbone and genus group nodes and 6% for nine terminal genus nodes. Especially notable was the increase under PICSOrd from 40% to 79% for the ChroodiscusDiploschistes node and from 57% to 96% for the FibrillithecisMyriotrema node, two nodes that are supported in multigene studies (Rivas Plata et al., in prep.). For two genus group nodes (AmpliotremaOcellularia and PhaeographisPlatygrammeSarcographa), support decreased slightly under PICSOrd (average of 7%), whereas the two major backbone nodes of the large sister clade to FissurinaDyplolabia showed substantial decrease in support under PICSOrd (71% to 36% and 60% to 41%, respectively). The two latter nodes are not recovered in multigene studies. The remaining four nodes appeared in one of the two analyses only: with ambiguous regions excluded, Glyphis clustered with the PhaeographisPlatygrammeSarcographa clade and Leucodecton with Thelotrema, in both cases lacking support (41% and 34%, respectively), whereas under PICSOrd, Glyphis clustered with Diorygma (53% support) and Leucodecton with Chapsa (83% support); the latter topologies are congruent with multigene studies. Inclusion of ambiguous regions under PICSOrd thus did not only result in overall increased bootstrap support, but also in topologies that are more in accordance with multigene studies using mtSSU, nuLSU, and RPB2 [[46,47]; unpubl. data].
Figure 3. Maximum likelihood trees computed from the 100OTU Graphidaceae dataset with ambiguous regions excluded (left) and recoded using PICSOrd with Ngila zeta cost scores (right). A GTRGamma model was applied to the DNA partition and a GTR model for the PICSOrd code partition (GTRCAT for rapid bootstrapping in both cases). Bootstrap support values are indicated next to the branches. Grey triangles indicate major clades with different position in both analyses, and black lines indicate clades with internal topology differing between analyses. Short arrows indicate nodes with increased (black) or decreased (grey) support under PICSOrd and long arrows indicate nodes present either with ambiguous regions excluded (grey) or under PICSOrd (black).
Simulations
We generated 100 simulated datasets of aligned sequences, each split into five partitions. Partitions 1 and 2 had unambiguous alignments, while 35 had different degrees of alignment ambiguity. Sections 14 were combined in one analysis, while 1, 2, and 5 in another. RAxML analysis of the 100 simulated datasets recovered the best trees when sections 14 (1+2+5; results below given in parentheses for each treatment) were treated as prealigned without changes, with a mean relative RF value of 2.74% (3.33%) and recovering the true tree 50 (47) times out of 100 (Figure 4). Excluding ambiguous sections 35 resulted in less accurate topologies on average (relative RF = 4.93%): 56 (63) datasets gave identical trees compared to prealigned sections 14 (12, 5), seven (14) datasets resulted in improved topology, but 37 (23) datasets showed worse topology. Recoding ambiguous sections using PICSOrd on average improved topologies under all three employed substitution models and particularly under GTR (RF = 3.81% and 3.96%, respectively): 59 (59) datasets gave identical trees compared to ambiguous sections excluded for sections 34 (5) encoded, 29 (27) datasets resulted in improved topology, and 12 (14) datasets gave a worse topology. Thus, recoding ambiguous sections under PICSOrd resulted in identical or improved topologies compared to excluding ambiguous sections in 88 (86) out of 100 cases. All differences were statistically significant using a Wilcoxon matched pairs test (Table 1).
Figure 4. Distribution of RF values of recovered tree topologies under different methodological approaches of excluding and including ambiguous sections in the simulated datasets (compared to the true tree from which the simulated datasets were generated); all = all sections prealigned, exc = ambiguous sections excluded, PIC = PICSOrd coding (Ngila zeta model). Numbers in upper part of boxes indicate recovered true trees (out of 100). Box plots indicate mean, standard deviation, and maximum/minimum values.
Table 1. Wilcoxon matched pairs test comparing the RF values of simulated datasets.
The 705OTU Physciaceae dataset showed 100 nodes at the backbone, genus group, genus, species group, and species level (with at least three samples per species; tree not shown). Eighteen nodes were present under PICSOrd but absent when ambiguous regions were excluded; of these, nine had support values ranging between 14% and 69% and nine had values ranging between 77% and 100% under PICSOrd (Figure 5). A total of 36 nodes had higher support under PICSOrd compared to both alternative treatments, and the differences ranged between 21.5% and 54% for 15 nodes, between 11% and 19.5% for ten nodes, and between 2% and 10% for 11 nodes. For another ten nodes, PICSOrd gave higher support values compared to one of the alternative methods and identical values compared to the other, with an average increase of 4.4%. Nineteen nodes behaved identically for all three methods, with support values of 100%. For seven nodes, PICSOrd gave higher support values compared to one alternative method (average increase 8.9%) but lower values compared to the other (average decrease 4.2%). The remaining ten nodes showed lower support values for PICSOrd compared to both alternative methods, with an average decrease of 6.0% (Figure 5). PICSOrd thus showed overall increased support for 64 nodes and overall decreased support for only ten nodes, with an average increase over all nodes of 10.3% and a maximum increase for one node versus ambiguous regions excluded of 78%. All nodes with increased support under PICSOrd or which appeared only under PICSOrd correspond to clades and taxa that are supported in twogene or multigene studies including also mtSSU and nuLSU [[46,47]; unpubl. data].
Figure 5. Proportion of increased or decreased support values for 100 backbone, genus group, genus, species group, and species nodes of the 705OTU Physciaceae dataset analysed under maximum likelihood with ambiguous regions either excluded or recoded with PICSOrd. Nodes were divided according to whether PICSOrd recoding performed better than, identical to, or worse than excluding ambiguous regions. Numbers in parentheses indicate mean difference in support values using PICSOrd versus the other two methods.
Discussion
Our study shows that ordination of distance matrices, while introducing a small amount of distortion, recovers phylogenetic signal remarkably well. For nonambiguous data with a 'known' topology derived from uncoded DNA, INAASE and PICSOrd with Clustal identity scores performed similarly, with most but not all clades recovered accurately. PICSOrd with Ngila zeta cost scores slightly outperformed both methods, whereas the performance of ARC could be best characterized as fair. Problems with ARC have been reported [38] and are based on the fact that the recoding method used in ARC is not distancebased but encodes sequences based on length and relative frequency of individual bases and base pairing patterns [33]. Under certain, usually rare circumstances, this can lead to erroneous codes, as the following example illustrates: consider sequences (1) TTGGCCAACCGGTT, (2) AACCGGTTGGCCAA, and (3) AGCCAGCTGGCTAA. Sequences (2) and (3) are more similar to one other, differing in four transitions only, whereas sequences (1) and (2) are dissimilar. However, because they have similar base and base pair frequencies, ARC will encode sequences (1) and (2) as being more similar to one other and sequences (2) and (3) as being dissimilar, as the ARC codes demonstrate: (1) 00000000000000000000000, (2) 01001100000000100001001, and (3) 01001011111111211112112. Therefore, ARC may not only recover topologies in conflict with nonambiguous portions of the alignment but also in conflict with the phylogenetic signal contained in the ambiguous regions. Distancebased methods avoid this problem. INAASE has been shown to perform well when the dataset is sufficiently small, recovering phylogenetic signal with great accuracy, even though the actual number of codes is very small, with a single character representing each ambiguous region [37,38]. For large datasets with over 32 distinct sequence patterns in ambiguous regions, PICSOrd with Ngila zeta cost scores offers a good and fast alternative. Zeta cost scores slightly outperformed simple identity and cost scores in our analysis, confirming the results of previous studies [25,49].
Since PCoA ordination is an eigenvector analysis, the eigenvalues can be used to assess the amount of information represented by each ordination axis and be implemented as weight factor. However, if the PICSOrd codes are used as ordered characters, the coding method encodes the ordination scores proportionally to the amount of variance explained by each axis, and a weighting factor will not markedly affect the overall performance. Weighting of the axes based on eigenvalues is recommended when the codes (equivalent to columns or sites) produced by PICSOrd are analyzed as unordered characters or in a GTR model under maximum likelihood, although tests (results not shown) did not suggest marked changes in topology or support with unweighted or weighted PICSOrd codes. One might also consider weighting to balance the influence of DNA versus PICSOrd characters in a partitioned dataset. However, in general this will not be necessary. The number of code columns (sites) retained by PICSOrd for each ambiguous region depends on the number of different sequence motifs present, with a maximum number corresponding to the number of OTUs. In our experience, only about 2535% of sites will have positive eigenvalues and about 1525% will be retained after removing invariant sites. The first ambiguous region each of the 100OTU Graphidaceae, the 706OTU Physciaceae, and the 1814OTU Parmeliaceae dataset retained 20, 172, and 320 sites, respectively. In addition, only the first few axes will be cladeinformative, that is they contain structure largely congruent with clades resolved by noncoded DNA, and hence increase clade support, whereas the higher axes tend to be 'nearconstant'. In a typical dataset of 1001000 OTUs, the number of sites retained by PICSOrd for each ambiguous region that are 'cladeinformative', will be roughly 525. In ITS datasets containing roughly 450 unambiguouslyaligned nucleotide sites, the 'cladeinformative' PICSOrd axes, assuming 23 ambiguous regions, would therefore add roughly about 1575 sites, replacing originally ambiguous portions of roughly 100150 bases in length.
The usefulness of including ambiguous regions in phylogenetic analyses and the performance of the corresponding recoding method can be evaluated using two criteria: improved confidence (statistical support) and improved topology (phylogenetic accuracy). Topology can be judged indirectly: when two different methods applied to the same dataset result in topological differences, but under certain conditions the topologies converge, this can be seen as improvement towards phylogenetic accuracy, as long as the resolution does not decrease and no novel topologies appear [50,51]. Ambiguous regions likely contain homoplastic phylogenetic signal which could mask the signal contained in nonambiguous portions of the alignment. A simple way to test this is to plot distance matrices obtained from nonambiguous and ambiguous regions against each other. If there is an acceptable level of congruence, one would expect that inclusion of ambiguous regions by means of a coding method should improve support and/or topology. The best way of testing these criteria is through the use of simulation studies [52,53]. However, simulated data are typically not as 'messy' as real biological data, and only a combined approach including biological and simulated data allowed us to assess the performance of our novel recoding method.
The simulation study showed that excluding ambiguous regions resulted in significantly worse topologies and that including them by means of PICSOrd allowed the recovery of a substantial part of the phylogenetic signal contained therein. The most accurate topologies were obtained when analyzing the simulated datasets unchanged ('as is'); however, since in real biological data we cannot know the true alignment, the inclusion of ambiguous regions by means of recoding, rather than excluding them, is the next best option. In PICSOrd, recoding ambiguous regions is based on a single optimal solution for each pairwise alignment given NGILA's model of logaffine gap costs, and the transformation of these pairwise alignments into distances reduces the risk of misinterpretation of positional homologies compared to frequencybased methods such as ARC.
The potential power of recovering phylogenetic signal contained in ambiguous regions is shown in our analysis of the 100OTU Graphidaceae dataset. The topology and support obtained when including ambiguous regions of the mtSSU gene by means of PICSOrd matches the topology and support obtained by a threegene tree [unpubl. data] better than the topology based on exclusion of ambiguous regions. Published 2gene and 3gene phylogenies of Graphidaceae [[46,47]; unpubl. data] recovered the Fissurina, Ocellularia, Phaeographis, and Thelotrema clades with strong support (90100% maximum likelihood bootstrap and 0.981.00 posterior probability). In our approach, PICSOrd recoding for these clades increased support by 9% for the Fissurina clade and by 1820% for the Ocellularia and Thelotrema clades. Graphis was supported sister to the Ocellularia clade when ambiguous regions were excluded but that support disappeared when using PICSOrd recoding; in 2gene and 3gene phylogenies, Graphis does not appear as sister to the Ocellularia clade. Similarly, Wirthiotrema appeared sister to the Thelotrema clade when ambiguous regions are excluded, but sister to a clade including Diploschistes under PICSOrd, which is more in line with published 2gene and 3gene phylogenies. This indicates that the phylogenetic signal contained in the ambiguous portion of the mtSSU gene is congruent with the phylogenetic signal contained in other genes (nuLSU, RPB2) and therefore should not be excluded from phylogenetic analysis. The predictive power contained in ambiguous portions of the mitochondrial small subunit ribosomal DNA and in the nuclear internal transcribed spacer ribosomal DNA, especially at lower hierarchical level, make these genes promising candidates for DNA barcoding in Ascomycota [5456].
During our study, we made some preliminary comparisons (results not shown) between PICSOrd and direct optimization methods such as POY, BALiPhy, PRANK, and SATè [1420,57]. Although PICSOrd recoding introduces a slight distortion of the original data, we did not find that topological accuracy was increased when using DO instead of PICSOrd recoding, either with the simulated or with the biological datasets. In fact, some of the DO methods tested consistently returned less accurate topologies, even if the alignment and tree scores were improved under variable cost settings. In addition, inference time was substantially increased for all DO methods compared to analyzing mixed datasets including PICSOrd codes under maximum likelihood in RAxML, by a factor of ten to one hundred or even more depending on the method. The detailed results of our comparison with DO methods will be presented in a forthcoming publication.
PICSOrd thus offers a simple and cheaptocompute alternative to direct optimization and recoding methods such as INAASE [37], when phylogenetic trees are derived from fixed multiple alignments with substantial ambiguous portions. For smaller datasets capable of being handled by INAASE (maximum of 32 or 64 sequence patterns per ambiguous region), INAASE and PICSOrd coding with Clustal give fairly similar results, and INAASE might be the preferred method, since the step matrices reflect the actual sequence distances before ordination. However, INAASE can only be implemented with parsimony analysis and not within a Bayesian or maximum likelihood framework. Also, PICSOrd with Ngila zeta costs scores outperformed INAASE, and application of a powerlaw model of indel evolution was shown to be superior to other methods for sequences with variable indels [25,49]. A criticism of PICSOrd might be that the resulting codes are abstract entities and do not directly correspond to DNA or phenotype data. However, since the ordination axes are perpendicular to each other and the ordination space is a reflection of the original pairwise distance matrix space, the PICSOrd codes can be interpreted as mathematically independent components of the original sequences' distances, thus fulfilling two important requirements for their phylogenetic analysis: mathematical independence and reflecting the original distance space. Besides easy computation of ambiguous region codes and seamless integration of data partitions (no userdefined step matrices are required), PICSOrd allows for a practically unlimited number of OTUs and sequence patterns to be analyzed.
The modularity of PICSOrd allows for flexible parameter settings, including transition:transversion ratio and gap penalties similar to those of INAASE when calculating simple pairwise cost scores in Ngila [25,49]. Alternative distance measures other than those provided by Clustal (identity) or Ngila (simple or zeta cost scores) are also conceivable, such as those based on 'Nlocal decoding' [35,36]. Another possibility for finetuning PICSOrd lies in the number of ordination axes selected for recoding and in the way the principal coordinates are encoded. This allows for adjustments of PICSOrd codes with respect to the relative length of ambiguous regions within a given alignment. The fact that PICSOrd codes are simple integer values permits combined analysis of DNA and PICSOrd code partitions in a Bayesian framework [58], with up to 6state ordered codes, and under maximum likelihood with RAxML 7.2.6, using a GTR or MK model or characters as ORDERED, with up to 32 states. This was previously impossible with mixed letter/integer codes or codes representing userdefined step matrices. In addition to the improved topological accuracy, a further argument for using PICSOrd versus direct optimization or INAASE is computational speed: recoding the ambiguous region in our 705OTU dataset on a dualcore INTEL processor took two minutes, and analysis of the partitioned dataset under maximum likelihood in RAxML on the same machine required about 36 hours including rapid bootstrapping (100 replicates). For the 1814OTU Parmeliaceae dataset, recoding took about 35 minutes for each region and maximum likelihood analysis including rapid bootstrapping (100 replicates) in RAxML lasted eight days. This is comparable to the time a 50100OTU dataset would have taken to be analysed on the same processor under maximum parsimony in PAUP with ambiguous regions included as INAASE step matrices. Computational speed can further be substantially increased when running the software on multiprocessor computers and web servers [59]. Since PICSOrd uses a set of integer codes to represent each ambiguous region for a given OTU, another problem of INAASE is avoided: the limited number of available symbols when coding an entire ambiguous region as a single character.
While PICSOrd recoding was here applied to DNA data, the underlying method can be used to incorporate any kind of multidimensional distance matrix as unidimensional columns in a phylogenetic dataset and hence simplify the analytical approach and considerably increase computational speed.
Conclusions
PICSOrd offers a simple and fast method to recode regions in multiple sequence alignments that exhibit low alignment confidence scores ('ambiguous regions') and include them as separate partition in phylogenetic analyses. PICSOrd can deal with datasets of practically unlimited size and the codes can be analyzed under maximum likelihood and Bayesian approaches, thus eliminating the disadvantages of previously available methods of ambiguous region coding while retaining the relative accuracy of distancebased recoding methods. The incorporation of Ngila allows for a variety of models of indel evolution to be implemented in the coding process, including a powerlaw zeta model. PICSOrd is especially useful for phylogenetic analyses that use ribosomal genes (mitochondrial small subunit, mtSSU; nuclear internal transcribed spacer, ITS), as these genes are difficult to align even across closely related taxa, and is therefore a useful alternative to computationally intensive methods that optimize alignments and trees simultaneously. For typical mtSSU and ITS datasets or other multiple sequence or protein alignments that contain portions aligned with low confidence but containing phylogenetic signal, PICSOrd coding will substantially improve topology and increase support compared to excluding such portions from the analysis.
Methods
Biological and simulated datasets and delimitation of ambiguous regions
Three datasets of real biological data were analyzed. One dataset was extracted from a larger dataset of the lichen family Graphidaceae that originally consisted of three genes (nuLSU, mtSSU, RPB2) and 280 morphological characters for over 600 OTUs [[47]; unpubl. data]. The extracted sample dataset comprised 100 OTUs representing eight major clades and 19 subclades. Only the mitochondrial small subunit ribosomal DNA (mtSSU) was used for this study, as it exhibited substantial ambiguity. The mtSSU alignment had a total length of 963 positions, with 738 sites corresponding to nonambiguous portions. The second and third dataset comprised 705 and 1814 OTUs of the lichen families Physciaceae and Parmeliaceae, respectively, both representing the internal transcribed spacer (ITS) of the nuclear ribosomal DNA. The Physciaceae alignment had a total length of 564 sites, with 446 corresponding to nonambiguous portions, whereas the Parmeliaceae alignment had 634 sites, with 456 corresponding to nonambiguous portions.
The delimitation of ambiguous regions is in itself a difficult task [13,37]. Several methods have been proposed, such as 'culling' and 'elision' [60,61]. However, these methods are very conservative and usually identify portions of nonambiguous alignments as ambiguous. Recent approaches include HoT (Heads or Tails), which uses an approach of comparing sequences aligned in original or reversed order [6,62,63], and the GUIDANCE scores [7,64]. Here, we used GUIDANCE through the web server at http://guidance.tau.ac.il webcite, plus a manual approach to assess alignment confidence scores and delimit ambiguous regions.
After initial multiple alignment using ClustalW2 [48,65] and MAFFT [3032], we identified 5baselong conserved flanks of highly lengthvariable ambiguous regions using the likelihood of a given nsized base combination to have evolved by chance; a conserved 5baselong motif was determined if the maximum pairwise distance across all OTUs was 2.0 (with a cost of 0.5 for transitions and 1 for transversions). The GUIDANCE scores and the 'manual' method returned remarkably similar results: for the 100OTU Graphidaceae dataset, GUIDANCE delimited regions with low alignment confidence as positions 29126, 513542, and 698807 in the GUIDANCEMAFFT alignment, whereas our manual delimitation, which was used in subsequent analyses, resulted in positions 28127, 514545, and 699809. The three ambiguous regions recognized for that dataset showed length variation of 30 to 94 bases (region 1), 4 to 31 bases (region 2), and 6 to 100 bases (region 3), respectively. For the 705OTU Physciaceae dataset, two ambiguous regions were identified, with length variation of 973 (region 1) and 2245 (region 2), respectively. The 1814OTU Parmeliaceae alignment had three ambiguous regions, with length variation of 5272 (region 1), 2746 (region 2), and 1933 (region 3)
In addition to the three biological datasets, we generated 100 simulated datasets using DAWG 1.2 [52]. Each dataset consisted of thirty sequences that were evolved along a phylogeny reconstructed from the 738 nonambiguous sites of a 30OTU ingroup subset of the 100OTU Graphidaceae dataset (the same subset as used below in the comparison of ambiguous region coding methods). GTR+Gamma model parameters were estimated from the real sequence data and used to generate the simulated datasets. The sequences were divided into five partitions. Partition 1 and 2 contained no indels and were each 400 residues long; partition 2 had a substitution rate twice that of partition 1. Partitions 3, 4, and 5 contained indels at a rate of 1 insertion and 2 deletions per 20 substitutions. Indel lengths were randomly generated from a powerlaw distribution with a slope of 1.6 and a maximum length of 30 residues. The root sequences of partitions 3 and 4 had a length of 100 nucleotides, with partition 3 having a substitution rate twice that of partition 1 and partition 4 thrice that of partition 1; the root sequences of partition 5 had a root length of 200 nucleotides, maximum indel length of 50 residues, and substitution rate twice that of section 1. The resulting partitions 12 were unambiguously alignable, whereas partitions 35 simulated ambiguous regions of increased length variation and complexity.
Computing, ordinating, and coding distance and cost score matrices (PICSOrd)
Ambiguous regions (biological datasets) and partitions containing indels (simulated datasets) were subjected to pairwise alignment to derive distance and cost score matrices. The alignment algorithms implemented in ClustalW2 [48,65], through the web server at the European Bioinformatics Institute (EBI; http://www.ebi.ac.uk/Tools/clustalw2/index.html webcite), were used to derive pairwise sequence identities, as the percentage of nongapped sites matching in both sequences. To derive distances from the identity scores, the latter were divided by 100 and subtracted from 1. INAASE 3.0 [37] was employed to calculate pairwise cost scores under a given transition:transversion:gap ratio. INAASE encodes each distinct sequence pattern (state) using a singledigit number or letter code and computes a multidimensional stepmatrix with the distances between each state. This has the limitation that only up to 32 (64) states can be handled by phylogenetic analysis programs such as PAUP on a 32bit (64bit) computer. As an alternative, we employed ARC [33], which is a nonpairwise alignment method based on frequencies of bases and base patterns in each individual sequence.
In addition to simple sequence identity and cost matrices, Ngila 1.3 was applied to find the most likely alignment between two homologous sequences and its loglikelihood score [25,49,66]. Ngila finds optimal alignments between pairs of sequences using a substitution cost matrix and logaffine gap costs (C_{k }= a + bk + clog(k)). These costs can be specified directly by the user (the 'cost' model), producing distance scores corresponding to those of INAASE if the same cost scheme is used, or can be derived from an evolutionary model (the 'zeta' model). The zeta model uses a Kimura2parameter model [67] to calculate substitution costs and a zeta powerlaw model to compute logaffine gap costs, which have been shown to be superior to other models of indel evolution [25,49]. We tested both the default options and the freeendgaps option under Ngila to derive the pairwise distance matrices, and used the latter for the analyses presented here. The freeendgaps option allows gaps at the start and end of the alignment to have lower or no cost compared to other gaps. This is useful when one expects that the end points of a sequence pair are not necessarily homologous, assumed to be the case in the highly lengthvariable regions of the ribosomal DNA, especially the ITS1 and ITS2 portions (corresponding to ambiguous regions 1 and 2 in the Physciaceae and Parmeliaceae datasets), which are excised and degraded during the transcription process.
Distance and cost score matrices derived via ClustalW and Ngila were subjected to principal coordinates analysis (PCoA). PCoA is found as a standalone application in the freely available executables PCO.exe [[68]; http://www.stat.auckland.ac.nz/~mja/Programs.htm webcite] and DistPCoA.exe [[69]; http://www.esapubs.org/archive/mono/M069/001 webcite] and in packages such as R 2.9.2 [[70]; http://www.rproject.org webcite] and the commercial XLSTATPro 2009 http://www.xlstat.com webcite. Since PCoA is an eigenvector analysis, it offers correction for negative eigenvalues if a distance matrix is not metric. Distance matrices derived from sequence identities and cost scores fall under this category. However, the correction is not mandatory, as axes with negative eigenvalues occur at higher orders and hence can be omitted. Our initial tests including correction for negative eigenvalues showed that correction resulted in undesired distortion of the original data, since the number of axes with nonzero eigenvalues will be higher than the number of different sequence patterns in the ambiguous portion. We therefore employed PCoA without correction and retained axes with positive eigenvalues only.
Since fractional ordination scores cannot be used in phylogenetic analyses, we encoded the ordination scores obtained from axes with positive eigenvalues as integers. For each axis, the maximum and minimum score (S_{max}, S_{min}) and the range (S_{Range }= S_{max } S_{min}) were computed. The maximum range S_{Range(max) }across all axes was retained; usually it corresponded to the first axis, more rarely to axes of higher order (because axis variance is determined by both range and dispersion). For each individual OTU, its axis coordinates S_{OTU }were then rescaled using the following equation: S_{rescaled }= (S_{OTU } S_{min})/S_{Range(max)}, which transformed all original scores into values ranging between 0.00 and 1.00. Integer scores INT_{OTU }were subsequently computed by multiplying S_{rescaled }with 9.99, subtracting 0.495, and rounding to the closest integer value, resulting in 10state ordered integer scores ranging from 0 to 9. The rescaling by multiplication with 9.99 and subtraction of 0.495 ensures that each integer code represents a nearly equal range of 1.0 prior to rounding. We also explored other scoring schemes by comparing uncoded DNA with recoded data, including 4state ordered integer scores and 20state unordered integer scores, and found that 10state ordered integer scores performed best in terms of preserving phylogenetic signal contained in uncoded DNA.
For the 100OTU Graphidaceae dataset, we used a simple approach to assess the level of congruence and potential homoplasy between each of the ambiguous regions and the nonambiguous alignment portion. For all 100 OTUs, Clustal pairwise sequence identity scores were computed for each ambiguous region of the alignment and for the nonambiguous portion. The resulting distance matrices were plotted against each other and the degree of linear correlation was assessed by means of the Pearson productmoment correlation coefficient as implemented in STATISTICA 6.0.
Comparative analysis of coding methods using a nonambiguously aligned biological dataset
To compare the output of coding methods with original, noncoded DNA data, we used a subset of 31 OTUs of the Graphidaceae dataset and the nonambiguous portion of the alignment, trimmed to 720 positions. The number of 31 OTUs (30 ingroup plus one outgroup) was chosen to accommodate the limitations of INAASE, which can only handle up to 32 distinct sequences patterns per alignment portion. The alignment was divided into 12 portions of 60 positions each, and each portion was subjected to recoding using: (1) INAASE cost scores (step matrix) with a transition: transversion:gap ratio of 1:1:1; (2) ARC; (3) PICSOrd with Clustal pairwise identity scores (default ratio of 1:1:1); and (4) PICSOrd with Ngila pairwise log likelihood cost scores (zeta powerlaw model with default settings); the latter two ordinated with uncorrected PCoA retaining axes with positive eigenvalues only and rescaled as ordered 10state integer codes. The encoded datasets resulted in 12 characters (step matrices) for INAASE, 276 for ARC, 204 for PICSOrd with Clustal scores, and 141 for PICSOrd with Ngila scores, as compared to 364 parsimony informative sites in the original DNA matrix. The original DNA alignment and all encoded datasets were subjected to maximum parsimony analysis in PAUP* 4.0b10 [39], using a heuristic search with tree bisectionreconnection (TBR) branch swapping, the MulTrees option in effect, and 100 random addition sequence replicates. Encoded characters were treated as ordered under PICSOrd. Bootstrapping was performed with 100 bootstrap and 100 random addition sequence replicates. Tree drawing was carried out in MESQUITE 2.7 [71]. Trees were compared using the relative RobinsonFoulds distance [72,73] as well as clade support values and the presence/absence of specific clades.
Comparative analysis of PICSOrd coding versus ambiguous regions excluded or automatically aligned
Using the 100OTU Graphidaceae dataset, the 705OTU Physciaceae dataset, and the simulated datasets, we performed a comparative analysis of the multiple alignments as follows: (1) ambiguous regions excluded, and (2) ambiguous regions encoded using PICSOrd. For option (1), we used the nonambiguous portions of the two biological datasets and the nonambiguous partitions 12 of the simulated datasets. For option (2), ambiguous regions (biological datasets) or partitions (simulated data) were pairwise aligned and Ngila log likelihood cost scores were computed under the zeta powerlaw model (default settings). The cost score matrices were ordinated using PCoA without correcting for nonmetricity and all axes with positive eigenvalues were retained. Ordination scores were rescaled to 10state ordered integer codes.
The biological datasets were analyzed under maximum likelihood using the most recent version 7.2.6 of RAxML [[73]; http://wwwkramer.in.tum.de/exelixis/software.html webcite]. This version allows for the combined analysis of DNA and nonDNA characters, the latter under a general MK or GTR model or as ORDERED. The DNA datasets from options (1) and (2) were analyzed under a GTR+G model. The mixed datasets from option (3) were partitioned and the DNA partition was analyzed under a GTR+G model, whereas the encoded partition was subjected to the GTR model with equal rates and fixed (equal) character state frequencies. Rapid bootstrapping was performed under GTRCAT with 100 replicates.
Phylogenetic inferences on simulated datasets were conducted using the SSE3vectorized version of RAxML 7.2.6 [74]. On each dataset we conducted 20 ML searches on randomized stepwise addition parsimony trees under the GAMMA model of rate heterogeneity to obtain the bestknown ML tree; ML optimization is NPhard [75]. Datasets were analyzed as follows: (a) partitions 14 or 12+5 prealigned as simulated (unchanged), (b) partitions 12 prealigned as simulated and 35 excluded (corresponding to option 1 above), (c) partitions 12 prealigned as simulated and 34 or 5 encoded using PICSOrd with Ngila log likelihood cost scores under the zeta powerlaw model (default settings; corresponding to option 3 above). For the PICSOrdencoded partitions of the alignment, we compared GTR, MK, and ORDERED multistate models as implemented in RAxML 7.2.6. All characters were treated as unweighted. We also tested joint and perpartition branch length estimates for the alignment partitions and found that inferences using joint branch length estimates across all partitions yielded slightly more accurate trees. Conducting 20 ML tree searches with RAxML under the most complex model with respect to computational complexity (PICSOrd multistate partitions analyzed under a GTR model) took on average eight minutes on a single AMD Shanghai core running at 2.7 GHz.
To compute the topological distances of all resulting trees to the true tree, we used the respective RAxML option (f r) to obtain the relative RobinsonFoulds (RF) distance [72,73]. We used Phyutility [76] for some of the batch data manipulations, including batch file format conversions. Tree drawing was carried out in MESQUITE 2.7 [71].
PICSOrd Implementation
A reference implementation of PICSOrd is available from http://scit.us/projects/ngila/wiki/PICSOrd/ webcite. It requires the statistical software package, R (R Development Core Team, 2009), and the alignment software, Ngila [49]. It is flexible but defaults to using Ngila's zeta cost scores to construct distance matrices, and R's cmdscale function to ordinate these matrices. It supports nucleicacid and protein data and can produce encodings using up to 64 characters. It integrates well with RAxML [73].
Authors' contributions
RL provided the idea of the novel methodology presented here and performed initial computations under parsimony, as well as the individual Ngila and PCoA analyses and ordination scores recoding and ML analyses of the 706OTU and 1814OTU biological datasets. BPH and RL further developed the idea and made analytical comparisons with alternative ambiguous region coding methods. AS developed an updated version of RAxML (7.2.6 and higher) to allow for phylogenetic analysis of ambiguous region codes using a mixed model under maximum likelihood, and performed ML analyses, including computation of RF values, on the simulated datasets and part of the biological datasets. RAC developed an updated version of Ngila for pairwise alignment score matrices to be directly analyzed by PCoA, generated the simulated datasets, and wrote an R script for automated recoding using the PICSOrd approach. All authors contributed equally to writing the manuscript and read and approved the final manuscript.
Acknowledgements
This study was elaborated in the framework of phylogenetic studies of the lichen family Graphidaceae (= Thelotremataceae) and the basidiolichencontaining family Hygrophoraceae. Both studies received support from the National Science Foundation under the titles 'Phylogeny and Taxonomy of Ostropalean Fungi, with Emphasis on the Lichenforming Thelotremataceae' (NSFDEB 0516116 to The Field Museum; PI H.T. Lumbsch; CoPI R. Lücking) and 'Phylogenetic Diversity of Mycobionts and Photobionts in the Cyanolichen Genus Dictyonema, with Emphasis on the Neotropics and the Galapagos Islands' (NSFDEB 0841405 to George Mason University, Virginia; PI J. Lawrey, CoPIs R. Lücking and P. Gillevet). Additional material for these phylogenetic studies that was used in part for the present paper was obtained through the NSFfunded projects 'TICOLICHEN  The Costa Rican Biodiversity Inventory' (NSFDEB 0206125 to The Field Museum; PI R. Lücking) and 'Neotropical Epiphytic Microlichens  An Innovative Inventory of a Highly Diverse yet Little Known Group of Symbiotic Organisms' (NSFDEB 715660 to The Field Museum; PI R. Lücking). Brendan P. Hodkinson was funded by a Graduate Fellowship from the Mycological Society of America and a Doctoral Dissertation Improvement Grant from the National Science Foundation entitled 'A Phylogenetic Characterization of the Lichen Microbiome' (NSFDEB 1011504 to Duke University; PI F. Lutzoni, CoPI B. P. Hodkinson). Reed Cartwright was funded by NLM grant LM01000901 to D. Graur and G. Landan. Marti Anderson (Massey University, Auckland, New Zealand) and Bruce McCune (Oregon State University) are warmly thanked for discussions and advice regarding principal coordinates analysis, and François Lutzoni (Duke University) for discussing methods of ambiguous region coding. Eimy Rivas Plata (University of IllinoisChicago), Matthew Nelsen (University of Chicago), James Lawrey (George Mason University), and four anonymous reviewers reviewed an earlier version of this manuscript.
References

Jones DT: Progress in protein structure prediction.
Curr Opin Struct Biol 1997, 7:37787. PubMed Abstract  Publisher Full Text

Phillips A, Janies D, Wheeler W: Multiple sequence alignment in phylogenetic analysis.
Mol Phyl Evol 2000, 16:317330. Publisher Full Text

Wheeler WC: Homology and the optimization of DNA sequence data.
Cladistics 2001, 17:S3S11. PubMed Abstract  Publisher Full Text

Hall BG: Comparison of the accuracies of several phylogenetic methods using protein and DNA sequences.
Mol Biol Evol 2005, 22:792802. PubMed Abstract  Publisher Full Text

Ogden TH, Rosenberg MS: Multiple sequence alignment accuracy and phylogenetic inference.
Syst Biol 2006, 55:314328. PubMed Abstract  Publisher Full Text

Landan G, Graur D: Heads or tails: a simple reliability check for multiple sequence alignments.
Mol Biol Evol 2007, 24:13801383. PubMed Abstract  Publisher Full Text

Penn O, Privman E, Landan G, Graur D, Pupko T: An alignment confidence score capturing robustness to guidetree uncertainty.
Mol Biol Evol 2010. PubMed Abstract  Publisher Full Text

Olsen GJ: Phylogenetic analysis using ribosomal RNA.
Methods Enzymol 1988, 164:793838. PubMed Abstract  Publisher Full Text

Olsen GJ, Woese CR: Ribosomal RNA: a key to phylogeny.
FASEB J 1993, 7:113123. PubMed Abstract  Publisher Full Text

Talavera G, Castresana J: Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments.
Syst Biol 2007, 56:564577. PubMed Abstract  Publisher Full Text

Lee MSY: Unalignable sequences and molecular evolution.
Trends Ecol Evol 2001, 16:681685. Publisher Full Text

Dwivedi B, Gadagkar SR: Phylogenetic inference under varying proportions of indelinduced alignment gaps.
BMC Evol Biol 2009, 9:211. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Redelings BD, Suchard MA: Robust inferences from ambiguous alignments. In Sequence Alignment: Methods, Models, Concepts, and Strategies. Edited by Rosenberg MS. University of California Press, Berkeley; 2009:209271.

Wheeler WC: Optimization alignment: the end of multiple sequence alignment in phylogenetics?
Cladistics 1996, 12:19. Publisher Full Text

Edgar RC, Sjölander K: SATCHMO: sequence alignment and tree construction using hidden Markov models.
Bioinformatics 2003, 19:14041411. PubMed Abstract  Publisher Full Text

Wheeler WC, Gladstein D, De Laet J: POY, version 3.0.11. American Museum of Natural History, New York; 2003.

Fleissner R, Metzler D, Haeseler A: Simultaneous statistical multiple alignment and phylogeny reconstruction.
Syst Biol 2005, 54:548561. PubMed Abstract  Publisher Full Text

Lunter G, Miklos I, Drummond A, Jensen J, Hein J: Bayesian coestimation of phylogeny and sequence alignment.
BMC Bioinformat 2005, 6:83. BioMed Central Full Text

Redelings BD, Suchard MA: Joint Bayesian estimation of alignment and phylogeny.
Syst Biol 2005, 54:401418. PubMed Abstract  Publisher Full Text

Loytynoja A, Goldman N: Phylogenyaware gap placement prevents errors in sequence alignment and evolutionary analysis.
Science 2008, 320:16321635. PubMed Abstract  Publisher Full Text

Liu K, Raghavan S, Nelesen S, Linder CR, Warnow T: Rapid and accurate largescale coestimation of sequence alignments and phylogenetic trees.
Science 2009, 324:15611564. PubMed Abstract  Publisher Full Text

Ogden TH, Rosenberg MS: Alignment and topological accuracy of the direct optimization approach via POY and traditional phylogenetics via ClustalW + PAUP*.
Syst Biol 2007, 56:182193. PubMed Abstract  Publisher Full Text

Lehtonen S: Phylogeny estimation and alignment via POY versus Clustal + PAUP*: a response to Ogden and Rosenberg (2007).
Syst Biol 2008, 57:653657. PubMed Abstract  Publisher Full Text

Kjer K, Gillespie JJ, Ober KA: Opinions on multiple sequence alignment, and an empirical comparison of repeatability and accuracy between POY and structural alignment.
Syst Biol 2007, 56:133156. PubMed Abstract  Publisher Full Text

Cartwright RA: Problems and solutions for estimating indel rates and length distributions.
Mol Biol Evol 2009, 26:473480. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wheeler WC, Giribet G: Phylogenetic hypotheses and the utility of multiple sequence alignment. In Sequence Alignment: Methods, Models, Concepts, and Strategies. Edited by Rosenberg MS. University of California Press: Berkeley, CA; 2009.

González D, Cubeta MA, Vilgalys R: Phylogenetic utility of indels within ribosomal DNA and betatubulin sequences from fungi in the Rhizoctonia solani species complex.

Price EW, Carbone I: SNAP: workbench management tool for evolutionary population genetic analysis.
Bioinformatics 2005, 21:402404. PubMed Abstract  Publisher Full Text

Aylor DL, Price E, Carbone I: SNAP: Combine and Map modules for multilocus population genetic analysis.
Bioinformatics 2006, 22:13991401. PubMed Abstract  Publisher Full Text

Katoh K, Misawa K, Kuma K, Miyata T: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform.
Nucl Acids Res 2002, 30:30593066. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Katoh K, Toh H: Recent developments in the MAFFT multiple sequence alignment program.
Brief Bioinf 2008, 9:286298. Publisher Full Text

Katoh K, Asimenos G, Toh H: Multiple alignment of DNA sequences with MAFFT.
Meth Mol Biol 2009, 537:3964. Publisher Full Text

Miadlikowska J, Lutzoni F, Goward T, Zoller S, Posada D: New approach to an old problem: gaprich regions from ITS and rDNA largesubunit are incorporated into phylogenetic analyses to resolve the Peltigera canina species complex.
Mycologia 2003, 95:11811203. PubMed Abstract  Publisher Full Text

Karlin S, Ladunga I: Comparisons of eukaryotic genomic sequences.
Proc Natl Acad Sci USA 1994, 91:1283212836. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Didier G, Laprevotte I, Pupin M, Hénaut A: Local decoding of sequences and alignmentfree comparison.
J Comput Biol 2006, 13:14651476. PubMed Abstract  Publisher Full Text

Didier G, Grossmann MM, Devauchelle A, Laprevotte I: Comparing sequences without using alignments: application to HIV/SIV subtyping.
BMC Bioinformatics 2007, 8:113. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Lutzoni F, Wagner P, Reeb V, Zoller S: Integrating ambiguously aligned regions of DNA sequences in phylogenetic analyzes without violating positional homology.
Syst Biol 2000, 49:628651. PubMed Abstract  Publisher Full Text

Lawrey JD, Lücking R, Sipman HJM, Chaves JL, Redhead SA, Bungartz F, Sikaroodi M, Gillevet PM: High concentration of basidiolichens in a single family of agaricoid mushrooms (Basidiomycota: Agaricales: Hygrophoraceae).
Mycol Res 2009, 113:11541171. PubMed Abstract  Publisher Full Text

Swofford DL: PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4 beta 10. Sinauer Associates, Sunderland, Massachusetts; 2003.

Legendre P, Legendre L: Numerical Ecology. 2nd edition. Elsevier Science BV, Amsterdam; 1998.

Zuur AF, Ieno EN, Smith GM: Principal coordinate analysis and nonmetric multidimensional scaling. In Analysing Ecological Data. Edited by Zuur AF, Ieno EN, Smith GM. Springer, New York; 2007:259264.

McCune B, Grace JB, Urban DL: Analysis of Ecological Communities. MjM Software, Gleneden Beach, Oregon; 2002.

McCune B, Mefford MJ: PCORD. Multivariate Analysis of Ecological Data.

Minchin PR: An evaluation of the relative robustness of techniques for ecological ordination.
Vegetatio 1987, 69:89107. Publisher Full Text

Hillis DM, Heath TA, John KS: Analysis and visualization of tree space.
Syst Biol 2005, 54:471482. PubMed Abstract  Publisher Full Text

Staiger B, Kalb K, Grube M: Phylogeny and phenotypic variation in the lichen family Graphidaceae (Ostropomycetidae, Ascomycota).
Mycol Res 2006, 110:765772. PubMed Abstract  Publisher Full Text

Mangold A, Martın MP, Lücking R, Lumbsch HT: Molecular phylogeny suggests synonymy of Thelotremataceae within Graphidaceae (Ascomycota: Ostropales).

Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG: Clustal W and Clustal × version 2.0.
Bioinformatics 2007, 23:29472948. PubMed Abstract  Publisher Full Text

Cartwright RA: Ngila: global pairwise alignments with logarithmic and affine gap costs.
Bioinformatics 2007, 23:14271429. PubMed Abstract  Publisher Full Text

Cunningham CW: Is congruence between data partitions a reliable predictor of phylogenetic accuracy? Empirically testing an iterative procedure for choosing among phylogenetic methods.
Syst Biol 1997, 46:464478. PubMed Abstract  Publisher Full Text

Hipp AL, Hall JC, Sytsma KJ: Congruence versus phylogenetic accuracy: revisiting the incongruence length difference (ILD) test.
Syst Biol 2004, 53:8189. PubMed Abstract  Publisher Full Text

Cartwright RA: DNA assembly with gaps (DAWG): Simulating sequence evolution.
Bioinformatics 2005, 21(Suppl 3):iii31iii38. PubMed Abstract  Publisher Full Text

Fletcher W, Yang Z: INDELible: A flexible simulator of biological sequence evolution.
Mol Biol Evol 2009, 26:18791888. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hebert PD, Stoeckle MY, Zemlak TS, Francis CM: Identification of birds through DNA barcodes.
PLoS Biol 2004, 2(10):e312. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Moritz C, Cicero C: DNA barcoding: promise and pitfalls.
PLoS Biol 2004, 2:15291531. Publisher Full Text

Meier R: DNA sequences in taxonomy: opportunities and challenges. In The New Taxonomy. Edited by Wheeler Q. CRC Press, Boca Raton; 2008.

Loytynoja A, Goldman N: An algorithm for progressive multiple alignment of sequences with insertions.
Proc Natl Acad Sci 2005, 102:1055710562. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ronquist F, Huelsenbeck JP: MRBAYES 3: Bayesian phylogenetic inference under mixed models.
Bioinformatics 2003, 19:15721574. PubMed Abstract  Publisher Full Text

Stamatakis A, Hoover P, Rougemont J: A fast bootstrapping algorithm for the RAxML webservers.
Syst Biol 2008, 57:758771. PubMed Abstract  Publisher Full Text

Gatesy J, DeSalle R, Wheeler W: Alignmentambiguous nucleotide sites and the exclusion of systematic data.
Mol Phyl Evol 1993, 2:152157. Publisher Full Text

Wheeler WC, Gatesy J, DeSalle R: Elision: a method for accommodating multiple molecular sequence alignments with alignmentambiguous sites.
Mol Phyl Evol 1995, 4:19. Publisher Full Text

Landan G, Graur D: Local reliability measures from sets of cooptimal multiple sequence alignments.

Hall BG: How well does the HoT score reflect sequence alignment accuracy?
Mol Biol Evol 2008, 25:15761580. PubMed Abstract  Publisher Full Text

Penn O, Privman E, Ashkenazy H, Landan G, Graur D, Pupko T: GUIDANCE: a web server for assessing alignment confidence scores.
Nucl Acids Res 2010, 38:W23W28. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Thompson JD, Higgins DG, Gibson TJ: Clustal W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice.
Nucl Acids Res 1994, 22:46734680. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cartwright RA: Logarithmic gap costs decrease alignment accuracy.
BMC Bioinformatics 2006, 7:527. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Kimura M: A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences.
Mol Evol 1980, 16:111120. Publisher Full Text

Anderson MJ: PCO: a FORTRAN computer program for principal coordinate analysis. Department of Statistics, University of Auckland, New Zealand; 2003.

Legendre P, Anderson MJ: Distancebased redundancy analysis: testing multispecies responses in multifactorial ecological experiments.
Ecol Monogr 1999, 69:124. Publisher Full Text

R Development Core Team: R: A language and environment for statistical computing. [http://www.Rproject.org] webcite
R Foundation for Statistical Computing, Vienna, Austria; 2009.

Maddison WP, Maddison DR: Mesquite: a modular system for evolutionary analysis. Version 2.7. [http://mesquiteproject.org] webcite
2009.

Robinson DF, Foulds LR: Comparison of phylogenetic trees.
Math Biosci 1981, 53:131147. Publisher Full Text

Stamatakis A: RAxMLVIHPC: Maximum likelihoodbased phylogenetic analyses with thousands of taxa and mixed models.
Bioinformatics 2006, 22:26882690. PubMed Abstract  Publisher Full Text

Berger SA, Stamatakis A: Accuracy and performance of single versus double precision arithmetics for maximum likelihood phylogeny reconstruction.
Proceedings of PBC09, Parallel Biocomputing Workshop, Wroclaw, Poland 2009.

Roch S: A short proof that phylogenetic tree reconstruction by maximum likelihood is hard.
IEEE Trans Comp Biol Bioinform 2006, 3:9294. Publisher Full Text

Smith SA, Dunn C: Phyutility: a phyloinformatics utility for trees, alignments, and molecular data.
Bioinformatics 2008, 24:715716. PubMed Abstract  Publisher Full Text