Department of Plant and Microbial Biology, 461 Koshland Hall, University of California, Berkeley, CA 94720-3102, USA

Abstract

Background

In a previous paper, we introduced MUSCLE, a new program for creating multiple alignments of protein sequences, giving a brief summary of the algorithm and showing MUSCLE to achieve the highest scores reported to date on four alignment accuracy benchmarks. Here we present a more complete discussion of the algorithm, describing several previously unpublished techniques that improve biological accuracy and / or computational complexity. We introduce a new option, MUSCLE-fast, designed for high-throughput applications. We also describe a new protocol for evaluating objective functions that align two profiles.

Results

We compare the speed and accuracy of MUSCLE with CLUSTALW, Progressive POA and the MAFFT script FFTNS1, the fastest previously published program known to the author. Accuracy is measured using four benchmarks: BAliBASE, PREFAB, SABmark and SMART. We test three variants that offer highest accuracy (MUSCLE with default settings), highest speed (MUSCLE-fast), and a carefully chosen compromise between the two (MUSCLE-prog). We find MUSCLE-fast to be the fastest algorithm on all test sets, achieving average alignment accuracy similar to CLUSTALW in times that are typically two to three orders of magnitude less. MUSCLE-fast is able to align 1,000 sequences of average length 282 in 21 seconds on a current desktop computer.

Conclusions

MUSCLE offers a range of options that provide improved speed and / or alignment accuracy compared with currently available programs. MUSCLE is freely available at

Background

Multiple alignments of protein sequences are important in many applications, including phylogenetic tree estimation, secondary structure prediction and critical residue identification. Many multiple sequence alignment (MSA) algorithms have been proposed; for a recent review, see

Current methods

While multiple alignment and phylogenetic tree reconstruction have traditionally been considered separately, the most natural formulation of the computational problem is to define a model of sequence evolution that assigns probabilities to all possible elementary sequence edits and then to seek an optimal directed graph in which edges represents edits and terminal nodes are the observed sequences. This graph makes the history explicit (it can be interpreted as a phylogenetic tree) and implies an alignment. No tractable method for finding an optimal graph is known for biologically realistic models, and simplification is therefore required. A common heuristic is to seek a multiple alignment that maximizes the SP score (the summed alignment score of each sequence pair), which is NP complete ^{N}) in the sequence length

Progressive alignment

**Progressive alignment. **Sequences are assigned to the leaves of a binary tree. At each internal (i.e., non-leaf) node, the two child profiles are aligned using profile-profile alignment (see Figure 2). Indels introduced at each node are indicated by shaded background.

Profile-profile alignment

**Profile-profile alignment. **Two profiles (multiple sequence alignments) X and Y are aligned to each other such that columns from X and Y are preserved in the result. Columns of indels (gray background) are inserted as needed in order to align the columns to each other. The score for aligning a pair of columns is determined by the profile function, which should assign a high score to pairs of columns containing similar amino acids.

Implementation

The basic strategy used by MUSCLE is similar to that used by PRRP

Algorithm overview

MUSCLE has three stages. At the completion of each stage, a multiple alignment is available and the algorithm can be terminated.

Stage 1: draft progressive

The first stage builds a progressive alignment.

Similarity measure

The similarity of each pair of sequences is computed, either using

Distance estimate

A triangular distance matrix is computed from the pair-wise similarities.

Tree construction

A tree is constructed from the distance matrix using UPGMA or neighbor-joining, and a root is identified.

Progressive alignment

A progressive alignment is built by following the branching order of the tree, yielding a multiple alignment of all input sequences at the root.

Stage 2: improved progressive

The second stage attempts to improve the tree and builds a new progressive alignment according to this tree. This stage may be iterated.

Similarity measure

The similarity of each pair of sequences is computed using fractional identity computed from their mutual alignment in the current multiple alignment.

Tree construction

A tree is constructed by computing a Kimura distance matrix and applying a clustering method to this matrix.

Tree comparison

The previous and new trees are compared, identifying the set of internal nodes for which the branching order has changed. If Stage 2 has executed more than once, and the number of changed nodes has not decreased, the process of improving the tree is considered to have converged and iteration terminates.

Progressive alignment

A new progressive alignment is built. The existing alignment is retained of each subtree for which the branching order is unchanged; new alignments are created for the (possibly empty) set of changed nodes. When the alignment at the root is completed, the algorithm may terminate, return to step 2.1 or go to Stage 3.

Stage 3: refinement

The third stage performs iterative refinement using a variant of tree-dependent restricted partitioning

Choice of bipartition

An edge is deleted from the tree, dividing the sequences into two disjoint subsets (a bipartition). Edges are visiting in order of decreasing distance from the root.

Profile extraction

The profile (multiple alignment) of each subset is extracted from the current multiple alignment. Columns containing no residues (i.e., indels only) are discarded.

Re-alignment

The two profiles obtained in step 3.2 are re-aligned to each other using profile-profile alignment.

Accept/reject

The SP score of the multiple alignment implied by the new profile-profile alignment is computed. If the score increases, the new alignment is retained, otherwise it is discarded. If all edges have been visited without a change being retained, or if a user-defined maximum number of iterations has been reached, the algorithm is terminated, otherwise it returns to step 3.1. Visiting edges in order of decreasing distance from the root has the effect of first re-aligning individual sequences, then closely related groups

Algorithm elements

In the following, we describe the elements of the MUSCLE algorithm. In several cases, alternative versions of these elements were implemented in order to investigate their relative performance and to offer different trade-offs between accuracy, speed and memory use. Most of these alternatives are made available to the user via command-line options. Four benchmark datasets have been used to evaluate options and parameters in MUSCLE: BAliBASE

Objective score

In its refinement stage, MUSCLE seeks to maximize an objective score, i.e. a function that maps a multiple sequence alignment to a real number which is designed to give larger values to better alignments. MUSCLE uses the

Gap penalties in the SP score

**Gap penalties in the SP score **This figure shows a multiple alignment of three sequences

Progressive alignment

Progressive alignment requires a rooted binary tree in which each sequence is assigned to a leaf. The tree is created by clustering a triangular matrix containing a distance measure for each pair of sequences. The branching order of the tree is followed in postfix order (i.e., children are visited before their parent). At each internal node, profile-profile alignment is used to align the existing alignments of the two child subtrees, and the new alignment is assigned to that node. A multiple alignment of all input sequences is produced at the root node (Figure

Similarity measures

We use the term

_{τ }min [_{X}(_{Y}(_{X}, _{Y}) -

Here _{X}, _{Y }are the sequence lengths, and _{X}(_{Y}(^{Binary}, so-called because it reduces the

^{Binary }= Σ_{τ }_{XY}(_{X}, _{Y}) -

Here, _{XY}(

Distance measures

Given a similarity value, we wish to estimate an additive distance measure. An additive measure distance measure d(A, B) between two sequences A and B satisfies d(A, B) = d(A, C) + d(C, B) for any third sequence C, assuming that A, B and C are all related. Ideal but generally unknowable is the

_{Kimura }= -log_{e }(1 - ^{2}/5) (3)

For

_{kmer }= 1 -

Tree construction

Given a distance matrix, a binary tree is constructed by clustering. Two methods are implemented: neighbor-joining

^{Avg}_{PC }= (_{LC }+ _{RC})/2. (5)

We can take the minimum rather than the average:

^{Min}_{PC }= min [_{LC}, _{RC}]. (6)

Following MAFFT, we also implemented a weighted mixture of minimum and average linkage:

^{Mix}_{PC }= (1 - ^{Min}_{PC }+ ^{Avg}_{PC}, (7)

where

Sequence weighting

Conventional wisdom holds that sequences should be weighted to correct for the effects of biased sampling from a family of related proteins; however, there is no consensus on how such weights should be computed. MUSCLE implements the following sequence weighting schemes: none (all sequences have equal weight), Henikoff

Profile functions

In order to apply pair-wise alignment methods to profiles, a scoring function must be defined for a pair of profile positions, i.e. a pair of multiple alignment columns. This function is the profile analog of a substitution matrix; see for example _{i }the background probability of _{ij }the joint probability of _{ij }the substitution matrix score, ^{x}_{i }the observed frequency of ^{x}_{G }the observed frequency of gaps in that column, and ^{x}_{i }the estimated probability of observing

PSP^{xy }= Σ_{i }Σ_{j }^{x}_{i }^{y}_{j }_{ij}. (8)

Note that _{ij }= log (_{ij }/ _{i}_{j})

PSP^{xy }= Σ_{i }Σ_{j }^{x}_{i }^{y}_{j }log (_{ij }/ _{i }_{j}). (9)

PSP is the function used by CLUSTALW and MAFFT. It is a natural choice when attempting to maximize the SP objective score: if gap penalties are neglected, maximizing PSP maximizes SP under the constraint that columns in each profile are preserved. (This follows from the observation that the contribution to SP from a pair of sequences in the same profile is the same for all alignments allowed under the constraint). MUSCLE implements PSP functions based on the 200 PAM matrix of

LA^{xy }= log Σ_{i }Σ_{j }^{x}_{i }^{y}_{j }_{ij }/ _{i }_{j}. (10)

LE is defined as follows:

LE^{xy }= (1 - ^{x}_{G}) (1 - ^{y}_{G}) log Σ_{i }Σ _{j }^{x}_{i }^{y}_{j }_{ij }/ _{i}_{j}. (11)

The MUSCLE LE function uses probabilities computed from VTML 240. Note that estimated probabilities _{G}) is the _{i }must be normalized to sum to one if indels are present (otherwise the logarithm becomes increasingly negative with increasing numbers of gaps even when aligning conserved or similar residues). The occupancy factors are introduced to encourage more highly occupied columns (i.e., those with fewer gaps) to align, and are found to significantly improve accuracy. We avoid these complications in the PSP score by computing frequencies in a 21-letter alphabet (amino acids + indel), and by defining the substitution score of an amino acid to an indel to be zero. This has the desired effect of down-weighting column pairs with low occupancies, and can also be motivated by consideration of the SP function. If gap penalties are ignored, then this definition of PSP preserves the optimization of SP under the fixed-column constraint by correctly accounting for the reduced number of residue pairs in columns containing gaps.

Gap penalties

We call the first indel in a gap its _{o }in Y and the gap-close to _{c}. The penalty for this gap is _{o}) + _{c}) + _{X }and _{Y}. If a constant _{X }+ _{Y})/2 to the score of any possible alignment, and the set of optimal alignments is therefore unchanged. Given a scoring scheme with substitution matrix _{ij }and extension penalty _{ij }= _{ij }+ 2^{y}_{o }be the number of gap-opens in column ^{y}_{c }be the number of gap-closes in column

Position-specific gap penalties

**Position-specific gap penalties. **An alignment of two profiles X and Y. Gaps in sequences

^{y}_{o}) (1 + _{w}(

^{y}_{c}) (1 + _{w}(

Here, _{w}(^{y}_{o}) is motivated by considering the SP score of the alignment. The gap penalty contribution to SP for a pair of sequences (^{y}_{o}) therefore corrects the gap-open contribution to the SP score due to pre-existing gaps in Y. (It should be noted that even with this correction, there are other issues related to gaps and PSP still does not exactly optimize SP under the fixed-column constraint). The increased penalty in hydrophobic windows is designed to discourage gaps in buried core regions where insertions and deletions are less frequent. Note that MUSCLE treats open and close positions symmetrically, in contrast to CLUSTALW, which treats the open position specially and may therefore tend to produce, in word processing terms, left-aligned gaps with a ragged right margin.

Terminal gaps

A

Tree comparison

In progressive alignment, two subtrees will produce identical alignments if they have the same set of sequences at their leaves and the same branching orders (topologies). We exploit this observation to optimize the progressive alignment in Stage 2 of MUSCLE, which begins by constructing a new tree. Unchanged subtrees are identified, and their alignments are retained (Figure ^{A}, take the ids of its two child nodes ^{A }and ^{A }and use them as indexes into a lookup table pointing to nodes in ^{A }is equivalent to a node ^{B }in ^{A }is equivalent to a node ^{B}, and (b) ^{B }and ^{B }have the same parent ^{B}, then assign ^{B }the same id as ^{A}, to which it is equivalent. When the traversal is complete, a node

Tree comparison

**Tree comparison. **Two trees are compared in order to identify those nodes that have the same branching orders within subtree rotation (white). If a progressive alignment has been created using to the old tree, then alignments at these nodes can be retained as the same result would be produced at those nodes by the new tree. New alignments are needed at the changed (black) nodes only.

Defaults, optimizations and complexity analysis

We now discuss the default choices of algorithm elements in the MUSCLE program and analyze their complexity.

Complexity of CLUSTALW

It is instructive to consider the complexity of CLUSTALW. This is of intrinsic interest as CLUSTALW is currently the most widely used MSA program and, to the best of our knowledge, its complexity has not previously been stated correctly in the literature. It is also useful as a baseline for motivating some of the optimizations used in MUSCLE. The CLUSTALW algorithm can be described by the same steps as Stage 1 above. The similarity measure is the fractional identity computed from a global alignment, clustering is done by neighbor-joining. Global alignment of a pair of sequences or profiles is computed using the Myers-Miller linear space algorithm ^{2}) time in the typical sequence length ^{2}) pairs, it is therefore O(^{2}^{2}) time and O(^{2 }+ ^{2}) space and O(^{4}) time, at least up to CLUSTALW 1.82, although O(^{3}) time is possible; see e.g. _{P}_{P}) time and space in the number of sequences in the profile _{P }and the profile length _{P}, then uses Myers-Miller to align the profiles in O(_{P}) space and O(_{P}^{2}) time. There are _{P }is O(_{P }is O(^{2}) in both space and time. This analysis is summarized in Table

Complexity of CLUSTALW. Here we show the big-O asymptotic complexity of the elements of CLUSTALW as a function of

**Step**

**O(Space)**

**O(Time)**

Distance matrix

^{2 }+

^{2}
^{2}

Neighbor joining

^{2}

^{4}

Progressive (one iteration)

_{P }+ _{P }= ^{2}

_{P }+ _{P}^{2 }= ^{2 }+ ^{2}

Progressive (total)

^{2}

^{3 }+ ^{2}

TOTAL

^{2 }+ ^{2}

^{4 }+ ^{2}

Initial distance measure

One might expect (a) that a more accurate distance measure would lead to a more accurate final alignment due to an improved tree, and (b) that errors due to a less accurate distance measure might be eliminated by allowing Stage 2 to iterate more times. Neither of these expectations is supported by our test results (unpublished). Allowing Stage 2 to iterate more than once with the goal of further improving the tree gave no significant improvement with any distance measure. Possibly, the tree is biased towards the MSA that was used to estimate it, and the MSA is biased by the tree used to create it, making it hard to achieve improvements. The most accurate measure on a pair of sequences is presumably the fractional identity ^{Binary }gave slightly reduced accuracy scores even when Stage 2 was allowed to iterate. The default choice in MUSCLE is therefore to use the Dayhoff alphabet in step 1.1 and to execute Stage 2 once only. While the impact on the average accuracy of the final alignment due to the different options is not understood, we observe that a better alignment of a pair of sequences is often obtained from a multiple alignment than from a pair-wise alignment, due to the presence of intermediate sequences having higher identities. It is therefore plausible that ^{2}^{2}^{2}) in CLUSTALW. For a typical

Clustering

MUSCLE implements both UPGMA and neighbor-joining. We found UPGMA to give slightly better benchmark scores than neighbor-joining; UPGMA is therefore the default option. We expect neighbor-joining to give a better estimate of the correct evolutionary tree (see e.g. ^{4}) time, although this can be reduced to O(^{3}). UPGMA is naively O(^{3}) time as the minimum of an ^{2 }matrix must be found in each of ^{2}) time by maintaining a vector of pointers to the minimum value in each row of the matrix. We are again fortunate to find that the most accurate method is also the fastest.

Neighbor-joining and UPGMA trees for progressive alignment

**Neighbor-joining and UPGMA trees for progressive alignment. **Here we show the same set of four sequences and the order in which they will be aligned according to a neighbor-joining tree (above) and a UPGMA tree (below). Notice that

Additive profiles

**Additive profiles. **The profile functions in MUSCLE require amino acid frequencies for each column. Here we show the alignment of two profiles X and Y, giving a new profile Z. Note that the count ^{Z}_{i }for amino acid ^{Z}_{i }= ^{X}_{i }+ ^{Y}_{i}. In terms of frequencies, this becomes ^{Z}_{i }= ^{X}^{X}_{i }/^{Z }+ ^{Y}^{Y}_{i}/^{Z}, where ^{X}, ^{Y}, ^{Z }are the number of sequences in X, Y and Z respectively. Therefore, given a suitable sequence weighting scheme, it is possible to compute frequencies in Z from the frequencies in X and Y. This avoids the step of building an explicit multiple alignment for Z in order to compute frequencies, as done in CLUSTALW and MAFFT.

Dynamic programming

The textbook algorithm for pair-wise alignment with affine penalties employs three dynamic programming matrices; see e.g. _{X}, _{Y}. We use the following notation: X_{x }is the ^{x }the first _{xy }the substitution score (or profile function) for aligning X_{x }to Y_{y}, ^{X}_{x }the score for a gap-open in Y that is aligned to X_{x}, ^{X}_{x }the score for a gap-close aligned to X_{x}, _{xy }the set of all alignments of X^{x }to Y^{y}, M_{xy }the score of the best alignment in _{xy }ending in a match (i.e., X_{x }and Y_{y }are aligned), D_{xy }the score of the best alignment ending in a delete relative to X (X_{x }is aligned to an indel) and I_{xy }the score of the best alignment ending in an insert (Y_{y }is aligned to an indel). A match is preceded by either a match, delete or insert, so:

M_{xy }= _{xy }+ max { M_{x-1y-1}, D_{x-1y-1 }+ ^{X}_{x-1}, I_{x-1y-1 }+ ^{Y}_{y-1}} (14)

We assume that a center parameter has been added to _{xy }such that the gap extension penalty is zero. By considering all possible lengths for the final gap,

D_{xy }= max(_{ky }+ ^{X}_{k+1}]. (15)

Here,

D_{xy }= max { max(_{ky }+ ^{X}_{k+1}], M_{x-1y }+ ^{X}_{x}}. (16)

Hence,

D_{xy }= max { D_{x-1y}, M_{x-1y }+ ^{X}_{x }}. (17)

Similarly,

I_{xy }= max { I_{xy-1}, M_{xy-1 }+ ^{Y}_{y }}. (18)

Let the outer loop iterate over increasing ^{curr}_{y }= M_{xy}, M^{prev}_{y }= M_{x-1y}, D^{curr}_{x }= D_{xy}, D^{prev}_{x }= D_{x-1y}; for fixed ^{curr }= I_{xy}, I^{prev }= I_{xy-1}. Now we can re-write (14), (17) and (18) to obtain the following recursion relations:

M^{curr}_{y }= S_{xy }+ max { M^{prev}_{y-1}, D^{prev }_{y-1 }+ ^{X}_{x-1}, I^{prev}_{y-1 }+ ^{Y}_{y-1 }} (19)

D^{curr}_{y }= max { D^{prev}_{y}, M^{prev}_{y }+ ^{X}_{x }} (20)

I^{curr }= max { I^{prev}, M^{prev}_{y }+ ^{Y}_{y }}. (21)

An _{X }× _{Y }matrix is needed for the trace-back that produces the final alignment.

Inner loop

The inner-most dynamic programming loop, which computes the profile function, deserves careful optimization. We will consider the case of PSP; similar optimizations are possible for LE. PSP = Σ_{i }Σ_{j }^{x}_{i }^{y}_{j }_{ij }= Σ_{i }^{x}_{i }^{y}_{i}, where ^{y}_{i }= Σ_{j }^{y}_{j }_{ij}. The vector ^{y}_{i }is used _{X }times, and it therefore pays to compute it once and cache it. Observe that a typical profile column contains << 20 different amino acids. We sort the frequencies in decreasing order; the summation Σ_{i }^{x}_{i }^{y}_{i }is terminated if a frequency ^{x}_{i }= 0 is encountered. This typically reduces the time spent in the summation, especially when sequences are closely related. As with ^{y}_{i}, the sort order is computed once and cached. Observe that the roles of the two profiles are not symmetrical. It is most efficient to choose X, for which frequency sort orders are computed, to be the profile with the lowest amino acid diversity when averaged over columns. With this choice, the summation terminates earlier on average then if the other profile is identified as X. Note that out of

Diagonal finding

Many alignment algorithms are optimized for speed, typically at some expense in average accuracy, by using fast methods to identify regions of high similarity between two sequences, which appear as diagonals in the similarity matrix. The alignment path is then constrained to include these diagonals, reducing the area of the dynamic programming matrix that must be computed. MAFFT uses the fast Fourier transform to find diagonals. MUSCLE uses a different technique which we have previously shown

Additive profiles

Both the PSP and LE profile functions are defined in terms of amino acid frequencies and position-specific gap penalties. The data structure representing a profile is a vector of length _{P }in which each element contains frequencies for each amino acid type and a few additional values related to gaps. We call this data structure a _{P }multiple alignment containing letters and indels. For _{P}) = O(^{2 }+ _{P}) procedure in both time and space, giving a significant advantage for

Sequence weighting

For the frequencies in the parent profile vector to be a linear combination of the child frequencies, the weight assigned to a sequence must be the same in the child and parent profiles. This requirement is not satisfied, for example, by the Henikoff or PSI-BLAST schemes, which compute weights based on a multiple alignment. We therefore choose the CLUSTALW scheme, which computes a fixed weight for each sequence from edge lengths in the tree.

Gap representation

To compute gap penalties, we need the frequencies _{o }of gap opens and _{c }of gap closes in each position. In the case of the LE profile function, we additionally require the gap frequency _{G}. These can be accommodated by storing _{o}, _{c }and _{e }in the profile vector, where _{e }is the frequency of gap-extensions in the column (meaning that indels are found in a given sequence in the column, the preceding column and in the following column; i.e., a gap-close is not counted as an extension). These three _{G }of indels, as needed for the occupancy factor in the profile function, as follows:

_{G }= _{o }+ _{c }+ _{e}. (22)

Now consider the problem of computing the occupancy frequencies in the parent profile vector, given only the child occupancy frequencies and the trace-back path for the alignment. Consider first a diagonal edge in the path, i.e. an edge that does not open or extend a gap, following another diagonal edge. In this case, the occupancy frequencies are computed similarly to amino acid frequencies (as a sum in which a child frequency is weighted according to the total weight of the sequences in its profile). For horizontal or vertical edges, i.e. edges that open or extend gaps, the parent occupancy frequencies can be computed by considering the effect of the new column of indels (Figure _{o}, _{c }and _{e }are sufficient for their values in the parent profile vector to be computed in O(_{P}) time from the child profile vectors and alignment path.

Occupancy frequencies in additive profiles

**Occupancy frequencies in additive profiles. **Here we show an alignment of profiles X and Y giving Z. A column C of indels (shaded background) has been inserted at position

Construction of the root alignment

By avoiding the use of profile matrices, the complexity of a single progressive alignment iteration is reduced from O(_{P }^{2 }+ _{p}) space and O(_{P}^{2 }+ _{P}) time to O(_{P}^{2}) = O(^{2 }+ _{P }log ^{2 }log _{P}) = O(^{2}) space for storage of the paths. This is expensive for large

E-strings

An alignment path can be considered as an operator on a pair of sequences that inserts indels into those sequences such that their lengths become equal. Conventionally, an alignment path is represented as a vector of three symbols representing edges in the graph, say M, D and I (for match, delete and insert, i.e. a diagonal, horizontal or vertical edge). Note that indels in one sequence are inserted only by Ds, indels in the other are inserted only by Is. Define an e-string _{P}, the length of the alignment path. Now consider the effect of applying two consecutive e-strings ("multiplying" them). This can be expressed as a third e-string, which can be efficiently computed in O(|

E-strings

**E-strings. **(1) The effect of the e-string operator <3,-1,2> on the sequence MQTIF. A positive number

Root alignment construction

**Root alignment construction. **Here we show the same progressive alignment as Figure 1. Each edge in the tree is labeled with the e-string for its side of the alignment at the parent node. The e-string needed to insert indels into a sequence in the root alignment can be determined by multiplying e-strings along the path to the root. For example, for sequence LSF, the root e-string is <3,-1,1>*<1,-1,2> = <1,-1,1,-1,1>.

Brenner's method

Steven Brenner (personal communication) observed that a multiple alignment can be alternatively be obtained by aligning each sequence to the root profile. This requires O(_{P}^{2}) time, giving a lower asymptotic complexity in _{P}. This method gives opportunities for errors relative to the "exact" e-string solution (when a sequence misaligns to its copy in the profile), but can also lead to improvements by allowing the sequence to correctly align to conserved motifs that were not apparent when the sequence was added. (Note the resemblance to the refinement stage, which begins by re-aligning individual sequences to the rest). The chances for error are reduced by constraining the alignment to forbid gaps in the root profile. Our tests show that this method gives comparable average accuracy to the e-string solution but to be slower for up to at least a few thousand sequences, and e-strings are therefore used by default.

Refinement complexity

In the following, we assume that an explicit multiple alignment (profile matrix) of all sequences is maintained, and determine the complexity of each step in Stage 3. Step 3.1 determines the bipartition induced by deleting an edge from the tree. This is O(_{P}) time and space. Step 3.3 performs profile-profile alignment, which is O(_{P}^{2}) time and space. Step 3.4 computes the SP score, which is O(^{2}_{P}) time and O(_{P}) space (discussed in more detail shortly). A single iteration of Stage 3 is thus O(^{2}_{P }+ _{P}^{2}) time and O(_{P }+ _{P}^{2}) space. There are O(^{3}_{P }+ _{P}^{2}) time and O(_{P }+ _{P}^{2}) space, which is O(^{4 }+ ^{3}^{2}) time and O(^{2 }+ ^{2}) space. Assuming that a fixed maximum number of iterations of Stage 3 is imposed, this is also the total complexity of refinement. We now consider optimizations of the refinement stage.

Anchor columns

A multiple alignment can be divided vertically at high-confidence (^{2}) factor in dynamic programming. This strategy has been used by several previous algorithms, including PRRP

SP score

Notice that computation of the SP score dominates the time complexity of refinement and of MUSCLE overall, introducing O(^{4}) and O(^{3}^{a }to the SP score from amino acids; gap penalties require special treatment. Let

SP^{a }= Σ_{x }Σ_{s }Σ_{t >s }

Define _{i }[_{s }_{i }[

SP^{a }= Σ_{x }Σ_{i }_{i }Σ_{j }_{j>i }_{x }Σ_{i }(_{i}^{2 }- _{i})

Frequencies are computed as:

^{x}_{i }= _{i }[

Using frequencies,

For simplicity, we have neglected sequence weighting; it is straightforward to show that (26) applies unchanged if weighting is used. Note that (23) is O(^{2}_{P}) but (25) and (26) are O(_{P}). For ^{g }be the contribution of gap penalties to SP, so SP = SP^{a }+ SP^{g}. It is natural to seek an O(_{P}) expression for SP^{g }analogous to (26), but to the best of our knowledge no solution is known. Note that in MUSCLE refinement, the absolute value of the SP score is not needed; rather, it suffices to determine the difference in the SP scores before and after re-aligning a pair of profiles. Let SP(_{s }Σ_{t>s }SP(

SP = Σ_{s∈X }Σ_{t∈X:t>s }SP(_{s∈Y }Σ_{t∈Y:t>s }SP(_{s∈X }Σ_{t∈Y }SP(

Note that the intra-profile terms are unchanged in any alignment that preserves the columns of the profile intact, which is true by definition in profile-profile alignment. This follows by noting that any indels added to align the profiles are guaranteed to be external gaps with respect to any pair of sequences in the same profile. It therefore suffices to compute the change in the inter-profile term:

SP_{XY }= Σ_{s∈X }Σ_{t∈Y }SP(

This reduces the average time by a factor of about two. We can further improve on this by noting that in the typical case, there are few or no changes to the alignment. This suggests computing the change in SP score by looking only at differences between the two alignments. Let _{- }be the alignment path before re-alignment and _{+ }the path after re-alignment. The change in alignment can be specified as the set of edges in _{- }or _{+}, but not both; i.e., by considering a path to be a set of edges and taking the set symmetric difference Δ_{- }∪ _{+}) - (_{- }∩ _{+}). The path _{+ }after re-alignment is available from the dynamic programming traceback. The path _{- }before re-alignment can be efficiently computed in O(_{P}) time. Note that in order to construct the profile of a subset of sequences extracted from a multiple alignment, those columns that contain only indels in that subset must be deleted. The set of such columns in both profiles is therefore available as a side effect of profile construction, and this set immediately implies _{-}. It is a simple O(_{P}) procedure to compute Δ_{- }and _{+}. Note that SP^{a }is a sum over columns, and there is a one-to-one correspondence between columns and edges in ^{a }can therefore be computed as a sum over columns in Δ_{-}, reducing the time complexity from O(_{P}) to O(^{g}. We say that a gap G ^{g }is unchanged. It therefore suffices to consider penalties for gaps in Γ, again with negative signs for edges from _{-}. The construction of Γ is straightforward in O(_{P}) time. Finally, a sum over pairs in Γ is needed, reducing the O(^{2}) component to the smallest possible set of terms.

Dimer approximation

We next describe an approximation to SP that can be computed in O(_{P}) time. Define a two-symbol alphabet {X, -} in which X represents any amino acid and - is the indel symbol. There are four dimers in this alphabet: XX, X-, -X and --, which denote by no-gap, gap-open, gap-close and gap-extend respectively. Re-write a multiple alignment in terms of these dimers, adopting the convention that dimer ^{g }of an aligned pair of dimers, written as ^{g }can be computed by summing substitution scores over all pairs of sequences. We can now apply Equation 26, re-interpreting the frequency vectors

Dimers in the {X,-} alphabet

**Dimers in the {X,-} alphabet. **Gap penalties for the sequence pair (

Problem dimer pair

**Problem dimer pair. **The aligned dimer pair -X ↔ -- causes a problem because its gap penalty contribution cannot be computed without additional information. Note that the first column of indels is external; after this column is discarded, different penalties may be needed, as these two examples show.

Dimer substitution matrix

**Dimer substitution matrix. **This matrix specifies the contribution to the total gap penalty for a pair of sequences for each possible pair of aligned dimers. Here,

Evaluation of profile functions

We have previously attempted a systematic comparison of profile functions _{F}(_{G}(_{F }is the discriminator plot for F as a function of

Discrimination plot for PP2

**Discrimination plot for PP2. **The

Discrimination plot for PP

**Discrimination plot for PP. **This is similar to Figure 13, except that the database was generated from the PP test set. Here we see an ambiguous result as the discrimination plots for LE and PSP intersect.

Complexity of MUSCLE

The complexity of MUSCLE is summarized in Table _{P }= O(

Complexity of MUSCLE. Here we show the big-O asymptotic complexity of the elements of MUSCLE as a function of

**Step**

**O(Space)**

**O(Time)**

^{2 }+

^{2}

UPGMA

^{2}

^{2}

Progressive (one iteration)

_{P}^{2 }= ^{2}

_{P}^{2 }= ^{2 }+ ^{2}

Progressive (root alignment)

_{P }= ^{2 }+

_{P }log ^{2 }log

Progressive (

^{2 }+ ^{2}

^{3 }+ ^{2}

Refinement (one edge)

_{P }+ _{P}^{2 }= ^{2 }+ ^{2}

^{2}_{P }+ _{P}^{2 }= ^{3}^{2}

Refinement (

^{2 }+ ^{2}

^{4}
^{2}

TOTAL

^{2 }+ ^{2}

^{4 }+ ^{2}

Results

MUSCLE offers a variety of options that offer different trade-offs between speed and accuracy. In the following, we report speed and accuracy results for three sets of options: (1) the full MUSCLE algorithm including Stages 1, 2 and 3 with default options; (2) Stages 1 and 2 only, using default options (MUSCLE-prog); and (3) Stage 1 only using the fastest possible options (MUSCLE-fast), which are as follows: ^{Binary }is used as a distance measure (Equation 2), the PSP profile function is used, and diagonal finding is enabled.

Alignment accuracy

In Tables

Accuracy scores. The average accuracy, measured by the

**Method**

**PREFAB**

**BAliBASE**

**SABmark**

**SMART**

MUSCLE

0.648

0.896

0.430

0.856

MUSCLE-prog

0.634

0.883

0.427

0.837

FFTNS1

0.619

0.844

0.376

0.815

MUSCLE-fast

0.616

0.849

0.396

0.813

CLUSTALW

0.588

0.860

0.404

0.823

POA-blast

0.577

0.839

0.352

0.788

POA

0.576

0.834

0.280

0.797

CPU times. The total CPU time required to create all alignments in each test set, measured in seconds on a 2.5 GHz Pentium 4 desktop computer.

**Method**

**PREFAB**

**BAliBASE**

**SABmark**

**SMART**

MUSCLE-fast

540

11

45

30

FFTNS1

720

16

70

46

MUSCLE-prog

3,000

52

429

180

MUSCLE

11,000

81

1,500

560

POA-blast

11,000

90

290

670

CLUSTALW

15,000

160

210

480

POA

24,000

130

380

880

Execution speed

To compare speeds for a larger number of sequences, we created a test set by using PSI-BLAST to search the NCBI non-redundant protein sequence database for hits to dienoyl-coa isomerase (1dci in the Protein Data Bank _{P }is O(

Execution time as a function of

**Execution time as a function of N. **This plot shows the execution time as a function of

Conclusions

MUSCLE demonstrates improvements in accuracy and reductions in computational complexity by exploiting a range of existing and new algorithmic techniques. While the design–typically for practical multiple sequence alignment tools–arguably lacks elegance and theoretical coherence, useful improvements were achieved through a number of factors. Most important of these were selection of heuristics, close attention to details of the implementation, and careful evaluation of the impact of different elements of the algorithm on speed and accuracy. MUSCLE enables high-throughput applications to achieve average accuracy comparable to the most accurate tools previously available, which we expect to be increasingly important in view of the continuing rapid growth in sequence data.

Availability and requirements

MUSCLE is a command-line program written in a conservative subset of C++. At the time of writing, MUSCLE has been successfully ported to 32-bit Windows, 32-bit Intel architecture Linux, Solaris, Macintosh OSX and the 64-bit HP Alpha Tru64 platform. MUSCLE is donated to the public domain. Source code and executable files are freely available at