Centre for Genomic Regulation, Dr. Aiguader 88, 08003 Barcelona, Spain

Departament de Matemàtica Aplicada I, ETSEIB, Universitat Politècnica de Catalunya, Avinguda Diagonal 647, 08028 Barcelona, Spain

Abstract

Background

A number of software packages are available to generate DNA multiple sequence alignments (MSAs) evolved under continuous-time Markov processes on phylogenetic trees. On the other hand, methods of simulating the DNA MSA directly from the transition matrices do not exist. Moreover, existing software restricts to the time-reversible models and it is not optimized to generate nonhomogeneous data (i.e. placing distinct substitution rates at different lineages).

Results

We present the first package designed to generate MSAs evolving under discrete-time Markov processes on phylogenetic trees, directly from probability substitution matrices. Based on the input model and a phylogenetic tree in the Newick format (with branch lengths measured as the expected number of substitutions per site), the algorithm produces DNA alignments of desired length.

Conclusion

The software presented here is an efficient tool to generate DNA MSAs on a given phylogenetic tree.

Background

The package _{
e
} assigned to each edge _{
e
} correspond to the conditional probabilities

The shape of the substitution matrices (in all cases sum of rows is equal to 1 and the entries are nonnegative) for

For these models, the algorithms given in

In the continuous-time models, the process is often assumed to be

Other powerful software include the Bayesian phylogenetic methods of

It is worth pointing out that the strand symmetric model and the general Markov model considered in

As shown in

Implementation

./GenNon-h treefilename outputfilename length modelname

For instance, if ‘tree.txt’ is a text file consisting a Newick 5-taxon phylogenetic tree:

((species1:0.01,species2:0.2,species3:0.3):0.5,species4:0.4,species5:0.7),

then the following command line input

./GenNon-h tree.txt data.fa 10000 k81

generates a MSA of length 10,000nt evolving on the tree given in ‘ tree.txt’ under the

The algorithm proceeds as follows:

Input: a discrete-time Markov model

Step 1: generate a DNA sequence _{0}of length

Step 2: for each edge _{
e
}of the type
_{
e
}of edge _{
e
}such that
_{
e
}and repeat the procedure. We limited the number of trials to 1000 before the simulations require a re-start, however, in practice a

Step 3: we let _{0} evolve according to the corresponding Markov process on

Output: a multiple sequence alignment and the substitution matrices used for its simulation.

The output files constitute both a fasta file with a multiple sequence alignment simulated on

Results and discussion

The C++ implementation of

**Is a zipped (extension .zip) file containing the C++ implementation of **
**GenNon-h.**

click here for file

In order to test the speed of

**Model**

Time

2.6s

2.6s

2.5s

2.6s

3.0s

The simulated data saved in the output files together with the parameters used for its simulation are suited for hypothesis testing in a variety of biological applications.

Conclusions

The possibility of generating nonhomogeneous (for any of the models above) or nonstationary (for the

Availability and requirements

**Project name**:

**Project home page**:

**Operating systems**: Platform independent

**Programming language**: C++

**Other requirements**: GNU gcc compiler, version 1.47.0 of the boost library (

**Distributed under** the GNU General Public License

Abbreviations

Competing interest

Both authors declare that they have no competing interests.

Author’s contributions

AK created and tested the software, established a platform for its usage and drafted part of the manuscript. MC conceived of the project and drafted part of the manuscript. Both authors read and approved the final manuscript and declare no conflicts of interests.

Acknowledgements

Both authors were partially supported by Generalitat de Catalunya, 2009 SGR 1284. MC is partially supported by Ministerio de Educación y Ciencia MTM2009-14163-C02-02.