Abstract
Background
In this study we consider DNA sequences as mathematical strings. Total and reduced alignments between two DNA sequences have been considered in the literature to measure their similarity. Results for explicit representations of some alignments have been already obtained.
Results
We present exact, explicit and computable formulas for the number of different possible alignments between two DNA sequences and a new formula for a class of reduced alignments.
Conclusions
A unified approach for a wide class of alignments between two DNA sequences has been provided. The formula is computable and, if complemented by software development, will provide a deeper insight into the theory of sequence alignment and give rise to new comparison methods.
AMS Subject Classification
Primary 92B05, 33C20, secondary 39A14, 65Q30
Keywords:
DNA sequence; Alignment; Difference equationBackground
Let us consider a DNA sequence as a mathematical string
where x_{i}∈{A,G,C,T} is one of the four nucleotides, i=1,2,…,n, i.e. A denotes adenine, C cytosine, G guanine and T thymine. In these conditions, the sequence x is of length n.
Our main goal is to compare the sequence x with another DNA sequence
to measure the similarity between both strings and also to determine their residueresidue correspondences.
Sequence comparison and alignment is a central and crucial tool in molecular biology. For example, Pairwise Sequence Alignment is used to identify regions of similarity that may indicate functional, structural and/or evolutionary relationships between two biological sequences (protein or nucleic acid) [1].
For some recent developments and directions we refer the reader to [27] and [8] for a general review of different alignments methods.
To align the sequences CGT and ACTT, one can use EMBOSS Needle for nucleotide sequence [9] that creates an optimal global alignment of the two sequences using the NeedlemanWunsch algorithm to get
Following Lesk [10], in order to compare the amino acids appearing at their corresponding positions in two sequences, theirs correspondences must be assigned and a sequence alignment is the identification of residueresidue correspondence. For some references on sequence alignment we refer the reader to [1016].
To compare two sequences, there exist mainly three different possibilities leading to three different numbers of total alignments [10,11,13]:
1. The total number of alignments denoted by f(n,m) that was solved in [13].
2. A gap in a sequence is followed by another gap in the other sequence as in Alignments 1 and 2 for the sequences x=CGT and y=ACTT (see Tables 1 and 2 below) Considering the two alignments as equivalents to the Alignment 3 (see Table 3) without gap in those positions, we have the number of reduced alignments denoted by h(n,m), and obviously h(n,m)<f(n,m). This case has been solved in [11], and we give here another representation in terms of hypergeometric series.
3. In the interesting case that the alignments 1 and 2 are equivalent, but different from alignment 3 we have a number or reduced alignments g(n,m) where h(n,m)<g(n,m)<f(n,m). This last case is new and we present an explicit formula for g.
Results and discussion
Number of f(x,y) alignments
The total number of alignments f(x,y) satisfies the following recurrence relation [13]
with initial conditions f(n,0)=f(0,m)=1 for n,m=1,2,3,…. The solution of the above partial difference equation is given by
(see formula (10) in [13]) and the generating function [17,18] is
Therefore the coefficients f(n,m) in the expansion
are given in terms of a hypergeometric series by
This relation seems to be new in this form. Here, the generalized hypergeometric series is defined as (see e.g. [19, Chapter 16])
and (A)_{k}=A(A+1)⋯(A+n−1), with (A)_{0}=1, denotes the Pochhammer’s symbol. It is assumed that b_{j}≠−k in order to avoid singularities in the denominators. If one of the parameters a_{j} equals to a negative integer, then the sum becomes a terminating series.
Number of h(x,y) alignments
In this case, the recurrence relation for the h(n,m) coefficients is [11]
with initial conditions h(n,0)=h(0,m)=1. Therefore, the generating function [17,18] is
and the coefficients in the expansion
are given by
where
The above coefficients can be written in terms of (terminating) hypergeometric series as
Number of g(x,y) alignments
As indicated before, the main aim of this paper is to give an explicit representation in this case. The recurrence relation for the g(n,m) coefficients is [11]
with initial conditions g(n,0)=g(m,0)=1. Thus, the generating function [17,18] is
Theorem 1. The coefficientsα_{n,m}in the expansion
are explicitly given by
where
and [ x] denotes the integer part of x.
Proof. If we expand,
we have two summands to be computed, namely
In order to compute the first sum (12) let us introduce
Therefore, the summation to be done reads as
where U, V, A and B must be computed in terms of the initial indices.
The product of binomials can be simplified to
Thus,
and then
Finally, the summation reads as
where
A similar work with the second summand (13) leads to the final result.
Some numerical values are g(10,10)=2003204, g(50,50)=2.71972×10^{34}, g(100,100)=7.55997×10^{69}, and we note that g(n,n)>10^{80} for n≥115. This last inequality is relevant since 10^{80} is an estimation of the number of protons of our universe [13].
Conclusions
A unified approach for a wide class of alignments between two DNA sequences has been provided. We conclude also that our approach gives an explicit formula filling a gap in the theory of sequence alignment. The formula is computable and, if complemented by software development, will provide a deeper insight into the theory of sequence alignment and give rise to new comparison methods. It may be used also, in the future, to get explicit formulas and compute the number of total, reduced, and effective alignments for multiple sequences.
Methods
We have performed a number of numerical computations to compare our formulae and Mathematica®; [20] command Coefficient for the series expansion of (1), on a MacBook Pro featuring a 45 nm “Penryn” 2.66 GHz Intel “Core 2 Duo” processor (P8800), with two independent processor “cores” on a single silicon chip, 8 GB of 1066 MHz DDR3 SDRAM (PC38500). We would like to mention that our approach is amazingly fast, since e.g. g(100,100) is computed by using Mathematica®; in 0.125165 seconds by using the new formulas presented in this paper, while the use of Mathematica®; command Coefficient needs 99.167659 seconds.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
Each of the authors HA, IA, JJN and AT, contributed to each part of this study equally and read and approved the final version of the manuscript.
Acknowledgements
The authors are grateful to Prof. Marko Petkovs̆ek for helpful comments. The work of I. Area has been partially supported by the Ministerio de Economía y Competitividad of Spain under grant MTM2012–38794–C02–01, cofinanced by the European Community fund FEDER. J.J. Nieto also acknowledges partial financial support by the Ministerio de Economía y Competitividad of Spain under grant MTM2010–15314, cofinanced by the European Community fund FEDER.
References

The European Bioinformatics Institute: Pairwise Sequence Alignment. http://www.ebi.ac.uk/Tools/psa/ webcite

Orobitg M, Lladós J, Guirado F, Cores F, Notredame C: Scalability and accuracy improvements of consistencybased multiple sequence alignment tools. In EuroMPI. Edited by Dongarra J, Blas JG, Carretero J. New York, USA: ACM International Conference Proceeding Series; 2013:259264.

Orobitg M, Cores F, Guirado F, Roig C, Notredame C: Improving multiple sequence alignment biological accuracy through genetic algorithms.
J Supercomput 2013, 65(3):10761088. Publisher Full Text

Montañola A, Roig C, Guirado F, Hernández P, Notredame C: Performance analysis of computational approaches to solve multiple sequence alignment.
J Supercomput 2013, 64(1):6978. Publisher Full Text

Zhong C, Zhang S: Efficient alignment of rna secondary structures using sparse dynamic programming.
BMC Bioinformatics 2013, 14:269. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Veeneman BA, Iyer MK, Chinnaiyan AM: Oculus: faster sequence alignment by streaming read compression.
BMC Bioinformatics 2012, 13:297. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Chaisson M, Tesler G: Mapping single molecule sequencing reads using basic local alignment with successive refinement (blasr): theory and application.
BMC Bioinformatics 2012, 13:238. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Löytynoja A: Alignment methods: Strategies, challenges, benchmarking, and comparative overview. In Evolutionary Genomics. Methods in Molecular Biology. Edited by Anisimova M. New York, USA: Humana Press; 2012:203235.

The European Bioinformatics Institute: Pairwise Sequence Alignment (Nucleotide). http://www.ebi.ac.uk/Tools/psa/emboss\_needle/nucleotide.html webcite

Lesk AM: Introduction to Bioinformatics. Oxford, UK: Oxford University Press; 2002.

Andrade H: Análise matemática dalgunhos problemas no estudo de secuencias biolóxicas. PhD thesis, Universidade de Santiago de Compostela, Departamento de Análise Matemática (2013)

Bai F, Zhang J, Zheng J: Similarity analysis of DNA sequences based on the EMD method.
Appl Math Lett 2011, 24(2):232237. Publisher Full Text

Cabada A, Nieto JJ, Torres A: An exact formula for the number of aligments between two DNA sequences.
DNA Sequence (continued as Mitochondrial DNA) 2003, 14:427430.

Eger S: Sequence alignment with arbitrary steps and further generalizations, with applications to alignments in linguistics.

Morgenstern B: A simple and spaceefficient fragmentchaining algorithm for alignment of DNA and protein sequences.
Appl Math Lett 2002, 15(1):1116. Publisher Full Text

Zhang J, Wang R, Bai F, Zheng J: A quasiMQ EMD method for similarity analysis of DNA sequences.
Appl Math Lett 2011, 24(12):20522058. Publisher Full Text

Srivastava HM, Manocha HL: A Treatise on Generating Functions. Ellis Horwood Series: Mathematics and its Applications. Chichester: Ellis Horwood Ltd.; 1984.

Wilf HS: Generatingfunctionology. Wellesley, MA: A K Peters Ltd.; 2006.

Abramowitz M, Stegun IA: Handbook of Mathematical Functions, with Formulas, Graphs, and Mathematical Tables. New York: Dover Publications Inc.; 1966.

Wolfram Research I: Mathematica, Version 9.01. Champaign, Illinois: Wolfram Research, Inc.; 2013.