Department of Computer Science, University of California Los Angeles, Los Angeles, California 90095, USA

Department of Human Genetics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, 90095, USA

Abstract

Background

DNA sequence comparison is a well-studied problem, in which two DNA sequences are compared using a weighted edit distance. Recent DNA sequencing technologies however observe an encoded form of the sequence, rather than each DNA base individually. The encoded DNA sequence may contain technical errors, and therefore encoded sequencing errors must be incorporated when comparing an encoded DNA sequence to a reference DNA sequence.

Results

Although two-base encoding is currently used in practice, many other encoding schemes are possible, whereby two ore more bases are encoded at a time. A generalized

Conclusions

The novel generalized

Background

DNA sequence comparison is a well studied problem in biology and bioinformatics

The central advantage of the two-base encoding scheme is that the false discovery rate of a single nucleotide polymorphism (SNP) is reduced, since two specific adjacent errors are required to produce a SNP call. In fact, only one-fourth of all adjacent errors would result in a false call. This significantly reduces the probability of falsely observing a SNP, with current machines exhibiting a color read error rate less than 5%. Nevertheless, the currently implemented two-base encoding is not the only possible encoding. Therefore a generalized

Results and Discussion

Simulations were performed to explore the power and performance of

Plotted in Figure

Power of

**Power of k-base encoding**. Power calculated as the fraction of reads that correctly align. 10, 000 simulated reads from the E. Coli genome were generated.

To assess the power of

Power of

**Power (0 SNPs)**

**Power (1 SNP)**

**Power (2 SNPs)**

1

0.877

0.847

0.820

2

0.931

0.824

0.706

3

0.963

0.876

0.784

4

0.964

0.911

0.834

5

0.965

0.911

0.840

Power calculated as the fraction of reads that correctly align. 10, 000 simulated 50 bp reads from the E. Coli genome were generated with an estimated real-world error rate.

The false positive SNP discovery rate is evaluated for 25, 50, and 75 base-pair reads (Figure

False SNP discovery rate for

**False SNP discovery rate for k-base encoding**. False positive SNP discovery rate calculated as the fraction of reads that have a SNP call after alignment when no SNP call is expected. 10, 000 simulated reads from the E. Coli genome were generated.

False negative SNP discovery rate for

**False negative SNP discovery rate for k-base encoding**. False negative SNP discovery rate calculated as the fraction of reads that do not call a SNP after alignment when a SNP call is expected. 10, 000 simulated reads from the E. Coli genome were generated.

To illustrate the flexibility of

Flexibility of scoring systems for 5-base encoding

**Flexibility of scoring systems for 5-base encoding**. Power of scoring system evaluation for 5 base encoding. 1, 000 simulated reads from the E. Coli genome were generated.

The performance time of

Performance of

**Time in s (0 SNPs)**

**Time in s (1 SNP)**

**Time in s (2 SNPs)**

1

7

7

7

2

65

65

65

3

403

346

403

4

2178

2166

2178

5

23464

23460

23466

Performance time (in seconds) of

Nevertheless, this exponential increase in running time could be significantly reduced at the cost of completeness by using methods initially adopted for protein similarity search and sequence comparison

Conclusions

The generalized

Currently a two-base encoding system is used by ABI SOLiD sequencing technology. Some other next-generation sequencing technologies could also adopt an encoding system to improve their performance and accuracy. Furthermore, algorithms that perform multiple sequence alignment or local reassembly could also utilize the power of the encoding scheme presented here. It is interesting to note that error correction utilizing encoded DNA sequence could be performed if single bases or sets of bases were observed more than once. Utilizing various encoding schemes, this error correction would necessarily not rely on a target DNA reference comparison, thereby eliminating the expensive exponential increase in time for higher order encodings (larger

Methods

Generalized

Given an _{1}, ..., _{n}, it is the goal of the proposed algorithm to minimize the edit distance between c and some regular DNA sequence _{1}, ..., _{m }given a set of valid edit operators Σ. The DNA alphabet is assumed to be Λ = {_{1}, _{2}) and Δ(_{1}, _{2}) corresponding to the color substitution scoring and base substitution scoring functions respectively. To model insertions and deletions, affine gap penalties are used whereby a score of _{1 }≠ _{2 }and for any colors _{1 }≠ _{2 }that 0 ≤ Δ(_{1}, _{1}), Δ(_{1}, _{2}) < 0, 0 ≤ ∏(_{1}, _{1}), ∏(_{1}, _{2}) < 0,

To illustrate the encoding and decoding method used by this technology, let _{1}, ..., _{n }be a DNA sequence. To encode a DNA sequence, the function Φ^{k}(_{1}, ..., _{k}) is defined to return a color _{k }using the bases _{1}, ..., _{k}, where B_{i-1 }occurs before B_{i }in the sequence. For example, to encode the DNA sequence _{1}, ..., _{n}, first a known start adaptor _{1}, ..., _{k-1 }∈ Λ^{k-1 }is assumed. Next, the function Φ^{k }is iteratively applied to the concatenation of _{1 }= Φ^{k}(_{1}, ..., _{k-1}, _{1}), _{2 }= Φ^{k }(_{2}, ..., _{k-1}, _{1}, _{2}), ..., _{n }= Φ^{k}(_{n-k+1}, _{n}). The adaptor sequence p is known in practice and is used in the physical chemistry of the sequencer (for

The encoding function Φ^{k }(_{1}, ..., _{k}) transforms each base _{i }into an integer representation (^{k}(_{1}, ..., _{k}) =

To decode an encoded sequence, the function Γ^{k}(_{1}, ..., _{k-1}, _{k }using the encoded color _{1}, ..._{k-1}. To compute Γ^{k}(_{1}, ..., _{k-1}, _{1}, ..., _{n }with a known start adaptor _{1}, ..._{k-1 }∈ Λ^{k-1}, Γ^{k }is iteratively used. The decoded sequence will be _{1 }= Γ^{k}(_{1}, ..., _{k-1}, _{1}), _{2 }= Γ^{k}(_{1}, ..., _{k-2}, _{1}, _{2}), ..., _{n }= Γ^{k}(_{n-k+1}, _{n-1}, _{n}). Without the start adaptor ^{k-1 }possible decodings of the encoded sequence.

This encoding function has two useful properties. First, if one base in x is changed to obtain a new DNA sequence _{i }≠ _{i}: _{i}, ..., _{i+k-1}. Let the _{i}. The following constraint is made to prefer a base change and

In this case, it is assumed that _{j }≠ _{i }≠

The Algorithm

Suppose that a color sequence _{1}, ..., _{n }with a known adaptor ^{k-1 }is to be aligned to a reference sequence _{1}, ..., _{m}. To search over all possible base substitution, base insertion, base deletions, and color substitutions, define a recursive formula that is the repeated calculation in the dynamic programming algorithm.

Intuitively, Equation 2 is filling in an ^{k-1 }sub-cells. It is interesting to observe for _{i }aligning to a base _{i }in the reference sequence

An alignment that begins or ends with a deletion is ignored, since a sequence must span the break-point for the deletion to be observable (with respect to

If the color match scores are the same (∀_{i}, _{i}) = ∏(_{j}, _{j})) and all color mismatch scores are the same (∀_{i}, _{j}) = ∏(_{k}, _{l}), then Equation 2 can be simplified. The recursive rule for the

This modification forces any color substitution to be at the beginning or end of any inserted bases in

Various initializations are possible, and the alignment of the entire encoded DNA sequence ^{k}(_{i}) and ^{k}(_{i}), so that the local alignment spans the entire encoded sequence and insertions are allowed at the beginning of any alignment. Notice that if there were any color errors within the beginning an insertion, they are aligned such that they occur at the end of the insertion. ^{k-1}, and ^{k-1}. These initializations enforce that the starting adaptor is

This algorithm is in fact finding the shortest path on a graph with the nodes defined by the sub-cells of the matrix, and the edges weighted and defined by the recursive rules. To analyze the time complexity, it is observed that given the ^{k-1 }sub-cells. For each ^{k-1 }+ |Λ|^{k-1 }× 3 × |Λ| + |Λ|^{k-1 }× 3 × |Λ| = 2 × |Λ|^{k-1}(1 + 3 × |Λ|). In practice, |Λ| = 4 and therefore various maxima must be computed over 26 × 4^{k-1 }values. From this analysis, it is clear that the running time of this algorithm is ^{k}), which unfortunately scales exponentially with respect to the length of the encoding

Simulations

Simulations were performed to assess the power and performance of

To allow for insertions and deletions, the original sequence is used (before applying errors and variants) with an additional 10 bp before and after as the reference or target DNA sequence. In accordance with Equation 1, ϵ = - 50, _{1}, _{2}) = -125 (_{1 }≠ _{2}), ∏(_{1}, _{1}) = 0, Δ(_{1}, _{2}) = - 150(_{1 }≠ _{2}) and Δ(_{1}, _{1}) = 50. Due to these initializations, the optimization in Equation 3 is able to be performed. To model real-world error-rates, the simulated error-rates are learned from a run of an ABI SOLiD sequencer (50 base pairs), utilizing the aligned reads to calculate the 2-base encoding error, which is inherently dependent on the decoding algorithm used Homer et al. (2009). The error-rate was not uniform by sequencing position, therefore producing a color-error-rate for each position in the 50 color sequence reads. The observed error rate for each sequencing position was: 0.014, 0.005, 0.006, 0.007, 0.006, 0.006, 0.006, 0.008, 0.008, 0.008, 0.007, 0.006, 0.009, 0.009, 0.009, 0.009, 0.008, 0.015, 0.015, 0.012, 0.012, 0.011, 0.021, 0.021, 0.018, 0.019, 0.014, 0.037, 0.033, 0.031, 0.029, 0.022, 0.055, 0.052, 0.051, 0.043, 0.036, 0.087, 0.084, 0.076, 0.071, 0.060, 0.125, 0.118, 0.118, 0.108, 0.092, 0.179, 0.175, 0.184. To evaluate various scoring schemes of 5-base encodings, simulations of only 1, 000 test sequences were used due to running time limitations. For these evaluations a dual quad-core Intel Xeon E5420 machine at 2.5 GHz, with 32 GB of RAM and 2TB of RAID 0 disk space, was used, although the actual hardware requirements of the algorithm itself beyond CPU power are negligible relative to any modern computer.

Scoring constraints 5-base encoding

Various scoring schemes were evaluated for 5-base encoding. For notational convenience, for all colors _{1 }≠ _{2 }and bases _{1 }≠ _{2 }let _{1}, _{2}), _{1}, _{1}), _{1}, _{2}), and _{1}, _{1}). Consider the scoring scenarios that satisfy one of the following constraints:

1. 5

2. 5

3. 4

4. 3

5. 2

6.

Intuitively, these scenarios try to decide if a given set of color errors should be preferred if they can be explained by a SNP and possibly other color errors. For example, the first scenario always prefers calling color errors over anything that can be explained by a SNP. The second scenario will prefer to explain the encoding with a SNP if it results in no color errors, but does not prefer to explain the encoding with a SNP if it is accompanied by any color errors. In the extreme, the last scenario would prefer to explain all color errors as a combination of a SNP and possibly color errors.

Nevertheless, given the assumptions that

In the above constraints color error scores are given that satisfy the constraints given the previously defined base match, base substitution, and color match scores. The score -150 is also included, which was previously used, to illustrate that there is flexibility even within these constraints to tune the scoring scheme.

Authors' contributions

BM and NH conceived of

Acknowledgements

This research was partially supported by University of California Systemwide Biotechnology Research and Education Program GREAT Training Grant 2007-10 (to NH), the NIH Neuroscience Microarray Consortium (U24NS052108), and a grant from the NIMH (R01 MH071852). We would also like to thank members of the Nelson Lab: Zugen Chen, Hane Lee, Bret Harry, Jordan Mendler, Brian O'Connor for input and computational infrastructure support. Finally, we would like to thank the anonymous reviewers for their very insightful and helpful comments and suggestions.