School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore

The Key Laboratory of Embedded System and Service Computing, Ministry of Education; Tongji University, Shanghai 200092, China

Department of Mathematics, National University of Singapore, Singapore

Abstract

Background

The discovery of single-nucleotide polymorphisms (SNPs) has important implications in a variety of genetic studies on human diseases and biological functions. One valuable approach proposed for SNP discovery is based on base-specific cleavage and mass spectrometry. However, it is still very challenging to achieve the full potential of this SNP discovery approach.

Results

In this study, we formulate two new combinatorial optimization problems. While both problems are aimed at reconstructing the sample sequence that would attain the minimum number of SNPs, they search over different candidate sequence spaces. The first problem, denoted as

Conclusions

We believe that an efficient solution to either problem above could offer a seamless integration of information in four complementary base-specific cleavage reactions, thereby improving the capability of the underlying biotechnology for sensitive and accurate SNP discovery.

Background

Single nucleotide polymorphisms (SNPs) is a common type of DNA sequence variations that occur when a single nucleotide base is altered at a specific locus. They are among the most important genetic factors that contribute to human disease and biological functions. However, discovering novel SNPs is a scientifically challenging task. Among others, one valuable approach proposed for SNP discovery is based on base-specific cleavage and mass spectrometry

The SNP discovery approach based on base-specific cleavage and mass spectrometry usually adopts a data-acquisition procedure as summarized below. First, a target sample DNA sequence is PCR-amplified using primers that incorporate the T7 promoter sequences. Then, the PCR products are in-vitro transcribed and subsequently digested with the endonuclease RNase A in four base-specific cleavage reactions. Each reaction can cleave the sample sequence to completion at all loci wherever a specific base is found. Finally, the matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) is applied to the cleavage products, resulting in four measured mass spectra, each corresponding to one base-specific cleavage reaction.

Since each cleavage product is expected to be made of three non-cleavage bases, it is fairly straightforward to calculate the base composition from its measured mass signal. With all these base compositions in hand, the task of discovering SNPs in the sample sequence is now left to a computational solution. In principle, this computational solution shall find a way to integrate the four complementary base-specific mass spectra, and then identify those SNPs that necessarily account for the unanticipated base compositions (i.e., corresponding to the measured mass signal changes as compared with an

Schematic outline

**Schematic outline**. The SNP discovery approach using base-specific cleavage and mass spectrometry.

The early proof-of-concept studies on the above SNP discovery approach using base-specific cleavage and mass spectrometry were presented in

In this paper, we study two new combinatorial optimization problems to exploit the full potential of the above SNP discovery approach. While both problems are aimed at reconstructing the sample sequence that would attain the minimum number of SNPs, they search over different candidate sequence spaces. The first problem, denoted as

Methods

Preliminaries

Let ^{l}

Given a string

**Example 1 **

Problem formulation

Let _{H }

**Definition 2 **_{H }

**Definition 3 **_{H }

The only difference between the above two problem formulations is that one requires

**Example 4 **

The measured mass spectra of a sample sequence are rarely perfect in practice. Some peaks may actually represent noises, while some true signal peaks are missing. The problem

We noticed that several computational problems in the literature that are more or less related to our problems introduced above. In

Results

An exact dynamic programming algorithm for

In this subsection, we shall describe an exact dynamic programming algorithm for solving the

Let us start with some preliminary definitions and notations. For a string

**Example 5 **

**Example 6 **

Examples.

strings

I-compatible

L-compatible

R-compatible

ATGATAC

ATGCTAC

ACATGCT

TACATTA

CTACATTA

This table shows whether each of the given strings is I-compatible, L-compatible, or R-compatible with

For each compomer _{i}C_{j }G_{k}T_{l }a non-cut base whose composition value is zero, then we have

Then, let

**Example 7 **

Given a string

**Lemma 8 **

Suppose we have an input instance

- all its substrings are I-compatible with

- it begins with a string from

- it ends with the given string

To compute

Then, let

**Example 9 ****∈ **_{A}.

To compute

Note that the minimization in the above is taken over all those strings ^{′ }

**Theorem 10 **

Let the string ^{′ }

On the other hand, let ^{′}

- If

- If

- If ^{′}

- If

In conclusion, for every internal cleavage fragment of ^{′}^{′}

Note that computing each entry

**Corollary 11 **

The NP-hardness of

This subsection is dedicated to prove that the

**Definition 12 (The general form of the 3-partition problem) **

The 3-partition problem is strongly NP-complete

**Definition 13 (The restricted variation of the 3-partition problem) **

There are two constraints imposed in the above restricted variation of the 3-partition problem. The first one limits _{i }

**Theorem 14 **

- Let Σ = {G, T}.

- Let ^{B+2}T)^{m}

- Let

First, we check whether this construction can be done in polynomial time in the size of the input instance of the 3-partition problem. Since the restricted variation of the 3-partition problem is strongly NP-complete, we may encode the integers in unary so that the size of the input instance is Θ(

Next, we show that every feasible solution _{H }_{i }are distinct, all such cleavage fragments shall be pairwise non-overlapping. Thus, the string _{T}. By construction, we also know that the string _{H }(

Now, we are going to show that there exists a valid partition for the input instance of the 3-partition problem if and only if there exists an optimal solution ^{′ }_{H }(

Suppose that

1.

2. **for **

3. **for **

4.

5. **end**

6. **end**

7.

As one can easily check, the resulting string ^{′}_{H }(

Conversely, suppose that the string _{H}(

1.

2.

3.

4. **for **

5. **if **

6.

7.

8. **if **

9.

10.

11. **end**

12.

13. **else**

14.

15. **end**

16. **end**

It follows from the earlier discussions that _{H }(_{T}(^{′}_{T}(

Extensions to edit distance

Naturally we may extend our previous problem formulations to the edit distance (i.e., Levenshtein distance). The resulting two new problems are formally defined as follows.

**Definition 15 (The **

**Definition 16 (The **

These extensions make it possible to detect not only base substitutions but also base insertions and deletions. Hence, they would permit the mutation discovery in DNA sequences (see

**Extensions to edit distance**. The analysis results for the problems

Click here for file

Conclusions

To exploit the full potential of the SNP discovery approach using base-specific cleavage and mass spectrometry, in this paper we have studied two new combinatorial optimization problems, called

Although we cannot change the inherent complexity of our proposed dynamic programming algorithm for the

Authors' contributions

XC conceived the study. All authors contributed to the problem analysis, read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Acknowledgements

We would like to thank Yuguang Mu and Kai Tang for introducing us the problem of SNP discovery using base-specific cleavage and mass spectrometry. X.C.'s research was supported by the Singapore National Medical Research Council grant (CBRG11nov091) and a College of Science Collaborative Research Award at NTU. Q.W.'s research was supported by National Science Foundation for Young Scientists of China (61103066). L.Z.'s research was supported by the Singapore MOE AcRF Tier 2 grant (R-146-000-134-112).

This article has been published as part of