Email updates

Keep up to date with the latest news and content from BMC Systems Biology and BioMed Central.

This article is part of the supplement: Proceedings of the 23rd International Conference on Genome Informatics (GIW 2012)

Open Access Proceedings

Two combinatorial optimization problems for SNP discovery using base-specific cleavage and mass spectrometry

Xin Chen1*, Qiong Wu12, Ruimin Sun1 and Louxin Zhang3

Author Affiliations

1 School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore

2 The Key Laboratory of Embedded System and Service Computing, Ministry of Education; Tongji University, Shanghai 200092, China

3 Department of Mathematics, National University of Singapore, Singapore

For all author emails, please log on.

BMC Systems Biology 2012, 6(Suppl 2):S5  doi:10.1186/1752-0509-6-S2-S5

The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1752-0509/6/S2/S5


Published:12 December 2012

© 2012 Chen et al.; licensee BioMed Central Ltd.

This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

The discovery of single-nucleotide polymorphisms (SNPs) has important implications in a variety of genetic studies on human diseases and biological functions. One valuable approach proposed for SNP discovery is based on base-specific cleavage and mass spectrometry. However, it is still very challenging to achieve the full potential of this SNP discovery approach.

Results

In this study, we formulate two new combinatorial optimization problems. While both problems are aimed at reconstructing the sample sequence that would attain the minimum number of SNPs, they search over different candidate sequence spaces. The first problem, denoted as <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1">View MathML</a>, limits its search to sequences whose in silico predicted mass spectra have all their signals contained in the measured mass spectra. In contrast, the second problem, denoted as <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2">View MathML</a>, limits its search to sequences whose in silico predicted mass spectra instead contain all the signals of the measured mass spectra. We present an exact dynamic programming algorithm for solving the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1">View MathML</a> problem and also show that the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2">View MathML</a> problem is NP-hard by a reduction from a restricted variation of the 3-partition problem.

Conclusions

We believe that an efficient solution to either problem above could offer a seamless integration of information in four complementary base-specific cleavage reactions, thereby improving the capability of the underlying biotechnology for sensitive and accurate SNP discovery.

Background

Single nucleotide polymorphisms (SNPs) is a common type of DNA sequence variations that occur when a single nucleotide base is altered at a specific locus. They are among the most important genetic factors that contribute to human disease and biological functions. However, discovering novel SNPs is a scientifically challenging task. Among others, one valuable approach proposed for SNP discovery is based on base-specific cleavage and mass spectrometry [1-3].

The SNP discovery approach based on base-specific cleavage and mass spectrometry usually adopts a data-acquisition procedure as summarized below. First, a target sample DNA sequence is PCR-amplified using primers that incorporate the T7 promoter sequences. Then, the PCR products are in-vitro transcribed and subsequently digested with the endonuclease RNase A in four base-specific cleavage reactions. Each reaction can cleave the sample sequence to completion at all loci wherever a specific base is found. Finally, the matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) is applied to the cleavage products, resulting in four measured mass spectra, each corresponding to one base-specific cleavage reaction.

Since each cleavage product is expected to be made of three non-cleavage bases, it is fairly straightforward to calculate the base composition from its measured mass signal. With all these base compositions in hand, the task of discovering SNPs in the sample sequence is now left to a computational solution. In principle, this computational solution shall find a way to integrate the four complementary base-specific mass spectra, and then identify those SNPs that necessarily account for the unanticipated base compositions (i.e., corresponding to the measured mass signal changes as compared with an in-silico predicted mass spectra from a reference sequence). See Figure 1 for schematic outline of the SNP discovery approach using base-specific cleavage and mass spectrometry.

thumbnailFigure 1. Schematic outline. The SNP discovery approach using base-specific cleavage and mass spectrometry.

The early proof-of-concept studies on the above SNP discovery approach using base-specific cleavage and mass spectrometry were presented in [3-5], where the identification of SNPs however was done by visual inspection. Shortly afterwards, two automated computational solutions were developed [1,2]: one was implemented in the proprietary MassARRAY™ SNP Discovery software package from Sequenom, Inc. and the other implemented in the software package called RNaseCut which is instead freely available online [6]. In particular, the solution in [1] mainly comprises of two separate procedures. It first computes all potential SNPs that give rise to each unanticipated based composition and then score them by taking into account the mass spectrometry data from the four base-specific cleavage reactions. Thus, the integration of the four base-specific cleavage reactions was done only in the second step. Apparently, such an integration strategy is far from being optimal, as at least it assumes that the occurrences of potential SNPs are independent in the first step.

In this paper, we study two new combinatorial optimization problems to exploit the full potential of the above SNP discovery approach. While both problems are aimed at reconstructing the sample sequence that would attain the minimum number of SNPs, they search over different candidate sequence spaces. The first problem, denoted as <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1">View MathML</a>, limits its search to sequences whose in silico predicted mass spectra have all their signals contained in the measured mass spectra. In contrast, the second problem, denoted as <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2">View MathML</a>, limits its search to sequences whose in silico predicted mass spectra instead contain all the signals of the measured mass spectra. Then, we present an exact dynamic programming algorithm for solving the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1">View MathML</a> problem and also show that the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2">View MathML</a> problem is NP-hard by a reduction from the restricted variation of the 3-partition problem [7,8].

Methods

Preliminaries

Let s ∈ Σ* denote a string over the four-base alphabet <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M3">View MathML</a>. The length of s is denoted by |s|, the i-th base of s by s[i], and the substring of s from the i-th base to the j-th base by s[i, j], for 1 ≤ i j ≤ |s|. We use to denote the empty string so that |∈| = 0. The concatenation of two strings s and t is denoted by s · t, and the concatenation of l copies of a string s is denoted by sl.

Given a string s and a cut base <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M4">View MathML</a>, a cleavage fragment refers to a substring of s that does not contain x and that cannot be extended in either side without crossing a base x. Formally, the substring s[i, j] is a cleavage fragment with respect to the cut base x if the following three conditions are satisfied: (i) s[i − 1] = x if i ≠ 1, (ii) s[j + 1] = x if j ≠ |s|, and (iii) s[k] ≠ x, ∀k ∈ [i, j]. In addition, the empty string ε is a cleavage fragment if there exits i ∈ [1,|s| − 1] such that s[i] = s[i + 1] = x. Given a cleavage fragment, we use <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M5">View MathML</a> to denote its base composition of i As, j Cs, k Gs, and l Ts. In [1], this base composition is termed as a compomer of the string s with respect to the cut base x. The whole set of compomers is hence called the compomer spectrum of the string s with respect to the cut base x, and denoted by Finally, let <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M6">View MathML</a>, a collection of four compomer spectra of the string s where each is generated with one cut base.

Example 1 Let s := ACATGCTACATTA. Then, the string s contains four cleavage fragments with respect to the cut base A: C, TGCT, C, and TT. With respect to the cut base T, it instead contains five cleavage fragments: ACA, GC, ACA, ∈, and A. Their respective compomer spectra are <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M7">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M8">View MathML</a>. Note that each compomer appears in a compomer spectrum at most once.

Problem formulation

Let dH <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M9">View MathML</a> denote the Hamming distance between two strings s and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11">View MathML</a> of equal length. It measures the minimum number of substitutions required to transform one string into the other. Given a collection of compomer spectra <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M10">View MathML</a> of an unknown string <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11">View MathML</a> (i.e., the sample DNA sequence experimented) which can in principle be generated from a mass spectrometry experiment, and a string s (i.e., the reference DNA sequence) which is believed to differ from the unknown string <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11">View MathML</a> by a number of substitutions only, we formulate below two combinatorial optimization problems for SNP discovery.

Definition 2 <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M12">View MathML</a>Given a string s and a collection of compomer spectra <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M13">View MathML</a>, find a string <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11">View MathML</a>such that <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M14">View MathML</a>, for all <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M15">View MathML</a> and dH <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M9">View MathML</a>is minimized.

Definition 3 <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M16">View MathML</a>Given a string s and a collection of compomer spectra <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M13">View MathML</a>, find a string <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11">View MathML</a>such that <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M17">View MathML</a>, for all <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M15">View MathML</a>and dH <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M9">View MathML</a>is minimized.

The only difference between the above two problem formulations is that one requires <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M14">View MathML</a> and the other requires <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M17">View MathML</a>, for all the cut bases. Once the string <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11">View MathML</a> is found, it is easy to identify the SNPs in <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11">View MathML</a>, i.e., those base substitutions that transform <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11">View MathML</a> into s.

Example 4 In this example, we let <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M18">View MathML</a>for simplicity. Given the string s:= ATAAT and the set <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M19">View MathML</a>of compomer spectra (of an unknown string) where

<a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M20">View MathML</a>

The feasible solutions to the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1">View MathML</a>problem for the above instance include the strings such as ATATA, TATAT, TTATT, ATATT, and ATTAT. Their respective Hamming distances to the input string s are 2, 3, 2, 1, and 1. The string <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11">View MathML</a> = TTAAT is not a feasible solution because the compomer <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M21">View MathML</a> but <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M22">View MathML</a> so that <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M23">View MathML</a>.

The feasible solutions to the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M24">View MathML</a> problem for the above instance include the strings such as TTATA, TATTA, ATATT, and ATTAT. Their respective Hamming distances to the input string s are 3, 5, 1, and 1. The string <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11">View MathML</a> = TTAAT is not a feasible solution because the compomer <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M25">View MathML</a> but <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M26">View MathML</a> so that <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M27">View MathML</a>.

The measured mass spectra of a sample sequence are rarely perfect in practice. Some peaks may actually represent noises, while some true signal peaks are missing. The problem <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1">View MathML</a> is so formulated that its computational solution would be robust against noisy peaks but susceptible to missing peaks (i.e., there is a good chance to recover the sample sequence even if some noisy peaks are present in the measured mass spectra, but the chance would become much less if there are some true signal peaks missing). In contrast, the problem <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2">View MathML</a> is so formulated that its computational solution would be robust against missing peaks but susceptible to noisy peaks.

We noticed that several computational problems in the literature that are more or less related to our problems introduced above. In [9], a so-called sequencing from compomers problem was studied which, like the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1">View MathML</a> problem, also aimed to reconstruct the sample sequence from a given collection of compomer spectra, but without help of a reference sequence. In [10], the spectral alignment problem differs from the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1">View MathML</a> problem mainly by its exploration on short read sequencing data rather than the mass/compomer spectra data, which may lead to wide implications in the subsequent algorithm design and complexity analysis. Moreover, in [1], a so-called SNP discovery from mass spectrometry problem was defined in a similar way to the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2">View MathML</a> problem. However, it has only a single compomer as input, as opposed to a collection of four complementary compomer spectra used in the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2">View MathML</a> problem.

Results

An exact dynamic programming algorithm for <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1">View MathML</a>

In this subsection, we shall describe an exact dynamic programming algorithm for solving the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1">View MathML</a> problem. Without loss of generality, we may assume in the remaining of this section that every base of Σ will eventually occur in the optimal solution to a given instance of the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1">View MathML</a> problem. Consequently, only those feasible solutions that contains all the bases of Σ need to be considered when we search for the optimal solution. In case some base x would not occur in the optimal solution <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11">View MathML</a> note that it becomes relatively easy to find <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11">View MathML</a> since we would have <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M28">View MathML</a> and |s'| = |s|. See below for definitions of <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M29">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M30">View MathML</a>.

Let us start with some preliminary definitions and notations. For a string s, a cleavage fragment s[i, j] is called internal if neither i = 1 nor j = |s|, left-ended if i = 1, or right-ended if j = |s|. In addition, a cleavage fragment is always considered internal. Given a collection of compomer spectra <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31">View MathML</a>, we call a string is I-compatible if the compomers of its internal cleavage fragments are all contained in <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31">View MathML</a> (under the respective cut base). A string is called L-compatible (resp. R-compatible) if it is I-compatible and if the compomers of its left-ended (resp. right-ended) cleavage fragments are all contained in <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31">View MathML</a> as well.

Example 5 Consider the string s given in Example 1. The four cleavage fragments of s with respect to the cut base A are all internal. Among the five cleavage fragments of s with respect to the base T, the first cleavage fragment ACA is left-ended, the last cleavage fragment A is right-ended, and the other three cleavage fragments in the middle are all internal.

Example 6 Let <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M32">View MathML</a>be a collection of compomer spectra where

<a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M33">View MathML</a>

We show in Table 1 whether each of the given strings is I-compatible, L-compatible, or R-compatible with <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31">View MathML</a>.

Table 1. Examples.

For each compomer <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M34">View MathML</a> in a given collection of compomer spectra <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31">View MathML</a>, we use <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M35">View MathML</a> to denote the set of strings that (i) consist of i As, j Cs, k Gs, l Ts, (ii) contain exactly three distinct bases (i.e., three bases in the set Σ \ {x}), and (iii) are I-compatible with <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31">View MathML</a>. It is easy to check that <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M36','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M36">View MathML</a>. In particular, if there exists in AiCj GkTl a non-cut base whose composition value is zero, then we have <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M37">View MathML</a> so that <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M38">View MathML</a>. Furthermore, we may define the following set

<a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M39">View MathML</a>

Then, let <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M40','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M40">View MathML</a>. Analogously, we may define <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M41','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M41">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M42','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M42">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M43','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M43">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M44','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M44">View MathML</a>for the L-compatible strings and the R-compatible strings, respectively. Clearly, <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M45','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M45">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M46','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M46">View MathML</a>, for all x ∈ Σ.

Example 7 Consider the collection of compomer spectra <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31">View MathML</a>given in Example 6. For the compomer <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M47','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M47">View MathML</a>, we have <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M48','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M48">View MathML</a>, and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M49','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M49">View MathML</a>. For the compomer <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M50','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M50">View MathML</a>, we have <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M51','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M51">View MathML</a>.

Given a string t which could be a potential cleavage fragment with respect to the cut base x (i.e., the string t does not contain any base x), we say a string s begins with the string t if t · x is a prefix of s · x, or say a string s ends with the string t if x · t is the suffix of x · s. The following lemma is useful to design a dynamic programming algorithm for solving the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1">View MathML</a> problem. Its easy proof is omitted. Recall that our discussions in this section are limited only to the feasible solutions containing all the bases of Σ.

Lemma 8 A string s' of length |s| is a feasible solution to the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1">View MathML</a>problem if and only if

- all the substrings of <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11">View MathML</a>are I-compatible with <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31">View MathML</a>,

- <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11">View MathML</a>begins with a string in <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M52','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M52">View MathML</a> for some <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M15">View MathML</a>, and

- <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M11">View MathML</a>ends with a string in <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M53','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M53">View MathML</a> for some<a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M15">View MathML</a>.

Suppose we have an input instance <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M54','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M54">View MathML</a> of the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1">View MathML</a> problem. Given a string <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M55','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M55">View MathML</a> where <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M15">View MathML</a>, we define <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M56','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M56">View MathML</a> to be the minimum Hamming distance between the prefix of s of length i and a string which is such that

- all its substrings are I-compatible with <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31">View MathML</a>,

- it begins with a string from <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M57','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M57">View MathML</a> for some y ∈ Σ, and

- it ends with the given string t.

To compute <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M58','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M58">View MathML</a>, we first find in the string x · t the rightmost position k at which the base (x · t)[k] is its first occurrence. Formally, we may write

<a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M59','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M59">View MathML</a>

Then, let x':= (x · t)[k], p := (x · t)[1, k - 1], and q := (x · t)[k,| x · t|]. Note that x' ≠ x and the string p contains all the bases of Σ except x'.

Example 9 Let t := CGTT IA. Then, x · t = ACGTT, k = 4, <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M60','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M60">View MathML</a>, p = ACG, and q = TT.

To compute <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M61','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M61">View MathML</a>, we now use the following recurrence relation

<a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M62','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M62">View MathML</a>

<a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M63','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M63">View MathML</a>

Note that the minimization in the above is taken over all those strings <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M64','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M64">View MathML</a>in <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M65','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M65">View MathML</a> which have p as the suffix. If there is no such a string in <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M65','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M65">View MathML</a>, then we let <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M66','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M66">View MathML</a>. The initial conditions for the recurrence relation are given as follows:

<a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M67','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M67">View MathML</a>

Theorem 10 Let s' be the string that leads to

<a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M68','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M68">View MathML</a>

then s' would be an optimal solution to the input instance <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M69','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M69">View MathML</a>of the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M70','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M70">View MathML</a>problem.

Proof: For the correctness of the above dynamic programming algorithm, we need to show that (i) every feasible solution of the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1">View MathML</a> problem would be essentially evaluated by the dynamic programming algorithm, and (ii) every string evaluated by the dynamic programming algorithm must be a feasible solution of the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1">View MathML</a> problem.

Let the string s' be a feasible solution. Consider a cleavage fragment t of s' that contains all the bases of Σ except its corresponding cut base x. Clearly, <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M71','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M71">View MathML</a> and t is the suffix of a substring s'[1, i] for some integer i. Without loss of generality, we can further suppose that t s'[1, i]. To show (i), what we mainly need to show is that there exists a string <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M72','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M72">View MathML</a> such that p is the suffix of t' and t' is the suffix of the substring s'[1, i - |q|], where x', p, and q are computed for the string t as described earlier. Indeed, we can find the string tas follows. First, let (i' 1) be the position of the last occurrence of the base x' in the substring s'[1, i − |t|]; if there is no such occurrence, we let i' = 1. Then, we assign <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M73','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M73">View MathML</a>. Obviously, <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M74">View MathML</a> is the suffix of s'[1, i |q|]. Because s'[i - |t|] = x and x x , we have i' ≤ i - |t|. It then follows from p = s'[i − |t|, i − |q|] that p shall be the suffix of t'. Since p contains all the bases of <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M75','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M75">View MathML</a> except x' so, does t'. Moreover, t' is a cleavage fragment of s' with respect to the cut base x' because we have either s'[i' 1] = x' or i' = 1 on the left end of t' and s'[i − |q| + 1] = x' on the right end of <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M74">View MathML</a>. By Lemma 8, we can see that <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M76','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M76">View MathML</a>. For the reader's convenience, we demonstrate in the following example how to find <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M74">View MathML</a> from t. Let s' = ACATGCTACATTA, t = s' [4,7] = TGCT, i = 7, x = A, and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31">View MathML</a> be the one as given in Example 6. Note that<a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M77','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M77">View MathML</a>. Further, for the given string t = TGCT, we have x' = C, p = ATG, and q = CT. Then, we obtain that i' = 3 and then t' = s' [3, 7 2] = s' [3,5] = ATG. It is easy to check that p is the suffix of t', t' is the suffix of the substring <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M78','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M78">View MathML</a>, and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M79','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M79">View MathML</a>.

On the other hand, let s' be a string evaluated by the dynamic programming algorithm. So, the string s' must begin with a string in <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M80','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M80">View MathML</a> for some <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M15">View MathML</a> and end with a string in <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M81','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M81">View MathML</a> for some <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M82">View MathML</a>. Consider a cleavage fragment t of s' that was used to construct the string s' during the backtracking procedure of the algorithm. Clearly, the string t contains all the bases of <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M83">View MathML</a> except its corresponding cut base x. Moreover, <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M84','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M84">View MathML</a> and t is the suffix of a substring s'[1, i] for some integer i. Without loss of generality, we can further suppose t s'[1, i] and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M85','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M85">View MathML</a>, so that s'[i − |t|] = s'[i + 1] = x. Let t' be the string considered next to the string t during the backtracking procedure of the algorithm. Thus, we have <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M86','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M86">View MathML</a> such that p is the suffix of t' and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M74">View MathML</a> is the suffix of the substring s'[1, i − |q|], where x', p, and q are computed for the string t as described earlier. More specifically, there exists i' such that t' = s'[i', i −|q|] and s[i' 1] = s' [i −|q| + 1] = x' if i' ≠ 1. To show (ii), by Lemma 8 and also by backward induction, what we mainly need to show is that the extended substring s'[i',|s'|] is I-compatible with <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31">View MathML</a>, given that the substring s'[i − |t| + 1, |s'|] is already I-compatible with <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31">View MathML</a>. To this end, we consider any internal cleavage fragment s'[j, k] of s' [i', |s'|] with respect to the cut base x″ = s'[j − 1] = s'[k + 1]. By definition of the internal cleavage fragment, we have j i' + 1 and k ≤ |s'| 1. In the following we distinguish four cases:

- If j i − |t| + 2, then s'[j, k] is an internal cleavage fragment of s'[i − |t| +1, |s'|]. Since s'[i − |t| +1, |s'|] is already assumed to be I-compatible with <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31">View MathML</a>, the base composition of s'[ j, k] shall be also contained in <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M87','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M87">View MathML</a>.

- If j = i − |t| + 1, then x″ = x, which further implies that k = i and s' [j, k] = t. Since <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M88','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M88">View MathML</a>, the base composition of s'[j, k] shall be contained in <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M89','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M89">View MathML</a>.

- If j i − |t| and k i − |q|, then s'[i − |t|, i − |q|] is a substring of s'[j, k]. Since s[i − |t|, i − |q|] contains all the bases of Σ, the string s'[j, k] can not be a cleavage fragment (as a cleavage fragment must not contain its corresponding cut base). Therefore, there shall not have the case where j i − |t| and k i − |q|.

- If k i − |q| − 1, then s'[j, k] is an internal cleavage fragment of t' = s'[i', i − |q|]. Since <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M90','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M90">View MathML</a>, the base composition of s'[j, k] shall be contained in <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M91','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M91">View MathML</a>.

In conclusion, for every internal cleavage fragment of s'[i, |s|], its base composition is contained in <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31">View MathML</a> under the respective cut base. Therefore, the extended substring s'[i', |s'|] is still I-compatible with <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M31">View MathML</a>.

Note that computing each entry <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M92','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M92">View MathML</a> of the dynamic programming table may take time <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M93','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M93">View MathML</a>, where <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M94','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M94">View MathML</a>. Hence, the above dynamic programming algorithm can be done in time <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M95','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M95">View MathML</a>. In the worst case, we may have <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M96','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M96">View MathML</a>, that is, <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M97','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M97">View MathML</a> is in the factorial order of the input problem size. In practice, however, we would expect <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M97','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M97">View MathML</a> not too large to be manageable, because cleavage fragments are usually of small size. Therefore, the above dynamic programming algorithm could be a practically feasible solution to the problem <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1">View MathML</a>, especially when compared to the brute-force algorithm which needs to examine all the possible strings s'. For the special case where <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M98','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M98">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1">View MathML</a> is actually an easy problem, as we can see from the above that <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M99','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M99">View MathML</a>

Corollary 11 The above dynamic programming algorithm can solve the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M100','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M100">View MathML</a> problem in polynomial time when <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M101','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M101">View MathML</a>.

The NP-hardness of <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2">View MathML</a>

This subsection is dedicated to prove that the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2">View MathML</a> problem is NP-hard. We begin with a brief introduction of the 3-partition problem.

Definition 12 (The general form of the 3-partition problem) Given a multiset of positive integers <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M102','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M102">View MathML</a>where n = 3m and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M103','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M103">View MathML</a>, can we partition the multiset <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M104','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M104">View MathML</a> into m multisets <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M105','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M105">View MathML</a>, such that the sum of each multiset is equal to B?

The 3-partition problem is strongly NP-complete [7]. Therefore, it remains NP-complete even when the integers in <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M104','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M104">View MathML</a> and the integer B are encoded in unary. In this case, the size of a problem instance is Θ(nB). In contrast, it becomes O(n log B) when using the binary encoding of integers.

Definition 13 (The restricted variation of the 3-partition problem) Given a set of positive integers <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M106','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M106">View MathML</a>where n = 3m, <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M107','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M107">View MathML</a>, and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M108','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M108">View MathML</a>, can we partition the set <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M104','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M104">View MathML</a>into m subsets <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M109','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M109">View MathML</a>, such that the sum of each subset is equal to B?

There are two constraints imposed in the above restricted variation of the 3-partition problem. The first one limits <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M104','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M104">View MathML</a> to be a set so that all the integers in <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M104','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M104">View MathML</a> are distinct. The second one limits all the integers in <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M104','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M104">View MathML</a> strictly between <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M110','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M110">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M111','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M111">View MathML</a>, which subsequently enforces every subset <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M112','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M112">View MathML</a> to consist of exactly three elements. Interestingly, this restricted variation of the 3-partition problem remains strongly NP-complete [8], just like the general form of the 3-partition problem. Note that the second constraint <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M113','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M113">View MathML</a> was actually not imposed in [8]. But, it can be easily done by adding B to each ai and then multiplying B by 4.

Theorem 14 The <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M114','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M114">View MathML</a>problem is NP-hard, even when <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M115','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M115">View MathML</a>.

Proof: We prove it by a reduction from the above restricted variation of the 3-partition problem. As an input for 3-partition, we are given a set of distinct positive integers <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M116','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M116">View MathML</a> where n = 3m, <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M117','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M117">View MathML</a>, and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M118','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M118">View MathML</a>. Then, we construct an instance <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M119','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M119">View MathML</a> of the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2">View MathML</a> problem as follows:

- Let Σ = {G, T}.

- Let s be the string such that s · T = (GB+2T)m. That is, let s · T be the concatenation of m copies of the fragment GG · · · GT, where each fragment consists of (B + 2) consecutive base Gs followed by one base T. Note that |s| = m(B + 3) 1 = mB + 3m − 1.

- Let <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M120','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M120">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M121','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M121">View MathML</a> so that <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M122','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M122">View MathML</a>.

First, we check whether this construction can be done in polynomial time in the size of the input instance of the 3-partition problem. Since the restricted variation of the 3-partition problem is strongly NP-complete, we may encode the integers in unary so that the size of the input instance is Θ(nB). In the above reduction, we can easily see that the first step can be done in constant time, the second step in time O(mB), and the third step in time O(n log B). Therefore, the total time needed for construction is O(nB), no more than time polynomial in the size of the input instance of the 3-partition problem.

Next, we show that every feasible solution s″ to the reduced instance <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M123','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M123">View MathML</a> of the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2">View MathML</a> problem is such that (i) <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M124','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M124">View MathML</a>, (ii) s″ contains exactly 3m − 1 base Ts, and (iii) dH (s, s″) 2m. For each compomer <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M125','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M125">View MathML</a>, there exists at least one cleavage fragment <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M126','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M126">View MathML</a> in s″ that is obtained with respect to the cut base T. Since all the integers ai are distinct, all such cleavage fragments shall be pairwise non-overlapping. Thus, the string s′′ contains at least <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M127','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M127">View MathML</a> base Gs and at least n − 1 = 3m - 1 base Ts. On the other hand, since |s| = mB + 3m - 1, the string s″ hence consists of exactly mB + 3m− 1 bases. Therefore, we can deduce that s″ contains exactly 3m − 1 base Ts and further that <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M128','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M128">View MathML</a> cannot have any other compomer than those in CT. By construction, we also know that the string s contains exactly m − 1 base Ts, which hence implies that dH (s, s″) ≥ 2m.

Now, we are going to show that there exists a valid partition for the input instance of the 3-partition problem if and only if there exists an optimal solution sfor the reduced instance of the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2">View MathML</a> problem such that dH (s, s') = 2m.

Suppose that <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M104','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M104">View MathML</a> can be partitioned into m subsets<a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M129','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M129">View MathML</a> such that, for each subset <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M130','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M130">View MathML</a>, its size is three and its integer elements adds up to exactly B, that is, <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M131','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M131">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M132','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M132">View MathML</a>. Then, we use the following procedure to find the string s':

1.    <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M133','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M133">View MathML</a>;

2.    for i = 1 to m

3.      for j = 1 to 3

4.       <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M134','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M134">View MathML</a>; // append the string <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M135','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M135">View MathML</a> to s'

5.      end

6.    end

7.    s':= s'[1, |s'| 1]; // remove the last base T

As one can easily check, the resulting string s' is such that |s'| = mB + 3m − 1, <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M136','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M136">View MathML</a>, and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M137','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M137">View MathML</a>. Therefore, s' is a feasible solution to the reduced instance <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M138','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M138">View MathML</a> of the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2">View MathML</a> problem. On the other hand, since <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M139','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M139">View MathML</a>, we can deduce that s'[k] = s[k] if s'[k] = G or s[k] = T; otherwise, s[k] ≠ s[k], ∀k ∈ [1, mB + 3m - 1]. Therefore, dH (s, s') =|[k : s'[k] ≠ s[k]}| = |s| − |{k : s'[k] = s[k]}| = mB + 3m − 1 |{k : s'[k] = G}| |{k : s[k] = T}| = mB + 3m − 1 − mB − m + 1 = 2m. It hence follows that s′ is indeed an optimal solution to the reduced instance <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M140','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M140">View MathML</a> of the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2">View MathML</a> problem.

Conversely, suppose that the string s' is an optimal solution to the reduced instance <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M141','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M141">View MathML</a> of the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2">View MathML</a> problem such that dH(s, s') = 2m. Then, we use the following procedure to find a partition <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M142','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M142">View MathML</a> of A:

1.    s := s · T; s':= s' · T;

2.    i := 1; j := 1;

3.    <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M143','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M143">View MathML</a>

4.    for k = 1 to mB + 3m

5.      if s'[k] = T

6.       <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M144','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M144">View MathML</a>

7.       j + +;

8.       if s[k] = T

9.        i + +; j := 1;

10.        <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M145','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M145">View MathML</a>

11.       end

12.       <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M146','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M146">View MathML</a>

13.      else

14.       <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M147','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M147">View MathML</a>

15.      end

16.    end

It follows from the earlier discussions that <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M148','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M148">View MathML</a> and also that s' contains exactly 3m − 1 base Ts. Furthermore, since dH (s, s') = 2m, we can deduce that s'[k] = s[k] if s[k] = T, ∀k ∈ [1, mB + 3m − 1]. Notice that s[k] = T if and only if k can be written as a multiple of (B + 3), that is, k = i(B + 3) ∈ [1, mB + 3m − 1], ∀i. Therefore, s'[k] = T if k = i(B + 3) ∈ [1, mB + 3m − 1], ∀i, which subsequently implies that <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M149','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M149">View MathML</a>, for each i ∈ [1, m]. Note that s[(i − 1)(B + 3) + 1, i(B + 3) 1] is a substring of s that consists of (B + 2) base Gs; it is located either strictly between two consecutive base Ts or strictly between one base T and one end of the string s. Since CT(s[(i − 1)(B + 3) + 1, i(B + 3) 1]) ⊆ CT(s'), we can let <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M150','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M150">View MathML</a> such that <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M151','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M151">View MathML</a>. Since <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M152','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M152">View MathML</a>, we can deduce that j = 3; hence <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M153','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M153">View MathML</a>. Let<a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M154','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M154">View MathML</a>, for all i ∈ [1, m]. Then, we can see that <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M155','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M155">View MathML</a> is a partition of <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M104','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M104">View MathML</a> such that the sum of integers in each subset is equal to B.

Extensions to edit distance

Naturally we may extend our previous problem formulations to the edit distance (i.e., Levenshtein distance). The resulting two new problems are formally defined as follows.

Definition 15 (The <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1">View MathML</a> problem) Given a string s and a collection of compomer spectra <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M156','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M156">View MathML</a>, find a string s' such that <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M157','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M157">View MathML</a>, for all × ∈ Σ and dE (s, s') is minimized.

Definition 16 (The <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2">View MathML</a> problem) Given a string s and a collection of compomer spectra <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M158','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M158">View MathML</a>, find a string s' such that <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M159','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M159">View MathML</a>, for all <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M15">View MathML</a>and dE (s, s') is minimized.

These extensions make it possible to detect not only base substitutions but also base insertions and deletions. Hence, they would permit the mutation discovery in DNA sequences (see [1]). In the Additional file 1, we show that both <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2">View MathML</a> are theoretically NP-hard, together with an exact dynamic programming algorithm for solving the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1">View MathML</a> problem.

Additional file 1. Extensions to edit distance. The analysis results for the problems <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M160','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M160">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M161','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M161">View MathML</a> are presented. See "Additional file 1.pdf".

Format: PDF Size: 81KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

Conclusions

To exploit the full potential of the SNP discovery approach using base-specific cleavage and mass spectrometry, in this paper we have studied two new combinatorial optimization problems, called <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2">View MathML</a>, respectively. We believe that any efficient solution to either problem could offer a more seamless integration of information in four complementary base-specific reactions than previously done in [1,2], thereby improving the capability of the underlying biotechnology (i.e., base-specific cleavage and mass spectrometry) for sensitive and accurate SNP discovery.

Although we cannot change the inherent complexity of our proposed dynamic programming algorithm for the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1">View MathML</a> problem, we believe that by improving and optimizing its implementation, the compute runtime can be significantly reduced to the extent suitable for practical use. On the other hand, the NP-hardness result indicates that in the most general situation, solving the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2">View MathML</a> problem exactly in polynomial time is impossible unless P = NP. In more realistic situations where only a very few SNPs (e.g., two or three SNPs) occur in a target sample sequence, however, the problem can be quite easily tackled, e.g., using an exhaustive search approach. In the future work, we shall try to prove that the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M1">View MathML</a> problem is NP-hard and develop an efficient heuristic algorithm for the <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S5/mathml/M2">View MathML</a> problem for practical use.

Authors' contributions

XC conceived the study. All authors contributed to the problem analysis, read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Acknowledgements

We would like to thank Yuguang Mu and Kai Tang for introducing us the problem of SNP discovery using base-specific cleavage and mass spectrometry. X.C.'s research was supported by the Singapore National Medical Research Council grant (CBRG11nov091) and a College of Science Collaborative Research Award at NTU. Q.W.'s research was supported by National Science Foundation for Young Scientists of China (61103066). L.Z.'s research was supported by the Singapore MOE AcRF Tier 2 grant (R-146-000-134-112).

This article has been published as part of BMC Systems Biology Volume 6 Supplement 2, 2012: Proceedings of the 23rd International Conference on Genome Informatics (GIW 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/6/S2.

References

  1. Bocker S: SNP and mutation discovery using base-specific cleavage and MALDI-TOF mass spectrometry.

    Bioinformatics 2003, 19(Suppl 1):i44-53. PubMed Abstract | Publisher Full Text OpenURL

  2. Krebs S, Medugorac I, Seichter D, Forster M: RNaseCut: a MALDI mass spectrometry-based method for SNP discovery.

    Nucleic Acids Research 2003., 31(7) PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  3. Stanssens P, Zabeau M, Meersseman G, Remes G, Gansemans Y, Storm N, Hartmer R, Honisch C, Rodi CP, Bocker S, van den Boom D: High-throughput MALDI-TOF discovery of genomic sequence polymorphisms.

    Genome Research 2004, 14:126-133. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  4. Hartmer R, Storm N, Bocker S, Rodi CP, Hillenkamp F, Jurinke C, van den Boom D: RNase T1 mediated base-specific cleavage and MALDI-TOF MS for high-throughput comparative sequence analysis.

    Nucleic Acids Research 2003., 31(9) PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  5. Honisch C, Raghunathan A, Cantor CR, Palsson BO, van den Boom D: High-throughput mutation detection underlying adaptive evolution of Escherichia coli-K12.

    Genome Research 2004, 14(12):2495-2502. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  6. RNaseCut webpage link [http://www.vetmed.uni-muenchen.de/gen/forschung.html] webcite

  7. Garey MR, Johnson DS: Complexity results for multiprocessor scheduling under resource constraints.

    Siam Journal on Computing 1975, 4:397-411. Publisher Full Text OpenURL

  8. Hulett H, Will TG, Woeginger GJ: Multigraph realizations of degree sequences: Maximization is easy, minimization is hard.

    Operations Research Letters 2008, 36(5):594-596. Publisher Full Text OpenURL

  9. Bocker S: Sequencing from compomers: Using mass spectrometry for DNA de novo sequencing of 200+ nt.

    Journal of Computational Biology 2004, 11(6):1110-1134. PubMed Abstract | Publisher Full Text OpenURL

  10. Pevzner PA, Tang HX, Waterman MS: An Eulerian path approach to DNA fragment assembly.

    Proceedings of the National Academy of Sciences of the United States of America 2001, 98(17):9748-9753. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL