Abstract
With the remarkable development in inexpensive sequencing technologies and supporting computational tools, we have the promise of medicine being personalized by knowledge of the individual genome. Current technologies provide high throughput, but short reads. Reconstruction of the donor genome is based either on de novo assembly of the (short) reads, or on mapping donor reads to a standard reference. While such techniques demonstrate high success rates for inferring 'simple' genomic segments, they are confounded by segments with complex duplication patterns, including regions of direct medical relevance, like the HLA and the KIR regions.
In this work, we address this problem with a method for assessing the quality of a predicted genome sequence for complex regions of the genome. This method combines two natural types of evidence: sequence similarity of the mapped reads to the predicted donor genome, and distribution of reads across the predicted genome. We define a new scoring function for readtogenome matchings, which penalizes for sequence dissimilarities and deviations from expected read location distribution, and present an efficient algorithm for finding matchings that minimize the penalty. The algorithm is based on a formal problem, first defined in this paper, called Coverage Sensitive manytomany mincost bipartite Matching (CSM). This new problem variant generalizes the standard (onetoone) weighted bipartite matching problem, and can be solved using network flows. The resulting Javabased tool, called SAGE (Scoring function for Assembled GEnomes), is freely available upon request. We demonstrate over simulated data that SAGE can be used to infer correct haplotypes of the highly repetitive KIR region on the Human chromosome 19.
Introduction
The inexorable drop in costs and rise in throughput of DNA sequencing is driving a future in which every individual person will have their genome sequenced, perhaps multiple times in their lifetimes [1]. Current high throughput technologies produce sequenced read fragments from donor genomes, which are then used for inferring the complete genomic sequence. The main algorithmic approaches for inferring a donor genome from a set of its sequenced reads are either based on de novo assembly [2,3], i.e. producing a parsimonious superstring that approximately contains most reads as its substrings, or based on mapping approaches [46], in which the algorithm takes the read set and a previously sequenced reference genome (or a set of reference genomes), maps the reads to the reference, and uses the identified similarities and variations in order to predict the donor genome.
While the accuracies of sequencing technologies keep improving and their usage costs keep decreasing, many of them still produce reads of relatively short lengths. Reconstruction of repetitive genomic regions using the mentioned approaches is considered more challenging, due to the fact that short reads may be denovo assembled, or mapped to the reference, in multiple ambiguous manners. The difficulty even increases for diploid genomes, limiting the investigation of many important genomic regions, such as the killer cell immunoglobulin like receptor (KIR) region (located in humans within the 1Mb Leucocyte Receptor Complex 19q13.4, see Figure 1b), the 3.6Mbp Human Leucocyte Antigen (HLA) region and others, which exhibit highly repetitive sequences and extensive polymorphisms.
Here, we address the problem of assessing the quality of a donor genome prediction given the set of its sequenced reads, confronting difficulties related to genomic regions of repetitive nature. We present a prediction quality measure a prediction quality measure which is independent of the approach used for generating the prediction. It combines scoring penalties related to both (a) imperfect alignments of the reads to the predicted region, and (b) deviations between the expected and actual read coverage of segments of the region. Our tool differs from previous ones which compare predictions to a known reference. For example, tools that evaluate the quality of denovo assemblies [7] rely on comparing assembled genomes to known references. Mapping tools [8,9] can be used to provide a naive scoring function comparable to SAGE by summing up the best alignment score of each read. This naive scoring function only optimizes the alignment of the reads and does not take into account read coverage. Our results show the advantage of simultaneously optimizing the combined alignment and coverage score by comparing our tool to the naive approach.
In order to evaluate the new cost function, we applied it to the KIR, a hypervariable region known to be important for the immediate immune response in humans and higher mammals [10]. The KIR region is challenging to reconstruct from sequence read fragments due to its variable gene architecture (Figure 1a) and repetitive nature (Figure 1b). We show that our scoring function allows us to correctly identify KIR haplotype templates in diploid genomes, differentiating correct predictions form incorrect ones based on their computed score, while the naive approach fails in many cases to predict the correct template.
Our cost function for evaluating donor genome predictions is based on a new variant of a bipartite matching problem, entitled Coverage Sensitive manytomany mincost bipartite Matching (CSM), which is a manytomany generalization of the classical mincost (or maxweight) bipartite matching problem [11,12]. The formal definition of the CSM problem is given in the next section. While in general CSM is NPHard (see Additional File 1), we show a special "convexed" case for which CSM can be efficiently solved by reducing it to a network flow problem, similar to many other variants of bipartite matching problems [12]. Optimal matching/flow algorithms were recently used by several related works to predict structural variations between genomes. Examples to such works include [13], in which mincost flow was used to call copy number variations between a reference and a donor genome, [14], which used maximumweight matching in order to reconstruct breakpoint sequences in long genomic insertions, and [15], which used maximumflow in order to apply a postprocess refinement of simultaneous detection of structural variations in multiple genomes.
Additional file 1. NPhardness of CSM.
Format: PDF Size: 116KB Download file
This file can be viewed with: Adobe Acrobat Reader
Coverage Sensitive manytomany mincost bipartite Matching (CSM)
The CSM problem is a manytomany generalization of the classical mincost bipartite matching problem [12]. We describe the problem in an abstract setting, and cast it to a read alignment problem in the next section.
Consider arbitrary sets X and Y. A manytomany matching (henceforth a matching) between X and Y is a set M of pairs {(x, y) ∈ X × Y} (see Figure 2, (a), (b), (c). The coverage of an element x ∈ X with respect to a matching M is c_{M }(x) = {y : (x, y) ∈ M}. Symmetrically, c_{M }(y) = {x : (x, y) ∈ M} for y ∈ Y .
Figure 2. Matching instance and its reduction to a cost flow network. (a) A bipartite graph corresponding to sets X and Y. In our particular application, X represents a set of reads and Y represents a set of genomic segments, where the expected coverage of each read is one and segments are expected to be uniformly covered. Each read x ∈ X potentially maps to multiple segments, illustrated by the edges in the graph. An edge (x, y) has the weight w_{m }(x, y), reflecting the best similarity between read x and a substring of of the genome starting at segment y. (b) and (c) depict two possible matchings. In (b), one of the y segments is covered by four reads, while the other two segments are covered by one read each. In (c), each segment is covered by two reads. It is possible that the matching in (b) is better in terms of sequence similarity, though is unrealistic in terms of segment coverage, which would make the matching in (c) preferable. (d) The corresponding network. Each pair of consecutive layers is a bipartite graph with capacities c and costs w' as described.
A coverage sensitive matching cost function (henceforth a cost function) w for X and Y assigns matching costs w_{m }(x, y) for every pair (x, y) ∈ X × Y , and coverage costs w_{c }(z, i) for every z ∈ X ∪ Y and every integer i ≥ 0. The cost of a matching M between X and Y with respect to w is given by
The CSM problem
Input: A Matching Instance (X, Y, w) consisting of sets X, Y, and cost function w.
Note that CSM is a generalization of classical problems in combinatorics. For example, consider the problem of finding a maximum (partial onetoone) matching on a bipartite graph G with vertex shores X, Y, and an edge set E. This problem can be solved by solving CSM on the input X, Y using the following costs: set w_{c }(z, 0) = w_{c }(z, 1) = 0, and w_{c }(z, i) = ∞ for all z ∈ X ∪ Y, i >1; set w_{m }(x, y) = 1 for (x, y) ∈ E and otherwise set w_{m }(x, y) = ∞. Similarly, CSM can also be used for solving the minimum/maximum weight variants of the bipartite matching problem. However, CSM is NPhard in general (see Additional File 1), and therefore we do not expect to solve the general instance efficiently.
CSM with convex coverage costs
Let (X, Y, w) be a matching instance. We say that w has convex coverage costs if for every element z ∈ X ∪ Y and every integer i >0, . We show here that CSM with convex coverage costs can be reduced to the polytime solvable mincost integer flow problem [11].
For x ∈ X, denote d_{x }= {y : w_{m }(x, y) <∞}, and similarly d_{y }= {x : w_{m }(x, y) <∞} for y ∈ Y . Denote and . The reduction builds the flow network N = (G, s, t, c, w'), where G is the network graph, s and t are the source and sink nodes respectively, and c and w' are the edge capacity and cost functions respectively. The graph G = (V, E) is defined as follows (Figure 2d).
• V = X ∪ Y ∪ C^{X }∪ C^{Y }∪ {s, t}, where the sets , , and {s, t} contain unique nodes different from all nodes in X and Y . Note that we use the same notations for elements in X and Y and their corresponding nodes in V, where ambiguity can be resolved by the context.
• E = E_{1 }∪ E_{2 }∪ E_{3 }∪ E_{4 }∪ E_{5}, where
and
The capacity function c assigns infinity capacities to all edges in E_{1 }and E_{5 }and unit capacities to all edges in E_{2}, E_{3 }and E_{4}. The cost function w' assigns zero costs to edges in E_{1 }and E_{5}, costs w_{c }(x, i)  w_{c }(x, i  1) to edges , costs w_{c }(y, i)  w_{c }(y, i  1) to edges , and costs w_{m }(x, y) to edges (x, y) ∈ E_{3}. For E' ⊆ E, denote . An integer flow in N is a function f : E → {0, 1, 2, . . .}, satisfying that f(e) ≤ c(e) for every e ∈ E (capacity constraints), and for every v ∈ V \ {s, t} (flow conservation constraints). The cost of a flow f in N is defined by .
In what follows, let (X, Y, w) be a matching instance where w has convex coverage costs, and let N be its corresponding network. Due to the convexity requirement, for every x ∈ X and every integer i >0, Similarly, for every y ∈ Y and every integer i >0, , and we get the following observation:
Observation 1. Series of the form and are nondecreasing. Consequentially, for every and , and similarly for and .
Given a flow f in N, define the matching M_{f }= {(x, y) : (x, y) ∈ E_{3}, f(x, y) = 1}. Denote and . Since for edges e ∈ E_{1 }∪ E_{5 }we have that w'(e) = 0, and since for edges e ∈ E_{2 }∪ E_{3 }∪ E_{4 }we have that f(e) ∈ {0, 1} (due to capacity constraints), we can write
Given a noninfinity cost matching M between X and Y, define the flow f_{M }in N as follows:
• For every (x, y) ∈ E_{3}, f (x, y) = 1 if (x, y) ∈ M, and otherwise f(x, y) = 0;
• For every if c_{M }(x) ≤ i, and otherwise ;
• For every , if c_{M }(y) ≤ i, and otherwise ;
It is simple to assert that f_{M }is a valid flow in N (satisfying all capacity and flow conservation constraints), and that .
Claim 1. For every flow f in N,
Proof. From flow conservation constraints for every x ∈ X, where in particular by definition we have that Therefore, it follows from Observation 1 that for every x ∈ X, and similarly it may be shown that for every y ∈ Y. Hence,
□
Denote , and note that Δ depends only on the instance (X, Y, w) and not on any specific matching.
Claim 2. For every matching M between × and Y, w'(f_{M}) = w(M)  Δ.
Proof. For x ∈ X, we have that and similarly for y ∈ Y. Therefore,
□
Claim 3. Let f* be a minimum cost flow in N. Then, M_{f* }is a minimum cost matching between X and Y, and CSM(X, Y, w) = w'(f*) + Δ.
Proof. Since f* is a minimum cost flow in N, , thus . Let M be a matching between X and Y. Again, from the optimality of f*, w'(f*) ≤ w'(f_{M}) and so , and in particular . Thus, is a minimum cost matching for (X, Y, w), and so .
□
Constructing CSM instance from read mapping data
Consider a set of reads and a prediction of the genomic sequence (henceforth, the "prediction") from which the reads were extracted. It is assumed that the sequencing procedure produces reads with some sequencing error probability, and that read extraction positions along the genome adhere to some expected distribution. The probability for extracting a read starting at a given position may depend on the sequential context at this position and its location along the genome. Given such probabilities, it is possible to compute for a given segment of the prediction an expected amount of extracted reads starting within this segment. Such an amount of expected reads will be referred to here as the expected coverage of the segment. Hence, we can argue that the reads well support the prediction in case it is possible to assign to each read a position within the prediction, from which it was presumably extracted, in a manner that (a) each read sequence approximately matches the substring of the prediction starting at the assigned position, and (b) for every segment of the prediction, the amount of reads assigned to positions within this segment does not deviate significantly from the expected coverage of the segment. On the other hand, when no such position assignment can be found, it is suggestive that the prediction exhibits some variation with respect to the true genome.
Given a predicted region, a mapping between the reads and the prediction is a function that assigns to each read a set of positions in the region from which it is possible to extract the read (with some allowed amount of sequencing errors). Software tools for producing such mappings exist (e.g. Bowtie [8]) and are widely used. Ideally, if the prediction is in fact the correct genomic sequence from which the reads were extracted, and this region is nonrepetitive, it is expected that a mapping would assign to each read a unique position that is the true position from which it was extracted. Nevertheless, when the sequence contains repeats, and sequencing errors are not negligible, it is expected that some of the reads will be mapped to multiple positions (due to the repeats), while others may not be mapped to any position (due to sequencing errors). Given a mapping between the reads and the region, we define a readtogenome matching as a function that selects for each read at most one corresponding position among its set of positions given by the mapping, from which it was presumably extracted. A readtogenome matching better supports the prediction the more reads it matches to the genome, the higher the similarity is between reads and their matching positions, and the smaller the deviation is between the expected coverage and the coverage implied by the matching positions.
The quality of a readtogenome matching can be naturally evaluated using the CSM formalism described in the previous section. A matching instance (X, Y, w) can be generated, choosing X to be the set of reads, and Y to be a partition of the prediction into segments (where each element in Y corresponds to a segment in the partition). For each read x ∈ X and each segment y ∈ Y, w_{m }(x, y) is set to the best sequence similarity score between x and a substring of the prediction starting at y (such similarity scores may be generated by tools such as Bowtie [8]), or set to ∞ if no substring starting at y is similar to x. The coverage cost function for a read x ∈ X sets w_{c }(x, 0) to some penalty added to the score in case x is unmatched, sets w_{c }(x, 1) to 0 (no penalty is added when x participates in the matching), and w_{c }(x, i) for i >1 to ∞ (a matching in which a read is assigned to more than one position is illegal, and has an infinite cost). For a segment y ∈ Y, it is possible to compute the expected coverage c_{y }of y, and generate a convex score function f(i) whose minimum point is at i = c_{y}, and set w_{c }(y, i) = f (i) for every nonnegative integer i. The cost of an optimal matching for this instance can then serve as a quality measure for the prediction.
Implementation
We implemented the CSM algorithm as a java based tool named SAGE, a Scoring function for Assembled GEnomes. The inputs to SAGE are a set of reads, R, mapped to a genomic template, G, in the BAM format [16] along with a parameter file containing alignment costs, unmatched read penalty, genome segmentation, expected segment coverage values, and a choice of coverage cost functions (currently linear and polynomial cost functions).
Results
We tested SAGE on the hypervariable KIR region. The KIR region, while variable, is tightly organized and contains between 8 and 14 genes, and 2 pseudogenes (Figure 1a) [17]. The genes are organized into two adjacent regions, each bordered by two anchoring genes/pseudogenes: KIR3DL3 and 3DP1 for the centromeric region; 2DL4 and 3DL2 for the telomeric region. Variability within KIR is expressed in the form of changing gene numbers, genecopy numbers, and gene polymorphisms. There are two broad types of KIR haplotypes Type A and Type B that are distinguished by their gene content. Type A haplotypes are characterized by the absence of the following genes: {KIR2DL5, 2DS1, 2DS2, 2DS3, 2DS5, 3DS1}, while Type B haplotypes contain one or more of these genes [18]. Type B haplotypes can be split further into different subtypes, characterized by the gene content on the centromericside and telomericside. The various (sub)types of KIR haplotypes are denoted by {A, AB, BA1, BA2, BA2X, Bdel, B}. However, the typing is incompletely developed, and is likely to change as more data is acquired.
To test the effectiveness of SAGE on a variety of haplotype types, we simulated reads from 27 known KIR haplotypes using GemSIM [19] with an error model learned from pairedend (100 × 2)bp reads generated by Illumina GA IIx with TrueSeq SBS Kit v5GA [19]. The 27 haplotype templates were taken from the IPDKIR database [20]. The sequences of these templates were obtained experimentally by first separating the two haplotypes of an individual using fosmidpools, determining the gene content and architecture of each haplotype using STS assays, and then finally sequencing the individual fosmids [21].
Before we ran SAGE, we mapped each read set, R, back to each template, G, using Bowtie. We ran Bowtie under the 'a' option with all other parameters set to the default, in order to obtain a set of all possible mapping locations and their corresponding alignment costs for each read, which was used as input into SAGE. The mapping position of a pairedend read was set to be the genomic index to which the first character of the first subread was aligned. The alignment cost for a complete (100 × 2)bp pairedend read varied between 0 and 180, with 0 corresponding to identity. When two pairedends mapped in a concordant manner, the total alignment cost for the read was calculated by adding the alignment cost of both pairedends. When a pairedend did not have a concordant mate, suggestive of incorrect architecture, the alignment cost was further penalized by adding a cost of 90, which is the maximum penalty for one pairedend. The unmatched read penalty was constant for all reads and set to 100. On the other side, the genome G was partitioned to segments of fixed length of 1000bp (except for the last segment which may be shorter), with expected coverage per segment given by (with the appropriate adjustment for the last segment), where R and G denote the number of reads and the length of the genome, respectively. To allow for natural variation in coverage, the quadratic function, f(i) = (λ  i)^{2}, was chosen as the segment coverage cost function.
To the best of our knowledge, SAGE is the first tool that scores templates given a set of reads. As there is no competing tool, we compared SAGE results against a naive approach that ignores coverage and sums up the best alignment score for each read to obtain a total score for each read set and template. The scores obtained by this approach will be referred to as the Bowtie scores below.
Haploid templates
As a first pass, we tested SAGE's ability to score haploid templates. We scored each of the 27 read sets against each of the 27 templates using SAGE. A visualization of the scores are shown in Figure 3, where the templates are organized by sequence similarity so that templates of the same type/subtype are clustered together. Note that the matrix is not symmetric. Each row corresponds to the scores of a single read data set against a collection of haploid templates. As can be seen, SAGE always gets the topscore for the correct template. Moreover, the other templates from the same subtype get progressively weaker scores. Major haplotypes fall within distinct blocks, but the scores also suggest a hierarchy within the subtypes that can be studied further.
Figure 3. Scoring simulated reads against haploid templates using SAGE. Each row contains the colorcoded percentage from the topscore of a readset mapped against 27 genomic templates. Black: topscore; Red: within 5% of topscore; Orange: ≤ 10%; Yellow: ≤ 20%; White: >20% below topscore. Sequences are ordered along the rows and columns so that sequences with the same (sub)type are adjacent to each other. Templates of the same type are indicated by the blue boxes, and those of the same subtype by light blue boxes.
Dipolid templates
To test scoring on more realistic templates, we simulated reads from 9 diploid individuals whose pair of haploid templates were obtained experimetally in Pyo et al. [21] and are in the IPDKIR database [20]. The 9 diploid templates from this study fell into one of 6 combination of subtypes. We scored each of the 9 simulated read sets against each of the 9 diploid templates using SAGE. In all but one case, SAGE (Figure 4a) and Bowtie (Figure 4b) predicted the correct diploid template of the donor. Furthermore, SAGE is better at predicting the subtype of the donor template than Bowtie. When the donor template is not in database, as is usually the case in practice, SAGE will give a better score to templates that are more similar to the donor while Bowtie may not. For example, row 3 of Figure 4(a, b) show the scores when the donor template is of type A and BA1. Both SAGE and Bowtie correctly gave the best score to the diploid template G085A/BA1. However, the template with the next best SAGE score was also of subtype A/BA1, while the template with the next best Bowtie score was of subtype A/BA2.
Figure 4. Scoring simulated reads against diploid templates. Each row of a matrix represent scores from the same read sets mapped to different prediction templates. The scores are normalized so that the second best score in each row is equal to 1 and the worst score is equal to 0. We normalize with respect to the second best score since it would be used to predict the haplotype in the absence of the best scoring (and presumably correct) template. Furthermore, the entries are colorcoded accordingly Black: topscore; Red: second topscore; Orange: within 10% of second topscore; Light Orange: ≤ 20%; Yellow: ≤ 30%; White: >30% below topscore. Both matrices are ordered according to template subtypes. Templates of the same type are indicated by the blue boxes. (a) SAGE scores (b) Bowtie scores.
In general, coverage plays an important role in determining the correct haplotype. Figure 5(be) show the coverage plots when reads from donor template G085A/BA1 are mapped to a template of the same subtype (F06A/BA1) and a template of a different subtype (FH13A/BA2) using SAGE and Bowtie. When mapped to templates of the same subtype (Figure 5(b, d)), the coverage plots for both SAGE and Bowtie show less variance when compared to the coverage plots of the other templates (Figure 5(c, e)). Bowtie does not take into account variance of coverage and scores the template of a different subtype (FH13A/BA2) higher than the template of the same subtype (F06A/BA1). On the contrary, SAGE penalizes for the variance in coverage, and correctly predicts the subtype of the donor. Furthermore, if several possible mappings of a read are given, SAGE can be used to determine the best mapping. In Figure 5(b, c), we see less variability in the coverage plots from SAGE's matching compared against those of Bowtie's matching (Figure 5(d, e)). Therefore, even if Bowtie is able to determine the correct donor template, it may not output the correct mapping.
Figure 5. Coverage plots for reads sampled from G085A/BA1 templates. (a) genomic architecture of of G085A/BA1, FH06A/BA1, and FH13A/BA2. SAGE coverage plots when reads are extracted from G085A/BA1 and mapped to (b) FH06A/BA1 and (c) FH13 A/BA2. Bowtie coverage plots when reads are extracted from G085A/BA1 and mapped to (d) FH06A/BA1 and (e) FH13A/BA2.
Running time
For a dataset with n reads and a total of m read mapping locations, SAGE scales as O(nm + n^{2 }log n). Thus, on our datasets with haploid genomes of average length 166Kbp (166 1000bpsegments), and ~ 24,900 reads, SAGE ran in 21 seconds. The running time increased to 210 seconds for the average diploid genome (~ 332 1000bpsegments, ~ 49,800 reads). Running times were recorded using a 4 core Intel 2.66GHz processor with 9Gb of RAM.
Discussion and conclusions
To the best of our knowledge, SAGE is the first tool that scores predicted donor templates given a set of sequenced reads. Our results on the KIR region show that SAGE can be used to predict the subtype of the donor KIR template, and can be directly used for haplotyping this region. Furthermore, SAGE scores the correct template higher than even templates of the same subtype.
While we focused our attention on the KIR region, SAGE is general enough to be applied to any complex region. It is also possible to implement many different scoring functions, which would allow the user to obtain optimal matchings according to his own custom scores. For example, read unmatching penalties may be constant for all reads, or may be readspecific. A motivation for read specific costs is in the case where the sequencing phase produces some sequencing qualities for reads, and it is possible to "pay" less when not matching reads of lower sequencing quality. Similarly, it is possible to choose a segmentation of the prediction in which all segments are of the same length, and uniform coverage is assumed, or one with variable segment lengths and possibly different coverage cost functions for each segment. A motivation to such complex segmentation is e.g. in the case where one tries to identify a specific structural variation, such as a deletion of a segment of specific length around a specific region of the prediction. Setting lengths of segments in the examined region to the expected deletion length can increase the likelihood that an optimal matching would not add artifact matchings of reads to a long segment spanning the deleted segment, in order to compensate for low coverage of the deleted segment. Lastly, by using different coverage cost functions, it is possible to decide the rate in which penalty increases due to deviations of expected coverage, which may grow linearly, polynomially, exponentially, or based on other probabilistic models, as long as the function satisfies the convexity requirement.
Future work would involve extending the use of SAGE on real data. Some challenges in dealing with real data include obtaining the set of reads extracted from the region of interest (especially when sequencing data is likely taken from the whole genome) and providing the expected coverage. If we know the parameters of the sequencing run, we could use the target read coverage as the expected coverage; however, if that is unknown, we may be able to estimate the expected coverage from the number of reads we need to map to the region. For example, if we assume a uniform distribution of coverage, then the expected coverage per segment is simply the total length of the segment multiplied by .
Although haplotype analysis of the KIR region is medically relevant, the genomic complexity (i.e. repetitive nature and variable gene architecture) of this region makes it difficult to do a complete analysis. Indeed, the possible subtypes of this region have not been completely characterized. Thus, reconstruction of this region and other complex regions of the genome remain a worthwhile problem. SAGE takes the first step in reconstructing complex regions of the genome by providing a scoring function for predicted templates based on their similarity to the true donor. Therefore, it might be possible to obtain a complete reconstruction of the donor genome by iteratively refining predicted donor templates until SAGE scores are optimized. Furthermore, SAGE can also be applied for scoring denovo assemblies and for comparing the accuracies of different assemblers.
List of abbreviations used
CSM: Coverage Sensitive manytomany mincost Matching; SAGE: Scoring function for Assembled Genomes; KIR: Killer cell Immunoglobulinlike Receptor; HLA: Human Leucocyte Antigen
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
SZ and CL developed and implemented the method. CL and SK designed the experiments. CL performed the experiments. CL, SZ, and VB wrote the manuscript. All authors read and approved the final manuscript.
Acknowledgements
The research was supported by grants from the NIH (RO1HG004962, U54 HL108460), and the NSF (CCF1115206). CL was supported by an NSF graduate fellowship.
Declarations
Publication of this article was supported by NIH (RO1HG004962, U54 HL108460) and NSF (CCF1115206).
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 5, 2013: Proceedings of the Third Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMBseq 2013). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S5.
References

Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, Li S, Yang H, Wang J, Wang J: De novo assembly of human genomes with massively parallel short read sequencing.
Genome Research 2010, 20(2):265272. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pevzner PA, Tang H, Waterman MS: An Eulerian path approach to DNA fragment assembly.
Proc Natl Acad Sci USA 2001, 98(17):97489753. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Havlak P, Chen R, Durbin KJ, Egan A, Ren Y, Song XZ, Weinstock GM, Gibbs RA: The Atlas genome assembly system.
Genome Res 2004, 14(4):721732. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kidd JM, et al.: Mapping and sequencing of structural variation from eight human genomes.
Nature 2008, 453:5664. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Mills RE, et al.: Mapping copy number variation by populationscale genome sequencing.
Nature 2011, 470:5965. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, Delcher AL, Roberts M, Marcais G, Pop M, Yorke JA: GAGE: A critical evaluation of genome assemblies and assembly algorithms.
Genome Res 2012, 22(3):557567. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Langmead B, Salzberg S: Fast gappedread alignment with Bowtie 2.
Nature methods 2012, 9(4):357359. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li H, Durbin R: Fast and accurate longread alignment with BurrowsWheeler transform.
Bioinformatics 2010, 26(5):589595. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hsu KC, Chida S, Geraghty DE, Dupont B: The killer cell immunoglobulinlike receptor (KIR) genomic region: geneorder, haplotypes and allelic polymorphism.
Immunol Rev 2002, 190:4052. PubMed Abstract  Publisher Full Text

Edmonds J, Karp R: Theoretical improvements in algorithmic efficiency for network flow problems.
Journal of the ACM (JACM) 1972, 19(2):248264. Publisher Full Text

Matching Theory, Volume 29 of Annals of Discrete Mathematics. Amsterdam: NorthHolland. 1986.

Medvedev P, Fiume M, Dzamba M, Smith T, Brudno M: Detecting copy number variation with mated short reads.
Genome research 2010, 20(11):16131622. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hajirasouliha I, Hormozdiari F, Alkan C, Kidd J, Birol I, Eichler E, Sahinalp S: Detection and characterization of novel sequence insertions using pairedend nextgeneration sequencing.
Bioinformatics 2010, 26(10):12771283. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hormozdiari F, Hajirasouliha I, McPherson A, Eichler E, Sahinalp S: Simultaneous structural variation discovery among multiple pairedend sequenced genomes.
Genome research 2011, 21(12):22032212. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Subgroup GPDP: The Sequence Alignment/Map format and SAMtools.
Bioinformatics 2009, 25(16):20782079. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Middleton D, Gonzelez F: The extensive polymorphism of KIR genes.
Immunology 2010, 129:819. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Marsh SG, Parham P, Dupont B, Geraghty DE, Trowsdale J, Middleton D, Vilches C, Carrington M, Witt C, Guethlein LA, Shilling H, Garcia CA, Hsu KC, Wain H: Killercell immunoglobulinlike receptor (KIR) nomenclature report, 2002.
Tissue Antigens 2003, 62:7986. PubMed Abstract  Publisher Full Text

McElroy KE, Luciani F, Thomas T: GemSIM: general, errormodel based simulator of nextgeneration sequencing data.
BMC Genomics 2012, 13:74. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Robinson J, Waller MJ, Stoehr P, Marsh SG: IPDthe Immuno Polymorphism Database.
Nucleic Acids Res 2005, 33:D523526. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pyo CW, Guethlein LA, Vu Q, Wang R, AbiRached L, Norman PJ, Marsh SG, Miller JS, Parham P, Geraghty DE: Different patterns of evolution in the centromeric and telomeric regions of group A and B haplotypes of the human killer cell Iglike receptor locus.
PLoS ONE 2010, 5(12):e15115. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Krumsiek J, Arnold R, Rattei T: Gepard: a rapid and sensitive tool for creating dotplots on genome scale.
Bioinformatics 2007, 23(8):10261028. PubMed Abstract  Publisher Full Text