≥S-similar sequences measured in the human genome. 1 million query intervals, each 1000 bases long, were randomly sampled from the genome. Each query interval was searched against the human genome to determine the number of non-overlapping 1000 base intervals in the genome that are ≥S-similar to the query. The cumulative distribution for the number of target intervals that are (A) ≥1-similar, (B) ≥5-similar, (C) ≥10-similar, and (D) ≥20-similar to these 1 million query intervals, is shown. Each panel uses minimum anchor lengths k = 15, 20, and 25 and indel rate δ = 0.15. From this, one may interpret the number of intervals that must be searched when mapping a read using anchors. For example, when mapping with a minimum of a single 25 base match, 80% of the queries match to 100 other intervals in the genome with at least one one 25 base match (point X). On the other extreme, the top 3% of queries map to over 1 million other with at least one matchpoint Y), due to the high repeat content of the genome. This indicates that 80% of sequences may be correctly mapped to the human genome using a single 25 base match by only searching 100 100 candidates, however for full sensitivity many more candidates must be searched. Points P and Q show a contrast of the fraction of intervals that have 100 or fewer matches in the genome when matching using 1 or more anchors versus 20 or more anchors, for an anchor length of 15. Only 20% of the samples are limited to 100 or fewer additional matching intervals with at least 1 anchor (point P), and 97.5% of the samples have 100 or fewer matches when requiring at least 20 anchors in a match (point Q).
Chaisson and Tesler BMC Bioinformatics 2012 13:238 doi:10.1186/1471-2105-13-238