Sketch and example of the TAPyR procedure. (a) Sketch of the seed strategy employed by TAPyR. In this schema, three seeds are chosen, and seven matches of these seeds are found on the reference genome, three for the first seed, two for the second, and two for the third. These occurrences are ordered in the genome and scanned from left to right. Multiple seed matches are formed by extending current partial matches with the next occurrence if the coherence criteria are met. Otherwise, the current multiple match is stored as a potential candidate and a new one is started. In this example, we finish with five potential candidates for extension indicated by the dashed boxes. The largest candidate(s), i.e. the multiple seed occurrence that span most bases, are chosen for extension. In this case, that should be . (b) A more concrete example, in which we have a sequence of length 15, which was originally read from position 101 of the genome, with one insertion at position 5 and one substitution at position 9. The algorithm starts searching from the beginning of the read in the index, but cannot continue beyond the fourth a character. At this point, we have the first seed s1 = aaaa, which occurs at position 101 in the genome. The next character of the read is skipped, and the search continues from position 6, which is the beginning of the second seed. Seed s2 = ccct happens to have an accidental occurrence at the position 201, which is not related to the actual read position in the genome. Again, we skip the next (mismatched) character of the read and restart at position 11. This time the search reaches the end of the read, and yields the last seed s3 = gggtt, occurring at position 110. These three occurrences are now sorted according to their position in the genome, and it turns out that the occurrences of s1 and s3 form a coherent multiple seed occurrence of combined length 9. The other candidate would be composed of the occurrence s2 alone which is not chosen for expansion since it is smaller. The space between the two seeds is then filled using dynamic programming, and the correct mapped position (101) is returned along with the final alignment.
Fernandes et al. BMC Bioinformatics 2011 12:163 doi:10.1186/1471-2105-12-163