Department of Bioinformatic Engineering, Graduate School of Information Science and Technology, Osaka Univesity, 1-5 Yamadaoka, Suita, Osaka, Japan

Abstract

Background

With the advent of next-generation sequencers, the growing demands to map short DNA sequences to a genome have promoted the development of fast algorithms and tools. The tools commonly used today are based on either a hash table or the suffix array/Burrow–Wheeler transform. These algorithms are the best suited to finding the genome position of exactly matching short reads. However, they have limited capacity to handle the mismatches. To find n-mismatches, they requires ^{n}

Results

We propose a hash-based method for genome mapping that reduces the number of hash references for finding mismatches without increasing the size of the hash table. The method regards DNA subsequences as words on Galois extension field ^{2}) and each word is encoded to a code word of a perfect Hamming code. The perfect Hamming code defines equivalence classes of DNA subsequences. Each equivalence class includes subsequence whose corresponding words on ^{2}) are encoded to a corresponding code word. The code word is used as a hash key to store these subsequences in a hash table. Specifically, it reduces by about 70% the number of hash keys necessary for searching the genome positions of all 2-mismatches of 21-base-long DNA subsequence.

Conclusions

The paper shows perfect hamming code can reduce the number of hash references for hash-based genome mapping. As the computation time to calculate code words is far shorter than a hash reference, our method is effective to reduce the computation time to map short DNA sequences to genome. The amount of data that DNA sequencers generate continues to increase and more accurate genome mappings are required. Thus our method will be a key technology to develop faster genome mapping software.

Background

The history of bioinformatics has been dominated by the search for faster sequence alignment methods. Beginning with dynamic programming for protein and genome sequence alignment, many algorithms have been proposed. Hash tables are used in the series of FASTA programs

The emergence of next-generation sequencing technology has changed the demands for alignment speed. A so-called next-generation sequencer can read far more base pairs than a conventional sequencer: more than two billion short DNA sequences in a single run. For such a large number of the sequences, BLAST tools are too slow to map the sequences to target genomes. Therefore, researchers have called for a faster approach that is focused on mapping short fragments.

To meet this demand, more than 25 software programs designed for mapping short DNA sequences onto genomes have been developed. These are classified into two categories according to their algorithms, which are either hash-based or suffix array/Burrow–Wheeler transition (BWT)-based

These algorithms are effective for mapping short sequences to genome positions of perfect matches and one-base mismatches, but are inefficient for mapping to positions for two or more-base mismatches. In general, they require ^{n}

In the proposed method, DNA subsequences are divided into equivalence classes by using a perfect Hamming code. Each equivalence class includes subsequences whose corresponding words on ^{2}) are encoded to the corresponding code word of the perfect Hamming code. The code word is used as a hash key to store these subsequences in a hash table. A perfect Hamming code is a special case of a Hamming code, known in the field of coding theory

Hash-based genome-mapping algorithms use hash tables. A hash table is an array indexed by hash values generated from hash keys. Thus, a hash table is an implementation of an associative array. There are two methods for mapping short reads onto genomes using hash tables. One is to store subsequences of the genome and their positions in a hash table and the other is to store subsequences of short reads. As there is no essential difference between their hash usages, we use the former method for the following explanation.

The hash-based methods prepare a hash table whose keys and values represent subsequences of length

There are three methods to find the n-mismatch genome positions of a subsequence of length

1. Refer to all n-mismatch subsequences.

Prepare a hash table whose key length is

2. Store n-mismatch positions in the hash table.

For each position of the subsequence of the genome, store the position

3. Use pigeonhole principle; combine hash table and another method.

Generate a hash table whose key length is ⌊

Figure

Hash tables for three methods

**Hash tables for three methods.** Three methods to find genome positions of 1-mismatch from the subsequence AAGT. Genome position 1000 is ACGT, which is the 1-mismatch of the subsequence. The first method refers to the hash table 16 times. The second method refers to the table just once, but the table is 16-fold larger. The third method refers to the table three times. After getting position 1002 from the hash table, the method elongates the alignment toward the front of the sequence.

These methods are effective when

We propose a method to reduce the number of hash references to find the genome positions of 2 or more mismatches without enlarging the size of the hash table. To realize the method, 4-ary perfect Hamming code is used.

Results

Perfect Hamming codes as hash keys

Idea

We first describe the main idea of the proposed method. We define a graph whose nodes are all the subsequences of length ^{l}^{l}^{l}

Relationship among 16 subsequences

**Relationship among 16 subsequences.** Graphical depiction of subsequence “AAAAA” and 15 adjacent subsequences. Each node describes a subsequence and each edge indicates that the terminal nodes are of one nucleotide difference. The 15 subsequences are divided into five groups according to the position of the different nucleotide.

The features of this hash table are as follows: (1) The number of entries in the hash table does not increase because each subsequence is stored only once. (2) Using this hash table, we can reduce the number of hash references to find the genome positions of subsequences of 1 or more-mismatches.

We explain the concept how to reduce the number of hash references to find 1-mismatches by using an example. Let the length of subsequence be 5, the hash table be as described above, and

1-mismatch sequences of a non-code word sequence, “CAAAA”

**1-mismatch sequences of a non-code word sequence, “CAAAA”**. Idea behind finding 1-mismatch sequences of the sequence “CAAAA”. The two circles indicate the equivalence classes. The sequence “CAAAA” belongs to an equivalence class whose center is “AAAAA” that holds 3 of the 15 1-mismatch sequences. Two of the other sequences, “CAAAT” and “CAAGA”, belong to a equivalence class whose center is “CAAGT”.

The requirements for the establishment of the equivalence classes need to be determined. At a minimum, the length of the subsequence

This shows that 3

It is not clear that the above equation is a sufficient condition for constructing equivalence classes. Even if it is, two problems still remain; how to construct the equivalence classes and how to calculate the center words from a given subsequence. Perfect Hamming codes provide solutions to both these problems.

Perfect Hamming code

A perfect Hamming code (PHC) is a Hamming code that satisfies the equation of the Hamming bound,

where

The condition for a ^{k}

where (0, 1, ^{2}) are the elements in the Galois field GF(2^{2}). To get a code word ^{T}

Addition and multiplication on ^{2})

+

0

1

^{2}

×

0

1

^{2}

0

0

1

^{2}

0

0

0

0

0

1

1

0

^{2}

1

0

1

^{2}

^{2}

0

1

0

^{2}

1

^{2}

^{2}

1

0

^{2}

0

^{2}

1

The code word is calculated from a received word as follows.

1. Calculate the syndrome

2. If the syndrome

3. Find a column

4. Subtract

For example, assume the word ^{2}0) is received. The code word of

As ^{T}^{2} × (1 ^{2} times the fourth column of ^{2}0) from z:

The code word of

The (

PHC and DNA subsequence

DNA sequences are composed of four nucleotides, adenine, cytosine, guanine and thymine. Let these correspond one-to-one to the elements of Galois field ^{2}). Then, DNA sequences correspond to words on the Galois field. Without loss of generality, let (A,C,G,T) correspond to (0,1, ^{2}). The sequence “GGGTA” is expressed as the word (^{2}0), and the word (10000) represents the DNA sequence “CAAAA”.

This correspondence relationship and the PHC enables us to build the equivalence classes described in Section ^{2}) and is a code word of 4-ary (5-3)-PHC. From the properties of PHC, All the words whose Hamming distances from the code word (00000) are 1 are error-corrected to the code word, and they are adjacent nodes of “AAAAA”. Additional File

**DNA sequences and their code words.** All the 5-mer DNA sequences and their code words on 4-ary (5,3)-perfect Hamming codes.

Click here for file

Algorithms

We propose a hash table for genome mapping whose hash keys are code words of PHC. Then we show its use and efficiency in finding genome positions of n-mismatches and n-gaps. Following is a description of the notation used in this section.

Preparing the hash table

There are two ways to construct hash tables for mapping short DNA sequences onto a genome. One uses subsequences of the genome as hash keys to store their genome positions in a hash table. And another uses subsequences of short DNA sequence as the hash key. Because both of these use DNA subsequences as hash keys, our method can be applied to either. In the following, we use the former in the explanation.

The hash table of our method uses the representative subsequence of the equivalence classes as hash keys. The representative is the code word on PHC. Without loss of generality, the information digits of the code words can be used as the hash key. Given an (^{2}), let

Proposed hash table

**Proposed hash table.** Entries of proposed hash tables. Subsequence “AAAAA” is used as the key for storing the genome positions of “ACAAA” and “AAATA” because“AAAAA” is the center of the equivalence class that “ACAA” and“AAAT” belong to. The left hash table uses the center subsequence itself as a hash key. The right one uses the short code of the center subsequence as a hash key, where the short code is the information digits of the code. Short codes are described in Section

Searching for n-mismatches

In this section we describe how to find genome positions of 1- and 2-mismatch subsequences of a given subsequence

Let

The number of keys |_{n}_{1}(_{5}_{1} × 3, the method reduces the number of hash keys to 41.4% (= 6.625/16) and 47.7% (= 30.25/64), respectively. We summarize these values in the “ratio” column of Table _{1}(_{2}(

Summary of our methods for lengths 5, 21, and 10 to refer to 1- and 2-mismatch and 1- and 2-gap sequences

length

condition

#keys

#words

ratio

5

1-mismatch

6.625

16

41.4%

1 + 15

1 + 15^{2} + 54^{3}

2-mismatches

27.25

106

25.7%

1 + 15 + 90^{2} + 210^{3} + 180^{4}

1 + 15 + 90^{2} + 170^{3} + 156^{4}

1-gap

3.25

4

81.3%

4 + 12

4 + 60

2-gaps

10

16

62.5%

16 + 36^{2}

– ∗^{1}

21

1-mismatch

30.53

64

47.7%

1 + 63

1 + 63^{2} + 1710^{3}

2-mismatches

611.31

1954

31.3%

1 + 63^{2} + 4410^{3} + 34020^{4}

1 + 63^{2} + 5650^{3} + 31500^{4}

1-gap

3.81

4

95.3%

4 + 60

4 + 252

2-gaps

13.87

16

86.7%

16 + 84^{2}

16 + 48^{2}

10: Serialize

1-mismatch

12.25

31

39.5%

1 + 30^{2}

1 + 30^{2} + 538^{3} + 1089^{4} + 1620^{5} ∗^{2}

10: Parallelize

1-mismatch

13.25

31

44.1%

1 + 30

1 + 30^{2} + 108^{3} ∗^{3}

∗^{1} :^{2}: neither the first half nor the second half are code words. The reference formula when one of the two halves is a code word is 1 + 30^{2} + 267^{2} + 684^{3} + 810^{4}. ∗^{3}: neither the first half or second half are code words. The reference formula when one of the two halves is a code word is 1 + 30^{2} + 42^{2} + 54^{3}.

First, we analyze _{1}(

Case 1:

Case 2:

Fifteen 1-mismatch subsequences of

**Fifteen 1-mismatch subsequences of s when s is not a code word**. Nodes represent subsequences and edges indicate a Hamming distance between two nodes of 1, namely, the relation of 1-mismatch. Edge labels indicate the position of the different digit (nucleotide).

The rest of 12 words belong to six equivalence classes. Assume that word _{H}_{1}(

Because the proportions of Case 1 and Case 2 are respectively 1/16 and 15/16, the expected number of keys in _{1}(

Next, we show an algorithm to calculate the set of hash keys _{1}(

_{1}(

1. _{1}(

2. _{1} (

3. for _{1}(_{1}(

4. return (_{1}(

The algorithm calculates code words 16 times when ^{2}) can be calculated as binary operations.

By using the set of hash keys _{1}(

where

For Case 1, because

For Case 2, there are 7 hash keys in _{1}(

Each of the equivalence classes of the other hash keys has two 1-mismatch sequences, five 2-mismatch sequences and nine 3-mismatch sequences. The reference formula of _{1}(

Finally, the reference formula for the Case 2 is:

The reference formula shows the proposed method searches many 2- and 3-mismatch sequences. We discuss this feature in Section

The above algorithm and analysis can be applied to the word length 21. The numbers of hash keys are 1 and 31 for Case 1 and Case 2, respectively. Using the rate of occurrences of Case 1 and Case 2, 1/64 and 63/64, respectively, the expected number of hash keys is 30.53. The reference formulas are shows in Table

To refer to all the entries of 2-mismatches, our method requires 27.25 hash keys, which is 25.5% of the number of subsequences with 2 or fewer mismatches. when the length of subsequences is 5. Figure _{2}(_{2}(_{1}(_{2}(_{1}(_{2}(_{2}(_{2}(

Two-mismatch subsequences of

**Two-mismatch subsequences of s when s is a code word**. Dark-gray nodes are 2-mismatches whose Hamming distance from

In Case 2,

Two-mismatch subsequences of

**Two-mismatch subsequences of s when s is not a code word**. The equivalence classes they belong to are classified into four types.

Type 1: 2-mismatch subssequences belong to this type are neighbors of

Type 2: for some _{H}

Type 3: for some _{H}

Type 4: for some _{H}

Types 1 and 2 are included in _{1}(_{1}(

In the same manner, we can analyze the hash keys _{2}(

and the formula when

Search n-gaps

To align DNA subsequence and a genome, there are three types of gaps. These are gaps in short DNA sequence, gaps in genome sequence, and gaps in both. Our method can reduce the number of hash keys to refer to gaps in short DNA sequences. Given a subsequence with gaps

Let

When a subsequence

If no

In the same way, when the length is 21, the expected number of hash keys is

Next, we consider a subsequence with two gaps. Let _{1}(_{H}_{H}

For example, if

Length of subsequence

The code length of the PHC is restricted to 5 or 21 in practice. This is inconvenient. Therefore, we next explain ways to elongate the code length. There are four ways to elongate the code length. The first way is to simply add nucleotides before or/and after the code words. Two other ways are serialization and parallelization of PHCs. The former method serializes more than two PHCs and serialized code words are used as the hash key. The latter one uses more than two hash tables for the parallelization and is a way to utilize the pigeonhole principle where pigeons are mismatches or gaps and holes are the regions without mismatches and gaps. The fourth way is a combination of the three. Figure

Three ways to elongate the length of subsequences.

**Three ways to elongate the length of subsequences**.

Let _{1}_{2} be a sequence of length 10 and _{1} and _{2} be the subsequences of length 5. The hash key used in the serialization to store _{1})_{2}). In this case, each equivalence class holds 256 subsequences. To refer to the entry of a sequence _{1}_{2} and the 30 1-mismatch sequences, the set of hash keys is:

and the expected number of hash keys is 12.25. To prove this, we consider four cases:

Case 1 _{1} and _{2} are both code words.

Case 2 _{1} is a code word, but _{2} is not.

Case 3 _{2} is a code word, but _{1} is not.

Case 4 neither _{1} nor _{2} is a code word.

In Case 1, Use _{1}_{2} as a hash key; this can refer to all the 1-mismatches. The reference formula in this case is the square of the reference formula of length 5, (1 + 15^{2}.

In Case 2, we need to consider two subcases based on the position of the 1-mismatch. When the position is in the first half of _{1}_{2} can refer to all of them. When the position is in the second half, the second half of the hash keys becomes one of the seven words in _{1}(_{2}). Therefore, a set of hash keys is:

Because _{2} ∈ _{1}(_{2}), the hash keys in the second subcase include the hash keys in the first subcase. Therefore, the number of hash keys |_{1}(

In Case 3, similar to in Case2, the set of hash keys is:

and the reference formula is same as that of Case 2.

In Case 4, the set of hash keys is the union of {_{1}_{1}(_{2})} and {_{2}|_{1}(_{1})}, which correspond to 1-mismatch in the first half and 1-mismatch in the second half, respectively. Because both of these include _{1})_{2}), the number of hash keys is 13 (= 7 + 7 – 1). The reference formula is:

The proportions of cases 1 through 4 of 1/256, 15/256, 15/256, and 225/256, respectively, and so the expected number of hash keys is 12.25.

The parallelization requires two hash tables and each subsequence is stored in both hash tables. Therefore, the total size of the hash tables is twice that of serialization. The hash keys are _{1})_{2}: the first haf is PHC, and _{1}_{2}): the second half is PHC. Consequently, two types of equivalence classes are used in the parallelization and each equivalence class holds 16 subsequences. The set of hash keys for 1-mismatch sequences is:

where each set corresponds to one of the two hash tables. The expectation number of keys is 13.25.

Let us consider the four cases, which are the same as those for serialization. In case 1, use _{1}_{2} as the hash key for two hash tables and all the entries of 1-mismatch are referred to. In this case, the reference formula is (1 + 15_{1}_{2} is stored in both hash tables, one is subtracted in the formula. Though the number of hash keys appears to be one, it is used twice. Thus the number of hash keys |_{1}(

In Case 2, the hash keys are:

The total number hash keys is 8 and the reference formula is:

Case 3 is similar to Case 2. In this case the hash keys are:

In Case 4, The hash keys when the first half includes the mismatch are {_{2}|_{1}(_{1})} and that for the second half are {_{1}_{1}(_{2})}. The number of hash keys is 14 and the reference formula is:

The proportions of the cases are same as for serialization, and the expected number of hash keys is 13.25.

Discussion

To search genome positions of

The increasing demand to map massive amounts of short DNA sequences to genomes is inevitable. Because the number of short sequences is enormous, it is difficult to ensure finding all genome positions of 1-mismatches in a practical computation time. Therefore, faster methods are required and the proposed method is a step in that direction. We have shown that the proposed method can reduce the number of keys necessary to find the genome positions of n-mismatches. The main idea behind the method is to classify the subsequences into equivalence classes using PHC. Because equivalence classes contain multiple subsequences, our method can increase the density of the hash table over those using in the usual method. That is to say, our method can use longer subsequences.

For example, the size of human genome is about 3G bases long. When this is stored it with subsequences of length 21 in usual way, the density of the hash table is 3 × 10^{12}/4^{21} ≈ 0.07%. On the other hand, the hash table using our proposed method using (21,18)-PHC, the density is 3 × 10^{12}/[# of equivalence classes] = 3 × 10^{12}/4^{18} ≈ 4.7%. That is to say, our method can use longer subsequences. The length of subsequence is sensitive to the efficiency of the genome-mapping programs, and the longer the better, for a given density of hash table. Therefore, the proposed method has an advantage from this point of view.

We consider the computation time for code words is far shorter than a hash reference when we describe the effectiveness of the proposed method. In practice, the calculation of the syndrome using the parity-check matrix of the Hamming code is very short, even if on ^{2}), and so it is easy to calculate the code word from a subsequence. Also, the calculation is small enough to be executed within a CPU cache. On the other hand, the size of hash table is larger than the size of CPU caches. Some exceeds the size of memory because the number of entries is almost equal to the length of the target genome. The hash reference is apparently slower than the calculation of the code word. Therefore, the advantage of reducing the hash references exceeds the disadvantage of additional tasks to calculate code words. With these advantages, our method will help to implement faster genome mapping programs.

Conclusions

The paper shows perfect hamming code can reduce the number of hash references for hash-based genome mapping. The method encodes subsequences to perfect hamming codes on ^{2}) and use them as hash keys. It can reduce by about 70% the number of hash keys necessary for searching the genome positions of all 2-mismatches of 21-base-long DNA subsequence. As the amount of data that DNA sequencers generates continues to increase and more accurate genome mappings are required, our method will help to develop faster genome mapping software.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

YT provide the key idea and the mathematical analyses. SS and HM provide valuable help on the topics. All authors discusses the results and commented on the manuscript.

Acknowledgements

This work was partially supported by KAKENHI (22680023) and (22310125).

This article has been published as part of