Australian Centre for Plant Functional Genomics, The University of Adelaide, Urrbrae, SA 5064, Australia

Phenomics and Bioinformatics Research Centre, University of South Australia, Mawson Lakes, SA 5095, Australia

ACRF South Australian Cancer Genome Facility, Centre for Cancer Biology, SA Pathology, Adelaide, SA 5000, Australia

School of Molecular and Biomedical Science, University of Adelaide, Adelaide, SA 5000, Australia

Abstract

Background

Next (second) generation sequencing is an increasingly important tool for many areas of molecular biology, however, care must be taken when interpreting its output. Even a low error rate can cause a large number of errors due to the high number of nucleotides being sequenced. Identifying sequencing errors from true biological variants is a challenging task. For organisms without a reference genome this difficulty is even more challenging.

Results

We have developed a method for the correction of sequencing errors in data from the Illumina Solexa sequencing platforms. It does not require a reference genome and is of relevance for microRNA studies, unsequenced genomes, variant detection in ultra-deep sequencing and even for RNA-Seq studies of organisms with sequenced genomes where RNA editing is being considered.

Conclusions

The derived error model is novel in that it allows different error probabilities for each position along the read, in conjunction with different error rates depending on the particular nucleotides involved in the substitution, and does not force these effects to behave in a multiplicative manner. The model provides error rates which capture the complex effects and interactions of the three main known causes of sequencing error associated with the Illumina platforms.

Background

The combination of a high read depth and the highly expressed nature of some sequences can result in some reads occurring millions of times in a next generation sequencing data set. For these situations, even very low error rates may still result in the presence of a multitude of sequence variants. Distinguishing these variants from true biological variants is a technological and computational challenge. In many species, this difficulty is compounded by the lack of an available reference genome.

The importance of identifying and correcting sequence errors has been highlighted by the recent discussion prompted by the report of the presence of widespread differences between the human genome (DNA) and reads derived from the corresponding RNA

It goes without saying that when the genome of an organism has not been sequenced and assembled, the difficulty of identifying possible sequencing errors is greatly increased, necessitating the development of alternate analysis methods.

Sequencing errors arising from the use of Illumina sequencers, on which we concentrate, can occur for a variety of reasons. One source of error originates from a phenomena referred to as crosstalk. Crosstalk occurs when there is an overlap in signals of the dye emission frequencies used in sequencing machines.

This overlap can lead to confusion of the nucleotide G with nucleotide T, and of A with C

The issue of sequencing errors is so ubiquitous that being able to detect and correct them is essential in many areas of molecular biology, particularly in the identification of miRNAs. In

This work is typical of procedures that rely on the availability of a reference genome and many methods and software packages have been developed for the detection and/or correction of sequencing errors in this setting

A different approach, that does not rely on the existence of a sequenced genome, was adopted in

where _{
error
} is the overall probability of an error, _{
error
}(

A probabilistic model for predicting the occurrence of sequencing errors in short RNA reads proposed in

**Connected subgraph of sequences.** A connected subgraph of sequences of length 21 from an Illumina HiSeq data set. The most abundant sequence in this subgraph occurred 45,484 times and is represented by the largest node (filled circle).

Click here for file

**A larger connected subgraph.** A connected subgraph of sequences of length 21 from an Illumina GA data set. The most abundant sequence in this subgraph occurred 165,504 times in the data set. The size of the nodes (filled circles) representing each sequence is proportional to their abundance. The edges connect sequences that vary in one position only.

Click here for file

Number of vertices plotted against sequence abundance

**Number of vertices plotted against sequence abundance.** Number of vertices for each parent node (Y) plotted against abundance (X) for sequences of length 21. The theoretical curve given by the function ^{X}] (

where

In the following sections we present a method for modelling sequencing errors, extending the graph-based approach described in

Methods

We extend the approach of

The number of reads for each sample and lane ranged from 5 to 41 million reads, and several sequencing runs were performed approximately two years apart. The first samples were 36 base reads run on the Illumina Genome Analyzer (GeneWorks Pty. Ltd., c. 2009) and the second were 50 base reads from Illumina HiSeq with Illumina TruSeq v3 reagents (Australian Genome Research Facility Ltd., December 2011). What we term an individual data set is the sequenced data from a particular lane (numbered between 1 and 8), corresponding to their physical locations on the flow cell. We refer to lanes 4 and 5 as being the innermost lanes and 1 and 8 as the outermost.

Data processing and graph construction

Processing began with the 3^{′} adaptors being trimmed from the sequences. A number of mismatches to this adaptor were allowed depending on the length of the matching sections, as described in

Graphs were constructed, according to the model of

**Illumina GA lane 2**

**Length**

**Frequency of subgraph sizes**

**Largest**

1

2-20

20-40

40 +

20

33,170

2,530

27

17

992

21

132,373

11,992

170

105

1,048

22

86,118

7,078

63

32

387

23

171,287

9,714

79

47

296

24

1,277,008

101,108

1,264

757

2,030

Excluding adaptor trimming, the graphs are created in approximately 40 minutes (on a single processor of a PC running 32 bit Windows XP with 3.45GB of RAM) for a file of 6 million 35-base reads. Our algorithm was not parallelised, but can be, which would greatly reduce the processing time. A similar amount of time is required for the building of error models and correction of the graphs. The full source code is publicly available

Error model

More reliable error statistics can be extracted from sequences that appear a large number of times and have many sequence variants. Hence, for this purpose, we have chosen to select a subset of large subgraphs based on a user-defined threshold on the minimum number of nodes. These large subgraphs are then used to build a model of the error rate. Furthermore, to exclude as many graphs containing a true biological variant as possible, we introduce an additional series of thresholds,

where _{
parent
} is the number of times the parent sequence appears in the data set and _{
total
} is the sum of the frequencies of sequences in the subgraph. Starting from graphs satisfying the highest parental abundance threshold, we analyse the children of the most abundant sequence, recording the abundances of the sequence and each child sequence, the position along the read where the child differs from the parent, and the nucleotide substitution that has occurred.

From this information we calculate, for each graph, a probability of error for each combination of nucleotide substitution pattern type and position along the read. We use a weighted average (weighted on the basis of the abundance of the parent sequence) of all the individual probabilities, to determine our overall probability estimates. For example, given estimates _{
i
}and substitution pattern _{
j
}, we would calculate our probability estimate using the following formula

where

Using this estimate, and assuming that our data may be modelled as coming from a binomial distribution

We perform these calculations, as described in the preceding paragraph, beginning at parental abundance ratio 90% and working downwards. While the higher thresholds provide more reliable estimates, the number of graphs selected is not large and therefore all possible nucleotide substitutions are not seen at every position along the reads. Thus, we employ an iterative process to fill gaps in our estimates with probabilities derived from the subset of graphs with the next highest proportion threshold. Thereby, we have derived error probability estimates for all or most of the nucleotide transitions at each position along the read. We found that exponential curves provided a satisfactory fit to the data and provided the best theoretical fit to the expected error increase due to the phasing phenomenon. Consequently, we fitted exponential curves to these error estimates for each transition type between positions 2 and 24. This helped to further eliminate any effects of outliers (i.e., true biological variants) that were not rectified in the previous steps, and provided values for substitution-position combinations that were not observed in previous steps. An example of this, for Illumina GA data and the case of A being misread as C, is shown in Figure

Example model fit

**Example model fit.** Data points and fitted model for the probability of an A being misread as a C, for **(a)** an Illumina GA data set and **(b)** an Illumina HiSeq data set.

Our method does not assume that position and error type effects work multiplicatively. Our generalisation to account for these effects is simply

where R is the nucleotide substitution pattern _{1} → _{2}. Note that we do not enforce that

and hence are able to model non-multiplicative effects.

The model described above is used to find and correct sequencing errors by comparing the observed sequence abundances with those predicted by the model. Statistical hypothesis testing is used for this purpose with the null hypothesis being that a given sequence is a sequencing error. Sequences for which the null hypothesis is rejected are classified as true biological variants, the remaining sequences are classified as sequencing errors.

Results and discussion

Modelled error rate results for a selection of data sets are shown in Figure ^{
bx
}.

Modelled error rates

**Modelled error rates.** Modelled error rates from **(a)** an Illumina GA data set (lane 2), **(b)** an Illumina GA data set (lane 4) and **(c)** an Illumina HiSeq data set (lane 2).

Illumina GA

The G → T substitution error rate, which is the highest in the Illumina GA data sets (Figures

**Illumina GA lane 2**

**Illumina GA lane 4**

**Illumina HiSeq lane 2**

Probabilities for position 1 and exponents of the fitted exponential curves, ^{
bx
}, for positions 2 to 24 for the data sets corresponding to Figure

**Error**

**Position 1**

**
A
**

**
b
**

**Position 1**

**
A
**

**
b
**

**Position 1**

**
A
**

**
b
**

A → C

1.4E-03

1.4E-04

0.11

8.2E-04

2.2E-04

0.08

2.7E-04

3.7E-05

0.06

A → G

5.1E-04

2.0E-04

0.04

4.8E-04

1.7E-04

0.07

4.1E-04

1.5E-04

0.03

A → T

4.1E-04

3.4E-05

**0.15**

2.1E-04

5.8E-05

**0.12**

2.0E-04

5.6E-05

0.03

C → A

**2.8E-03**

**4.3E-04**

0.07

7.9E-04

**3.9E-04**

0.07

6.6E-05

6.9E-05

0.05

C → G

4.2E-04

8.0E-05

0.05

2.9E-04

5.3E-05

0.09

1.6E-04

5.2E-05

0.04

C → T

6.3E-04

2.1E-04

0.07

5.9E-04

1.9E-04

0.08

6.2E-04

3.1E-04

-0.01

G → A

4.3E-04

1.6E-04

0.05

3.7E-04

2.0E-04

0.03

6.1E-04

4.7E-04

-0.08

G → C

5.1E-04

1.4E-04

0.10

7.8E-04

1.3E-04

0.09

6.9E-05

3.1E-04

-0.11

G → T

1.5E-03

3.5E-04

0.10

**1.0E-03**

3.3E-04

0.08

**1.2E-03**

**7.7E-04**

-0.13

T → A

3.6E-04

7.4E-05

0.08

2.4E-04

1.0E-04

0.06

1.4E-04

5.4E-05

0.05

T → C

6.1E-04

3.5E-04

0.04

5.6E-04

3.6E-04

0.04

5.1E-04

1.4E-04

0.02

T → G

3.3E-04

2.8E-04

0.05

3.4E-04

2.7E-04

0.08

1.3E-04

2.0E-05

**0.15**

By comparing Figure

Illumina HiSeq

The error profiles of the sequenced reads from lane 2 of the Illumina HiSeq data (Figure

Evaluation

To address the difficult matter of evaluation we undertook two benchmarking analyses. Firstly, we applied our model to a simulated data set and secondly we checked the performance of our model by correcting reads from an organism with a known reference genome.

In our creation of a simulated data set, for the sake of comparison, we used the error probabilities from each position and transition that were found in the Illumina GA lane 2 data set. We then took a data set of short RNA reads thought to contain no sequencing errors and randomly simulated errors based on the given error rates and the corresponding binomial distributions. The data set was processed by our method in the same way as the other data sets. The resulting error model parameters are summarised in Table

**Simulated data**

Probabilities for position 1 and exponents of the fitted exponential curves, ^{
bx
}, for positions 2 to 24 for the simulated data set. The corresponding figure is shown in Additional file

**Error**

**Position 1**

**
A
**

**
b
**

A → C

1.4E-03

1.8E-04

0.10

A → G

6.8E-04

2.3E-04

0.04

A → T

4.6E-04

4.6E-05

**0.15**

C → A

**3.0E-03**

**4.7E-04**

0.07

C → G

4.3E-04

8.0E-05

0.08

C → T

8.3E-04

2.4E-04

0.06

G → A

4.9E-04

2.0E-04

0.06

G → C

5.3E-04

1.4E-04

0.11

G → T

1.8E-03

2.7E-04

0.14

T → A

4.1E-04

1.2E-04

0.06

T → C

5.9E-04

2.9E-04

0.06

T → G

3.9E-04

4.5E-04

0.02

**Modelled error rates.** Modelled error rates from a data set with simulated errors according to the pattern found in the data set of Figure

Click here for file

To further evaluate our model we studied HiSeq reads from a publicly available PhiX data set (SRA accession number SRS267273; SRX101468)

**Model prediction**

Sequence counts comparing our model predictions of correct and erroneous sequences to results obtained by mapping the sequences to the corresponding genome.

**Genome mapping**

**Correct sequences**

**Erroneous sequences**

Exact match

10115

8

1 mismatch

2779

64137

2 mismatches

164

17636

3 mismatches

14

3217

Conclusions

We have proposed a model of sequencing errors that is flexible enough to incorporate known sources of error intrinsic to the Illumina sequencing technologies and that does not rely on the availability of a reference genome for error detection. We have demonstrated the advantages of using of a non-factorisable model, particularly necessitated by the presence of accumulated T fluorophores in the Illumina GA data, and other unknown non-multiplicative effects in the Illumina HiSeq data. The method described herein is potentially applicable not only to short RNA reads but also to other sequencing activities where a reliable sequenced genome is not available, such as in the field of metagenomics, where a mixed sample containing reads from many organisms is sequenced, or when trying to distinguish sequencing errors from single nucleotide polymorphisms. While, as discussed in the results section, our model performs well in identifying sequencing errors (our method identifies at least 96.64% of errors in the example PhiX data set), we note that our model may not account for some errors that arise before the sequences enter a flowcell, e.g. during reverse transcription or library amplification. These errors may lack a highly abundant parent sequence and thus are difficult to identify without a reference genome.

A possible direction to improve this model is to include the investigation of the role of single and multiple preceding or following bases in determining error rates. The inclusion in the model of error prone positions, such as those reported in

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

JS performed the analyses, participated in the design and drafted the manuscript. AS conceived of the study, participated in the design and analyses and helped to draft the manuscript. UB participated in the design and analyses and helped to draft the manuscript. All authors read and approved the final manuscript.

Acknowledgements

The authors would like to thank Professor Stan Miklavcic for feedback on the manuscript, Dr Chris Brien for useful statistical discussions, Mr John Toubia for technical assistance and Dr Bu-Jun Shi for providing the data used for analysis. This work was supported through funding from the Australian Research Council, Grains Research and Development Corporation, and the Government of South Australia.