Analysis of quality raw data of second generation sequencers with Quality Assessment Software

Ramos, Rommel TJ; Carneiro, Adriana R; Baumbach, Jan; Azevedo, Vasco; Schneider, Maria PC; Silva, Artur

doi:10.1186/1756-0500-4-130

Technical Note
Open access
Published: 18 April 2011

Analysis of quality raw data of second generation sequencers with Quality Assessment Software

Rommel TJ Ramos¹,
Adriana R Carneiro¹,
Jan Baumbach³,
Vasco Azevedo²,
Maria PC Schneider¹ &
…
Artur Silva¹

BMC Research Notes volume 4, Article number: 130 (2011) Cite this article

8489 Accesses
17 Citations
Metrics details

Abstract

Background

Second generation technologies have advantages over Sanger; however, they have resulted in new challenges for the genome construction process, especially because of the small size of the reads, despite the high degree of coverage. Independent of the program chosen for the construction process, DNA sequences are superimposed, based on identity, to extend the reads, generating contigs; mismatches indicate a lack of homology and are not included. This process improves our confidence in the sequences that are generated.

Findings

We developed Quality Assessment Software, with which one can review graphs showing the distribution of quality values from the sequencing reads. This software allow us to adopt more stringent quality standards for sequence data, based on quality-graph analysis and estimated coverage after applying the quality filter, providing acceptable sequence coverage for genome construction from short reads.

Conclusions

Quality filtering is a fundamental step in the process of constructing genomes, as it reduces the frequency of incorrect alignments that are caused by measuring errors, which can occur during the construction process due to the size of the reads, provoking misassemblies. Application of quality filters to sequence data, using the software Quality Assessment, along with graphing analyses, provided greater precision in the definition of cutoff parameters, which increased the accuracy of genome construction.

Background

The introduction of second-generation genome sequencing has reduced the cost and time required for genome construction; this method generates large amounts of data and increased sequencing coverage when compared to the dideoxy terminal Sanger method [1]. However, this new methodology reduces the size of the readings and has brought challenges to the genome assembly process, such as a need to develop efficient algorithms to reconstruct the genome [2]. Several examples of programs suitable for genome assembly from short reads are Velvet [3], Edena [4], SHARCGS [5], VCAKE [6], ALLPATHS [7], Euler-SR [8], and Quality-value guided Short Read Assembler (QSRA) [9]. All of them involve a process of connecting overlapping DNA sequences; however, only QRSA considers the quality of the reads during the assembly process.

Regardless of the assembly method used, data preparation is necessary. One step in this preparation is the quality filter, whenever readings are taken with a lower phred quality [10]. Independent of the genome construction system, it is necessary to prepare the data. One of the steps in data preparation is a quality filter, with which reads with low phred quality are removed. This improves the alignment of the sequences to avoid problems due to mismatches [11]. Li et al. (2010) observed a 50% decrease in alignment errors when bases screened for quality were used; this is an important part of the preparation required for producing accurate results.

The cutoff value for read quality affects the coverage and especially the quality of sequencing. Very stringent parameters can reduce the coverage of the genome and hinder the assembly process. Also, using poor-quality bases that are products of mismatches can lead to less accurate results. To address this problem, we developed the software Quality Assessment (QA), with which one can review graphs showing the distribution of quality values from the sequencing reads, including the average quality, and the accumulated quality for each of the bases; this information can be used to estimate the coverage and quantity of the readings that pass through the quality filter.

Input format

QA receives two files as input: the first with standard-only Phred quality values for each base of a read, and the second containing the sequences in nucleotides or color space (SOLiD). The input files must have equal size sequences such as those generated by the SOLiD and Illumina platforms in order to be used for the generation of quality graphs.

Sample Data

The data that we tested with this software were obtained from sequencing of Corynebacterium pseudotuberculosis (Cp162) and Exiguobacterium antarcticum (B7) with SOLiD system, using a library of fragments with readings of 35 base pairs (bp) and a mate-pair library with 25 bp for each tag, F3 and R3, respectively [12]. We obtained 21,102,241 readings from the Cp162 data, and 44,171,676 and 45,024,226 readings, from the B7 tags F3 and R3, respectively.

The estimated genome coverage was obtained using the formula C = (n * L)/S, where C is the estimated coverage, n is the number of readings, L is the size of the reads and S is the expected size of the genome [13]. The expected sizes for the genomes used in this study were defined based on phylogenetically-related organisms deposited in Genbank. For Cp162, a size of 2.3 mega bases (Mb) was obtained based on Corynebacterium pseudotuberculosis FRC41 (CP002097), and for B7, about 3 Mb was obtained based on Exiguobacterium sibiricum 255-15 (CP001022).

Implementation

The software was developed in JAVA programming language http://java.sun.com/, using the paradigm of object orientation and the graph library Swing http://java.sun.com/docs/books/tutorial/uiswing. Input is raw files from the sequencing machine (multifasta format): (i) files containing the quality values of phred for the readings [14] and (ii) sequences in color space [15] or nucleotide format; this information is solicited only at the time that the quality filter is applied to the data. The software (Figure 1) offers an option in which the size of the expected reads is informed, and when processing is finished it generates a log that shows the multifasta-file-sequence formatting problems: invalid characters, blank lines, and reads that are not of the expected size; these are eliminated in the processing. Optionally, the software can be run without the graphing interface; however, in this option it is not possible to estimate the coverage of the genome, and the phred quality values need to be previously defined.

The raw data file, which includes information on the quality of the sequences, is processed, and the frequencies of the mean and median values for each base are stored in a hash table, to be used to calculate the estimated coverage of the sequencing and for the generation of the graphs that show the distribution of the base quality values and means, using the library JfreeCharthttp://www.jfree.org/jfreechart/.

Applying the filter to the raw data files requires a large memory; for this reason, after the first file is generated, the memory reserved for the execution of the process is liberated to the operational system through the Java language resource know as garbage collection, run by the program itself. The filtered files are stored in the same original directories, with the extension.new added to each file name.

Results and Discussion

The mean quality of each of the 35 sequence bases from the Cp162 data can be observed in Figure 2a; 17 of these gave a mean quality equal to or greater than phred 20, while the terminal bases of the reads had a mean quality of less than 20 [16]. Figure 3a shows the frequency of the quality values of the 35^th base of Cp162, with phred 5 being the most common value, which influences the reduction in mean quality observed for this base in Figure 2a. When a cut off filter of phred 20 was applied, the number of reads was reduced by about 43% (Table 1), resulting in a sequence coverage of 172×. Based on the data in Table 2, application of a filter with phred 23 values would give sequence coverage above 100× and a high degree of accuracy of the reads, which would reduce the possibility of misassemblies [17].

Table 1 Results of applying the Phred quality filter

Full size table

Table 2 Genome coverage analysis for different Phred quality values

Full size table

In the case of IB7 with the tags F3 and R3, both with 25 bp, the mean quality of the bases was above 20, except for the terminal bases shown in Figures 2b and 2c, with tag R3 presenting better quality than F3, except at the 12^th base. The terminal bases, Figures 3b and 3c, had the highest frequencies of quality levels at phred 5 and 26 for F3 and R3, respectively, which allows more stringent filters to be applied to tag R3, without excessive loss of coverage, when compared to F3 (Table 2). After applying the quality filter to F3 and R3, with a cutoff at phred 20, the percentage reads discarded for F3 was greater than for R3 (Table 1).

Defining quality values above phred 20 reduced the coverage of the sequences, but it increased data quality, as can be seen in Table 2, in which phred values of 23 and 25 were used as quality cutoff values.

To apply the filter, one can use mean or median quality values observed in the read. When the mean is used, low quality bases can provoke elimination of the read, which can reduce the coverage of the sequencing, though it will also increase the quality [17]. Variation in the quality values of the bases does not influence the median, which could result in a tendency to accept low quality bases, increasing the probability of errors in the genome construction process.

In Figure 4a, we can see a reading with a median value of 23 and a mean of 19; consequently, if we apply a quality filter with phred 20 based on the median, the read would be considered as having six bases with quality below 10. In Figure 4b, if the same filter were applied, the read would be discarded, even though it has a mean quality value of 22.

Conclusions

Applying a quality filter to raw sequencing data is required in order to reduce sequence construction error, given that the methodologies available for constructing genomes are based on sequence alignment, in which a wrong base can cause a mismatch, making alignment impossible.

The software Quality Assessment allows the operator to visualize quality graphs of the bases in the reads and estimate the coverage based on means or medians, making it possible to select more precise cutoff parameters, reducing the possibility of eliminating high-quality reads or including low-quality reads, which increases the accuracy of the process of constructing genomes from second-generation sequencers.

Availability and Requirements

Project name: QA - Quality Assessment

Project home page: http://qualevaluato.sourceforge.net

Operating system(s): Platform independent

Programming language: Java

Other requirements: Java JDK 1.6 or higher

License: GNU GPL

Restrictions for use: Permission must be obtained from the author for non-academic/non-public use.

References

Bentley S: Taming the Next-gen Beast. Nature Reviews Microbiology. 2010, 8: 161-10.1038/nrmicro2322.
Article PubMed CAS Google Scholar
Schuster SC: Next-generation sequencing transform today's biology. Nat Method. 2008, 5: 16-18. 10.1038/nmeth1156.
Article CAS Google Scholar
Zerbino DR, Birney E: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research. 2008, 18: 821-829. 10.1101/gr.074492.107.
Article PubMed CAS PubMed Central Google Scholar
Hernandez D, Francois P, Farinelli L, Osteras M, Schrenzel J: De Novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Research. 2008, 18: 802-809. 10.1101/gr.072033.107.
Article PubMed CAS PubMed Central Google Scholar
Dohm J, Lottaz C, Borodina T, Himmelbaurer H: SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Research. 2007, 17: 1697-1706. 10.1101/gr.6435207.
Article PubMed CAS PubMed Central Google Scholar
Jeck W, Reinhardt J, Baltrus D, Hickenbotham M, Magrini V, Mardis E, Dangl J, Jones C: Extending assembly of short DNA sequences to handle error. BMC Bioinformatics. 2007, 23: 2942-2944.
Article CAS Google Scholar
Butler J, MacCallum L, Kleber M, Shlyakhter I, Belmonte M, Lander E, Nusbaum C, Jaffe D: ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Research. 2008, 18: 810-820. 10.1101/gr.7337908.
Article PubMed CAS PubMed Central Google Scholar
Chaisson M, Pevzner P: Short read fragment assembly of bacterial genomes. Genome Research. 2008, 18: 324-330. 10.1101/gr.7088808.
Article PubMed CAS PubMed Central Google Scholar
Bryant DW, Wong W-K, Mockler TC: QSRA - a quality-value guided de novo short read assembler. BMC Bioinformatics. 2009, 10: 69-10.1186/1471-2105-10-69.
Article PubMed PubMed Central Google Scholar
Ewing B, Hillier L, Wendl M, Green P: Base-Calling of Automated Sequencer Traces Phred. I. Using Accuracy Assessment. Genome Research. 1998, 8: 175-185.
Article PubMed CAS Google Scholar
Smith A, Xuan Z, Zhang M: Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics. 2008, 9: 128-10.1186/1471-2105-9-128.
Article PubMed PubMed Central Google Scholar
Pandey V, Nutter R, Prediger E: Applied SOLiD System: Ligation-Based Sequencing. Next Generation Genome Sequencing: Towards Personalized Medicine. Edited by: Janitz M. 2008, Berlin: Wiley, 1: 29-41. 1
Chapter Google Scholar
Lande E, Waterman M: Genomic mapping by fingerprinting random clones: A mathematical analysis. Genomics. 1988, 2: 231-239. 10.1016/0888-7543(88)90007-9.
Article Google Scholar
Ewing B, Green P: Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities. BMC Bioinformatics. 1998, 8: 186-194.
CAS Google Scholar
SOLiD Sequencing and 2-Base Encoding. [http://www.appliedbiosystems.com/cms/groups/mcb_marketing/documents/generaldocuments/cms_057810.pdf]
Gordon D, C A, P G: A graphical tool for sequence finishing. Genome Research. 1998, 8: 195-202.
Article PubMed CAS Google Scholar
Li H, Homer N: A survey of sequence alignment algorithms for next-generation sequencing. Briefings Bioinformatics. 2010, 11: 181-197. 10.1093/bib/bbp046.
Article Google Scholar

Download references

Acknowledgements

This work was part of the Rede Paraense de Genômica e Proteômica supported by Fundação de Amparo a Pesquisa do Estado do Pará. V.A.C.A., A.S. and A.C were supported by Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq). R.T.J.R. acknolwedges support from the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES). We are also grateful to Silvanira Barbosa for her help with sequencing of samples.

Author information

Authors and Affiliations

Instituto de Ciências Biológicas, Universidade Federal do Pará, Belém, PA, Brazil
Rommel TJ Ramos, Adriana R Carneiro, Maria PC Schneider & Artur Silva
Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, MG, Brazil
Vasco Azevedo
Institute for Genome Research and Systems Biology. Center for Biotechnology. Germany Institute for Genome Research, Bielefeld University, Bielefeld, Germany
Jan Baumbach

Authors

Rommel TJ Ramos
View author publications
You can also search for this author in PubMed Google Scholar
Adriana R Carneiro
View author publications
You can also search for this author in PubMed Google Scholar
Jan Baumbach
View author publications
You can also search for this author in PubMed Google Scholar
Vasco Azevedo
View author publications
You can also search for this author in PubMed Google Scholar
Maria PC Schneider
View author publications
You can also search for this author in PubMed Google Scholar
Artur Silva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Artur Silva.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

RTJR developed the program, made the performance tests and contributed to data analysis. ARC tested the program and contributed to the analysis of the results. JB provided critical advice on analysis outputs. MPCS, VA and AS coordinated the work and finalized the manuscript. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Ramos, R.T., Carneiro, A.R., Baumbach, J. et al. Analysis of quality raw data of second generation sequencers with Quality Assessment Software. BMC Res Notes 4, 130 (2011). https://doi.org/10.1186/1756-0500-4-130

Download citation

Received: 06 January 2011
Accepted: 18 April 2011
Published: 18 April 2011
DOI: https://doi.org/10.1186/1756-0500-4-130

Analysis of quality raw data of second generation sequencers with Quality Assessment Software