The Proteogenomic Mapping Tool

Sanders, William S; Wang, Nan; Bridges, Susan M; Malone, Brandon M; Dandass, Yoginder S; McCarthy, Fiona M; Nanduri, Bindu; Lawrence, Mark L; Burgess, Shane C

doi:10.1186/1471-2105-12-115

Software
Open access
Published: 22 April 2011

The Proteogenomic Mapping Tool

William S Sanders^1,5,
Nan Wang²,
Susan M Bridges^3,5,
Brandon M Malone^3,5,
Yoginder S Dandass^3,5,
Fiona M McCarthy^4,5,
Bindu Nanduri^4,5,
Mark L Lawrence^4,5 &
…
Shane C Burgess^4,5

BMC Bioinformatics volume 12, Article number: 115 (2011) Cite this article

6883 Accesses
32 Citations
Metrics details

Abstract

Background

High-throughput mass spectrometry (MS) proteomics data is increasingly being used to complement traditional structural genome annotation methods. To keep pace with the high speed of experimental data generation and to aid in structural genome annotation, experimentally observed peptides need to be mapped back to their source genome location quickly and exactly. Previously, the tools to do this have been limited to custom scripts designed by individual research groups to analyze their own data, are generally not widely available, and do not scale well with large eukaryotic genomes.

Results

The Proteogenomic Mapping Tool includes a Java implementation of the Aho-Corasick string searching algorithm which takes as input standardized file types and rapidly searches experimentally observed peptides against a given genome translated in all 6 reading frames for exact matches. The Java implementation allows the application to scale well with larger eukaryotic genomes while providing cross-platform functionality.

Conclusions

The Proteogenomic Mapping Tool provides a standalone application for mapping peptides back to their source genome on a number of operating system platforms with standard desktop computer hardware and executes very rapidly for a variety of datasets. Allowing the selection of different genetic codes for different organisms allows researchers to easily customize the tool to their own research interests and is recommended for anyone working to structurally annotate genomes using MS derived proteomics data.

Background

Expressed proteins provide experimental evidence that genes in the genome are being transcribed and translated to produce a protein product. Recently, a new structural genome annotation method, proteogenomic mapping, has been developed that uses identified peptides from experimentally derived proteomics data to identify functional elements in genomes and to improve genome annotation [1, 2]. Initially used for the structural annotation of prokaryotic genomes, proteogenomic mapping is rapidly gaining traction in eukaryotic genome annotation projects with larger genomes as a complementary method [3, 4].

Proteogenomic mapping can identify potential new genes or corrections to the boundaries of predicted genes by using peptide matches against the genome that do not match against the predicted proteome to generate expressed Protein Sequence Tags (ePSTs) [2]. When aligned with the genome and combined with the published structural annotation, these ePSTs are indicative of translation throughout the genome and can serve to supplement traditional structural genome annotation methods [3–5].

While a number of research groups are becoming increasingly active in the field of proteogenomic mapping [1–5], there is a lack of published and standardized tools to rapidly and exactly map identified peptides back to the genome translated in all 6 reading frames. To our knowledge, there is only one comparable tool, PepLine [6], which utilizes a de novo based spectral identification methodology. In contrast our tool is implemented to work with the output from LC MS/MS combined with database search based spectral identification algorithms. PepLine uses peptide sequence tags (PSTs), short spectral match translations of 3-4 amino acids with flanking matches on either end for searches against the genome, where our tool works with peptides derived from MS/MS databases searches. While PepLine's use of PSTs allows the direct searching of spectra against the genome, a staged search method of searching spectra identified against database searches is an alternative.

Implementation

The Proteogenomic Mapping Pipeline is free to obtain and use, written completely in Java, and available for all common computer platforms. It is licensed under the GNU GPLv3 license making it completely open source and making the source code and implementation methodology available to the end user [7]. We have endeavored to make this tool as easy to use as possible and have provided both a command line version and a graphical user interface (GUI) for all common platforms.

Data Input and Customization

The GUI is shown in Figure 1 and takes as input from the user 3 files: a FASTA file of the peptides to be searched, a FASTA file containing the nucleic acid sequences the peptides are to be mapped against (typically the genome), and a file containing the genetic code to use based on the format of the National Center for Biotechnology Information's (NCBI) toolkit for genetic codes [8]. Furthermore, FASTA output from the splice site prediction tool GeneSplicer [9] can optionally be provided. If present, the splice sites given in that file are used instead of the default splice sites. The user is also required to provide a file name and location for the three output files that will be generated.

To generate the FASTA file of the peptides to be searched, it is expected that the user will have performed spectral matching for their MS dataset of interest against databases generated from both the proteome and the genome translated in all six reading frames and confirm these peptide identifications using a peptide validation strategy. After validation, the unique peptide identifications resulting from a database search against the genome that are not contained among the proteome peptide identifications should be used as the list of peptides to be searched.

The command line version of the Proteogenomic Mapping Pipeline allows the same inputs as the GUI to be specified as command line arguments and can be run on standard computer platforms (Windows, Linux, Unix, MacOS). An example of using the command line version of the program is included in the README file provided with the application.

The application translates the nucleotide database to protein in all 6 reading frames using the genetic code selected by the user (we provide the most common genetic codes from NCBI [8] which are represented in NCBI's standard format for genetic codes in the genetic_code_table file included with the application) and maps the peptides to the translated genome using the Aho-Corasick string searching algorithm to provide rapid and exact matches of peptides to the genome [10, 11]. The Aho-Corasick string matching algorithm [10] quickly locates all occurrences of keywords within a text string. The algorithm consists primarily of two phases. In the first, a finite state machine is constructed from the set of keywords. The time to construct this machine and its memory requirements are linearly proportional to the sum of the lengths of the keywords. The second phase consists of running the state machine using the text string as input. This phase takes time linearly proportional to the length of the text string. Thus, the time to run the entire algorithm is proportional to the sum of the length of the keywords and the length of the text string. In our case, the peptides for which to search are the keywords, and the reference genome against which to search is the text string.

ePST Generation

Once a peptide has been mapped to a nucleotide sequence, the reverse translated peptide is used to create an expressed Protein Sequence Tag (ePST) [2]. Figure 2 illustrates the ePST generation process for prokaryotes and Figure 3 shows both options for the ePST generation process in eukaryotes. For prokaryotes, the reverse translated peptide is extended in the 3' direction to an in-frame stop codon. In the 5' direction, the first in-frame stop-codon upstream of the peptide (5' stop) is identified and the peptide is extended to the first in-frame start downstream from this 5' stop before the start of the peptide. In the case that no in-frame start occurs between the 5' stop and the start of the peptide, the start of the peptide is used as the start of the ePST. The process is more complex for eukaryotes due to splicing. For eukaryotes, the peptides can be extended to produce ePSTs using three different approaches. In the first approach, the peptide is extended downstream to the first in-frame stop or splice site signal [12] and upstream until the first in-frame start, in-frame stop, or splice site signal. We have found that this approach often generates ePSTs that are far longer than typical exons. We speculate that this is because the potential new ORFs identified by this approach do not have a canonical splice site signal. While the application does default to using canonical splice site signals, our second approach includes the option of using predictions from GeneSplicer [9], a computational splice site prediction tool, by allowing the user to select to input GeneSplicer output for use instead of the canonical splice site signals. A third option is to extend the peptide upstream and downstream by a nucleotide length to be specified by the user.

Output File Description

Three output files are produced by the application. The first file is a FASTA file containing the ePSTs generated for the dataset. The second file is a more detailed tab separated text file containing the original peptide's identification, the peptide sequence, the FASTA header for the nucleotide sequence containing the match, the mapping start and end locations for the reverse translated peptide, the strand the nucleotide match, the reading frame of the match, the reverse translated peptide sequence, a longer nucleotide sequence extending from the 5' in-frame stop codon immediately upstream of the peptide to the 3' in-frame stop codon immediately downstream of the peptide, the ePST nucleotide sequence and the start and stop locations of the ePST on the nucleotide sequence, the length of the ePST, and the translated ePST. The third file is a GFF3 file containing the ePSTs generated for the dataset to provide researchers with a file format they can quickly load into genome browsers for data visualization.

Example Datasets

To test our implementation we acquired previously published proteogenomic mapping datasets for a number of organisms. For a relatively small example data set, we selected a proteogenomic mapping dataset for the channel catfish virus [5]. This small dataset contains 407 unique peptide identifications, of which 17 peptides did not map to the predicted proteome of the virus, but do map to novel open reading frames in the viral genome, and expression of several of these genes was confirmed by RT-PCR. Our example dataset consists of a FASTA file of these 17 peptides and the reference genome (NC_001493.1) for the channel catfish virus. For bacterial examples, proteomics datasets from three different microorganisms [13] were used to test our application. For a eukaryotic example, a previously published proteomics dataset generated from chicken serum was utilized for testing [14]. Table 1 details the number of unique peptides and the number of unique peptides mapping uniquely to the genomic database search contained in each of these five datasets.

Table 1 Example Dataset Statistics

Full size table

Results and Discussion

The output from the Proteogenomic Mapping Pipeline matches the previously published results against our test dataset [5], and our output provides additional information that not only places the mapped peptides on the appropriate nucleotide strand but also includes the reading frame in which the match occurs. Table 2 gives a list of the peptides and corresponding ePSTs for this dataset. We have also successfully tested this tool for proteogenomic mapping in previously published bacterial [13] and eukaryotic datasets [2, 14]. Table 3 provides runtime analysis for each of our five test datasets, and demonstrates that the Proteogenomic Mapping Tool scales well for increasingly large datasets.

Table 2 Channel Catfish Virus Peptides and ePSTs.

Full size table

Table 3 Runtime Analysis For Example Datasets.

Full size table

Possible future updates to this application include parallelization of the searches against the genome in all 6 reading frames, and the introduction of better thread support to improve performance further on today's modern increasingly multi-core processors.

Conclusions

The Proteogenomic Mapping Pipeline provides a standalone tool that facilitates a streamlined mapping of peptides to a target genome for structurally genome annotation through the use of proteomics. This software can be used on a variety of current operating systems and is its ability to use a variety of genetic codes makes it easily customizable for researchers performing proteogenomic mapping in a variety of prokaryotes, eukaryotes, and viruses.

Availability and requirements

Project name: The Proteogenomic Mapping Pipeline

Project home page: http://www.agbase.msstate.edu/tools/pgm/

Operating system(s): Windows XP, Vista (x86), Vista(x64), Linux, MacOS

Programming languages: Java

Other requirements: Java

License: GNU GPLv3 [7]

Any restrictions to use by non-academics: None

References

Jaffe JD, Berg HC, Church GM: Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics 2004, 4: 59–77. 10.1002/pmic.200300511
Article CAS PubMed Google Scholar
McCarthy FM, Cooksey AM, Wang N, Bridges SM, Pharr GT, Burgess SC: Modeling a whole organ using proteomics: the avian bursa of Fabricius. Proteomics 2006, 6: 2759–2771. 10.1002/pmic.200500648
Article CAS PubMed Google Scholar
Castellana NE, Payne SH, Shen Z, Stanke M, Bafna V, Briggs SP: Discovery and revision of Arabidopsis genes by proteogenomics. Proc Natl Acad Sci USA 2008, 105: 21034–21038. 10.1073/pnas.0811066106
Article PubMed Central CAS PubMed Google Scholar
Sevinsky JR, Cargile BJ, Bunger MK, Meng F, Yates NA, Hendrickson RC, Stephenson JL Jr: Whole genome searching with shotgun proteomic data: applications for genome annotation. J Proteome Res 2008, 7: 80–88. 10.1021/pr070198n
Article CAS PubMed Google Scholar
Kunec D, Nanduri B, Burgess SC: Experimental annotation of channel catfish virus by probabilistic proteogenomic mapping. Proteomics 2009, 9: 2634–2647. 10.1002/pmic.200800397
Article CAS PubMed Google Scholar
Ferro M, Tardif M, Reguer E, Cahuzac R, Bruley C, Vermat T, Nugues E, Vigouroux M, Vandenbrouck Y, Garin J, Viari A: PepLine: a software pipeline for high-throughput direct mapping of tandem mass spectrometry data on genomic sequences. J Proteome Res 2008, 7: 1873–1883. 10.1021/pr070415k
Article CAS PubMed Google Scholar
The GNU General Public License version 3[http://www.gnu.org/copyleft/gpl.html]
NCBI Genetic Code Table[ftp://ftp.ncbi.nih.gov/entrez/misc/data/gc.prt]
Pertea M, Lin X, Salzberg SL: GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Res 2001, 29: 1185–1190. 10.1093/nar/29.5.1185
Article PubMed Central CAS PubMed Google Scholar
Aho AV, Corasick MJ: Efficient String Matching: An Aid to Biblographic Search. Communications of the ACM 1975, 18: 333–340. 10.1145/360825.360855
Article Google Scholar
Dandass YS, Burgess SC, Lawrence M, Bridges SM: Accelerating string set matching in FPGA hardware for bioinformatics research. BMC Bioinformatics 2008, 9: 197. 10.1186/1471-2105-9-197
Article PubMed Central PubMed Google Scholar
Wu Q, Krainer AR: AT-AC pre-mRNA splicing mechanisms and conservation of minor introns in voltage-gated ion channel genes. Mol Cell Biol 1999, 19: 3225–3236.
PubMed Central CAS PubMed Google Scholar
Nanduri B, Wang N, Lawrence ML, Bridges SM, Burgess SC: Gene model detection using mass spectrometry. Methods Mol Biol 604: 137–144.
Corzo A, Kidd MT, Koter MD, Burgess SC: Assessment of dietary amino acid scarcity on growth and blood plasma proteome status of broiler chickens. Poult Sci 2005, 84: 419–425.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This research was funded in part by NIH grant 1R24GM079326-01A1, USDA grant 20053560017688 07010072, and NSF EPSCoR grant EPS-0903787. We also thank Dusan Kunec for providing us access to the channel catfish virus proteomics data.

Author information

Authors and Affiliations

Department of Biochemistry & Molecular Biology, Mississippi State University, MS, USA
William S Sanders
Department of Computer Science, University of Southern Mississippi, MS, USA
Nan Wang
Department of Computer Science & Engineering, Mississippi State University, MS, USA
Susan M Bridges, Brandon M Malone & Yoginder S Dandass
Department of Basic Sciences, College of Veterinary Medicine, Mississippi State University, MS, USA
Fiona M McCarthy, Bindu Nanduri, Mark L Lawrence & Shane C Burgess
Institute for Genomics, Biocomputing, and Biotechnology, Mississippi State University, MS, USA
William S Sanders, Susan M Bridges, Brandon M Malone, Yoginder S Dandass, Fiona M McCarthy, Bindu Nanduri, Mark L Lawrence & Shane C Burgess

Authors

William S Sanders
View author publications
You can also search for this author in PubMed Google Scholar
Nan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Susan M Bridges
View author publications
You can also search for this author in PubMed Google Scholar
Brandon M Malone
View author publications
You can also search for this author in PubMed Google Scholar
Yoginder S Dandass
View author publications
You can also search for this author in PubMed Google Scholar
Fiona M McCarthy
View author publications
You can also search for this author in PubMed Google Scholar
Bindu Nanduri
View author publications
You can also search for this author in PubMed Google Scholar
Mark L Lawrence
View author publications
You can also search for this author in PubMed Google Scholar
Shane C Burgess
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Susan M Bridges.

Additional information

Authors' contributions

FMM and SCB developed the initial procedure for generating ePSTs. BN and ML developed the algorithms for generating ePSTs for prokaryotes. SMB and NW designed the pipeline and NW implemented the initial Perl version of the pipeline. WSS and BMM implemented the current Java version and the graphical user interface. YSD implemented the string matching module. WSS and SMB drafted the document. BMM developed the software documentation. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Sanders, W.S., Wang, N., Bridges, S.M. et al. The Proteogenomic Mapping Tool. BMC Bioinformatics 12, 115 (2011). https://doi.org/10.1186/1471-2105-12-115

Download citation

Received: 23 June 2010
Accepted: 22 April 2011
Published: 22 April 2011
DOI: https://doi.org/10.1186/1471-2105-12-115

The Proteogenomic Mapping Tool