PCP-ML: Protein characterization package for machine learning

Eickholt, Jesse; Wang, Zheng

doi:10.1186/1756-0500-7-810

Technical Note
Open access
Published: 18 November 2014

PCP-ML: Protein characterization package for machine learning

Jesse Eickholt¹ &
Zheng Wang²

BMC Research Notes volume 7, Article number: 810 (2014) Cite this article

1712 Accesses
1 Citations
Metrics details

Abstract

Background

Machine Learning (ML) has a number of demonstrated applications in protein prediction tasks such as protein structure prediction. To speed further development of machine learning based tools and their release to the community, we have developed a package which characterizes several aspects of a protein commonly used for protein prediction tasks with machine learning.

Findings

A number of software libraries and modules exist for handling protein related data. The package we present in this work, PCP-ML, is unique in its small footprint and emphasis on machine learning. Its primary focus is on characterizing various aspects of a protein through sets of numerical data. The generated data can then be used with machine learning tools and/or techniques. PCP-ML is very flexible in how the generated data is formatted and as a result is compatible with a variety of existing machine learning packages. Given its small size, it can be directly packaged and distributed with community developed tools for protein prediction tasks.

Conclusions

Source code and example programs are available under a BSD license at http://mlid.cps.cmich.edu/eickh1jl/tools/PCPML/. The package is implemented in C++ and accessible as a Python module.

Findings

Machine Learning (ML) techniques have been successfully applied to a variety of protein related classification tasks. In particular, machine learning has proven quite useful in the area of protein structure prediction and resulted in the development of a number of tools and particular applications. These include the prediction of a protein’s secondary structure [1, 2], residue solvent accessibility [1], residue-residue contacts and contact maps [3, 4], residue order/disorder [5, 6], fold recognition [7] and protein model quality [8]. These tools, while useful in their own right, also form part of larger protein structure tools and tertiary structure prediction pipelines (e.g., MULTICOM [9] and I-TASSER [10]).

In general, machine learning methods work on a feature space which characterizes an object or event. The machine learning methods attempt to learn some meaningful relation between elements in the feature space and/or a map between the feature space and classifications. For most protein prediction tasks, the primary feature space is the protein’s sequence and/or data directly derived thereof (e.g., sequence profile). As machine learning techniques are mathematical models, the sequence data (e.g., FASTA files, multiple sequence alignments, etc.) must be read in and then converted to a meaningful numerical format.

Here we present PCP-ML, a package of methods that characterize a protein for machine learning tasks. The package can be of use in any protein prediction problem in which the input is the protein’s primary sequence. We have tailored PCP-ML to protein structure prediction tasks in particular. Our package was inspired by existing protein sequence libraries such as in Bio++ [11], Biopython [12] and SeqAn [13], but differs in its focus on machine learning, compact size and additional functionality for protein structure prediction. It provides a stable, expandable and lightweight set of methods that can be used when developing machine learning based tools in structural bioinformatics and to the best of our knowledge it is the first library or package of its type. The package is written in C++ but accessible as a Python module. This allows for rapid prototyping in a scripting language. Yet due to the scope and size of PCP-ML, it is much more amenable to being embedded as a part of an application or tool than many existing libraries. The primary purpose of our software package is to provide a concise, tested set of functions that can be used to generate feature files for existing machine learning tools (such as SVM^light[14] or NNrank [15]) or as a built in component for a stand-alone protein structure prediction tool. Note that the PCP-ML package itself does not provide any functionality to train prediction tools but rather it focuses on the pre-processing and data access phase, converting protein sequence data into a format that can be used with machine learning. Figure 1 illustrates how PCP-ML could be incorporated into a prediction pipeline, either as a built-in component or as a stand-alone feature generation program feeding into an off-the-shelf machine learning toolkit.

Methods

The design and functionality of PCP-ML is based on our experience with machine learning in protein structure prediction tasks as well as a survey of methods documented in the literature. We have broken PCP-ML into primarily three components: Parsers, Characterizers and Encoders, and Feature Writers/Generators. Table 1 summarizes the majority of the methods available in each component and Figure 2 depicts how data flows between the Parsers, Characterizers, Encoders and Feature Writers. To see how the components are used in practice, see Additional file 1 which contains some sample scenarios for using PCP-ML (in both C++ and Python).

Table 1 Major methods provided by each component of PCP-ML

Full size table

Parsers

As almost all protein structure prediction tasks start with the protein’s sequence and sequence profile, PCP-ML provides several methods to parse FASTA files, anchored multiple sequence alignments (MSAs), output from DSSP [16], and position specific scoring matrices (PSSMs) from PSI-BLAST [17]. Many higher level prediction tasks also make use of predicted secondary structure and predicted solvent accessibility. Therefore, we have included parsers for common output formats for these types of predictions. In particular, PCP-ML can read files generated from SSPro [1] and PSIPRED [2].

Here, we note that we do not include a parser for PDB files (i.e., a common format used by the Protien Data Bank [18] for protein structure). Our rationale for not including a PDB parser is that for most prediction tasks, the structural information that would be contained in a PDB file is not available and hence a PDB parser is not needed for the production of end-user protein structure prediction tools. Interested readers may find a PDB parser included with Biopython [12] or ESBTL [19].

Characterizers and encoders

The input into machine learning methods is numerical and as a result it is necessary to encode data such as secondary structure (SS), solvent accessibility (SA) and amino acid (AA) type. One approach to this end is to convert each SS, SA and AA type to vectors of length 3, 2 and 20, respectively. In each vector, all of the values are 0 except for one value which depending on its position in the vector signifies the type (e.g. 100 represents a helix while 001 encodes for a coil). This type of encoding is often referred to as hot encoding, or orthogonal encoding [20], and allows for a numerical conversion without arbitrarily imposing an ordering on the encoding. PCP-ML contains three methods for hot encoding.

There are a number of ways to characterize a protein’s sequence and the PCP-ML package includes many of these. Perhaps the most obvious is to represent amino acid residues by numerical values stemming from statistical studies on experimentally determined structures. Included in PCP-ML are pair-wise contact potentials [21], beta sheet pairing potentials [22], and hydrophobicity [23]. We also included the Atchley factors for each amino acid [24]. These factors represent each amino acid type in a five dimensional space in which similar amino acids are grouped together and the proximity of any two amino acids is a measure of their similarity.

Proteins can also be characterized by their content. PCP-ML contains methods which calculate the percent content of a protein by secondary structure type (i.e., helix, beta sheet, or coil/loop), solvent accessibility (i.e., buried or exposed) or amino acid residue type. This information is a way to characterize a protein globally (i.e., irrespective of residue index). This approach can also be applied at the residue level using an anchored MSA or PSSM. Using either an MSA or PSSM, it is possible to calculate the relative frequency of each type of AA at a position in the sequence as well as the amount of information contained at that position.

Finally, a protein can be characterized by patterns or correlations in sequence data. Thus, we have included in PCP-ML methods to calculate the information contained in a vector or the correlation or similarity between two vectors. We also mention here that most methods have the option of returning scaled values such that the feature values are between 0 and 1. This is required by some machine learning methods. Table 2 provides a brief summary of the functionality provided by each characterizer in PCP-ML.

Table 2 Description of each Characterizer contained in PCP-ML

Full size table

Feature generators/writers

The input format for standard machine learning packages (e.g., SVM^light, NNrank, etc.) varies but typically consists of a text file in which each line represents a training or classification example. Some packages require the features to be numbered as well. PCP-ML provides feature writers which can print out features (optionally with number) and/or save them to a file. This functionality allows users to use PCP-ML to create stand-alone feature generation programs that they can package with standard machine learning programs or tie feature generation directly into their tools. Note that it is difficult to accommodate file formats for all machine learning packages. The feature writers we developed and included allow a user to print the features along with feature numbers and/or the labels/targets themselves. The targets and feature numbers can be easily modified via the parameters passed to the feature writing functions.

Conclusions

PCP-ML is a software package to characterize proteins for machine learning applications in protein structure prediction as well as more general protein related prediction tasks. It provides a number of functions that allow for rapid prototyping and testing of methods and easy deployment of developed tools. The package can be used to create feature generation programs compatible with popular machine learning tools or compiled into stand-alone applications. As an open source project, it is freely available to the community and can be modified and extended as needed.

Availability and requirements

Project name: PCP-ML

Project home page: http://mlid.cps.cmich.edu/eickh1jl/tools/PCPML/

Operating System(s): Linux, Mac OS X

Programming Language: C++, Python

Other requirements: C++ compiler

License: BSD

Any restrictions to use by non-academics: None

PCP-ML is written in C++ and available in both source code and a Python module. These are available at http://mlid.cps.cmich.edu/eickh1jl/tools/PCPML/. At the site, users can also find examples, a tutorial, access additional documentation and learn about porting the package to other languages such as Perl or Octave.

References

Cheng J, Randall AZ, Sweredoski MJ, Baldi P: SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Res. 2005, 33: W72-W76. 10.1093/nar/gki396.
Article PubMed CAS PubMed Central Google Scholar
Jones DT: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol. 1999, 292: 195-202. 10.1006/jmbi.1999.3091.
Article PubMed CAS Google Scholar
Di Lena P, Nagata K, Baldi P: Deep architectures for protein contact map prediction. Bioinform Oxf Engl. 2012, 28: 2449-2457. 10.1093/bioinformatics/bts475.
Article CAS Google Scholar
Eickholt J, Cheng J: Predicting protein residue-residue contacts using deep networks and boosting. Bioinform Oxf Engl. 2012, 28: 3066-3072. 10.1093/bioinformatics/bts598.
Article CAS Google Scholar
Walsh I, Martin AJM, Di Domenico T, Tosatto SCE: ESpritz: accurate and fast prediction of protein disorder. Bioinform Oxf Engl. 2012, 28: 503-509. 10.1093/bioinformatics/btr682.
Article CAS Google Scholar
Eickholt J, Cheng J: DNdisorder: predicting protein disorder using boosting and deep networks. BMC Bioinform. 2013, 14: 88-10.1186/1471-2105-14-88.
Article CAS Google Scholar
Cheng J, Baldi P: A machine learning information retrieval approach to protein fold recognition. Bioinformatics. 2006, 22: 1456-1463. 10.1093/bioinformatics/btl102.
Article PubMed CAS Google Scholar
Wang Z, Eickholt J, Cheng J: APOLLO: a quality assessment service for single and multiple protein models. Bioinformatics. 2011, 27: 1715-1716. 10.1093/bioinformatics/btr268.
Article PubMed PubMed Central Google Scholar
Li J, Deng X, Eickholt J, Cheng J: Designing and benchmarking the MULTICOM protein structure prediction system. BMC Struct Biol. 2013, 13: 2-10.1186/1472-6807-13-2.
Article PubMed CAS PubMed Central Google Scholar
Xu D, Zhang J, Roy A, Zhang Y: Automated protein structure modeling in CASP9 by I-TASSER pipeline combined with QUARK-based ab initio folding and FG-MD-based structure refinement. Proteins. 2011, 79 (Suppl 10): 147-160.
Article PubMed CAS PubMed Central Google Scholar
Dutheil J, Gaillard S, Bazin E, Glémin S, Ranwez V, Galtier N, Belkhir K: Bio++: a set of C++ libraries for sequence analysis, phylogenetics, molecular evolution and population genetics. BMC Bioinform. 2006, 7: 188-10.1186/1471-2105-7-188.
Article Google Scholar
Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, Hoon MJL D: Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009, 25: 1422-1423. 10.1093/bioinformatics/btp163.
Article PubMed CAS PubMed Central Google Scholar
Döring A, Weese D, Rausch T, Reinert K: SeqAn An efficient, generic C++ library for sequence analysis. BMC Bioinform. 2008, 9: 11-10.1186/1471-2105-9-11.
Article Google Scholar
Joachims T: Advances in Kernel Methods. Edited by: Schölkopf B, Burges CJC, Smola AJ. 1999, Cambridge, MA, USA: MIT Press, 169-184.
Google Scholar
Cheng J, Wang Z, Pollastri G: A neural network approach to ordinal regression. IEEE Int. Jt. Conf. Neural Networks 2008 IJCNN 2008 IEEE World Congr. Comput. Intell. 2008, 1279-1284.
Chapter Google Scholar
Kabsch W, Sander C: Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983, 22: 2577-2637. 10.1002/bip.360221211.
Article PubMed CAS Google Scholar
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.
Article PubMed CAS PubMed Central Google Scholar
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The protein data bank. Nucleic Acids Res. 2000, 28: 235-242. 10.1093/nar/28.1.235.
Article PubMed CAS PubMed Central Google Scholar
Loriot S, Cazals F, Bernauer J: ESBTL: efficient PDB parser and data structure for the structural and geometric analysis of biological macromolecules. Bioinformatics. 2010, 26: 1127-1128. 10.1093/bioinformatics/btq083.
Article PubMed CAS Google Scholar
Baldi P, Brunak S: Bioinformatics: The Machine Language Approach. 2001, Cambridge: MIT Press
Google Scholar
Glaser F, Steinberg DM, Vakser IA, Ben-Tal N: Residue frequencies and pairing preferences at protein-protein interfaces. Proteins. 2001, 43: 89-102. 10.1002/1097-0134(20010501)43:2<89::AID-PROT1021>3.0.CO;2-H.
Article PubMed CAS Google Scholar
Zhu H, Braun W: Sequence specificity, statistical potentials, and three-dimensional structure prediction with self-correcting distance geometry calculations of beta-sheet formation in proteins. Protein Sci Publ Protein Soc. 1999, 8: 326-342.
Article CAS Google Scholar
Monera OD, Sereda TJ, Zhou NE, Kay CM, Hodges RS: Relationship of sidechain hydrophobicity and α-helical propensity on the stability of the single-stranded amphipathic α-helix. J Pept Sci. 1995, 1: 319-329. 10.1002/psc.310010507.
Article PubMed CAS Google Scholar
Atchley WR, Zhao J, Fernandes AD, Drüke T: Solving the protein sequence metric problem. Proc Natl Acad Sci U S A. 2005, 102: 6395-6400. 10.1073/pnas.0408677102.
Article PubMed CAS PubMed Central Google Scholar

Download references

Acknowledgements

This work was supported in part by a start-up grant from Central Michigan University to JE and a NSF Mississippi EPSCoR seed grant GM006278 to ZW.

Author information

Authors and Affiliations

Department of Computer Science, Central Michigan University, Mount Pleasant, MI 48859, USA
Jesse Eickholt
School of Computing, University of Southern Mississippi, Hattiesburg, MS 39406, USA
Zheng Wang

Authors

Jesse Eickholt
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jesse Eickholt.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

JE and ZW conceived of the toolset and JE implemented the algorithms and website. Both authors wrote, edited the manuscript and approved it.

Electronic supplementary material

Additional file 1:A stand-alone webpage with an example use of PCP-ML.(ZIP 5 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Eickholt, J., Wang, Z. PCP-ML: Protein characterization package for machine learning. BMC Res Notes 7, 810 (2014). https://doi.org/10.1186/1756-0500-7-810

Download citation

Received: 27 January 2014
Accepted: 31 October 2014
Published: 18 November 2014
DOI: https://doi.org/10.1186/1756-0500-7-810

PCP-ML: Protein characterization package for machine learning