Open Access Highly Accessed Methodology article

A new computational strategy for predicting essential genes

Jian Cheng12, Wenwu Wu13, Yinwen Zhang12, Xiangchen Li12, Xiaoqian Jiang12, Gehong Wei45* and Shiheng Tao124*

Author Affiliations

1 College of Life Science, State Key Laboratory of Crop Stress Biology for Arid Areas, Northwest A&F University, Yangling, Shaanxi, China

2 Bioinformatics Center, Northwest A&F University, Yangling 712100, Shaanxi, China

3 Key Laboratory of Food Safety Research, Institute for Nutritional Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai 200031, China

4 College of Life Science, Northwest A&F University, Yangling 712100, Shaanxi, China

5 College of Science, Northwest A&F University, Yangling 712100, Shaanxi, China

For all author emails, please log on.

BMC Genomics 2013, 14:910  doi:10.1186/1471-2164-14-910

Published: 21 December 2013

Abstract

Background

Determination of the minimum gene set for cellular life is one of the central goals in biology. Genome-wide essential gene identification has progressed rapidly in certain bacterial species; however, it remains difficult to achieve in most eukaryotic species. Several computational models have recently been developed to integrate gene features and used as alternatives to transfer gene essentiality annotations between organisms.

Results

We first collected features that were widely used by previous predictive models and assessed the relationships between gene features and gene essentiality using a stepwise regression model. We found two issues that could significantly reduce model accuracy: (i) the effect of multicollinearity among gene features and (ii) the diverse and even contrasting correlations between gene features and gene essentiality existing within and among different species. To address these issues, we developed a novel model called feature-based weighted Naïve Bayes model (FWM), which is based on Naïve Bayes classifiers, logistic regression, and genetic algorithm. The proposed model assesses features and filters out the effects of multicollinearity and diversity. The performance of FWM was compared with other popular models, such as support vector machine, Naïve Bayes model, and logistic regression model, by applying FWM to reciprocally predict essential genes among and within 21 species. Our results showed that FWM significantly improves the accuracy and robustness of essential gene prediction.

Conclusions

FWM can remarkably improve the accuracy of essential gene prediction and may be used as an alternative method for other classification work. This method can contribute substantially to the knowledge of the minimum gene sets required for living organisms and the discovery of new drug targets.

Keywords:
Essential genes; Naïve Bayes; Support vector machine; Gene essentiality