Abstract
Background
A key goal of systems biology and translational genomics is to utilize highthroughput measurements of cellular states to develop expressionbased classifiers for discriminating among different phenotypes. Recent developments of Next Generation Sequencing (NGS) technologies can facilitate classifier design by providing expression measurements for tens of thousands of genes simultaneously via the abundance of their mRNA transcripts. Because NGS technologies result in a nonlinear transformation of the actual expression distributions, their application can result in data that are less discriminative than would be the actual expression levels themselves, were they directly observable.
Results
Using stateoftheart distributional modeling for the NGS processing pipeline, this paper studies how that pipeline, via the resulting nonlinear transformation, affects classification and feature selection. The effects of different factors are considered and NGSbased classification is compared to SAGEbased classification and classification directly on the raw expression data, which is represented by a very highdimensional model previously developed for gene expression. As expected, the nonlinear transformation resulting from NGS processing diminishes classification accuracy; however, owing to a larger number of reads, NGSbased classification outperforms SAGEbased classification.
Conclusions
Having high numbers of reads can mitigate the degradation in classification performance resulting from the effects of NGS technologies. Hence, when performing a RNASeq analysis, using the highest possible coverage of the genome is recommended for the purposes of classification.
Background
In recent years, modern high throughput sequencing technologies have become one of the essential tools in measuring the number of transcripts of each gene in a cell population or even in individual cells. Such information could be used to detect differential gene expression due to different treatment or phenotype. In our case we are interested in using geneexpression measurements to classify phenotypes into one of two classes. The accuracy of classification will depend on the manner in which the phenomena are transformed into data by the measurement technology. We consider the effects of NextGeneration Sequencing (NGS) and Serial Analysis of Gene Expression (SAGE) on geneexpression classification using currently accepted measurement modeling. The accuracy of classification problem has previously been addressed for the LCMS proteomics pipeline, where stateoftheart modeling is more refined, the purpose being to characterize the effect of various noise sources on classification accuracy [1].
NGS technology provides a discrete counting measurement for geneexpression levels. In particular, RNASeq sequences small RNA fragments (mRNA) to measure gene expression. When a gene is expressed, it produces mRNAs. The RNASeq experiment randomly shears and converts the RNA fragments to cDNAs, sequences them, and finally outputs the results in the form of short reads [2,3]. After obtaining those reads, a typical part of a processing pipeline is to map them back to a reference genome to determine the geneexpression levels. The number of reads mapped to a gene on the reference genome defines the count data, which is a discrete measure of the geneexpression levels. Two popular models for statistical representation of the discrete NGS data are the negative binomial [4,5] and Poisson [6]. The negative binomial model is more general because it can mitigate overdispersion issues associated with the Poisson model; however, with the relatively small number of samples available in most current NGS experiments, it is difficult to accurately estimate the dispersion parameter of the negative binomial model. Therefore, in this study we choose to model the NGS data processing pipeline through the transformation via a Poisson model, for the purposes of phenotype classification.
SAGE technology produces short continuous sequences of nucleotides, called tags. After a SAGE experiment is done, one can measure the expression level of a particular region/gene of interest on the genome by counting the number of tags that map to it. SAGE is very similar to RNASeq in nature and in terms of statistical modeling. The SAGE data processing pipeline is traditionally modeled as a Poisson random vector [7,8]. We follow the same approach for synthetically generated SAGElike data sets.
Our overall methodology is to generate three different types of synthetic data: (1) actual gene expression concentration, called MVNGC, from a multivariate normal (Gaussian) model formulated to model various aspects of gene expression concentration [9]; (2) Poisson transformed MVNGC data, called NGSreads, with specifications that resemble NGS reads; and (3) Poisson transformed MVNGC data, called SAGEtags, where the characteristics of the data model SAGE data. The classification results related to these three different types of data sets indicate that MVNGC misclassification errors are lower compared to data subjected to transformations that produce either NGSreads or SAGEtags data. Moreover, classification using RNASeq synthetic data outperforms classification using SAGE data when the number of reads is in an acceptable range for an RNASeq experiment. The better performance is attributed to the significantly higher genome coverage associated with the RNASeq technology.
Nextgeneration sequencing technologies
NextGeneration Sequencing refers to a class of technologies that sequence millions of short DNA fragments in parallel, with a relatively low cost. The length and number of the reads differ based on the technology. Currently, there are three major commercially available platforms/technologies for NGS: (1) Illumina, (2) Roche, and (3) Life Technologies [10,11]. The underlying chemistry and technique used in each platform is unique and affects the output. In this paper, we focus on Illumina sequencers and use the NGS term to refer to this platform. Highthroughput sequencing and NGS are used interchangeably.
The specific application of NGS for RNA sequencing is called RNASeq [2], which is a highthroughput measurement of geneexpression levels of thousands of genes simultaneously as represented by discrete expression values for regions of interest on the genome (e.g. genes). NGS has many advantages when compared to the available microarray expression platforms. NGS does not depend on prior knowledge about regions of expression to measure a gene [11], whereas the microarray probes are designed based on known sequences [12]. The background correction, probe design and spot filtering, which are typical for microarraybased technology, are no longer problematic due to the different nature of NGS technology. RNASeq enables scientists to discover new splice junctions [13], due to its flexibility and independence of the predesigned probes. The prediction of absolute copynumber variation (CNV) is another great advantage of RNASeq, which allows scientists to identify large segments of deletions/duplications in a genome with respect to another (a reference) genome [14]. Detecting singlenucleotide polymorphisms (SNP) is yet another application of RNASeq [3,15]. Furthermore, it has been shown that RNASeq can detect spikein RNA controls with good accuracy and reproducibility [16].
The RNASeq process starts with samples that are randomly sheared to generate millions of small RNA fragments. These fragments are then converted to cDNA and the adapter sequences are ligated to their ends. The length of a fragment can vary between 30 bp  110 bp, approximately. The Illumina system provides flowcells for sequencing by synthesis and its reversible terminator chemistry[2,3,10]. A flowcell is an eightchannel glass and each channel is commonly referred to as a lane. The size selected fragments with the adapters attach to the flowcell surface inside the lanes and generate clusters of the same fragment through bridge amplification. Following the bending and attaching of both sides of the fragment to the surface, the strand duplicates. This process is repeated many times and results in a cluster of fragments. After the cluster generation step, a pool of floating nucleotides is added to the flowcell along with DNA polymerase to incorporate to the single strand fragments in each cluster. Each nucleotide incorporation makes a unique fluorescent label and the images are captured after the addition. Finally, image processing and base calling determine the base at each cycle of the sequencing. Each cluster produces a read whose length equals the number of cycles. Each RNASeq experiment produces millions of reads depending on the input RNA material, length of the reads, desired coverage for the reference genome, number of samples per lane, etc. Following the sequencing experiment, the expression levels of the genes are estimated by mapping the reads to the reference genome. There are many algorithms developed for this task, including: ELAND [3], Bowtie [17], BWA [18], MAQ [19], SHRiMP [20], mrFast [14], mrsFAST [21], SOAP [22], etc. After the gene expressions are determined, they can be used in further analysis, such as SNP detection or detecting differentially expressed genes.
The entire RNASeq sample processing pipeline from generating reads to calling gene expressions can involve different sources of error, e.g. the basecalling is a probabilistic procedure and the quality scores assigned to each base of the reads are prone to small likelihoods of being wrong. The certainty of reference genomes and mapping algorithms are additional issues that need attention. Thus, the entire RNASeq sample processing pipeline should be considered from a probabilistic point of view. In this study, we model the above mentioned errors as a noise term in the model of the sample processing pipeline.
Discussion
In this section, we consider three specific models for “actual” gene expression data, RNASeq count data and SAGE tags data. The models are used to synthetically generate data for the simulation experiments described in this paper. The performances of different classification schemes are analyzed and compared across these three synthetically generated types of data.
A common assumption for modeling of the original mRNA expressions is that they follow a multivariate Gaussian distribution [9,23,24]. Starting with this assumption, we model and generate the RNASeq and SAGE data by applying a specific nonlinear Poisson transformation to the mRNA expression model. All data are synthetically generated according to a biologically relevant model to emulate the real experimental situations where the number of features/genes is very large, usually tens of thousands, and only a small number of sample points is available. Knowing the full distributional characteristics of the synthetic data makes it possible to measure the classification performance as described by the respective error rates.
MVNGC model
The model proposed in [9] uses a blockbased structure on the covariance matrix, which is a standard tool to model groups of interacting variables where there is negligible interaction between the groups. In genomics, genes within a block represent a distinct pathway and are correlated, whereas genes not in the same group are uncorrelated [25,26]. Sample points are drawn randomly and independently from two equally likely classes, 0 and 1, each sharing the same D features. There are also c equally likely subclasses in class 1 with different parameters for the probability distribution of the features. The subclasses model scenarios typically seen as different stages or subtypes of a cancer. Each sample point in class 1 belongs to one and only one of these subclasses. Features are categorized into two major groups: markers and nonmarkers. Markers resemble genes associated with a disease or condition related to the disease and they have different classconditional distributions for the two classes.
Markers can be further categorized into two different groups: global and heterogeneous markers. Global markers take on values from D_{gm}dimensional Gaussian distributions with parameters for the sample points from class 0 and for the points from class 1. Heterogeneous markers, on the other hand, are divided into c subgroups of size D_{hm}, each associated with one of c mutually exclusive subclasses within class 1. Therefore, a sample point that belongs to a specific subclass has D_{hm} heterogeneous markers distributed as a D_{hm}dimensional Gaussian with parameters . The same D_{hm} heterogeneous markers for the sample points belonging to other subclasses, as well as points in class 0, follow a different D_{hm} dimensional Gaussian distribution with parameters . We assume that the global and heterogeneous markers have similar structure for the covariance matrices. Therefore, we represent the covariance matrices of these two types of markers by and for class 0 and class 1, respectively, where and can be different. For this structure, we assume that Σ has the following block structure:
where Σ_{ρ} is an l×l matrix, with 1 on the diagonal and ρ off the diagonal. Note that the dimension of Σ is different for the global and heterogeneous markers. Furthermore, we assume that the mean vectors for the global markers and the heterogeneous markers possess the same structure denoted by μ_{0}=m_{0}×(1,1,…,1) and μ_{1}=m_{1}×(1,1,…,1) for class 0 and class 1, respectively, where m_{0} and m_{1} are scalars. The nonmarkers are also divided into two groups: highvariance and lowvariance nonmarkers. The D_{hv} nonmarkers belonging to the highvariance group are uncorrelated and their distributions are described by , where m_{0}, m_{1}, and take values equal to the means and variances of the markers, respectively, and p is a random value uniformly distributed over [0, 1]. The D_{lv} lowvariance nonmarkers are uncorrelated and have identical onedimensional Gaussian distributions with parameters [9,27]. Figure 1 represents the blockbased structure of the model.
NGSreads and SAGEtags models
In NGStype experiments, the geneexpression levels are measured by discrete values providing the number of reads that map to the respective gene. Several statistical models have been proposed for representing NGS data. Those models are based on either negative binomial or Poisson distributions [46]. In what follows, we denote the read count for gene j for sample point i by X_{i,j}, where it is assumed that in an NGS experiment each lane has a single biological specimen belonging to either class 0 or class 1. Furthermore, we select a model where the number of reads for each gene is generated from a Poisson distribution with a known parameter. We calculate the expected number of reads (mean of the Poisson distribution) from the generalized linear model [28]:
where λ_{i,j} is the jth geneexpression level in lane i. The term θ_{i,j} represents technical effects that might be associated with an experiment. The term logs_{i} is an offset where s_{i}, referred to as “sequencing depth” in the statistical community [29], denotes a major factor in the transformation from expression levels to read data. It accounts for different total numbers of reads produced by each lane and plays an important role in normalizing the specimens across the flowcell. The trimmed mean of M values (TMM) [30], quantile normalization [28], and median count ratio [4] are three commonly used methods for estimating the sequencing depth. Equation (1) represents the expected value of reads, conditioned on the sequencing depth, based on the linear combination of the factors that affect its value: the depth of sequencing, the geneexpression level and a general noise term. Therefore, it can be used to model the expected number of reads, as the mean of the Poisson distribution in our synthetic data generation pipelines. Rewriting equation (1) yields
indicating that if λ_{i,j} and θ_{i,j} are normally distributed, then exp(λ_{i,j}+θ_{i,j}) will have a lognormal distribution. Therefore, for a given s_{i} the mean of X_{i,j} is lognormally distributed. This phenomenon has been previously reported for microarray studies where the means of expression levels are shown to have lognormal distributions [23,31]. Furthermore, we assume that the offset s_{i} is random and uniformly distributed [29]. Because the term θ_{i,j} represents the unknown technical effects associated with the experimentation, we assume that it follows a Gaussian distribution with zero mean and a variance set by the coefficient of variation (COV):
COV aims to model the unknown errors that can occur during an/a NGS/SAGE experiment, including basecalling, mapping reads to the reference genome, etc. The term E[ X_{i,j}s_{i}] serves as the single parameter of a Poisson distribution that models the NGS/SAGE processes by generating random nonnegative integers, as the read counts or tag numbers data, having expected value equal to the righthand side of equation (2).
To generate synthetic datasets for the purposes of our simulation study we proceed as follows: for a sample point i in the experiment, we first randomly generate a number s_{i} from a uniform distribution, U (α,β), where α > 0 and β > α. Then, a random sample point, λ_{i,j}, is drawn from the MVNGC model and its value is perturbed with θ_{i,j}, which is drawn randomly according to its distribution defined in (3). Using (2), the mean of the Poisson distribution is calculated for each sample point i and gene j, and a single random realization of X_{i,j} is generated. The processes of generating count data for RNASeq reads and SAGE tag numbers are very similar, but the total number of reads per NGS run is significantly more than tags for a SAGE experiment. Therefore, we only change α and β to get the desired number of read counts or tags. We also assume that the SAGE experiments always have a fixed range for the total number of tags, whereas RNASeq has a variety of ranges for the total read counts. By having different ranges for the read counts, we can compare the performance of classification resulting from NGSreads and SAGEtags models under different experimental settings.
Classification schemes
The setup for the classification problem is determined by a joint featurelabel probability distribution F_{XY}, where is a random feature vector and Y ∈ {0,1} is the unknown class label of X. In the context of genomics, the feature vector is usually the expression levels of many genes and the class labels are different types or stages of disease to which sample points belong to. A classifier rule model is defined as a pair (Ψ, Ξ), where Ψ is a classification rule, possibly including feature selection, and Ξ is a trainingdata error estimation rule. In a typical classification task, a random training set is drawn from F_{XY} and the goal is to design a classifier , which takes X as the input and outputs a label Y. The true classification error of ψ_{n} is given by ε_{n} = P (ψ_{n} (X) ≠ Y). The error estimation rule Ξ provides an error estimate, , for ψ_{n}.
In this study, we consider linear discriminant analysis (LDA), three nearest neighbors (3NN) and radial basis function support vector machine (RBFSVM) as the classification rules, and report the true error of the classifiers. We implement ttest feature selection (as a part of the classification rule) before the classifier design procedure to select d ≤ D features with highest tscores. The training set with d features is then used to design the classifier. The same d features are also used for finding the true error of the designed classifier.
LDA is the plugin rule for the Bayes classifier when the classconditional densities are multivariate Gaussian with a common covariance matrix [32]. The sample means and pooled sample covariance matrix are estimated from the training data and plugged into the discriminant function. If the classes are equally likely, LDA assigns x to class 1 if and only if
where is the sample mean for class y ∈ {0,1}, and is the pooled sample covariance matrix.
3NN is a special case of kNN rule (with k = 3), which is a nonparametric classification rule based on the training data. The kNN classifier assigns a label, 0 or 1, to a point x according to the majority of the labels of the k nearest training data points to it. To avoid tied votes in binary classification problems, an odd number is usually chosen for k.
A support vector machine finds a maximal margin hyperplane for a given set of training sample points. If it is not possible to linearly separate the data, one can introduce some slack variables in the optimization procedure that allow the mislabeled sample points and solve the dual problem. Alternatively, one can transform the data and project it into a higherdimensional space, where it becomes linearly separable. The equivalent classifier back in the original feature space will generally be nonlinear [33,34]. If a Gaussian radial basis function used as the kernel function, the corresponding classifier is referred to as RBFSVM.
Simulation setup and parameters
Figure 2 presents an overview of the simulations performed in the study. In this section we provide the setup and list of parameters used in the study. The analysis of the results follows in the next section.
Figure 2. Three different types of data are generated: (1) MVNGC; (2) NGSreads; and (3) SAGEtags. Data sets are generated according to the respective statistical models. Then, the data is fed to the feature selection, classification and error estimation module.
To emulate realworld scenarios involving thousands of genes with only a few sample points, we choose D = 20,000 as the number of features and n ∈ {60, 120, 180} as the number of sample points available for each synthetic featurelabel distribution. Because there is no closed form expression to calculate the true error of the designed classifiers, we generate a large independent test sample of size n_{t} = 3000, with samples divided equally between the two classes. Because the RMS between the true and estimated error when using independent test data is bounded above by , this test sample size provides an errorestimate RMS of less than 0.01. Once the training and test data are generated, we normalize the training data so that each feature is zeromean unitvariance across all sample points in both classes. We also apply the same normalization coefficients from the training data to the test set. The normalized data are then used in the feature selection, classifier design and error calculation. The parameter settings for the MVNGC model, SAGEtags and NGSreads are provided in Table 1.
Table 1. MVNGC, SAGEtags and NGSreads models parameters
As explained in the previous section, the datasets generated from the MVNGC model are transformed into the NGSreads and SAGEtags datasets through equation (2) and Poisson processes. We only need to properly set the parameters COV, α, and β to get the desired number of read counts or tags. We assume that the parameter COV can take on two values, 0.01 and 0.05, representing two different levels of noise and unknown errors in the experiment. In its current state, RNASeq technology can provide different numbers of reads per sample, depending on many factors, such as quality of the sample, the desired coverage, sample multiplexing, etc. In this study, we examine a variety of ranges for RNASeq experiments. We start with a very low total number of RNASeq reads, which may not match the realworld experiments, however, they are necessary for comparing the SAGEtags and NGSreads models with similar coverage. Furthermore, demonstrating the classification results for the NGSreads model with wide ranges introduces an extra internal variability, which makes interpretations of the results rather difficult. Table 1 lists the NGSreads ranges we have considered in this study. In a typical SAGE experiment, one expects 50K to 100K tags [35,36]. Using trial and error, we have found that by choosing the parameters α = 2.0 and β = 3.75 in the distribution of s_{i}, the observed number of tags usually falls within this range. Similarly, the parameters α and β are chosen to meet the range requirements in the NGSreads model.
Our goal is to study the performance of the traditional classification schemes on different sources of random samples; thus, we take a MonteCarlo approach and generate 10,000 random training sets of size n from the MVNGC model, transform them to the corresponding NGSreads and SAGEtags samples, and apply the classification schemes to each training sample. By taking such an approach, we aim to compare the classification performance of the three pipelines, in terms of the true error of the deigned classifiers. We also study the effect of NGSreads and SAGEtags transformations on the performance of a simple ttest biomarker discovery method, where we report the probability that global markers are recovered when d ≪ D features are selected after the featureselection step.
Results
The probability of recovering a certain number of global markers after a ttest feature selection can be approximated empirically by the percentage of experiments (out of 10,000 independent experiments) that detect such a number of markers. This probability depends on the size of the training data sample, quality of features, and underlying joint probability distribution of the features. Here, we only show the results for the MVNGC model, with d = 10, in Table 2. Tables 3, 4, 5 and 6 represent the corresponding results for NGSreads for different combinations of COV and ρ: {0.05, 0.4}, {0.05, 0.8}, {0.1, 0.4}, {0.1, 0.8}, respectively. In each table, we also report the corresponding results for the SAGEtags model in a row with the NGSreads range of [ 50K − 100K]. A successful feature selection should identify as many global markers as possible, however the situation is worsened because with small sample sizes the noisy features sometimes may appear to be good among the 20,000 features. As the number of sample points increases, we expect to get better results for the feature selection and this is exactly what we see in Tables 2, 3, 4, 5 and 6. Another important observation in Tables 3, 4, 5 and 6 is the effect of the total number of reads and COV for the NGSreads models. As the number of reads increases, it is more likely to pick up more global markers, until the number of reads reaches a threshold, where no further improvement is observed. Similarly, for smaller COV the probability of selecting more global markers also increases.
Table 2. Percentage of the MVNGC experiments recovering certain number of global markers after feature selection
Table 3. Percentage of the NGSreads experiments recovering certain number of global markers after feature selection for COV=0.05 andρ=0.4
Table 4. Percentage of the NGSreads experiments recovering certain number of global markers after feature selection for COV=0.05 andρ=0.8
Table 5. Percentage of the NGSreads experiments recovering certain number of global markers after feature selection for COV=0.1 andρ=0.4
Table 6. Percentage of the NGSreads experiments recovering certain number of global markers after feature selection for COV=0.1 andρ=0.8
The true error of a designed classifier measures the generalization capability of the classifier on a future sample point. Given a set of training sample points and a classification rule, one needs the full featurelabel probability distribution to calculate the true error of the classifier designed on the training set. In this study, we find the true error of a classifier with a large independent sample. Because the training sample is random, the true error is a random variable with its own probability distribution. Therefore, to demonstrate the actual performance of the designed classifiers, the estimated probability density function (PDF) of the true error for each classification rule, distribution model and all data generation models (MVNGC, NGSreads and SAGEtags) is presented.
We report the true errors of the designed classifiers for two different featureselection schemes and classification rules. In the first scheme, no feature selection is performed and the first d global markers are directly used for classification. In the second scheme, ttest feature selection is done before a classifier is designed to select the best d features. Figures 3, 4 and 5 show the salient finding of this study. We present the PDF of the true error for different classification rules trained on 60 and 180 sample points when the correlation among features in the same block is high (ρ = 0.8) and COV = 0.05. Distributions with higher and tighter densities around lower true errors indicate better classifier performance. If the PDFs can be approximated as univariate Gaussians, then a good classification performance amounts to smaller mean and variance. In all cases, for similar sample sizes and similar settings for ρ and COV, MVNGC outperforms the two other data types. This is not surprising since it is considered as the ground truth. Also, it is evident that a larger sample size gives better classifiers with smaller variance as illustrated by the distribution of the true error.
Figure 3. Probability density estimate of the true error for LDA classification rule, without feature selection (left column,d=10 global markers are directly used), withttest feature selection (right column),ρ=0.8, COV =0.05, for SAGEtags and the following total number of reads for NGSreads: (a,b); (c,d); (e,f); (g,h).
Figure 4. Probability density estimate of the true error for 3NN classification rule, without feature selection (left column,d=10 global markers are directly used), withttest feature selection (right column),ρ=0.8, COV =0.05, for SAGEtags and the following total number of reads for NGSreads: (a,b); (c,d); (e,f); (g,h).
Figure 5. Probability density estimate of the true error for RBFSVM classification rule, without feature selection (left column,d=10 global markers are directly used), withttest feature selection (right column),ρ=0.8, COV =0.05, for SAGEtags and the following total number of reads for NGSreads: (a,b); (c,d); (e,f); (g,h).
Figures 3, 4 and 5 show that utilizing the best d features (with or without feature selection) in the model, SAGEtags and NGSreads for small ranges of the total reads yield similar results (or even much worse when the range is smaller) in terms of the classification performance, and both are inferior when compared to the ground truth. This may lead to a conclusion that one should not expect improvements in classification performance when geneexpression levels are processed through an NGSreads pipeline. However, our main objective here is to show that as one increases the total number of reads for the NGSreads model, improvement can be achieved and the error rates decrease. This conclusion confirms the intuition provided by a simple calculation about the increase of the separability of two Poisson random variables with means proportional to the number of reads. However, notice that the modeling of the sequencing pipeline introduces randomness in the means of the respective Poisson parameters describing the individual gene reads. Moreover, our focus is not that much on the separability of the two classes/phenotypes but rather on the classification performance as measured by the classification error and its dependence on ground truth (MVNGC model) sample size, sequencing depth, and feature vectors. Our goal for modeling NGSreads with small ranges is to demonstrate the performance of SAGE and RNASeq, when their data is similar. This result suggests that having a larger number of reads for the RNASeq experiments could compensate for the errors that can occur during the NGSreads pipeline sample processing. Here we have shown the results only for COV = 0.05, since in our observations, it appears that changing COV has little effect on the distribution of the true error for both SAGEtags and NGSreads models, except that it slightly increases the variance of the PDFs. The effect of feature selection (right columns) on the performance is best shown when the sample size is small. In this case, the variance of the true error is so large that drawing any conclusion regarding the performance on a small sample would be futile. Another interesting observation is that, on average, 3NN and RBFSVM classification rules outperform LDA for both NGSreads and SAGEtags data, supporting the idea that using linear classifiers for these types of data is not the best way to proceed in a highly nonGaussian model – as is the case after the data have gone through the pipeline. The best rates for the expected true error across all settings are achieved by the RBFSVM classification rule, even for small samples, especially with small variance.
Conclusions
In this paper, we model geneexpression levels as a multivariate Gaussian distribution that statistically captures the real mRNA levels within the cells. The newly developed technologies of sequencing DNA/RNA, referred to as NGS, and their ascendant SAGE technology generate discrete measures for gene expressions. The count data from these technologies can be modeled with a Poisson distribution. The multivariate Gaussian gene expressions are transformed through a Poisson filter to model NGS and SAGE technologies. The three categories of data are subjected to feature selection and classification. The objective is to evaluate the performance of the NGS technologies in classification. Our simulations show that when the gene expressions are directly used in classification, the best performance in terms of the classification error is achieved. The NGSreads model generates considerably higher coverage for genes and can outperform SAGE in classification, when the experiment generates large number of reads. Even though NGS still has a variety of error sources involved in its process, its high volume of reads for a specific gene can lower the chance of misclassification. Thus, it is important to use the highest possible coverage for the entire genome while performing a RNASeq analysis if the goal of the study is to distinguish the samples of interest. Nevertheless, one must recognize that, as is typical with nonlinear transformations, the NGS pipeline transforms the original Gaussian data in such a way as to increase classification difficulty. As more refined modeling becomes available for the NGS pipeline, further study needs to performed, as has been done in the case of the LCMS proteomics pipeline [1], to determine which segments of the pipeline are most detrimental to classification and what, if anything, can be done to mitigate classification degradation.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
NG proposed the main idea, planned/structured the study, and worked on the simulations and manuscript. MRY contributed to the main idea, simulation, structure of the study, and the manuscript. CDJ and II helped in conceiving the study and revised the manuscript. ERD contributed to the formulation of main idea and also to the steps of the study, and revised the manuscript. All authors read and approved the final manuscript.
References

Sun Y, BragaNeto UM, Dougherty ER: Modeling and systematic analysis of the LCMS Proteomics Pipeline.

Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNASeq.
Nature Methods 2008, 5(7):621628. PubMed Abstract  Publisher Full Text

Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bryant J: Accurate whole human genome sequencing using reversible terminator chemistry.
Nature 2008, 456(6):5359. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Anders S, Huber H: Differential expression analysis for sequence count data.
Genome Biol 2010, 11(10):R106. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.
Bioinformatics 2010, 26:139140. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNAseq: An assessment of technical reproducibility and comparison with gene expression arrays.
Genome Res 2008, 18(9):15091517. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Robinson MD, Smyth GK: Moderated statistical tests for assessing differences in tag abundance.
Bioinformatics 2007, 23(21):28812887. PubMed Abstract  Publisher Full Text

Robinson MD, Smyth GK: Small sample estimation of negative binomial dispersion, with applications to SAGE data.

Hua J, Waibhav T, Dougherty ER: Performance of feature selection methods in the classification of highdimensional data.
Pattern Recognit 2009, 42(3):409424. Publisher Full Text

Mardis ER: Nextgeneration DNA sequencing methods.
Ann Rev Genomics Human Genet 2008, 9:387402. Publisher Full Text

Auer PL, Doerge RW: Statistical design and analysis of RNA sequencing data.
Genet 2010, 185(2):405416. Publisher Full Text

Sun W: A statistical framework for eQTL mapping using RNAseq data.
Biometrics 2011, 68:111. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, Schmidt D, O’Keeffe S, Haas S, Vingron M, Lehrach H, Yaspo ML: A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome.
Science 2008, 321:956960. PubMed Abstract  Publisher Full Text

Alkan C, Kidd JM, MarquesBonet T, Aksay G, Antonacci F, Hormozdiari F, Kitzman JO, Baker C, Malig M, Mutlu O, Sahinalp SC, Gibbs RA, Eichler EE: Personalized copy number and segmental duplication maps using nextgeneration sequencing.
Nat Genet 2009, 41:10611067. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Craig DW, Pearson JV, Szelinger S, Sekar A, Redman M, Corneveaux JJ, Pawlowski TL, Laub T, Nunn G, Stephan DA, Homer N, Huentelman MJ: Identification of genetic variants using barcoded multiplexed sequencing.
Nature Methods 2008, 5:887893. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Jiang L, Schlesinger F, Davis CA, Zhang Y, Li R, Salit M, Gingeras TR, Oliver B: Synthetic spikein standards for RNAseq experiments.
Genome Res 2011, 21(9):15431551. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memoryefficient alignment of short DNA sequences to the human genome.
Genome Biol 2009, 10:R25. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Li H, Durbin R: Fast and accurate short read alignment with BurrowsWheeler transform.
Bioinformatics 2009, 25:17541760. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores.
Genome Res 2008, 18:18511858. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Rumble SM, Lacroute P, Dalca AV, Fiume M, Sidow A, Brudno M: SHRiMP: accurate mapping of short colorspace reads.
PLoS Comput Biol 2009, 5(5):e1000386. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hach F, Hormozdiari F, Alkan C, Hormozdiari F, Birol I, Eichler E, Sahinalp SC: mrsFAST: a cacheoblivious algorithm for shortread mapping.
Nature Methods 2010, 7:576577. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li R, Li Y, Kristiansen K, Wang J: SOAP: short oligonucleotide alignment program.
Bioinformatics 2008, 24(5):713714. PubMed Abstract  Publisher Full Text

Attoor S, Dougherty ER, Chen Y, Bittner ML, Trent JM: Which is better for cDNAmicroarraybased classification: ratios or direct intensities.
Bioinformatics 2004, 20(16):25132520. PubMed Abstract  Publisher Full Text

Dalton LA, Dougherty ER: Application of the Bayesian MMSE error estimator for classification error to geneexpression microarray data.
Bioinformatics 2011, 27(13):18221831. PubMed Abstract  Publisher Full Text

Dougherty ER: Validation of computational methods in genomics.
Curr Genomics 2007, 8:119. Publisher Full Text

Shmulevich I, Dougherty ER: Genomic Signal Processing. Princeton: Princeton University Press; 2007.

Yousefi MR, Hua J, Dougherty ER: Multiplerule bias in the comparison of classification rules.
Bioinformatics 2011, 27:16751683. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bullard JH, Purdom E, Hansen KD, Dudoit S: Evaluation of statistical methods for normalization and differential expression in mRNASeq experiments.

Li J, Wittn DM, Johnstone IM, Tibshirani R: Normalization, testing, and false discovery rate estimation for RNAsequencing data.

Robinson MD, Oshlack A: A scaling normalization method for differential expression analysis of RNAseq data.

Hoyle DC, Rattray M, Jupp R, Brass A: Making sense of microarray data distributions.
Bioinformatics 2002, 18(4):576584. PubMed Abstract  Publisher Full Text

Duda RO, Hart PE, Stork DG: Pattern Classification. New York: Wiley; 2001.

Boser BE, Guyon IE, Vapnik VN: A training algorithm for optimal margin classifiers.
Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory 1992, 144152.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.21.3818 webcite

Wang SM: Understanding SAGE data.
TRENDS Genet 2006, 23:4250. PubMed Abstract  Publisher Full Text

Bianchetti L, Wu Y, Guerin E, Plewniak F, Poch O: SAGETTARIUS: a program to reduce the number of tags mapped to multiple transcripts and to plan SAGE sequencing stages.
Nucleic Acids Res 2007, 35(18):e122. PubMed Abstract  Publisher Full Text  PubMed Central Full Text