Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Methodology article

A Bayesian nonparametric method for prediction in EST analysis

Antonio Lijoi1, Ramsés H Mena2* and Igor Prünster3

Author Affiliations

1 Department of Economics and Quantitative Methods, University of Pavia, 27100 Pavia and Institute for Applied Mathematics and Information Technology, National Research Council, 20133 Milan, Italy

2 Research Institute for Applied Mathematics and Systems, National Autonomous University of Mexico, Mexico City, A.P. 20-726, Mexico

3 Department of Statistics and Applied Mathematics and ICER, University of Turin, 10122 Turin and Carlo Alberto College, 10024 Moncalieri, Italy

For all author emails, please log on.

BMC Bioinformatics 2007, 8:339  doi:10.1186/1471-2105-8-339


The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/8/339


Received:27 February 2007
Accepted:14 September 2007
Published:14 September 2007

© 2007 Lijoi et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Expressed sequence tags (ESTs) analyses are a fundamental tool for gene identification in organisms. Given a preliminary EST sample from a certain library, several statistical prediction problems arise. In particular, it is of interest to estimate how many new genes can be detected in a future EST sample of given size and also to determine the gene discovery rate: these estimates represent the basis for deciding whether to proceed sequencing the library and, in case of a positive decision, a guideline for selecting the size of the new sample. Such information is also useful for establishing sequencing efficiency in experimental design and for measuring the degree of redundancy of an EST library.

Results

In this work we propose a Bayesian nonparametric approach for tackling statistical problems related to EST surveys. In particular, we provide estimates for: a) the coverage, defined as the proportion of unique genes in the library represented in the given sample of reads; b) the number of new unique genes to be observed in a future sample; c) the discovery rate of new genes as a function of the future sample size. The Bayesian nonparametric model we adopt conveys, in a statistically rigorous way, the available information into prediction. Our proposal has appealing properties over frequentist nonparametric methods, which become unstable when prediction is required for large future samples. EST libraries, previously studied with frequentist methods, are analyzed in detail.

Conclusion

The Bayesian nonparametric approach we undertake yields valuable tools for gene capture and prediction in EST libraries. The estimators we obtain do not feature the kind of drawbacks associated with frequentist estimators and are reliable for any size of the additional sample.

Background

Expressed Sequence Tags (ESTs) are generated by partially sequencing randomly isolated gene transcripts that have been converted into cDNA. From their introduction in Adams et al. [1], ESTs have played an important role in the identification, discovery and characterization of organisms as they provide an attractive and efficient alternative to full genome sequencing. The resulting transcript sequences and their corresponding abundances are the main focus of interest providing the identification and level of expression of genes. Important issues to be addressed in terms of design of a future study are: (1) a comparison among different cDNA libraries of the same organism with the aim of detecting the least redundant library; (2) the determination of an "optimal" number of genes to be sequenced by the experimenter. Indeed, these issues are relevant since, despite the novel advances in technology, see [2], sequencing is still expensive and therefore suitable cost-effectiveness thresholds must be established. This suggests that there is the need for assessing the relative redundancy of various libraries prepared from the same organism in order to detect which one yields new genes at a higher rate. Indeed, there are 'normalization' protocols which aim at making the frequencies of genes in the library more uniform thus typically improving the discovery rate. However, performing such protocols is also expensive. Hence, the decision, whether to proceed with sequencing of a non-normalized library or to resort to a normalization procedure, has to balance carefully the involved costs: such a decision is necessarily based on statistical estimates of the coverage of the given sample, of the expected number of new genes in a future sample and on the future discovery rate. Note that ideally one would like to sequence the smallest possible portion of the library and, based on the outcome, predict the tentative future sequencing well beyond the size of the given dataset.

The practical issues discussed above naturally translate in statistical problems which, given an initial sample of EST, can be described as follows:

a) Coverage: Coverage can be seen as the proportion of genes in the library represented in the initial sample or, equivalently, the probability that a new read will not produce a new gene. The coverage estimate provides a first description of redundancy of the library.

b) Expected number of new genes: Having observed an initial sample of size n generated from the cDNA library and estimated its coverage, prediction of outcomes of further reads is in order. The first question to answer is: 'How many new unique genes are expected to be detected in an additional EST dataset of targeted size m?' Such estimates provide, then, an overall measure of redundancy of the library with reference to a further EST survey.

c) Discovery rate: In addition to the expected number of genes in a future sample of size m, it is also important to establish the rate at which the probability of discovering a new gene decays as more and more reads are recorded. In other words, interest lies in determining the probability that the (n + m + 1)-th read leads to a new gene, given the observed initial sample of size n and regardless of the experimental outcome yielded by the m intermediate draws. The availability of the discovery rate as a function of the size of the future sample m, represents then a pointwise predictive measure of the evolution of redundancy as the sequencing ideally proceeds.

Note that the combination of the measures under b) and c) provides a natural guideline for selecting the size of a future sample m. Supposing the targeted number of new genes is j, the estimator in b) guides the selection of the minimum sample size <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M1">View MathML</a> which should lead to j new genes. Then, one can resort to the discovery rate: in case it is relatively low around <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M1">View MathML</a>, it may be convenient to reduce the size of the future sample in a way that the discovery rate does not fall below a threshold suggested by the problem at issue. In such a case, one obviously would not be able to achieve the targeted number j of newly observed genes: costs considerations may indeed suggest that, with such a low discovery rate, sequencing is too expensive. On the other hand, if the discovery rate around <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M1">View MathML</a> is still relatively high, one may decide to enlarge the survey size. Moreover, the information conveyed by b) and c) is useful in comparing libraries and, again, it is worth considering these estimates together. Indeed, suppose we have to compare two libraries and that, for a fixed size m of the additional sample, library 1 yields a larger expected number of new genes but a lower discovery rate in comparison with library 2. If the sample size m is increased to m + m', for m' sufficiently large, the comparison between the two libraries can lead to different conclusions in the sense that a larger number of new genes is predicted for library 2. This happens because library 1 features a lower discovery rate, which implies that, within the additional m' draws, the expected number of new genes is lower for library 1. With reference to 'normalization' protocols, this means that the decision whether to carry it out or not should also depend on the foreseen sample size. For instance, the normalized Mastigamoeba balamuthi data we analyze exhibit a higher discovery rate, with respect to the non-normalized one, for small m. But, since the discovery rate has a faster decay, it appears that, already for moderately large m, the effect of the 'normalization' is exhausted producing fewer number of new genes. The three questions raised above can be seen as particular instances of classical species sampling problems: indeed, in the present context each species takes on the meaning of gene and the population is given by the library. Species problems appear in a variety of different applied situations such as astronomy, ecology, linguistics, machine learning, population biology. We now briefly recall well-known estimation methods which have recently been applied to EST data and then outline the key ideas of our Bayesian nonparametric approach.

Estimation methods

The main frequentist tools, that are useful for inference on the cDNA library properties described in the previous section, are based on the theory set forth in Good [3] and Good and Toulmin [4], where nonparametric estimators for the sample coverage and the expected number of new species to be detected in a future sample of size m, given the initial sample, are provided. The estimator of the sample coverage in [3] coincides with the proportion of distinct species represented by at least two units in the sample. Good attributes the original idea to Turing and this explains why it is usually referred to as Turing estimator. The popular Good-Toulmin estimator for the number of new species to be observed in a future sample is derived in [4] and, as a by-product, an evaluation of the discovery probability is achieved. Recently, the interest in species sampling problems has remarkably grown, mainly due to their importance in genomics. Indeed, Mao [5] studies various properties of the Good-Toulmin estimator and shows that it can be also viewed as a nonparametric empirical Bayes estimator. In [6], the authors suggest a parametric variation of the Good-Toulmin estimator. An alternative to it is presented in [7], where the detection of ESTs from each gene in EST sequencing is modeled by means of a Poisson process whose intensity is governed by some unknown distribution. It is to be noted that all frequentist nonparametric approaches lead to reliable estimates for the number of new genes in an additional sample only if its size is not too large. For instance, if the size of the additional survey m is larger than the initial sample n, it is well-known that the Good-Toulmin predictor can become a monotone decreasing function of m: this leads to the paradox of predicting fewer new genes by enlarging the additional sample size m. Even the nonparametric alternative proposed in [7] yields reliable results only when m ≤ 2n. This fact is also outlined in [8]. Hence, if one wishes to predict the number of new genes for large m, one needs to resort to a parametric framework. As we will see, the relative dimension of m with respect to n is not an issue in a Bayesian nonparametric framework, and the expected number of new genes that will be discovered in m further reads is monotone increasing with respect to m.

The application of Bayesian methods in this area of research is, to the authors' knowledge, quite modest even if the Bayesian learning scheme is very well suited for making predictions with EST data. An early contribution, based on a model for sampling from a finite population, is provided by Hill [9] where posterior estimates of the coverage are obtained. However, computational problems do not allow, in this approach, a direct and effective evaluation of the expected number of new species in a future sample. Recently, Lijoi et al. [10] have proposed new Bayesian nonparametric estimators for the problems a)-c) mentioned above. The prior distribution they employ is induced by a family of exchangeable Gibbs random partitions. See Pitman [11] for an interesting review of recent advances and applications of the theory of Gibbs random partitions. Their application to a Bayesian inferential framework is very useful since they provide a general scheme which encompasses some of the most notable nonparametric priors such as the Dirichlet and the two parameter Poisson-Dirichlet process. In this paper, we apply the general formulas derived in [10] with the two parameter Poisson-Dirichlet process as prior distribution. It will be seen that the resulting expressions can be evaluated exactly and do not need for any supplementary simulation scheme. Moreover, such a Bayesian approach does not incur in any problem for large values of m since all possible behaviors of future EST data are incorporated in the probabilistic model.

Results and Discussion

EST Datasets

The datasets we analyze consist of ESTs samples obtained from cDNA libraries from two different organisms: the amitochondriate protist Mastigamoeba balamuthi (non-normalized and normalized libraries, where the normalized library was prepared from the non-normalized library) and Naegleria gruberi libraries, prepared from cells grown under different culture conditions, aerobic and anaerobic. These data sets have been previously analyzed in [6], where a full account of their preparation is detailed. It is worth mentioning that our approach assumes full-length cDNA clones and high quality sequence reads.

Therefore, possible errors associated with the clustering procedure are not considered. For the statistical identification and evaluation of types of clustering errors one may incur in EST sequencing, the reader is referred to Wang et al. [12].

Specifically, each EST survey consists of n reads with k unique genes and corresponding frequencies n1,...,nk, i.e. ni is the number of tags displaying the i-th gene in the initial sample of size n. Clearly, <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M2">View MathML</a>. The reads can equivalently be clustered according to their level of expression, that is

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M3">View MathML</a>

(1)

where I(A) is an indicator of A: I(A) = 1 if A is true and 0 otherwise. Note that s represents the maximum level of expression among unique genes in the sample and that the number of positive rl's is typically smaller than s.

Table 1 summarizes the four EST samples using the compact notation set in (1). For example, the survey of the naeglaria aerobic library produces n = 959 reads with k = 473 unique genes, which are clustered into 17 levels of expression 1, 2,...,12, 16, 17, 18, 27, 55. For the first level we have r1 = 346, meaning that 346 genes appear just once, that is n1 = n2 = … = n346 = 1. For the second level r2 = 57 implies that 57 genes appear twice and, hence, n347 = n348 = … = n403 = 2 and so on up to r55 = 1, which means that 1 gene is represented 55 times yielding n473 = 55.

Table 1. EST surveys information clustered into levels of expression

Coverage, estimation of the number of new genes and discovery rate

We applied the Bayesian nonparametric method, detailed in the following section, to these datsets and obtained the following results. Denote the unknown proportion of genes (in the whole library) belonging to the i-th class by pi. Then, the coverage of the initial sample of size n is given by

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M4">View MathML</a>

(2)

which is precisely the proportion of unique genes represented in the initial sample. Our estimates for the coverage are 0.47 and 0.45 for the non-normalized (n = 715) and normalized (n = 363) Mastigamoeba, respectively. This means that, by virtue of the 'normalization', an initial sample of about half the size produces almost the same coverage. Moreover, we get 0.64 and 0.49 for the aerobic (n = 959) and anaerobic (n = 969) Naegleria, respectively: clearly, the tissue cultured aerobically achieves a remarkably higher coverage with an initial sample of the same size, which may lead to conclude that the sequencing of the aerobic tissue is more effective. However, such a finding could also be the consequence of a higher redundancy in the aerobic library. Finally, it is worth noting that our results for the coverage match exactly the ones obtained in [6], where the frequentist estimator described in [3] was exploited.

Turning attention to predicting the outcomes of future sequencing for the libraries at issue, we focus on the expected number of new genes in an additional sample of size m and on the discovery rate. The first index provides an overall measure of redundancy with respect to the additional sample of size m, whereas the discovery rate predicts the trend at which the discovery probability decays as more and more reads come in. If one adopts a Bayesian nonparametric approach, these quantities can be estimated rigorously and exactly since such an approach is naturally designed for prediction. In contrast note that, as already anticipated, the Good-Toulmin estimator becomes highly variable and unstable if the size of the additional sample m is larger than the size of the initial sample n. In particular, the Good-Toulmin estimator often produces negative values as estimates for the number of new genes if m ∈ (n, 2n) and almost always behaves badly for m > 2n. Such a phenomenon can be seen in Figure 1 for the two Naegleria libraries. In order to overcome these problems, frequentist methods typically give up the flexibility of the nonparametric approach and resort to parametric models, whose fit can be a delicate issue. For instance, Susko and Roger [6] resort to an approximated version of the Good-Toulmin estimator which assumes a parametric model for the expression levels rl.

thumbnailFigure 1. Expected number of new genes: comparison with Good-Toulmin estimator. Expected number of new genes in an additional sample for the Naegleria gruberi aerobic and anaerobic libraries arising from the application of the Good-Toulmin estimator and of the Bayesian nonparametric estimator.

In order to give a complete picture, it is important to accompany our point estimates with the 95% highest posterior density intervals, which represent the Bayesian counterpart to frequentist confidence intervals (see [13]). In Tables 2 and 3 the results arising by the application of the Bayesian nonparametric method are displayed.

Table 2. Estimates for the Mastigamoeba libraries

Table 3. Estimates for the Naeglaria libraries

As for the Mastigamoeba libraries, an interesting phenomenon takes place: the survey of the normalized library has achieved almost the same coverage (0.45) as the non-normalized library (0.47), but when considering an additional sample it exhibits a significantly faster decay in the discovery rate. Figure 2 compares the discovery rate for the two libraries. It is worth pointing out that our estimates predict that the discovery rates associated to both libraries coincide for m = 125 yielding a discovery probability of 0.508. For larger m the non-normalized exhibits a higher discovery rate. This implies that at some point also the estimates for the expected number of new genes in the additional sample will coincide: indeed, this is estimated to happen for m = 270, for which 137 new genes are predicted to be identified from both libraries. Hence, for m > 270 the expected number of new genes is systematically higher for the non-normalized library. For instance, if m = 1089, just 477 new genes are expected for the normalized library and 510 for the non-normalized. Taking m larger, at some point even the highest posterior density intervals will not overlap anymore. Such a behavior hints toward the fact that, in deciding whether to perform a 'normalization' protocol, the sizes of the samples to be drawn from the libraries is a variable to be taken into account.

thumbnailFigure 2. Discovery rate. Bayesian nonparametric estimates of the discovery rate associated to the non-normalized and normalized Mastigamoeba libraries.

As for the Naegleria libraries the behavior is apparent in the sense that the anaerobic library systematically produces more new genes and the discovery probability is sensibly higher at the considered levels of m. Note that the aerobic library presents a slightly slower decay rate but an extremely large m is required for matching the expected number of genes of the anaerobic one. Figure 3 displays the estimated decay rate of the discovery probability for both libraries with the corresponding 95% highest posterior density intervals.

thumbnailFigure 3. Expected number of new genes. Expected number of new genes in an additional sample and corresponding 95 % highest posterior density intervals for the Naegleria gruberi aerobic and anaerobic libraries arising from the application of the Bayesian nonparametric method.

A Bayesian nonparametric methodology

The primary aim of the Bayesian approach to inference is prediction and Bayesian methods are tailored for conveying the available information into prediction. In particular, for EST sequencing, the main problem of frequentist methods is represented by the difficulty of incorporating not yet observed unique genes into the model. This can then produce unpleasant behaviors of estimators such as the one exhibited by the Good-Toulmin estimator discussed before. In contrast, the Bayesian nonparametric approach naturally incorporates the fact that further sequencing will feature new unique genes and leads to consistent predictions.

In our framework we are going to consider a sample of n EST data yielding Kn distinct gene species with corresponding frequencies N = (N1,...,<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M5">View MathML</a>). Clearly Kn ∈ {1,...,n} and <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M6">View MathML</a>. Our basic model is the so-called Pitman's sampling formula [14] which consists of a probability distribution for Kn and the frequencies N of the form

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M7">View MathML</a>

(3)

where σ ∈ (0, 1), θ > 0, n = (n1,...,nk) and (a)n = a(a + 1) … (a + n - 1) is the ascending factorial with (a)0 ≡ 1. Formula (3) is a generalization of the famous Ewens' sampling formula [15] which can be recovered by letting σ tend to zero and it represents a important formula in modern probability theory. See [11]. Recently, it has found many interesting applications for bacterial taxonomy [16], clustering of microarray gene expression data [17], mixture models [18], linguistics [19], among others.

In a Bayesian nonparametric setting, one alternatively obtains model (3) by selecting the two parameter Poisson-Dirichlet process as a prior for the genes proportions within the library. This clearly makes the sequence of tags exchangeable, thus implying that the order of appearance of the tags does not influence probability assessments. Such an assumption, which constitutes the Bayesian analog of the frequentist assumption of independent and identically distributed data, is clearly reasonable in the context of EST sequences. Note that we implicitly assume that the sequence of tags can be extended to infinity. However, the size of the library represents an upper bound for the number of unique genes that will be observed and it is always finite: hence, all the estimates we are going to obtain will be finite.

As mentioned before, the Bayesian nonparametric approach has the advantage of yielding, in a straightforward way, predictive distributions for future observations given the data. Considering Pitman's sampling formula, the probability of detecting a new gene from a future observation, given a sample of n tags containing k distinct genes, is

(θ + )/(θ + n)(4)

whereas the probability of re-observing the j-th unique gene coincides with

(nj - σ)/(θ + n)   j = 1,...,k.(5)

See [11]. Hence, the coverage coincides with

1 - (θ + )/(θ + n).(6)

As already pointed out, in the analysis of ESTs one is also interested in evaluating: (i) the expected number of new genes that will be recorded in a further sample of size m and (ii) the discovery probability, which is the probability of observing a new gene in the (n + m + 1)-th draw, given the initial sample of size n. The basis for deriving estimators for these quantities is represented by the distribution of the number of new genes to be observed in an additional sample given the initial sample. Such a posterior probability, which can be seen as the predictive distribution for the outcome of additional m reads, is given by

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M8">View MathML</a>

(7)

See [10] for details on its derivation. From (7) Bayes estimators (under quadratic loss function) for both the expected number of new genes and the discovery probability have been obtained, within general Gibbs random partition models, in [10]. The expected number of new genes observed in a future sample of size m coincides with

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M9">View MathML</a>

(8)

and the discovery probability turns out to be equal to

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M10">View MathML</a>

(9)

Moreover, the highest posterior density intervals can be derived in quite a straightforward way from (7). The only point left to discuss concerns the specification of the parameters (σ, θ). In order to avoid subjective inputs in the model, (σ, θ) is fixed according to an empirical Bayes rule which consists in choosing σ and θ that maximize (3) corresponding to the observed sample (k, n1,...,nk), i.e.

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M11">View MathML</a>

(10)

Figure 4 provides the contour plots corresponding to the two Naegleria gruberi datasets: the parameters maximizing (3) turned out to be (<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M12">View MathML</a>) = (0.67, 46.3) for the aerobic case and (<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M12">View MathML</a>) = (0.66, 155.5) for the anaerobic case. On the other hand, for the two Mastigamoeba balamuthi datasets (normalized and non-normalized) (10) yields (<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M12">View MathML</a>) = (0.7, 57) and (<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M12">View MathML</a>) = (0.77, 46), respectively. These parameters have been used for computing the estimators (8) and (9) for the 4 datasets, whose results are reported in the results and discussion section.

thumbnailFigure 4. Contour plots of Pitman's sampling formula. Contour plots of Pitman's sampling formula, as a function of (σ, θ), corresponding to the two Naegleria gruberi datasets: aerobic (right) and anaerobic (left).

It is worth pointing out how the structure of the data influences the choice of the parameters (σ, θ). Indeed, the value of θ is linked to the number of distinct genes observed in the n-sample: the larger k/n the larger <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M13">View MathML</a>. On the other hand, the value of σ is determined by the configuration of the frequencies n1,...,nk. Moreover, one may note that, for a given value of θ, the expected number of new genes in (8) is an increasing function of σ : as σ increases one expects that a larger number of new genes is going to be observed in a further m-sample. This is also confirmed by the behavior of <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M14">View MathML</a> as σ varies. Figure 3 suggests an almost linear increase of <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M14">View MathML</a>, as a function of m, and accordance with linearity is higher the closer σ is to 1. In contrast, when σ is low and close to 0 the function is concave and <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M14">View MathML</a> increases at a lower rate as σ increases.

In order to further evaluate the goodness of the estimation method, we implemented a cross-validation procedure. Given a dataset of size n, this consisted in randomly drawing without replacement a sub-sample of size n/2. On the basis of this sub-sample, we have fixed, according to (10), the values of (σ, θ). Then, we have computed the expected number of new genes in the remaining part of the original sample using (8) and checked whether the estimated number of new genes is close to the actually observed number of new genes. The outcome of such an analysis is satisfactory in the sense that, in the great majority of the performed experiments, the 95% highest posterior density intervals capture the actual number of new genes. Consider first the normalized Mastigamoeba library and a sub-sample of size n = 2 ≈ 182. For instance, when such a sub-sample consists of k = 144 distinct genes with expression levels r1 = 116, r2 = 24, r3 = 1, r4 = 2, r7 = 1, then the actual number of new genes in the remainder of the original sample is j* = 104. Basing on these data, the Bayesian nonparametric estimator <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M15">View MathML</a> yields an estimate of 101 with 95% highest posterior density interval (87, 117). As for the non-normalized Mastigamoeba library, sub-samples of size n/2 ≈ 358 were taken. For instance, a sub-sample consisting of k = 261 distinct genes with expression levels r1 = 213, r2 = 31, r3 = 4, r4 = 7, r5 = 1, r7 = 2, r8 = 3, implies that the remainder contains j* = 199 new genes, the number to be predicted. In such a case, the Bayesian nonparametric estimator <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M16">View MathML</a> predicts 210 new genes with 95% highest posterior density interval (187, 237).

Conclusion

In this paper we have presented a Bayesian nonparametric approach, which relies on Pitman's sampling formula, for prediction problems arising in sequencing of EST libraries. This provides a fully probabilistic model which conveys, in a statistically rigorous way, the available information into prediction. No parametric assumption is made and the prior is fixed using an empirical Bayes approach. The resulting estimators are applied to four EST libraries and lead to interesting and coherent predictions of the outcome of additional sequencing. The arising information is of great value for researchers providing guidelines in: establishing the quality of a certain library; deciding whether to perform a normalization protocol; choosing whether to proceed with sequencing from a certain library; determining the size of an additional EST survey etc.

It is important to remark that our Bayesian nonparametric approach does not feature problems usually exhibited by frequentist methods. In particular, no ad-hoc adjustments or introduction of parametric components is necessary for predicting future reads if their number is larger than the initial survey. Finally, the estimators presented here can be easily adapted to take into account joint data from multiple libraries leading to Bayesian analogs of the estimators set forth in [6].

Methods

Here we briefly describe how the estimators in (8) and (9) are derived by simplifying the expressions provided in [10]. In particular, one finds out that <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M17">View MathML</a>, where the <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M18">View MathML</a>'s are displayed in (7) and can be deduced from (8) in [10]. The further simplification yielding the expression of <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M14">View MathML</a> in (8) is obtained by observing that (θ + 1)n-1/(θ + 1)n + m-1 = 1/(θ + n)m and

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M19">View MathML</a>

As far as the determination of (9), note that

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M20">View MathML</a>

where <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M21">View MathML</a> (1) is the probability of observing a new gene at the (n + m + 1)-th draw given the in the previous sample, of size n + m, there have been detected k + j distinct genes. Hence, by virtue of the prediction structure associated with the two parameter Poisson-Dirichlet process as outlined in Section 3, one has <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M21">View MathML</a> (1) = (θ + (k + j)σ)/(θ + n + m). From this one deduces

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M22">View MathML</a>

and one obtains the expression in (9) since <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/339/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/339/mathml/M23">View MathML</a>.

Competing interests

The author(s) declares that that there are no competing interests.

Authors' contributions

The three authors participated in equal way in the development of the proposed methodology and the writing of the paper. All authors read and approved the final manuscript.

Acknowledgements

The authors are grateful to E. Susko who kindly provided the EST data analyzed in the paper. Moreover, useful comments and remarks from two referees are gratefully acknowledged. A. Lijoi and I. Prünster are partially supported by the Italian Ministry of University and Research, grants 2006/134525 and 2006/133449, respectively. R.H. Mena acknowledges support from CONACYT-México, grant J50160-F.

References

  1. Adams M, Kelley J, Gocayne J, Mark D, Polymeropoulos M, Xiao H, Merril C, Wu A, Olde B, Moreno R, Kerlavage A, McCombe W, Venter J: Complementary DNA Sequencing: Expressed Sequence Tags and Human Genome Project.

    Science 1991, 252:1651-1656. PubMed Abstract | Publisher Full Text OpenURL

  2. Emrich S, Barbazuk W, Li L, Schnable P: Gene discovery and annotation using LCM-454 transcriptome sequencing.

    Genome Res 2007, 17:69-73. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  3. Good IJ: The population frequencies of species and the estimation of population parameters.

    Biometrika 1953, 40:237-264. OpenURL

  4. Good IJ, Toulmin GH: The number of new species, and the increase in population coverage, when a sample is increased.

    Biometrika 1956, 43:45-63. OpenURL

  5. Mao CX: Prediction of the conditional probability of discovering a new class.

    J Amer Statist Assoc 2004, 99:1108-1118. Publisher Full Text OpenURL

  6. Susko E, Roger AJ: Estimating and comparing the rates of gene discovery and expressed sequence tag (EST) frequencies in EST surveys.

    Bioinformatics 2004, 20:2279-2287. PubMed Abstract | Publisher Full Text OpenURL

  7. Wang JPZ, Lindsay BG, Cui L, Wall PK, Marion J, Zhang J, dePamphilis CW: Gene capture prediction and overlap estimation in EST sequencing from one or multiple libraries.

    BMC Bioinformatics 2005, 6:300. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  8. Mao CX: Estimating species accumulation curves and diversity indices.

    Statistica Sinica 2007, in press. OpenURL

  9. Hill BM: Posterior moments of the number of species in a finite population and the posterior probability of finding a new species.

    J Amer Statist Assoc 1979, 74:668-673. Publisher Full Text OpenURL

  10. Lijoi A, Mena R, Prünster I: Bayesian nonparametric estimation of the probability of discovering new species.

    Biometrika 2007, in press. OpenURL

  11. Pitman J: Combinatorial Stochastic Processes. Lecture Notes in Mathematics 1875. Berlin: Springer; 2006. Publisher Full Text OpenURL

  12. Wang JPZ, Lindsay BG, Cui L, Wall PK, Miller WC, dePamphilis CW: EST clustering error evaluation and correction.

    Bioinformatics 2004, 20:2973-2984. PubMed Abstract | Publisher Full Text OpenURL

  13. Bernardo JM, Smith AFM: Bayesian theory. Chichester: Wiley; 1994. OpenURL

  14. Pitman J: Exchangeable and partially exchangeable random partitions.

    Probab Theory Related Fields 1995, 102:145-158. Publisher Full Text OpenURL

  15. Ewens WJ: The sampling theory of selectively neutral alleles.

    Theor Popul Biol 1972, 3:87-112. PubMed Abstract | Publisher Full Text OpenURL

  16. Gyllenberg M, Koski T: Probabilistic models for bacterial taxonomy.

    Int Statist Review 2001, 69:249-276. Publisher Full Text OpenURL

  17. Zhaohui S: Clustering microarray gene expression data using weighted Chinese restaurant process.

    Bioinformatics 2006, 22:1988-1997. PubMed Abstract | Publisher Full Text OpenURL

  18. Ishwaran H, James LF: Gibbs sampling methods for stick-breaking priors.

    J Amer Statist Assoc 2001, 96:161-173. Publisher Full Text OpenURL

  19. Teh YW: A hierarchical Bayesian language model based on Pitman-Yor processes.

    Proceedings of the Annual Meeting of the Association for Computational Linguistics 2006., 44 OpenURL