Boosting accuracy of automated classification of fluorescence microscope images for location proteomics

Huang, Kai; Murphy, Robert F

doi:10.1186/1471-2105-5-78

Research article
Open access
Published: 18 June 2004

Boosting accuracy of automated classification of fluorescence microscope images for location proteomics

Kai Huang^1,3 &
Robert F Murphy^1,2,3

BMC Bioinformatics volume 5, Article number: 78 (2004) Cite this article

12k Accesses
103 Citations
Metrics details

Abstract

Background

Detailed knowledge of the subcellular location of each expressed protein is critical to a full understanding of its function. Fluorescence microscopy, in combination with methods for fluorescent tagging, is the most suitable current method for proteome-wide determination of subcellular location. Previous work has shown that neural network classifiers can distinguish all major protein subcellular location patterns in both 2D and 3D fluorescence microscope images. Building on these results, we evaluate here new classifiers and features to improve the recognition of protein subcellular location patterns in both 2D and 3D fluorescence microscope images.

Results

We report here a thorough comparison of the performance on this problem of eight different state-of-the-art classification methods, including neural networks, support vector machines with linear, polynomial, radial basis, and exponential radial basis kernel functions, and ensemble methods such as AdaBoost, Bagging, and Mixtures-of-Experts. Ten-fold cross validation was used to evaluate each classifier with various parameters on different Subcellular Location Feature sets representing both 2D and 3D fluorescence microscope images, including new feature sets incorporating features derived from Gabor and Daubechies wavelet transforms. After optimal parameters were chosen for each of the eight classifiers, optimal majority-voting ensemble classifiers were formed for each feature set. Comparison of results for each image for all eight classifiers permits estimation of the lower bound classification error rate for each subcellular pattern, which we interpret to reflect the fraction of cells whose patterns are distorted by mitosis, cell death or acquisition errors. Overall, we obtained statistically significant improvements in classification accuracy over the best previously published results, with the overall error rate being reduced by one-third to one-half and with the average accuracy for single 2D images being higher than 90% for the first time. In particular, the classification accuracy for the easily confused endomembrane compartments (endoplasmic reticulum, Golgi, endosomes, lysosomes) was improved by 5–15%. We achieved further improvements when classification was conducted on image sets rather than on individual cell images.

Conclusions

The availability of accurate, fast, automated classification systems for protein location patterns in conjunction with high throughput fluorescence microscope imaging techniques enables a new subfield of proteomics, location proteomics. The accuracy and sensitivity of this approach represents an important alternative to low-resolution assignments by curation or sequence-based prediction.

Background

High-throughput methods for proteomics have become necessary in order to characterize the tens of thousands of proteins expressed in each multicellular organism. Although high-throughput methods for structural proteomics [1] and expression proteomics [2] are available and a number of large-scale projects have been initiated to make use of them, far less attention has been paid to automating the determination of the subcellular locations of proteins.

As an alternative to experimental determination, various approaches to predicting subcellular location from protein sequences have been described. These have used amino acid composition [3, 4], N-terminal signal peptide sequences [5], functional domain composition [6, 7], and combined evidence from sequence properties and expression levels [8]. The limitation in these approaches to date has been the absence of a systematic, high-resolution framework for describing subcellular location. That is, since only a small number (4–12) of broad location classes have been used to train these systems, they cannot distinguish the many different distributions that proteins can have within a single organelle (or the many combinations of these distributions that may occur). The performance of these approaches cannot be significantly improved unless a large number of subcellular location patterns are determined experimentally.

The spatial distribution of each protein in a cell type can be detected by various methods for fluorescently tagging a protein followed by fluorescence microscope imaging. In recent years, automated classification systems that can be used to assign proteins to the major subcellular location classes have been described [9–11]. The combination of such methods with high throughput imaging systems [12] holds enormous promise for automating the systematic determination of this critical aspect of protein function. The information that can be obtained from such systems can potentially be used to train improved systems for predicting location from primary sequence.

High-throughput automated recognition of protein subcellular distributions requires both robust image features and accurate classifiers. Several different image feature sets have been derived for 2D and 3D fluorescence microscope images [9, 11, 13, 14], which capture information such as texture, object morphology and geometrical properties. Advanced feature reduction techniques have been used to select the most discriminative features from these feature sets [15]. Ten major subcellular location patterns in 2D images and eleven in 3D images, including two that cannot be discriminated by visual inspection, can be distinguished using these features.

Our group has considered two variations on this recognition problem, one in which only gray level images of the distribution of a single protein are available, and the other in which a parallel image of the DNA distribution in the same cell is also available. Our initial work made use of an artificial neural network for classification. With this classifier, we have achieved 86% average classification accuracy on 2D HeLa cell images without parallel DNA images and 88% accuracy with them [14]. For 3D HeLa cell images, we have achieved 91% average accuracy for the same patterns [11]. The goal of the work described here was to improve upon these results using state-of-the-art classifiers and new features. The classifiers evaluated in this paper are described briefly below.

Support Vector Machines (SVM)

Support Vector Machines [16] are linear classifiers that find a hyperplane between two classes which maximizes the minimum distance between the hyperplane and the data points (the maximum margin hyperplane). The hyperplane can be sparsely represented by a small amount of data lying on the boundary of the maximum margin. Theoretically, maximum-margin classifiers can minimize the classification structural risk, the upper bound on the error expected for a test set [17].

SVM were first characterized in a two-class learning environment. However, they can be extended to solve multi-class learning problems by using one of three methods.

Max-win classification [17] trains N support vector machines, each of which separates class i from non-i. The predicted class from the machine generating the highest output score is selected as the prediction.
Pair-wise classification [18] trains a separate binary classifier for distinguishing each possible pair of classes. Each of these binary classifiers is given a vote and the class with the most votes is selected as the predicted class.
DAG (Directed Acyclic Graph) classification [19] puts the binary classifiers trained in the pair-wise classification method into nodes of a rooted binary DAG. The output of each SVM is used to trace the graph until a leaf node that specifies the final prediction is reached. The rationale behind this method is that achieving a maximum margin classifier at each node of the classifier graph guarantees an upper bound on the generalization error [19].

Ensemble Methods

All current machine learning models have some constraints or local minima in certain application domains given limited training data [20], resulting in the fact that different learning models might perform better for different application domains. Therefore it is useful to build an ensemble expert that can combine the outputs from different classifiers in a way such that the local limitations of each individual model can be overcome in the ensemble by other models that do not have the same constraints. In practice, ensemble methods have been shown to perform better than any of the base classifiers if the output errors of each base classifier are not fully dependent [20–23]. However, it is still an open question whether an ensemble model consisting of base classifiers with independent errors performs better than one consisting of base classifiers with partially dependent errors [24]. Many different ways of creating ensemble classifiers have been described [20, 25] and we summarize some of these approaches here.

AdaBoost

AdaBoost is an example of ensemble methods that create new and improved base classifiers by iteratively manipulating the training dataset [26]. It generates base classifiers, such as a neural network or decision tree, at each training iteration. Every newly created base classifier is fed an adjusted distribution in which each data point in the training set is assigned a weight computed from its classification result for the previous base classifier. The rationale behind Adaboost is to force a new base classifier to focus on the data points that were incorrectly classified previously such that those hard examples can be better classified in the new base classifier [26].

Bagging

Bagging, which stands for bootstrap aggregation, is another way of manipulating training data for ensemble methods. In each bagging iteration, the training examples are bootstrapped (resampled with replacement), to generate a different training set for the base classifier [20]. On average, 63.2% of the total original training examples are retained in each bootstrapped set [20]. The rational behind bagging is that unstable base classifiers, such as neural networks and decision trees whose behavior could be significantly changed by small fluctuations in the training dataset, are more likely to be stabilized after being trained with different input data and combined afterwards [20, 25].

Mixtures-of-Experts

An ensemble method can also use a divide-and-conquer strategy to separate the training examples into many partitions, train a local base classifier (an expert) for each partition, and then combine the outputs of these local experts in a supervised-learning way [25, 27]. Mixtures-of-Experts [27–29] is one kind of these ensemble methods.

Mixtures-of-Experts models the target data generation process as [27]:

where Y stands for the target data, X is the input data, and Z is a representation of hidden experts that relate an input data point to a target. The target generation process can be regarded as two steps [27, 28]: one is to assign every input data point to a specific expert and the other is to compute the target data given the input data point and its related expert. A gating network is used in Mixtures-of-Experts to model P(Z | X), the first step above, and local expert networks are used to model P(Y | X, Z), which is the second step above [27].

Majority-voting Ensembles

The majority-voting ensemble classifier is one of the simplest ensemble forms that can combine the outputs of multiple classifiers. Instead of using the complicated schemes described above, we can simply choose the predicted class by plurality from a classifier pool [20, 25]. Assuming that the errors made by the classifiers are not highly correlated, the samples that are not accurately classified by one classifier have a good chance to be correctly classified by a plurality of the other classifiers. The majority-voting ensemble classifier overcomes the difficulty of choosing one classifier from a pool of classifiers with similar performance. Different weighting schemes have been described to form the plurality prediction from the classifier pool [20, 30].

Results and Discussion

Subcellular Image Datasets

In order to demonstrate the feasibility of recognizing the major subcellular structures in fluorescence microscope images of single cells, we have previously created large collections of 2D and 3D immunofluorescence images of HeLa cells [9, 11]. The subcellular location patterns in these collections include endoplasmic reticulum (ER), the Golgi complex, lysosomes, mitochondria, nucleoli, actin microfilaments, endosomes, microtubules, and nuclear DNA. We designed sets of numerical features to describe the pattern in the images, which we term SLF (for Subcellular Location Features). When these features were used to train neural network classifiers, from 73–99% of the images could be correctly classified (depending on the pattern) [11, 14]. Figure 1 depicts the most typical 2D and 3D image of each pattern from correctly classified images using these neural network classifiers. (The most typical image was selected using TypIC [31], an algorithm that ranks images by their distance to the centroid of the feature space.) The goal of the work described here was to improve on the previous performance.

Comparison of Classifier Characteristics

Our first goal was to measure the performance of the different classifiers described above with the SLF that we have used previously. Before embarking on this, we here try to provide some insight into the differences between the classifiers. For this purpose, we can describe them using three characteristics. The first characteristic is the complexity of the decision boundaries that a classifier can generate. Figure 2 illustrates differences in this characteristic for a pair of Golgi proteins, giantin and gpp130, that are difficult to distinguish computationally (and essentially impossible to distinguish visually [14]). There is significant overlap in the feature distributions for these classes but for illustration we selected the most discriminative pair of features for the two Golgi patterns using stepwise discriminant analysis (SDA) from the combination of SLF7 and the 6 DNA features (SLF7DNA). These were the fraction of fluorescence not in any object (SLF7.79) and the convex hull eccentricity (SLF1.16). The values of these features for each image are plotted in Figure 2 along with colored regions illustrating the boundary between the two classes found by three different classifiers. While the boundaries for the neural network and mixture-of-experts classifiers are nearly linear, the boundary for the exponential-rbf-kernel SVM is much more convoluted. The accuracy of the classifiers increases with the extent of curvature, from 68.6% for the neural network, to 71.5% for the mixture-of-experts, to 75% for the exponential-rbf-kernel SVM.

A second classifier characteristic is the dependence of performance on the size of the training set. This can be divided into two parts: ability to learn from limited training data, and insensitivity to the presence of outliers when training data is plentiful. Both of these can only be described in relative terms by comparing the performance of different classifiers on the same data. Figure 3 shows a comparison of the major classifiers for both 2D and 3D images as a function of the amount of training data provided. For low numbers of training images, performance is similar for all 2D classifiers but varies significantly for the 3D classifiers. In particular, bagging does poorly for low numbers of training images while the majority-voting ensemble does quite well. As the amount of training data is increased, all but one of the classifiers improve their accuracy monotonically. The exception is the neural network classifier, for which the accuracy decreases. We presume that this is due to a heightened chance of including outlier images as the size of the training set increases.

The third characteristic is the sensitivity of performance to the presence of uninformative features. This is illustrated in Figure 4, in which classification accuracy is displayed as a function of increasing numbers of features. For this purpose, the features in SLF13 and SLF10 were ranked in order of their discriminative ability by SDA. Classifiers were trained using increasing numbers of the sorted features, and when these were exhausted additional features (which can be regarded as noisy or less informative features) were randomly included from the 90-dimensional SLF7DNA and 28-dimensional SLF9. For the 2D images (Figure 4A), all classifiers except neural networks handled the increasing number of input features with little impairment from uninformative features. The decrease in accuracy for neural networks at high numbers of features, along with the large fluctuations in their performance, implies that feature selection is essential for neural network classification in the SLF7DNA feature space. For 3D images (Figure 4B), neural networks performed the best and the SVM classifier showed a significant drop in accuracy as the number of features increased. When compared to Figure 4A, the stability of neural networks as well as the instability of SVM in the SLF9 feature space implies that the ability of a classifier to adapt to more noisy features depends on the feature space itself. As will be shown later, image set classification is much less dependent on the feature space.

Table 1 summarizes the above three characteristics for all major classifiers. It also shows the information content (the number of parameters times the bits required for each parameter) of the model that each classifier builds from the training data. We note that the classifier with the lowest information content, the neural network, is also the most effective classifier at learning from limited training data.

Table 1 Comparison of the characteristics of the classifiers used in this study. The results are derived from the data in Figures 2-4.

Full size table

Evaluation of Eight Classifiers

With this background, we proceeded to evaluate the different classifiers using two different feature sets for 2D HeLa images, namely SLF8 and SLF13, and two different feature sets for 3D HeLa images, namely SLF10 and SLF14. Two feature sets were used for each image collection so that performance with and without DNA features could be compared. Eight classifiers, including a one-hidden-layer neural network, linear-kernel SVM, rbf-kernel SVM, exponential-rbf-kernel SVM, polynomial-kernel SVM, AdaBoost, Bagging, and Mixtures-of-Experts, were evaluated by 10-fold cross validation.

Table 2 shows the results for each classifier with its optimal parameters for each of the four feature sets. To address the statistical significance of the experiments, we conducted a paired t-test between the 10-fold cross validation results of each of the eight classifiers and those for the neural network classifier topology we described previously. Previous results have shown that the inclusion of features calculated from a parallel DNA image can improve classification accuracy. The parallel DNA channel provides an additional landmark for locating protein subcellular distributions. For instance, a nucleolar protein would be expected to overlap completely with the DNA channel, while a mitochondrial protein would be localized around but not in the nucleus. As expected, both SLF13 and SLF10 performed better than their no-DNA counterparts SLF8 and SLF14, respectively. However, the benefit from the DNA features was much larger for 3D image classification (about 6%) than for 2D image classification (about 1%). The reason for this is unclear, but could be due either to the greater amount of information present in 3D images, or to the presence of redundancy between the non-DNA and DNA features for 2D but not 3D images.

Table 2 Comparison of eight classifiers for 2D and 3D image classification. The average classification accuracy on test data (from 10-fold cross-validation) is shown for the optimal parameters settings (shown in parentheses) for each classification approach. The parameters are: nhu – number of hidden nodes in neural network, stop-fract – the fraction of the training data used to stop neural network training, C – error penalty in SVMs, sigma – kernel variance in SVMs, nboost – total number of iterations in AdaBoost, nbag – total number of iterations in Bagging, nhug – number of hidden nodes in the gating network of Mixtures-of-Experts, and ne – total number of experts in Mixtures-of-Experts. The accuracies across the 10-fold cross-validation trials were compared to those for the previously described neural network configuration (nhu = 20, stop-fract = 0.3) by a paired t-test (88% for SLF13, 86% for SLF8, 93% for SLF10, and 84% for SLF14). The best performances are underscored and highlighted for each feature set. *CPU times listed for each classifier are for training and testing for all images in each cross-validation trial (training times include times calculating features), which were measured on an Athlon 1.7 GHz processor with 1.5 GB memory running Redhat Linux 7.1.

Full size table

The various parameters selected for each classifier on different feature sets suggest that the right configuration of a classifier is highly dependent on the input data. Given limited data, a classifier can only be optimized to an extent that its performance is maximized on this specific set of data. However, we can still observe some common trends in parameter selection. The pairwise method for combining binary SVMs was never selected in our experiments, although the pairwise method was reported to give good performance in some application domains. There is an assumption in this method that the binary classifier should generate an unknown prediction for classes that it cannot recognize. In other words, a binary classifier trained between classes a and b should give equal weights to both a and b given a data point from a third class c. However, this assumption seemed not to hold in our experiments. None of the pairwise methods gave an average accuracy over 20% for any of the four feature sets (data not shown). On the contrary, both maxwin and DAG methods gave much better results and it is hard to tell which one of the two was more advantageous. The close performance of linear-kernel SVMs with other nonlinear-kernel SVMs in most experiments suggested that the classification boundaries in the original input spaces of all four feature sets were not far from linear. This can be confirmed by the optimal choices of the polynomial degrees in the polynomial-kernel SVMs. A polynomial degree of 2 was selected for all four feature sets. The smaller the degree, the more linear the final classification boundary would be. SVMs, ranked in the top two for all four feature sets, performed generally very well among all classifiers in both 2D and 3D classifications. Ensemble methods displayed more varieties in their performances. Although two ensemble methods, Mixtures-of-Experts and AdaBoost, performed the best on the feature set SLF13 and SLF8 respectively, these methods did not perform well in 3D classifications. Among the ensemble methods, AdaBoost showed the most consistent performance.

Table 2 also shows cpu times for training and testing each classifier. Although the training and testing time of a classifier can be affected by the number of features, the data itself, and the implementation details, we observe some trends from the experimental results. As expected, the more hidden nodes in a neural network, the longer it will take to train, which can be observed in neural networks, Adaboost, Bagging, and Mixtures-of-Experts. It is also obvious that longer iterations of boosting, bagging, and larger number of experts in Mixtures-of-Experts account for more time in training classifiers. SVMs are the fastest to train among all eight classifiers, although relatively slow in the test phase. Despite various changes in its parameters, the training time of SVM stays consistent for each kernel function. Neural network is the fastest classifier in the testing phase in three out of the four experiments conducted. Three ensemble methods also show faster performance in testing than SVMs, partially because the building blocks of these methods in the experiments are also neural networks.

To determine the significance of improvements in performance by individual classifiers over our previous neural network approach, we conducted a paired t-test [32], on the 10-fold cross validation results of each classifier against those of the previously configured neural networks. From Table 2, we can see that three, one, and two of the eight individually configured classifiers performed significantly better than the previous neural network classifiers on SLF13, SLF8, and SLF14 feature sets respectively. None of the eight classifiers could perform statistically better than the previous neural network classifier on SLF10. If a single classifier has to be selected for each of the four feature sets, a support vector machine classifier with exponential-rbf kernel function can be used given its reasonable accuracy and speed, although some fine tuning of its parameters might need to be conducted.

Optimal Majority-voting Ensembles of Eight Classifiers

More improvement might be achieved by taking a majority-voting ensemble model of all possible combinations of the eight tested classifiers. Since all eight classifiers were trained differently and were based on either different kernel functions or different theoretical justifications, their error can be regarded as mostly independent, which makes constructing a larger ensemble model possible. Table 3 shows the statistics of pairwise-classifier-error correlation coefficients between all 8 classifiers on six different feature sets. As expected, all pairs of classifiers did not show strong error correlation. The mean error correlation coefficients from all six experiments were less than 0.15.

Table 3 Analysis of error correlation between different classifiers. For each of the classifiers listed in Table 2, a list of all test images that were incorrectly classified was generated. The correlation coefficient between the incorrectly classified images was calculated for each pair of classifiers, and the minimum, maximum, mean and standard deviation of these correlations were calculated for all pairs. This process was repeated for each feature set.

Full size table

There are many possible voting schemes that can be employed in a majority-voting ensemble. Some experiments have shown that a relatively simple method such as unweighted majority voting works as well as those more complicated trainable weighted voting methods [30]. Therefore, we constructed an unweighted majority-voting ensemble of all possible combinations of the 8 classifiers for each feature set. Table 4 shows the optimal majority-voting classifiers found for each feature set. The accuracies on both SLF8 and SLF13 feature sets were improved by 1% by combining three classifiers for each: exponential-rbf-kernel SVM, AdaBoost, and Bagging for SLF8; rbf-kernel SVM, AdaBoost, and Mixtures-of-Experts for SLF13. Less than 1% improvement, however, were observed on the SLF10 and SLF14 feature sets. Two of the top three classifiers for each feature set (Table 2) were selected in all optimal majority-voting ensembles. The same paired t-test was conducted between the majority-voting classifier and the previously configured neural network classifier for each feature set. Statistically significant improvements were obtained for all four feature sets. Compared to the individual optimal classifiers listed in Table 2 for each feature set, the majority-voting classifiers for SLF13 also showed statistically significant improvement. Although only a marginal improvement was observed by using a majority-voting classifier for each feature set in general, it eliminates the subjective errors resulting from having to choose a single classifier for a classification problem.

Table 4 Improvement in classification accuracy using majority voting ensembles. Optimal unweighted majority-voting ensemble classifiers were formed by selecting classifiers from all 8 classifiers for each feature set listed and the average classification accuracy for 10-fold cross-validation was calculated. A paired-t test was performed for each ensemble classifier against the previous neural network classifier for each feature subset (SLF15 and SLF16 were compared against the previous classifier for SLF8 and SLF13, respectively). Each ensemble classifier was also compared against the optimal classifier for each feature set listed in Table 2 (SLF15 and SLF16 were compared with the individual optimal classifiers for SLF8 and SLF13, respectively).

Full size table

Assuming we were able to select the best classifier for each individual image, we can calculate the upper bounds of classification accuracy on the current feature sets. Over 95% of all images can be correctly classified by at least one of the eight tested classifiers for each of the four feature sets. SLF10, containing only 9 features, gave the closest performance to its upper bound among the feature sets. The accuracy upper bounds presumably represent the fraction of cells whose patterns were distorted by mitosis, cell death or acquisition errors.

Inclusion of New Texture Features

Having presumably identified the limits of classification accuracy using the existing SLF, we next examined whether adding new features could improve these accuracies. In looking for new features, we noted that eleven of the features selected for SLF8 and nine of the features selected for SLF13 were Haralick texture features [14]. To explore the possibility that other types of texture features might provide additional descriptive power, a wavelet-based multiresolution filtering technique was used to derive two additional sets of texture features, namely Gabor texture features [33] and Daubechies 4 wavelet features [34]. Gabor texture features were reported to outperform a number of other wavelet transformations in a task of analyzing images with natural textures [35, 36].

60
Gabor texture features and 30 Daubechies 4 wavelet features were added to the current 2D image features. In addition to the six features derived from parallel DNA images and the original 84 features from the SLF7 feature set, we ended up with 180 features in total for 2D image description. We have shown that feature reduction can significantly improve classification accuracy and reduce the training and testing time [15]. Among a group of eight different feature reduction methods, stepwise discriminant analysis (SDA) was rated as the best [15]. We therefore applied SDA on both the 180-feature set including 6 DNA features and the 174-feature set without DNA features. 53 ranked features were returned from the 180-feature set and 51 features from the 174-feature set. The optimal majority-voting ensemble classifiers for SLF13 and SLF8 listed in Table 4 were applied to the 53-feature set and 51-feature set respectively to evaluate the sequential inclusion of these ranked features. Figure 5 shows the sequential inclusion of the top-ranked features in both sets evaluated by the ensemble classifiers. The top 47 out of 53 features selected by SDA from the 180-feature set gave the highest accuracy on the optimal majority-voting ensemble classifier from SLF13, and we defined these as a new feature set, SLF16. The top 44 features selected by SDA from the 174-feature set gave the highest accuracy on the optimal majority-voting ensemble classifier from SLF8. We defined a new feature set containing these 44 features as SLF15.

Both SLF16 and SLF15 were then evaluated using all eight classifiers. Table 3 shows the pairwise-classifier-error correlation coefficients for these two new feature sets. Again, all 8 classifiers gave few dependent errors on both feature sets. Optimal majority-voting ensemble classifiers were created and found to include neural network, linear-kernel SVM, exponential-rbf-kernel SVM, polynomial-kernel SVM, and AdaBoost for SLF16, and rbf-kernel SVM, exponential-rbf-kernel SVM, and polynomial-kernel SVM for SLF15 (Table 4). We achieved a 92.3% average accuracy for 2D image classification by using SLF16 and 91.5% by using SLF15. These two majority-voting classifiers for SLF16 and SLF15 performed statistically better than the previously configured neural networks and the individual classifiers for SLF13 and SLF8 respectively. The benefits of including the new texture features can be represented by a 2% improvement on classifying 2D protein fluorescence microscope images both with and without DNA features. Table 4 also showed that the accuracy upper bounds for SLF16 and SLF15 are higher than those of SLF13 and SLF8 feature sets respectively.

To gain insight into the basis for the improvement, we compared the distributions in the two feature spaces of those images that were misclassified by the neural network classifier using SLF13 but were correctly classified by the ensemble classifier using SLF16. The ratio of the average distance of these images from their class mean over the average distance of all images from their class mean changed from 0.08 in the SLF13 feature space to 0.07 in that of SLF16. Thus the new features apparently moved the outlier images close enough to their class means that they could be correctly classified.

Improved distinction between similar classes

That the best classifiers described above represent improvement over previous results not only in average classification accuracy but also in the ability to discriminate similar patterns is shown by the confusion matrixes in Tables 5A and 5B. Compared to the best previous results on 2D and 3D classification [11, 14], the recognition accuracies on most patterns are significantly improved. The Golgi proteins giantin and Gpp130, which cannot be distinguished by visual inspection [14], can be distinguished over 82% of the time in 2D images using SLF16 and 96% in 3D images using SLF10. These are 12% and 15% higher than the previous best performance for 2D and 3D images respectively. The transferrin receptor pattern (TfR), which has a similar distribution to that of the lysosomal protein LAMP2 (Lam), was the least accurately classified pattern in both 2D and 3D. Its recognition was improved by 6% compared to previous results for 2D images, but not at all for 3D images. This suggests the need for future work to improve its recognition.

Table 5 Confusion matrix for 2D HeLa cell (A) and 3D HeLa cell (B) images using optimal majority-voting ensemble classifier with feature set SLF16 (A) and SLF10 (B) respectively. (Due to rounding, each row may not sum to 100). The average accuracy was 92.3% for 2D images (A) and 95.8% for 3D images (B).

Full size table

The better performance achieved from the new features and the ensemble classifiers can be further seen by inspecting images that were misclassified by the original neural network classifier but could be correctly classified using the new texture features (Figure 6). We note that the images in Figure 6 are much less typical of the patterns expected for the major organelles than those in Figure 1. For instance, mitochondria typically locate around the nucleus (Figure 1E) but the image shown in Figure 6E includes staining over the nucleus. The same is true for the tubulin pattern. The blurry image shown in Figure 6I is much less typical compared to the branched pattern shown in Figure 1I. Presumably, the new features that rely on texture are not confused by these visible differences. As a further example, the nucleolin pattern shown in Figure 6F only contains one big object, while that of Figure 1F includes a few round objects. Although the morphological and geometric features would be very different for these images, similar texture information can be observed in both images. Furthermore, the relatively independent errors (Table 3) among the classifiers of a majority-voting ensemble contribute to a more robust prediction. For instance, linear-kernel SVM, one of the five classifiers in the best performing ensemble classifier of SLF16 (Table 4), predicted the image of the transferrin receptor pattern in Figure 6 as tubulin, while all other classifiers in the ensemble made the accurate prediction. This error would not be avoided if the linear-kernel SVM was selected as the only classifier. Therefore, the accurate recognition of these less typical images can be attributed to the new texture features that capture more frequency information from the fluorescence distribution and the ensemble classifier that enriches the prediction confidence by combining outputs from multiple classifiers.

Figure 7 shows example 2D images that could not be correctly classified by any of the eight classifiers using the new feature set SLF16. The giantin and gpp130 images in Figure 7 show patterns that are much more diffuse than typical Golgi images. This is presumably due to the onset of Golgi breakdown prior to mitosis. The LAMP2 and mitochondria patterns in Figure 7 may also be affected by mitosis (or perhaps cell death) given the extensive fluorescence in the nucleus and cytoplasm that are not usually observed (Figure 1). In future work, we plan to seek more robust features that can minimize the effect of mitotic changes and capture the common information that exists between the images in Figure 7 and those in Figure 1. Of course, we expect that some images that are perturbed by cell death and other experimental artifacts may never be correctly classified.

Tradeoffs between classification accuracy and computational cost for feature sets

To further investigate the utility of our classification systems, we examined the processing time ("cost") of each feature. Image preprocessing steps including background subtraction, cropping, thresholding, and filtering need to be conducted before calculating features for 2D images. For 3D images, similar preprocessing steps were employed except that cropping is replaced by seeded watershed segmentation [11]. Since many features are related to each other, we divided the calculation of each feature to two parts, the setup cost and the computation cost. For instance, SLF7.80–7.84 are object skeleton features [14] that all require object finding as the setup cost before calculating each individual feature. Therefore, the total cost of SLF7.80–7.84 is the common setup cost plus the cost of calculating each individual feature. Since Gabor and Daubechies wavelet features involve decomposing an input image with filters that have different scales, we consider using them as a group for image classification. Table 6 shows the time costs and performances of the six feature sets used in our experiments as well as other processing costs. The large increases in costs from SLF13 to SLF16 and SLF8 to SLF15 are from the computation of both Gabor and Daubechies wavelet features. In 3D image classification, preprocessing and segmentation dominate the total costs of a feature set. Given the relationship between performance and computation cost for each of the six feature sets, a practical system can trade performance against cost and select the optimal feature set for a given purpose.

Table 6 Processing costs for calculating feature sets as well as the performance of each feature set. The costs were averaged on ten random images selected from all classes of both 2D and 3D fluorescence microscope images. Costs shown for feature sets include preprocessing (and segmentation) costs.

Full size table

Classification of sets of images

Previous work has shown that classifying sets of images by plurality voting can dramatically increase the overall accuracy of recognizing subcellular patterns [9]. We therefore evaluated two strategies of image set classification by using the six feature sets. Firstly, different image set sizes were tested for the six feature sets. Figure 8A shows the average performance of the majority-voting classifier for each feature set (Table 4) over 1000 random trials of image sets drawn from each class in the test set. The dominant predicted class in an image set was taken as the output, while random choice was made if several classes tied. The more images in a set, the more accurate the classifier will be. The smallest numbers of images in a set required to reach an average accuracy of 99% are 7, 9, 5, 6, 7, and 7 for SLF13, SLF8, SLF10, SLF14, SLF16, and SLF15 respectively. 3D image classification requires fewer images in a set to achieve 99% classification accuracy than for 2D. Secondly, we tested different numbers of features (in increments of 3) selected from each feature set given a fixed image set size of 10. The results are shown in Figure 8B. To achieve an average 99% classification accuracy in a 10-image set, the top 24, 30, 6, 6, 9, 9 features of SLF13, SLF8, SLF10, SLF14, SLF16, and SLF15 need to be included respectively. Again, 3D image classification shows more advantages in image sets classification than 2D image sets classification. The benefits of SLF16 and SLF15 can be represented by a large reduction of the numbers of features required to achieve an average 99% accuracy compared to SLF13 and SLF8 respectively. Unlike the fluctuation in classifier performance on individual images given increasing number of features in different feature spaces (Figure 4), the performance on image sets approaches a plateau as the number of features increases for all feature spaces. Image sets are often easily acquired in biological imaging experiments; therefore classifying sets of images has practical feasibility and can dramatically enhance system performance. By using only 9 and 6 features for 2D and 3D image sets that have 10 images, we can achieve essentially perfect classification accuracy.

Conclusions

This paper addresses a supervised learning problem in the domain of protein subcellular location determination. Through employing new classifiers and features in our system, we reduced the error rates of our previous system by one third to one half for 2D and 3D image classification. The current system is able to give 92–93% classification accuracy on 10 major eukaryotic subcellular location patterns from 2D fluorescence microscope images with DNA features (compared to 88% previously) and 91–92% accuracy without DNA features (compared to 86% previously). Around 96% classification accuracy can be expected using the current system to recognize 11 major eukaryotic subcellular location patterns from 3D images with DNA features (compared to 91% previously) and 89–90% accuracy without DNA features. Patterns that cannot be distinguished by visual examination, Giantin and Gpp130, can now be distinguished 96% of the time in 3D images. In addition, essentially perfect results can be achieved by classifying sets of images using a small number of features. These results are the best currently available for the protein subcellular location domain.

This fast and automated image-based approach can be combined with high-throughput fluorescence microscope imaging techniques, forming a new subfield of proteomics which we have termed location proteomics. By taking fluorescence microscope images of all proteins expressed in a certain cell type, we can cluster proteins by their location similarity and create a Subcellular Location Tree organizing location families [37]. The system also has possible applications in medicine, including identification of proteins whose location patterns change as a result of disease and which may be either diagnostic markers or therapeutic targets.

Methods

Training of the Classifiers

Throughout this section, K represents the total number of classes, F stands for the number of input features to each classifier, and P denotes the total number of parameters of a given model. We define the information content of each classifier, similar to the minimum description length [32], as the number of bits that need to be learned and delivered in order to evaluate the classifier on test datasets. Since each parameter is represented by a floating point number, the total number of bits is calculated as 32P.

Support Vector Machine

We tested SVM with four different kernel functions as well as the three multi-class expansion methods. The kernel functions used were a linear kernel, a polynomial kernel, a Gaussian radial basis kernel, and an exponential radial basis kernel. The optimal values of the parameters for the last three kernels as well as the error penalty C for each kernel were determined for a given dataset by 10-fold cross validation, and the highest scoring kernel was chosen. A Matlab SVM toolbox was downloaded from http://theoval.sys.uea.ac.uk/~gcc/svm/toolbox/.

The parameters of an SVM classifier include the support vectors represented in the feature space, the weights associated with each support vector plus the bias, kernel parameters used in the kernel function, and the scale factor introduced by a multi-class expansion method. Let S_idenote the number of support vectors in the ith model, k denote the number of kernel parameters, and m denote the number of models that need to be saved for a given multi-class expansion method. The total number of parameters of an SVM is then:

where m = K for the maxwin strategy and m = K * (K - 1)/2 for the DAG and pairwise strategies. Except for the linear kernel function that has only one kernel parameter (penalty C), all other three kernel functions have two kernel parameters (Table 2).