Reasearch Awards nomination

Email updates

Keep up to date with the latest news and content from BMC Systems Biology and BioMed Central.

This article is part of the supplement: Selected articles from the 24th International Conference on Genome Informatics (GIW2013)

Open Access Research

Shrinkage regression-based methods for microarray missing value imputation

Hsiuying Wang1, Chia-Chun Chiu2, Yi-Ching Wu1 and Wei-Sheng Wu2*

Author Affiliations

1 Institute of Statistics, National Chiao Tung University, 1001 University Road, 300 Hsinchu, Taiwan (R. O. C

2 Department of Electrical Engineering, National Cheng Kung University, No.1 University Road, 701 Tainan, Taiwan

For all author emails, please log on.

BMC Systems Biology 2013, 7(Suppl 6):S11  doi:10.1186/1752-0509-7-S6-S11


The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1752-0509/7/S6/S11


Published:13 December 2013

© 2013 Wang et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Abstract

Background

Missing values commonly occur in the microarray data, which usually contain more than 5% missing values with up to 90% of genes affected. Inaccurate missing value estimation results in reducing the power of downstream microarray data analyses. Many types of methods have been developed to estimate missing values. Among them, the regression-based methods are very popular and have been shown to perform better than the other types of methods in many testing microarray datasets.

Results

To further improve the performances of the regression-based methods, we propose shrinkage regression-based methods. Our methods take the advantage of the correlation structure in the microarray data and select similar genes for the target gene by Pearson correlation coefficients. Besides, our methods incorporate the least squares principle, utilize a shrinkage estimation approach to adjust the coefficients of the regression model, and then use the new coefficients to estimate missing values. Simulation results show that the proposed methods provide more accurate missing value estimation in six testing microarray datasets than the existing regression-based methods do.

Conclusions

Imputation of missing values is a very important aspect of microarray data analyses because most of the downstream analyses require a complete dataset. Therefore, exploring accurate and efficient methods for estimating missing values has become an essential issue. Since our proposed shrinkage regression-based methods can provide accurate missing value estimation, they are competitive alternatives to the existing regression-based methods.

Keywords:
missing value; imputation; microarray; regression

Background

Nowadays microarray technique has become an important and useful tool in functional genomics research. This high throughput technique allows the characterization of the gene expression of the whole genome by measuring the relative transcript levels of thousands of genes in various experimental conditions or time points [1]. Microarray data analyses have been widely used to investigate various biological processes such as the cell cycle process [2-8] and the stress response [9,10].

Although the microarray technology has been developed for more than a decade, typical microarray data still contain more than 5% missing values with up to 90% of genes affected [11]. Missing values could be generated by various reasons, including technological failures, administrative error, insufficient resolution, image corruption, dust or scratches on the slide [12]. As many downstream analysis methods (such as gene clustering, disease classification and gene network reconstruction) require complete datasets, missing value estimation becomes an important pre-processing step in the microarray data analysis [11-13].

The missing values in the microarray dataset are traditionally estimated by repeating the microarray experiments or simply replacing the missing values with zero or the row average (the average expression over the experimental conditions). Because these approaches are either time-consuming or leading to serious estimation errors, more advanced missing value imputation methods are needed to solve the missing value problems. In 2001, Troyanskaya et al. published the first two missing value imputation algorithms based on the k-nearest neighbors (kNNimpute) and the singular value decomposition (SVDimpute) [12]. Since then, a lot of missing value imputation methods have been proposed such as Bayesian principal component analysis (BPCA) [14], Gaussian mixture clustering imputation (GMCimpute) [11], conditional ordered list imputation [15], random-forest-based imputation [16] and so on.

Among the existing missing value imputation methods, the regression-based methods are very popular and contain many algorithms, including least squares imputation (LSimpute) [17], local least squares imputation (LLSimpute) [18], sequential local least squares imputation (SLLSimpute) [19], and iterated local least squares imputation (ILLSimpute) [13]. LSimpute estimates the missing values in the target gene by using a weighted average of the k estimates from the k most similar genes. Each estimate is attained by constructing a single regression model of the target gene by a similar gene. LLSimpute represents the target gene as a linear combination of k similar genes by a multiple regression model and uses the regression coefficients to estimate the missing values. SLLSimpute modifies the LLSimpute by estimating the missing values sequentially from the gene containing the fewest missing values and partially utilizing these estimated values. ILLSimpute modifies the LLSimpute by not choosing the similar genes with a fixed number k but defining the similar genes as the genes whose distances from the target gene are less than a distance threshold and then runs LLSimpute iteratively.

In this study, we focus on the regression-based methods because these methods have been shown to have better performances than the other existing methods in many testing microarray datasets [20,21]. To further improve the performance of the regression-based methods, we propose shrinkage regression-based methods which use a shrinkage estimator to replace the least square estimator for the estimation of the regression coefficients in the regression model. The shrinkage estimator such as the James-Stein estimator has been shown to dominate the least square estimator in many statistical models [22,23]. By adopting our new regression coefficients in the regression-based methods, we showed that an improvement on missing value estimation in six testing microarray datasets could be achieved.

Methods

In this study, we propose using the well-known shrinkage estimation approach to improve three existing regression-based methods (LLSimpute [18], SLLSimpute [19], and ILLSimpute [13]) for missing value estimation. We call our proposed methods the shrinkage regression-based methods (see Figure 1). In the following subsections, we first introduce the shrinkage estimation approach and then describe the proposed shrinkage LLSimpute, shrinkage SLLSimpute, and shrinkage ILLSimpute.

thumbnailFigure 1. The shrinkage regression-based methods.

Shrinkage estimation approach

One of the shrinkage estimators, the James-Stein estimator, for the normal distribution is introduced here. Suppose that Y1, Y2, ..., Yk are independent normal random variables and these k random variables all have a common known variance, but their means are unknown and different. Let Yi ~ N(θi, σ2) and Y = (Y1, ..., Yk). Then we have Y ~ N(θ, σ2I), where θ = (θ1, ..., θk) and I is a k × k identity matrix. Let d(Y) = (d1(Y), ..., dk(Y)) be an estimator of θ. Under the squared error loss function

<a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M1">View MathML</a>

(1)

we are interested in finding estimators of θ such that the mean squared error EY [L (θ, d(Y))] is minimized. An intuitive estimator of θ is Y (i.e. <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M2">View MathML</a>). However, Stein [22] showed that when k ≥ 3, there exists other estimators with smaller mean squared error than the intuitive estimator Y. For k ≥ 3, under the squared error loss, the intuitive estimator Y is dominated by the estimator

<a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M3">View MathML</a>

(2)

where <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M4">View MathML</a>[23]. The estimator in (2) is called the James-Stein estimator in the literature [23]. With the form in (2), the James-Stein estimator of θi is

<a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M5">View MathML</a>

(3)

It is worth noting that the estimator of θi in (3) depends on not only the random variable Yi, but also the other variables Y1, ..., Yi-1, Yi + 1, ..., Yk because of the term <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M6">View MathML</a>. On the contrary, the intuitive estimator <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M7">View MathML</a> does not use the other variables Y1, ..., Yi - 1, Yi + 1, ..., Yk but only uses Yi to estimate θi. It has been shown that estimators using other variables' information provide more accurate estimation for θ than the intuitive estimator does [22]. In fact, except for the estimator in (3), the estimators of the form

<a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M8">View MathML</a>

(4)

all have uniformly smaller mean squared error than the intuitive estimator Yi, for k ≥ 3 and 0 <c <2 (k - 2). Among all the estimators of the form in (4), the estimator in (3) has the minimized mean squared error. The shrinkage estimation approach has also been shown to have good performance in interval estimation [24,25]. Based on the James-Stein estimator in (3), we developed shrinkage regression-based imputation methods.

Notations

In a typical microarray data matrix, the rows are the genes under investigation and the columns are the experimental conditions or time points. The microarray data matrix is obtained by performing a series of experiments on the same set of genes. We use <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M9">View MathML</a> to represent a microarray data matrix with m genes and n experiments, and assume <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M10">View MathML</a> which is true for microarray data. In the matrix G, a row <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M11">View MathML</a> represents the expressions of the ith gene in n experiments:

<a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M12">View MathML</a>

(5)

where <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M13">View MathML</a> denotes the transpose of a column vector gi. If there is a missing value in the lth position of the ith gene, we denote it as <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M14">View MathML</a>, i.e. Gi,l = gil = <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M14">View MathML</a>.

Shrinkage local least squares imputation (Shrinkage LLSimpute)

In the LLSimpute method [18], a target gene with missing values is represented as a linear combination of k similar genes. Rather than using all genes in the dataset, only k genes with high similarity to the target gene are used. The procedure of selecting k similar genes is as follows. Suppose that the target gene is the first gene and has a missing value <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M14">View MathML</a> in the first position, i.e. <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M14">View MathML</a>= g11 in the matrix <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M15">View MathML</a>. The Pearson correlation coefficient is used to find the k similar genes. These k similar genes are called the k-nearest neighbor genes, which have the k largest absolute values of the Pearson correlation coefficients. The Pearson correlation coefficient <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M16">View MathML</a> between the target gene and the jth gene is defined as

<a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M17">View MathML</a>

(6)

where <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M18">View MathML</a> and σj denote the average and the sample standard deviation of the vector (gj2, ..., gjn). When computing the correlation coefficients, gj1 is not used because it corresponds to the position of the missing value in the target gene. Based on these selected k-nearest neighbor genes, a matrix <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M19">View MathML</a> and two vectors <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M20">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M21">View MathML</a> can be formed as follows

<a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M22">View MathML</a>

where <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M14">View MathML</a>is the missing value in the target gene g1 and <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M23">View MathML</a> are the k-nearest nieghbor genes of the target gene g1. Each row of matrix A consists of the last n - 1 elements of one k-nearest neighbor gene <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M24">View MathML</a>, 1 ≤ i k. The elements of the vector b comprise of the first elements of all these k-nearest neighbor genes and the elements of the vector w are the last n - 1 elements of the target gene g1. With the matrix A, and the vectors b and w, the least squares problem is formulated in LLSimpute as

<a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M25">View MathML</a>

(7)

Solving the above problem, the least square regression coefficients <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M26">View MathML</a> are acquired as

<a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M27">View MathML</a>

(8)

In the LLSimpute, the missing value is then estimated by

<a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M28">View MathML</a>

(9)

In this study, we want to improve the performance of LLSimpute by adjusting the regression coefficients in (8). Our shrinkage LLSimpute associates the LLSimpute method with the shrinkage estimator to impute the missing values. Our method replaces the regression coefficient estimators <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M29">View MathML</a> in (8) by the shrinkage estimator, and then use the new estimator to estimate the missing value <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M14">View MathML</a>in (9). However, we found that applying the existing shrinkage estimator in (3) did not always improve the performance of LLSimpute. Therefore, we tested different forms of the shrinkage coefficient estimators and conceived a feasible coefficient estimator to improve the LLSimpute method. We proposed using the shrinkage regression coefficients

<a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M30">View MathML</a>

(10)

to replace the conventional coefficients in (8), where σ2 is the variance of the coefficients <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M31">View MathML</a> S is the norm of the coefficients (i.e. <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M32">View MathML</a>), k is the row number of the matrix A and <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M33">View MathML</a> is the column number of the matrix A, which equals n - 1 in this case. Finally, the missing value is estimated as

<a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M34">View MathML</a>

(11)

where <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M35">View MathML</a>.

Shrinkage sequential local least squares imputation (Shrinkage SLLSimpute)

In the LLSimpute, it does not use the information of genes with missing values since the existence of missing values hinders the use of the other observed values of that gene. In the SLLSimpute method, it estimates the missing values sequentially from the gene containing the fewest missing values and partially utilizes these estimated values. The details of SLLSimpute [19] is described as follow. First, the microarray matrix <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M36','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M36">View MathML</a> is divided into two submatrices: a complete matrix <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M37">View MathML</a> consisting of genes without missing values and an incomplete matrix <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M38">View MathML</a> consisting of genes with missing values. In the incomplete matrix G2, the genes are sorted by their missing rates. The first gene has the smallest missing rate and the last gene has the largest missing rate. The missing rate is calculated by

<a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M39">View MathML</a>

(12)

where ci is the number of missing values in i-th gene. The imputation is executed sequentially from the first gene of G2. That is, the first gene of G2 which has the smallest missing rate is selected as the target gene firstly. Then LLSimpute is applied to estimate the missing values in the target gene by finding the k-nearest neighbour genes from the complete matrix G1 and then using the formula in (9) to estimate the missing values. After filling all the missing values in the target gene, it is moved to G1. Then the second gene of G2 is selected as the target gene and repeat the same process again. By moving the genes whose missing values have been imputed to the complete matrix, the previous target genes with imputed values can be utilized for the missing value estimation of the following target gene. However, too many missing values in a gene will result in big estimation error and reusing a gene with too many imputed values will reduce the imputation performance. Therefore, only the genes with missing rates less than a threshold r0 are reused, where r0 is set as the average missing rate of all genes containing missing values, i.e.,

<a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M40','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M40">View MathML</a>

(13)

By a similar argument as for the shrinkage LLSimpute, we apply the shrinkage estimator to SLLSimpute. The shrinkage SLLSimpute adjusts the coefficients of the regression model by the formula in (10) and use the formula in (11) to estimate the missing values.

Shrinkage iterated local least squares imputation (Shrinkage ILLSimpute)

LLSimpute and SLLSimpute methods select k-nearest neighbor genes for a target gene, where k is a fixed number. However, in the ILLSimpute method [13], it does not fix the number of similar genes selected. Alternatively, it defines the similar genes as the genes whose distances to the target genes are less than a distance threshold <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M41','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M41">View MathML</a>. The rationale of using a distance threshold rather than using a fixed number of similar genes is that some of the k-nearest neighbor genes are already far away from the target gene and are not very similar to the target gene.

The procedure of ILLSimpute is as follows. In the first iteration, missing values of each target gene are filled with the row average. Then a distance threshold <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M42','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M42">View MathML</a> is used to select the similar genes of each target gene. Finally, LLSimpute method is used to estimate the missing values of each target gene. In the later iteration, ILLSimpute method uses the imputed results from the previous iteration to reselect the similar genes of each target gene (using the same distance threshold) and applies LLSimpute method to re-estimate the missing values.

By a similar argument as for the shrinkage LLSimpute, we apply the shrinkage estimator to ILLSimpute. The shrinkage ILLSimpute adjusts the coefficients of the regression model by the formula in (10) and use the formula in (11) to estimate the missing values.

Results and Discussion

We conducted several experiments to compare the performances of our shrinkage regression-based methods and the original regression-based methods under different scenarios. In the first subsection, we introduce the benchmark datasets. In the second subsection, we describe how we measure the performance of various imputation methods. In the following three subsections, we report the comparison results for different number of similar genes used, different missing rates, and different noise levels. Finally, we further compare the performances of our shrinkage regressioni-based methods and three existing non-regression-based methods.

Datasets

Considering the effects of dataset selection and types of microarray experiments on the performance of an imputation method, six representative datasets (three non-time series and three time series) were used in our simulations. They were Ogawa's data from the study of phosphophate accumulation and poly-phosphophate metabolism (denoted as Ogawa, non-time series) [26], Bohen's follicular lymphomas data (denoted as BohenSH, non-time series) [27], the data from a lymphoma study (denoted as Lymphoma, non-time series) [28], the data from Brauer's experiments which studied the physiological response to glucose limitation in batch and steady-state cultures of yeasts (denoted as Brauer05, time series) [29], and Shapira's oxidative stress data (denoted as Shapira04A and Shapira04B, time series) [30]. We divided Shapira's data into two datasets because the authors used one kind of oxidative chemical in the experiment in Shapira04A, but they used another kind of oxidative chemical in the experiment in Shapira04B. The six microarray datasets were used as benchmark datasets in numerical experiments to compare the performances of our shrinkage regression-based methods and the original regression-based methods. Each dataset was processed by deleting the genes with missing values to generate a complete data matrix, and the details of these datasets were listed in Table 1.

Table 1. Benchmark datasets.

The performance measure

A common criterion used to compare the performances of different imputation methods is the normalized root mean squared error (NRMSE) [11-13,17-19]. From a microarray dataset, we can obtain an original data matrix M0 with m genes and n experiments, and then we can construct a complete matrix <a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M43','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M43">View MathML</a> by deleting the genes with missing values. After the complete data matrix M1 is established, we randomly select a specific percentage of the elements of M1 and regard these elements as missing values. Then we estimate the missing values using various imputation methods and compare their performances using NRMSE which is shown below:

<a onClick="popup('http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M44','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/7/S6/S11/mathml/M44">View MathML</a>

(14)

where yguess and yans are vectors whose elements are the estimated values by an imputation method and the known answers for all missing entries, respectively.

Performance comparison for different k values

A parameter k, the number of similar genes used, has to be determined before using two regression-based methods (LLSimpute and SLLSimpute). Since the performance of both algorithms is known to be affected by the k value used and different microarray datasets may have different optimal k values [18,19], we tested several possible k values (50, 100, 150, 200, 250 and 300) on six benchmark datasets. Table 2 listed the optimal k values for LLSimpute and SLLSimpute on each of the six benchmark datasets. Another regression-based method (ILLSimpute) does not have the parameter k and therefore was not considered in this numerical experiment.

Table 2. The optimal k value for each benchmark dataset.

For each of the six benchmark dataset, we also compared the performances of the proposed shrinkage regression-based methods and the original regression-based methods for several possible k values (50, 100, 150, 200, 250 and 300). In our numerical experiments, missing rate for each benchmark dataset was set to be 5%. Namely, for each dataset, we randomly removed 5% entries of the complete matrix to generate a matrix with missing values, and then estimated the missing values using the shrinkage and the original regression-based methods. The same procedure was run for five independent rounds and the average NRMSE of these five simulations was used to compare the performances of different imputation methods.

As shown in Figure 2, the proposed shrinkage LLSimpute outperforms LLSimpute for all k values and all benchmark datasets. Similarly, the proposed shrinkage SLLSimpute outperforms SLLSimpute for all k values and all benchmark datasets (see Figure 3). The simulation results suggest that utilizing a shrinkage estimation approach to adjust the coefficients of the regression model can improve the performances of the original regression-based methods.

thumbnailFigure 2. Performance comparison between shrinkage LLS (shr_LLS) and LLS for different k values.

thumbnailFigure 3. Performance comparison between shrinkage SLLS (shr_SLLS) and SLLS for different K values.

Performance comparison for different missing rates

In real applications, different microarray data may have different missing rates to be imputed. It is informative to know how an imputation method performs for different missing rates. Therefore, we compared the performances of the shrinkage regression-based methods and the original regression-based methods on the microarray data with different missing rates (1%, 5%, 10%, 15% and 20%). Namely, for each of the six benchmark dataset, we randomly removed x% (x = 1, 5, 10, 15 or 20) entries of the complete matrix to generate a matrix with missing values, and then estimated the missing values using the shrinkage and the original regression-based methods. The same procedure was run for five independent rounds and the average NRMSE of these five simulations was used to compare the performances of different imputation methods. Note that the optimal k value used for each benchmark dataset was listed in Table 2.

Figure 4 shows that the proposed shrinkage LLSimpute outperforms LLSimpute for all missing rates and all benchmark datasets. Figure 5 shows that the proposed shrinkage SLLSimpute outperforms SLLSimpute for all missing rates and all benchmark datasets. Figure 6 shows that the proposed shrinkage ILLSimpute outperforms ILLSimpute for all missing rates and all benchmark datasets. The simulation results suggest that utilizing a shrinkage estimation approach to adjust the coefficients of the regression model can improve the performances of the original regression-based methods.

thumbnailFigure 4. Performance comparison between shrinkage LLS (shr_LLS) and LLS for different missing rates.

thumbnailFigure 5. Performance comparison between shrinkage SLLS (shr_SLLS) and SLLS for different missing rates.

thumbnailFigure 6. Performance comparison between shrinkage ILLS (shr_ILLS) and ILLS for different missing rates.

Performance comparison for different noise levels

In real applications, different microarray data may contain different levels of noises. It is informative to know how an imputation method performs for different levels of noises inherent in the microarray data. Therefore, we compared the performances of the shrinkage regression-based methods and the original regression-based methods on the microarray data with different noise levels. For each of the six benchmark dataset, we added Gaussian noises with different levels into the data. The magnitudes of the noises were set in terms of the standard deviations ranging from 0 to 0.25 with a step size 0.05. In our numerical experiments, missing rate for each benchmark dataset was set to be 5% and the optimal k value used for each benchmark dataset was listed in Table 2. Namely, for each dataset (after adding Gaussian noises into the data), we randomly removed 5% entries of the complete matrix to generate a matrix with missing values, and then estimated the missing values using the shrinkage and the original regression-based methods. The same procedure was run for five independent rounds and the average NRMSE of these five simulations was used to compare the performance of different imputation methods.

Figure 7 shows that the proposed shrinkage LLSimpute outperforms LLSimpute for all noise levels and all benchmark datasets. Figure 8 shows that the proposed shrinkage SLLSimpute outperforms SLLSimpute for all noise levels and all benchmark datasets. Figure 9 shows that the proposed shrinkage ILLSimpute outperforms ILLSimpute for all noise levels and all benchmark datasets. The simulation results suggest that utilizing a shrinkage estimation approach to adjust the coefficients of the regression model can improve the performances of the original regression-based methods.

thumbnailFigure 7. Performance comparison between shrinkage LLS (shr_LLS) and LLS for different noise levels.

thumbnailFigure 8. Performance comparison between shrinkage SLLS (shr_SLLS) and SLLS for different noise levels.

thumbnailFigure 9. Performance comparison between shrinkage ILLS (shr_ILLS) and ILLS for different noise levels.

Performance comparison with three existing non-regression-based methods

We have shown that our shrinkage regression-based methods perform better than the existing regression-based methods. Still, it would be interesting to know whether our shrinkage regression-based methods provide more accurate missing value imputation than the existing non-regression-based methods do. Therefore, we compared the performances of our shrinkage regression-based methods and three existing non-regression-based methods (kNNimpute [12], SVDimpute [12], and BPCA [14]) on the six benchmark microarray datasets. As shown in Figures 10, 11, 12, the proposed shrinkage regression-based methods outperform these three existing non-regression-based methods for almost all missing rates and all benchmark datasets. Taken together, our shrinkage regression-based methods are competitive alternatives to the existing methods for microarray missing value imputation.

thumbnailFigure 10. Performance comparison between shrinkage LLS (shr_LLS) and three non-regression-based methods for different missing rates.

thumbnailFigure 11. Performance comparison between shrinkage SLLS (shr_SLLS) and three non-regression-based methods for different missing rates.

thumbnailFigure 12. Performance comparison between shrinkage ILLS (shr_ILLS) and three non-regression-based methods for different missing rates.

Conclusions

Imputation of missing values is a very important aspect of microarray data analyses because most of downstream analyses require a complete dataset. Therefore, exploring accurate and efficient methods for estimating missing values has become an essential issue. In this study, regression-based methods associated with a shrinkage estimation approach are proposed to estimate missing values in the microarray data. Our methods take the advantage of the correlation structure existing in the microarray data and select similar genes for the target gene by Pearson correlation coefficients. Besides, our methods incorporate the least squares principle, utilize a shrinkage estimation approach to adjust the coefficients of the regression model, and apply the new coefficients of the regression model to estimate missing values. Simulation results show that the proposed shrinkage regression-based methods provide more accurate missing value estimation for various types of datasets than the original regression-based methods do. Since our proposed methods can be applied to modify any kind of regression-based methods and can provide accurate missing value estimation, they are competitive alternatives to the existing regression-based methods.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

WSW conceived the research topic and provided essential guidance. HW developed the alogirthm. CCC did all the simulations. HW, CCC, YCW, and WSW wrote the manuscript. All authors have read and approved the final manuscript.

Acknowledgements

This study was supported by the National Cheng Kung University and Taiwan National Science Council NSC 99-2628-B-006-015-MY3 and NSC 101-2118-M-009-006-MY2.

Declarations

The full funding for the publication fee came from Taiwan National Science Council and College of Electrical Engineering and Computer Science, National Cheng Kung University.

This article has been published as part of BMC Systems Biology Volume 7 Supplement 6, 2013: Selected articles from the 24th International Conference on Genome Informatics (GIW2013). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/7/S6.

References

  1. Schena M, Shalon D, Davis R, Brown P: Quantitative monitoring of gene expression patterns with a complementary DNA microarray.

    Science 1995, 270:467-470. PubMed Abstract | Publisher Full Text OpenURL

  2. Wu W, Li W, Chen B: Computational reconstruction of transcriptional regulatory modules of the yeast cell cycle.

    BMC Bioinformatics 2006, 7:421. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  3. Rowicka M, Kudlicki A, Tu B, Otwinowski Z: High-resolution timing of cell cycle-regulated gene expression.

    Proc Natl Acad Sci USA 2007, 104:16892-16897. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  4. Wu W, Li W, Chen B: Identifying regulatory targets of cell cycle transcription factors using gene expression and ChIP-chip data.

    BMC Bioinformatics 2007, 8:188. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  5. Futschik M, Herzel H: Are we overestimating the number of cell-cycling genes? The impact of background models on time-series analysis.

    Bioinformatics 2008, 24:1063-1069. PubMed Abstract | Publisher Full Text OpenURL

  6. Wu W, Li W: Systematic identification of yeast cell cycle transcription factors using multiple data sources.

    BMC Bioinformatics 2008, 9:522. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  7. Siegal-Gaskins D, Ash J, Crosson S: Model-based deconvolution of cell cycle time-series data reveals gene expression details at high resolution.

    PLoS Comput Biol 2009, 5:e1000460. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  8. Wang H, Wang Y, Wu W: Yeast cell cycle transcription factors identification by variable selection criteria.

    Gene 2011, 485:172-176. PubMed Abstract | Publisher Full Text OpenURL

  9. Gasch A, Spellman P, Kao C, Carmel-Harel O, Eisen M, Storz G, Botstein D, Brown P: Genomic expression programs in the response of yeast cells to environmental changes.

    Mol Biol Cell 2000, 11:4241-4257. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  10. Wu W, Li W: Identifying gene regulatory modules of heat shock response in yeast.

    BMC Genomics 2008, 9:439. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  11. Ouyang M, Welsh W, Georgopoulos P: Gaussian mixture clustering and imputation of microarray data.

    Bioinformatics 2004, 20:917-923. PubMed Abstract | Publisher Full Text OpenURL

  12. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman R: Missing value estimation methods for DNA microarrays.

    Bioinformatics 2001, 17:520-525. PubMed Abstract | Publisher Full Text OpenURL

  13. Cai Z, Heydari M, Lin G: Iterated local least squares microarray missing value imputation.

    J Bioinform Comput Biol 2006, 4:935-957. PubMed Abstract | Publisher Full Text OpenURL

  14. Oba S, Sato M, Takemasa I, Monden M, Matsubara K, Ishii S: A Bayesian missing value estimation method for gene expression profile data.

    Bioinformatics 2003, 19:2088-2096. PubMed Abstract | Publisher Full Text OpenURL

  15. Yu T, Peng H, Sun W: Incorporating nonlinear relationships in microarray missing value imputation.

    IEEE/ACM Trans Comput Biol Bioinform 2011, 8:723-731. OpenURL

  16. Stekhoven D, Bühlmann P: MissForest-non-parametric missing value imputation for mixed-type data.

    Bioinformatics 2012, 28:112-118. PubMed Abstract | Publisher Full Text OpenURL

  17. Bø T, Dysvik B, Jonassen I: LSimpute: accurate estimation of missing values in microarray data with least squares methods.

    Nucleic Acids Res 2004, 32:e34. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  18. Kim H, Golub G, Park H: Missing value estimation for DNA microarray gene expression data: local least squares imputation.

    Bioinformatics 2005, 21:187-198. PubMed Abstract | Publisher Full Text OpenURL

  19. Zhang X, Song X, Wang H, Zhang H: Sequential local least squares imputation estimating missing value of microarray data.

    Comput Biol Med 2008, 38:1112-1120. PubMed Abstract | Publisher Full Text OpenURL

  20. Celton M, Malpertuy A, Lelandais G, de Brevern A: Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments.

    BMC Genomics 2010, 11:15. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  21. Brock G, Shaffer J, Blakesley R, Lotz M, Tseng G: Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes.

    BMC Bioinformatics 2008, 9:12. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  22. Stein C: Inadmissibility of the usual estimator for the mean of a multivariate normal distribution.

    Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability 1956, 1:197-206. OpenURL

  23. James W, Stein C: Estimation with quadratic loss.

    Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability 1961, 1:361-379. OpenURL

  24. Wang H: Brown's paradox in the estimated confidence approach.

    The Annals of Statistics 1999, 27:610-626. Publisher Full Text OpenURL

  25. Wang H: Improved confidence estimators for the multivariate normal confidence set.

    Statistica Sinica 2000, 10:659-664. OpenURL

  26. Ogawa N, DeRisi J, Brown P: New components of a system for phosphate accumulation and polyphosphate metabolism in Saccharomyces cerevisiae revealed by genomic expression analysis.

    Molecular Biology of the Cell 2000, 11:4309-4321. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  27. Bohen S, Troyanskaya O, Alter O, Warnke R, Botstein D, Brown P, Levy R: Variation in gene expression patterns in follicular lymphoma and the response to rituximab.

    Proc Natl Acad Sci USA 2003, 100:1926-1930. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  28. Alizadeh A, Eisen M, Davis R, Ma C, Lossos I, Rosenwald A, Boldrick J, Sabet H, Tran T, Yu X, Powell J, Yang L, Marti G, Moore T, Hudson J, Lu L, Lewis D, Tibshirani R, Sherlock G, Chan W, Greiner T, Weisenburger D, Armitage J, Warnke R, Levy R, Wilson W, Grever M, Byrd J, Botstein D, Brown P, Staudt L: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling.

    Nature 2000, 403:503-511. PubMed Abstract | Publisher Full Text OpenURL

  29. Brauer M, Saldanha A, Dolinski K, Botstein D: Homeostatic adjustment and metabolic remodeling in glucose-limited yeast cultures.

    Mol Biol Cell 2005, 16:2503-2517. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  30. Shapira M, Segal E, Botstein D: Disruption of yeast forkhead-associated cell cycle transcription by oxidative stress.

    Mol Biol Cell 2004, 15:5659-5669. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL