Abstract
Background
Mutual information is a measure of similarity between two variables. It has been widely used in various application domains including computational biology, machine learning, statistics, image processing, and financial computing. Previously used simple histogram based mutual information estimators lack the precision in quality compared to kernel based methods. The recently introduced Bspline function based mutual information estimation method is competitive to the kernel based methods in terms of quality but at a lower computational complexity.
Results
We present a new approach to accelerate the Bspline function based mutual information estimation algorithm with commodity graphics hardware. To derive an efficient mapping onto this type of architecture, we have used the Compute Unified Device Architecture (CUDA) programming model to design and implement a new parallel algorithm. Our implementation, called CUDAMI, can achieve speedups of up to 82 using double precision on a single GPU compared to a multithreaded implementation on a quadcore CPU for large microarray datasets. We have used the results obtained by CUDAMI to infer gene regulatory networks (GRNs) from microarray data. The comparisons to existing methods including ARACNE and TINGe show that CUDAMI produces GRNs of higher quality in less time.
Conclusions
CUDAMI is publicly available opensource software, written in CUDA and C++ programming languages. It obtains significant speedup over sequential multithreaded implementation by fully exploiting the compute capability of commonly used CUDAenabled lowcost GPUs.
Background
Mutual information (MI) is used to measure the mutual dependence between two random variables in information theory. As an information theoretic approach, it has been used in various areas including physics [1], image processing [2,3], speech recognition [4], and bioinformatics [5,6]. An advantage of MI compared to many other similarity measures (such as Pearson correlation), is its capability to detect nonlinear correlations between the two variables [7]. In [1], a recursive method for calculating MI is presented and used on dynamical systems and chaotic data. An overview about MI used in medical imaging applications is presented in [2]. An MI application to find features for audiovisual speech recognition tasks is proposed in [4]. Zhou et al. [5] used MI for gene clustering to determine gene regulations. In [6], MI is used to measure nonlinear relationships between the expressions of two genes. Because of its inherent algorithmic complexity, the reverse engineering or inference of gene regulatory networks (GRNs) from geneexpression profile (microarray) data remains a big challenge in system biology [79]. Approaches such as relevance networks [10], informationtheoretic methods [11], and Bayesian networks [12,13] have been used to infer GRNs. However, due to their high computational complexities, these methods are very timeconsuming and cannot be used to process large datasets. Daub et al. [7] proposed a Bspline function based MI estimation algorithm which can achieve accuracy comparable to kernelbased MI approaches. Since the Bspline based approach has a lower computational complexity, it is widely used in practice. However, it is still highly timeconsuming for big gene expression datasets. Since availability and size of such datasets is growing rapidly, finding fast solutions is of high importance to research in this area.
In this paper, we present a new approach to accelerate Bspline function based MI estimation using the CUDA programming model. We take advantage of shared memory for fast I/O operations to gain efficiency. We further use double precision floating point arithmetic as well as an efficient partitioning scheme to overcome the GPU device memory limitation for big datasets. We evaluate our implementation for a number of microarray datasets. It achieves speedups up to 82 on an Nvidia Tesla C2050 GPU compared to the publicly available multithreaded implementation by Daub et al. [7] running on an Intel i7 quadcore CPU. We use our MI results to infer GRNs by replacing the timeconsuming MI calculation in ARACNE [14]. The results show that the runtime needed for inferring GRNs is much shorter and quality of the resulting network is better by using our MI values compared to the original ARACNE.
The paper presented by Wilson et al. [15] is similar to the approach presented in this paper since it also uses CUDAenabled GPUs to accelerate Bspline based MI estimation. However, it only implements the weighting matrix computation on the GPU. Weighting matrix calculation is only one step in the Bspline based MI estimation algorithm. The remaining steps are performed on the CPU, thus limits the possible speedup. The solution presented in this paper overcomes this limitation by accelerating all steps of the Bspline based MI estimation using multiple CUDA kernels. Our experiments therefore show significant higher speedups than the ones reported in [15].
Bspline Based Mutual Information Estimator
Definition of Mutual Information
Mutual information (MI) is used as a quantity to measure the dependence of two discrete random variables. For a random variable X with values over a finite set of M possible values (or states) {x_{1}, x_{2}, ..., x_{M}}, the Shannon entropy H(X) is defined as follows:
where p(x_{i}) is the probability of the value x_{i}. The Shannon entropy is a measure of the uncertainty of X. If the outcome of a measurement of X is completely determined, then p(x_{i}) = 1 and the entropy will be 0. The joint entropy H (X, Y ) of two discrete variables (X and Y ) consisting of states {x_{1}, ..., x_{M}} and {y_{1}, ... y_{M }} can be defined by the following equation:
where p(x_{i}, y_{j}) is the joint probability between two states x_{i }and y_{j}. The joint entropy denotes the quantity of entropy contained in a joint system of two variables. The MI for X and Y can then be defined by:
From Eqn. (3), we can see that the MI is zero when X and Y are independent
Estimating Mutual Information for Continuous Data
MI can be easily estimated for discrete data. In order to estimate MI for continuous data (or measurements) which are supplied by many practical applications, a common method is to divide the continuous values into R discrete bins {a_{1}, ..., a_{R}}. Each bin may contain several data points. Assume the continuous dataset consists of M measurements {x_{1}, ..., x_{M}}. The indicator function θ_{j }is used to count the number of measurements within each bin a_{j }(j = 1, ..., R) [7]. The probability for each bin then equals to the number of measurements within the bin divided by the total number of measurements (see Eqn. (4)).
The indicator function θ_{j }is a binary function defined as:
The joint probability p'(a_{j}, b_{k}) for two random variables with measurements {x_{1}, ..., x_{M}} and {y_{1}, ..., y_{M }} and two given bins a_{j }and b_{k}, can be calculated as follows:
After estimating the probabilities using Eqns. (4) and (6), we can compute MI using Eqns. (1) to (3). In the simple binning method mentioned above, each continuous value is assigned to exactly one bin. Values near to the border of a bin can be shifted to neighboring bins by small fluctuations. Thus, the MI result is strongly affected by noise. In order to overcome this shortcoming of the simple binning method, Daub et al. [7] introduced the Bspline approach. In Daub's approach, each measurement can be assigned to multiple bins with weights given by Bspline functions.
Bspline Functions
We use the same knot vector in the Bspline function as in [7]. A knot vector t_{i }with R bins and the spline order k is defined in Eqn. (7). The spline order k should meet the condition k ∈ {1, ..., R  1}
The degree of the polynomial functions is determined by the spline order k. The Bspline function is defined as follows.
where i ∈ [1, R] and z ∈ [0, R  k + 1] is the domain interval of the Bspline function. Note that continuous values should be normalized to fit into the domain interval. Assume we have M continuous data points in the dataset, x_{1}, ..., x_{M }, the normalization procedure is as follows.
1. Find the minimum and maximum value x_{min }and x_{max }among all M data points.
2. Compute the normalized value z_{i}, (i = 1, ..., M) for the continuous value using
where R is the number of bins used in the Bspline algorithm and k is the spline order of the Bspline function.
Sequential MI estimator
Our parallel CUDA algorithm (described in Methods) extensively uses the concept of the weighting matrix of each given random variable, which is defined as follows.
Definition Weighting Matrix (WM): Consider a random variable X = {x_{1}, x_{2}, ..., x_{M}}, R bins {a_{1}, ..., a_{R}}, and the Bspline function order k. The weighting matrix for X is an M × R matrix denoted as W (X), where W (X)_{i}_{,}_{j }contains weighting coefficient of value x_{i }in bin a_{j}; i.e. W (X)_{i}_{,}_{j }= B_{j}_{,}_{k }(z_{i}) where z_{i }is the Bspline domain normalized value of x_{i }for each 1 ≤ i ≤ M and 1 ≤ j ≤ R.
Based on Eqns. (1) to (8), we can now outline the sequential Bspline based MI estimation algorithm. It consists of two parts: WM(X, R, k) and Single_MI (X, Y, R, k), which are described in Algorithm 1 and Algorithm 2.
Algorithm 1: W M(X, R, k)
Input: Random variable X = {x_{1}, ..., x_{M }}, number of bins R, BSpline order k
Output: W (X)
foreach i, 1 ≤ i ≤ M do
Calculate the normalized variable z_{i}, (i = 1, ..., M) using Eqn. (9);
foreach j, 1 ≤ j ≤ R do
Calculate the weighting coefficient B_{j}_{,}_{k }(z_{i}) using Eqn. (7) and (8) with the normalized value
z_{i };
end
end
Output W (X)_{i}_{,}_{j }= B_{j}_{,}_{k }(z_{i})
Algorithm 2: Single_M I (X, Y, R, k)
Input: Random variable measurements X = {x_{1}, ..., x_{M}} and Y = {y_{1}, ...,y_{M}}, number of bins R, BSpline order k
Output: M I (X, Y )
Call W M (X, R, k) to get W (X);
Call W M (Y, R, k) to get W (Y);
foreach j, 1 ≤ j ≤ R do
Calculate the probability of each bin for X and Y ;
end
Calculate the self entropy H (X) and H (Y);
foreach k, 1 ≤ k ≤ R do
foreach l, 1 ≤ l ≤ R do
Calculate joint probabilities;
end
end
Calculate the joint entropy;
Calculate the mutual information using Eqn. (3);
The Single_MI algorithm shows how the MI for one pair of random variables is calculated. Practical applications of MI usually have a large number of random variables as input, where the mutual information of each pair of variables needs to be computed. For example, in this paper we are interested in the analysis of gene expression data, where the input consists of N genes Ω = {X_{1}, ..., X_{N}}, where N is typically a few thousands. For each gene X_{i }we have M gene expression measurements; i.e. X_{i }= {x_{i}_{1}, ..., x_{iM }}. We then want to calculate M I (X_{i}, X_{j}) for all 1≤ i ≤ N  1, i < j ≤ N. The resulting matrix of pairwise MI values can be used as input to a subsequent clustering algorithm. The algorithm to calculate all pairwise MI values is outlined in Algorithm 3.
Algorithm 3: Pairwise_MI (Ω, R, k)
Input: N Random Variables Ω = {X_{1}, ..., X_{N}} consisting of M measurements each; i.e., X_{i }= {x_{i}_{1}, ..., x_{iM }} for all 1 ≤ i ≤ N; number of bins: R; BSpline order: k.
Output: M I (X_{i}, X_{j}) for all 1 ≤ i ≤ N  1, i < j ≤ N
foreach i, 1 ≤ i ≤ N do
W M (X_{i}, R, k);
end
foreach i, 1 ≤ i ≤ N  1 do
foreach j, i ≤ j ≤ N do
Call Single _MI (X_{i}, X_{j}, R, k) to calculate MI for this gene pair using Algorithm 2;
end
end
Complexity Analysis
The most time consuming step of Algorithm 1 is the inner forloop. Assuming that the evaluation of a Bspline function call takes O(k) time, the time complexity of this step is O(M × R × k). We further need O (M × R) space to store the output weighting matrix. Since W M (X_{i}, R, k) is called N times in Algorithm 3, this leads to a time complexity of O (N × M × R × k) and space complexity of O (N × M × R) for the first forloop. The nested forloop of Algorithm 3 calls Algorithm 2 O(N^{2}) times. The time consuming part of Algorithm 2 is determined by the nested forloop which has time complexity O (M × R^{2}). Thus, the overall time complexity of Algorithm 3 is O (N^{2 }× M × R^{2}). Note that k and R are usually significantly smaller than N and M.
CUDA Programming Model
As an extension of C/C++, CUDA (Compute Unified Device Architecture) is used to write scalable multithreaded programs for CUDAenabled GPUs [16]. CUDA programs can be executed on GPUs with NVIDIA's Tesla unified computing architecture. Examples of CUDA enabled GPUs ranging from GeForce 8/9/200, Tesla 800/1000, C1060, C2050, to Quadro FX 3000/4000/5000 series.
CUDA programs contain a sequential part, called the kernel program. The kernel is written in conventional scalar Ccode. It represents the operations to be performed by a single thread and is invoked as a set of concurrently executing threads. These threads are organized in a hierarchy consisting of socalled thread blocks and grids. A thread block is a set of concurrent threads and a grid is a set of independent thread blocks. Each thread has an associated unique ID. Similar to MPI processes, CUDA provides each thread access to its unique ID through corresponding variables. The total size of a grid (dimGrid) and a thread block (dimBlock) is explicitly specified in the kernel functioncall:
Kernel <<< dimGrid, dimBlock >>> (parameter list);
The hierarchical organization into blocks and grids has implications for thread communication and synchronization. Although threads located in different blocks cannot communicate or synchronize directly, threads within a thread block can communicate through a perblock shared memory (PBSM) and may synchronize using barriers. The Tesla architecture supports CUDA applications using a scalable processor array. The array consists of a number of streaming multiprocessors (SM). In the latest Fermi architecture [17], each SM contains 32 scalar streaming processor (SP) cores, which share a PBSM of size up to 48 KB. All threads of a thread block are executed concurrently on a single SM. The SM executes threads in small groups of 32, called warps, in singleinstruction multiplethread (SIMT) fashion. Thus, parallel performance is generally penalized by datadependent conditional branches and improved if all threads in a warp follow the same execution path. There are two types of parallelism supported by CUDA. First, a large number of threads can run in parallel independently. We call this type of parallelism the coarsegrained. Second, multiple threads within each thread blocks can cooperate on some memory spaces (such as the shared memory) simultaneously. For instance, shared memory I/O request made of n addresses can be serviced in a single clock cycle at the same time. Threads in the same thread block can cooperate together by efficiently sharing data and synchronizing their execution to coordinate memory access with other threads. We call this type of parallelism finegrained.
Previous work on using CUDA for computational biology focused on sequence alignment [1821], and molecular dynamics simulations [22]. In this paper we present the parallel Bspline function based MI estimation which can help to infer GRNs using CUDA. So far, parallel MI estimation using Bsplines has been limited to multithreading on multicore CPUs [7] and MPI on distributed memory clusters [8,23]. The results presented in this paper indicate that the CUDA approach can provide higher performance at reasonable hardware cost.
Methods
Parallel MI Estimation using CUDA
Algorithm 4: CUDAbased MI estimation algorithm
Input: N genes, each with M experiments.
Output: Pairwise MI values.
/*Host programs executed on CPU*/
Initialize parameters controlling MI estimation;
Load gene expression data into GPU device memory and launch the kernels;
/*Kernel program executed on GPU*/;
Compute the WM for each gene (Kernel 1);
Check the data integrity of the input data (Kernel 2);
Compute the self entropy for each gene (Kernel 3);
Compute the joint entropy and MI value for each gene pair (Kernel 4);
/*Host programs executed on CPU*/
Read back results to CPU and output;
Algorithm 4 shows the pseudocode of our CUDAbased MI estimation algorithm. There are four CUDA kernel programs in our algorithm. In Kernel 1 the gene expression data is divided evenly into subsets according to the total number of thread blocks. All thread blocks then work in parallel to compute the probabilities for the local gene subset using the Algorithm 1. The probabilities of each gene are stored in the WMs. In Kernel 1, threads in the same thread block process each WM in a finegrained parallel fashion. Before computing the self entropy for each gene using Kernel 3, we use Kernel 2 to check the integrity of the gene expression data first. The execution of Kernel 2 is necessary if there are blank values (caused by missing experiments) in the gene expression data. In Kernel 4, WM pairs are distributed evenly to all thread blocks. Threads in the same thread block then work in a finegrained way to compute the MI value for each gene pair. Figure 1 shows our algorithm framework for CUDAMI.
Figure 1. The algorithm framework for CUDAMI.
Parallel Computation of WM
Kernel 1 is used to compute WMs from gene expression data. We take advantage of the inherent parallelism of WM calculation and design parallel algorithms using the finegrained method. In our algorithm, the gene expression data is first divided evenly into subsets according to the total number of thread blocks. Each thread block then computes WMs for the allocated subsets of gene expression data. Since all elements in the same row of the WM can be computed independently, the WM can be computed row by row in parallel. In order to speed up the I/O operations, the rows are stored in highperformance shared memory arrays. The size of each shared memory array is R values, which equals the number of bins. In the kernel program there are R threads working in the negrained concurrent way to calculate each row of the WM. Assuming there are M rows in a WM, we partition the WM into small sized batches so that each batch can be mapped into the shared memory (see Figure 2). In Figure 2, the size of each batch is Q × R. The WM can then be computed batch by batch. Figure 3 shows that totally Q × R shared memory space is required to calculate each batch.
Figure 2. Assuming there are R bins and M rows in a WM. We partition the WM into small batches so that the size of each batch is Q × R.
Figure 3. There are R bins and Q rows in a WM batch. S_{1}, S_{2}, ..., S_{Q }are shared memory arrays to store each row of the batch. The size of each shared memory array is R values (the number of bins).
In practice, there are R + k  1 threads working in Kernel 1 in the finegrained concurrent way to calculate each row of the WM. Thus there are totally Q × (R + k  1) threads to compute the WM for each gene. It should be noted that the computation of the WM for each gene may span into different thread blocks because the maximum threads for each GPU block is limited and the size of the WM is varying with the size of the experiment M, the bin number R and the spline order k.
The reason for using R + k  1 threads instead of R threads for computing each row of WM is because the Bspline function in Eqn. (8) is recursively defined. In order to compute each row of the WM efficiently, we unwind the recursion relationships in Eqn. (8) into k dependent steps. In each step, R + k  1 threads work in the finegrained concurrent way to calculate the values of the shared memory array. In the first step, the shared memory array is initialized using Eqn. (8). In the subsequent step, the calculation of each cell on the shared memory array follows the dependency relationship in Figure 4. After the last step is done, the first R values of the shared memory array are the final results for the current row calculation. Figure 4 illustrates this method for the parameters k = 3 and R = 10.
Figure 4. Illustration of unwinding the recursive Eqn. (8) into k dependent steps using k = 3 and R = 10 for computing one row of a WM. S is the shared memory array. In each step, 12 threads work together to compute the values of S. After 3 steps, the first 10 values in S is the final result.
After all WMs are computed, Kernel 2 will be invoked to check the data integrity. Kernel 3 then follows the first forloop of Algorithm 2 to compute the self entropy for each gene. WMs and the self entropy calculated in Kernel 1 and 3 will be passed to Kernel 4 to compute the MI values for all gene pairs.
Parallel Computation of MI
According to Algorithm 3, two steps are involved in Kernel 4. Firstly, the joint entropy should be calculated using Eqn. (2) and (6). Secondly, we need to compute the MI values using Eqn (3). Profiling of these two steps for different datasets reveals that more than 99% of the overall runtime is spent on the first step. In order to calculate the joint entropy for a pair of genes, we need to perform matrix multiplication operations between the corresponding WMs using Eqn. (6). We have designed and implemented a tiled matrix multiplication algorithm to carry out this step (see Figure 5). Our method is similar to the matrix multiplication with shared memory example in the CUDA SDK [24]. However, instead of dividing WMs into square matrices as in [24], we divide them into submatrices of size Tile_ width × Bin_ number in order to make full use of the power of shared memory. We call these submatrices the tiles. In Figure 5 each WM is divided into two tiles. Then the matrix multiplication C = A × B can be implemented as C = Tile1_A × Tile1_B + Tile2_A × Tile2_B. Here, C is the joint WM. By dividing WMs into small tiles, we can store them in the fast shared memory and thus improve the performance greatly.
Figure 5. The tiled method to carry out matrix multiplication operations between two WMs.
Algorithm 5: CUDA Kernel 4
Input: WMs for each gene.
Output: Pairwise MI values.
foreach i, ≤ 1 ≤ i ≤ N  1 do
foreach j, i ≤ j ≤ N do
Assign (i, j)th pair of WM to one thread block;
R × R threads in each thread block work in parallel following the two forloops in Algorithm 2 to compute the joint WM by doing tiled matrix multiplication for the current WM pair;
One thread in this block is used to compute the joint entropy and MI;
end
end
In practice, a cyclic procedure is used in the kernel program to implement the tiled matrix multiplications. Each thread block first loads the current tile pair from global memory to shared memory. Then each thread computes one element of the tiled multiplication matrix. In the subsequent iteration, the next tile pair is loaded and multiplied. The tiled multiplication matrix will then be updated. This procedure will continue until all tile pairs are computed. The final tiled multiplication matrix is the joint WM we want to compute. At last, we calculate the MI value for the current gene pair. Algorithm 5 shows the pseudocode of our CUDA implementation of the parallel algorithm for Kernel 4.
Partitioning
Algorithm 6: CUDAbased MI estimation with partition
Input: N genes, each with M experiments.
Output: Pairwise MI values.
/*Host program executed on CPU*/
Initialize parameters controlling MI estimation;
Partition gene expression data into P groups ;
foreach i, 1 ≤ i ≤ P do
/*Kernel program executed on each thread*/
Load ith gene data group into the GPU device memory and compute WM for each gene using Algorithm 1;
Write the WMs to CPU RAM;
end
Partition WMs into Q groups;
foreach i, 1 ≤ i ≤ Q do
/*Kernel program executed on each thread*/
Load ith WM group into the GPU device memory and compute MI values using Kernel 2, Kernel 3 and Kernel 4;
Read MI values for current WM group back to CPU;
Write MI values to les;
end
The space complexity for storing all WMs is O(N × M × R). Using double precision floating point numbers this translates to 8 × N × M × R bytes of memory. The GPU global memory is not sufficient to load all WMs for parallel computation in the kernel for large datasets. Therefore, a method is required to partition the pairwise MI computation into a number of steps, where each step requires only a subset of WMs. By partitioning both the gene expression data and WMs into small groups, we can process large datasets using limited GPU device memory. Our partitioning method is illustrated in Algorithm 6.
The method shown in Algorithm 6 divides the gene expression data and WMs into smaller groups, which can be stored within the GPU global memory. Note that, the memory complexity of O(N × M × R) is dominated by the computational complexity of O(N^{2 }× M × R^{2}). Therefore, the required data transfer time can be completely hidden by the computation time.
Results and Discussion
We have implemented the double precision CUDAMI using CUDA Toolkit 3.0 and evaluated it on the following CUDAenabled hardware:
 Nvidia Tesla C2050: 1.15 GHz engine clock speed, 14 multiprocessors, 3 GB GDDR5 device memory, 48 KB shared memory/multiprocessor.
Tests have been conducted with this card installed in a PC with an Intel QuadCore i7920 2.66 GHz CPU, 12 GB RAM running Linux Fedora 10.
We have used CUDAMI to estimate MI values for different gene expression datasets. Two types of datasets are used in our simulations. One type consists of real datasets, the other type contains randomly generated simulated datasets for testing scalability. The number of genes and experiments in these datasets are shown in Table 1. In this table, the "nne" and "nasc" prefixed datasets are downloaded from. The two Yeast datasets are downloaded from the Eisen Lab website http://rana.lbl.gov/EisenData.htm webcite. They are all real gene expression datasets. To further test the scalability of our CUDAMI, we have additionally used simulated datasets with a varying number of genes and experiments. The simulated datasets were produced by ourselves and are listed as S2000_1000 to S10000_4000 in Table 1. We have compared the performance of our implementation to a widely used Bspline function based sequential MI estimation program  MIBE which is available from the author of [7]. In our tests we have used the double precision and multithreaded version of MIBE with four threads on our quadcore workstation. It is compiled by gcc 4.3.2 with all available compiler optimizations enabled and runs on an Intel QuadCore i7920 2.66 GHz CPU. In our experiments, we used the MIBE default parameters R = 10 and k = 3 for both MIBE and CUDAMI. Both tools use double precision floating point accuracy.
Table 1. Datasets used for performance evaluation.
Table 2 shows the runtime performance of MIBE and CUDAMI for processing different gene expression datasets. From Table 2 we can see that CUDAMI achieves speedups of up to 82 compared to the multithreaded MIBE. Because of the big memory usages of datasets S10000_2000, S10000_3000, and S10000_4000, the partitioning method (discussed later in Methods) is used to process them. Performance using the partitioning method is shown in bold characters in Table 2. From Table 2 we can see that the speedup of CUDAMI improves for a larger number of genes and measurements. There are two reasons for this observation. Firstly, there is higher arithmetic intensity for a larger number of genes and measurements. Secondly, the relative influence of the kernel overhead is reduced for bigger datasets. Table 3 shows the runtime performance of MIBE and CUDAMI for processing the nne4096_911 dataset using different parameters. The value of spline order ranges from 3 to 5 and the number of bins ranges from 10 to 20. From Table 3 we can make two observations:
Table 2. Comparison of runtime (in seconds) between multithreaded MIBE (4 threads) and CUDAMI
Table 3. Comparison of runtimes (in seconds) for processing the nne4096_911 dataset with various parameters.
1. The speedup does not change significantly with a larger value of spline order.
2. The speedup improves significantly with a larger number of bins.
The reason for Observation 1 is that the spline order k does not have much impact on the overall runtime of the MI estimation algorithm. On the contrary because of the quadratic item R^{2 }in the Big O notion, the value of R influences the performance greatly. From Algorithm 5 we can see that in CUDAMI totally R × R threads are used. This means a larger number of threads are used with a larger number of bins. Therefore CUDAMI can work more efficiently with larger number of bins R, which explains Observation 2. We also have compared the runtime performance of CUDAMI to the MPI based TINGe software [23] running with MPI installed on an Intel QuadCore i7920 and eight MPI processes for the datasets shown in Table 1. For all tested datasets except the nasc2048_2996 dataset, MIBE is able to outperform TINGe on the same hardware. Thus, we have decided only to include a runtime comparison between CUDAMI and MIBE in this paper.
Figure 6 shows the CUDAMI runtime pro le for dataset nne4096_911(created using the CUDA profiler). It can be seen that the Kernel 4 ("joint_matrix_mult") takes 61.39% of the total runtime. Kernel 2 ("scan_data") occupies 31:93% of the total runtime. As Kernel 2 is optional, the runtime for the CUDAMI can be further reduced if Kernel 2 is bypassed.
Figure 6. Pie chart of CUDAMI kernel functions using the visual profiler for dataset nne4096_911.
Further profiling of CUDAMI is performed with fixed spline order 3 and different bin numbers, 10, 15, 20. The bin number determines the number of threads used in Kernel 4 and therefore influences efficiency and complexity. The results in Table 4 show that the GPU runtime for Kernel 4 takes 61.39%, 71.9%, 79.38% of the total runtime for different bin number. The amount of floating point operations for the matrix multiplication in Kernel 4 is N^{2 }× M × R^{2}. The GFlops for matrix multiplication in Kernel 4 can then be computed as N^{2 }× M × R^{2}/t where t is the Kernel 4 runtime in Table 4. It can be seen that the GFlops for matrix multiplication is increased by using a larger bin number. This is because CUDAMI works more efficiently with larger number of bins.
Table 4. GFlops and GPU runtime (in Sec.) of Kernel 4 for dataset nne4096_911 with fixed spline order and variable bin number.
We have used the SynTReN [25] software to generate synthetic datasets from known underlying gene networks. Three synthetic networks are generated as shown in Table 5. The networks consist of the same number of genes (250) and a variable number of experiments (500, 900, and 1200). In order to use CUDAMI to infer GRNs, we convert its output into an adjacency matrix. The adjacency matrix is then used by ARACNE [14] to infer GRNs. We call this method "CARACNE". We have used the same parameters for ARACNE and TINGe for inferring GRNs, i.e., the Pvalue for MI threshold is 0.00001 and DPI tolerance is 0.01. TINGe runs on 4 cores using 8 MPI processes. As the underlying network structure is known, we can compute the true positive (TP), true negative (TN), false positive (FP), and false negative (FN) by comparing the output with the known network. In this paper, TP stands for the correctly inferred edges, TN stands for the correctly removed edges, FP represents wrongly added edges, and FN refers to wrongly removed edges. We have calculated sensitivity, specificity and precision as follows: specificity , sensitivity , and precision .
Table 5. Comparison of ARACNE, TINGe and CARACNE using synthetic networks.
The results for inferring GRNs using ARACNE, TINGe, and CARACNE are shown in Table 5. From Table 5 we can see that CARACNE achieves the best performance in terms of both runtime and quality of inferred GRNs.
Conclusions
In this paper we have proposed a CUDAbased parallel algorithm  CUDAMI for accelerating MI estimation using the Bspline function. In order to exploit the GPU's capabilities to accelerate MI estimation, we have used the fast shared memory, finegrained parallelism, and partitioning to implement our algorithm. Our implementation achieves speedups up to 82 compared to the multithreaded MIBE on a modern Intel quadcore. This result indicates that CUDAenabled architectures are a highly efficient hardware platform for this type computation. We also have used the output of CUDAMI to infer GRNs. Our experiments show that compared to ARACNE and TINGe, CUDAMI can achieve better performance in terms of both runtime and inferred GRNs' quality for synthetic datasets.
Availability and requirements
• Project name: CUDAMI
• Project home page: https://sites.google.com/site/liuweiguohome/cudami webcite
• Operating System: Linux
• Programming language: CUDA and C
• Other requirements: CUDA SDK and Toolkits 3.0 or higher, CUDAenabled GPU with at least 3 G memory.
• License: none
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
HS conceptualized the study, carried out the design and implementation of the algorithm, performed benchmark tests, analyzed the results and drafted the manuscript; BS and WL conceptualized the study, participated in the algorithm optimization and analysis of the results and contributed to the revising of the manuscript; WMW conceptualized the study and contributed to the revising of the manuscript. All authors read and approved the final manuscript.
References

Fraser AM, Swinney HL: Independent coordinates for strange attractors from mutual information.

Pluim JPW, Maintz JBA, Viergever MA: Mutualinformationbased registration of medical images: a survey.
IEEE Transactions on Medical Imaging 2003, 22:9861004. PubMed Abstract  Publisher Full Text

Tebmann M, Eisenacher C, Enders F, Stamminger M, Hastreiter P: GPU accelerated normalized mutual information and Bspline transformation. In Eurographics Workshop on Visual Computing for Biomedicine. Eurographics Association; 2008:117124.

Arsic I, Thiran JP: Mutual information eigenlips for audiovisual speech recognition.

Zhou X, Wang X, Dougherty ER: Construction of genomic networks using mutualinformation clustering and reversiblejump MarkovchainMonteCarlo predictor design.
Signal Processing 2003, 83:745761. Publisher Full Text

Zhou X, Wang X, Dougherty ER, Russ D, Suh E: Gene Clustering Based on Clusterwide Mutual Information.
Journal of Computational Biology 2004, 11:147161. PubMed Abstract  Publisher Full Text

Daub CO, Steuer R, Selbig J, Kloska S: Estimating mutual information using Bspline functionsan improved similarity measure for analysing gene expression data.
2004., 5

Zola J, Aluru M, Aluru S: Parallel information theory based construction of gene regulatory networks.

Butte AJ, Kohane IS: Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements.

Schäfer J, Strimmer K: An empirical Bayes approach to inferring largescale gene association networks.
Bioinformatics 2005, 21(6):754764. PubMed Abstract  Publisher Full Text

D'Haeseleer P, Wen X, Fuhrman S, Somogyi R: Mining the gene expression matrix: Inferring gene relationshops from large scale gene expression data.
Second International Workshop on Information Processing in Cells and Tissues 1998, 203212.

Friedman N, Linial M, Nachman I, Pe'er D: Using Bayesian networks to analyze expression data.
Journal of Computational Biology 2000, 7:601620. PubMed Abstract  Publisher Full Text

Chen X, Chen M, Ning K: BNArray: an R package for constructing gene regulatory networks from microarray data by using Bayesian network.

Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Favera RD, Califano A: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context.

Wilson J, Dai M, Jakupovic E, Watson S, Meng F: Supercomputing with toys: harnessing the power of NVIDIA 8800GTX and playstation 3 for bioinformatics problems.

Lindholm E, Nickolls J, Oberman S, Montrym J: NVIDIA Tesla: A unified graphics and computing architecture.

Nvidia: NvidiaFermiArchitecture. [http://www.nvidia.com/object/fermi_architecture.html] webcite

Manavski SA, Valle G: CUDA compatible GPU cards as efficient hardware accelerators for SmithWaterman sequence alignment.

Schatz MC, Trapnell C, Delcher AL, Varshney A: Highthroughput sequence alignment using graphics processing units.

Liu Y, Maskell DL, Schmidt B: CUDASW++: optimizing SmithWaterman sequence database searches for CUDAenabled graphics processing units.

Liu Y, Schmidt B, Maskell DL: CUDASW++2.0: enhanced SmithWaterman protein database search on CUDAenabled GPUs based on SIMT and virtualized SIMD abstractions.

Liu W, Schmidt B, Voss G, MüllerWittig W: Accelerating Molecular Dynamics simulations using Graphics Processing Units with CUDA.
Computer Physics Communications 2008, 179:634641. Publisher Full Text

Zola J, Aluru M, Sarje A, Aluru S: Parallel Information Theory Based Construction of Genomewide Gene Regulatory Networks.
IEEE Transactions on Parallel and Distributed Systems 2010, 21:17211733.

CUDA N: NVIDIA CUDA C Programming Guide Version 3.1.1.
2010.

den Bulcke TV, Leemput KV, Naudts B, van Remortel P, Ma H, Verschoren A, Moor BD, Marchal K: SynTReN: a generator of synthetic gene expression data for design and analysis of structure learning algorithms.