Department of Mathematics and Statistics, York University, 4700 Keele, Street, Toronto, Ontario, M3J 1P3 Canada

Department of Mathematics and Statistics, 50 Stone Road East, Guelph, Ontario, N1G 2W1 Canada

Abstract

Background

It is desirable in genomic studies to select biomarkers that differentiate between normal and diseased populations based on related data sets from different platforms, including microarray expression and proteomic data. Most recently developed integration methods focus on correlation analyses between gene and protein expression profiles. The correlation methods select biomarkers with concordant behavior across two platforms but do not directly select differentially expressed biomarkers. Other integration methods have been proposed to combine statistical evidence in terms of ranks and p-values, but they do not account for the dependency relationships among the data across platforms.

Results

In this paper, we propose an integration method to perform hypothesis testing and biomarkers selection based on multi-platform data sets observed from normal and diseased populations. The types of test statistics can vary across the platforms and their marginal distributions can be different. The observed test statistics are aggregated across different data platforms in a weighted scheme, where the weights take into account different variabilities possessed by test statistics. The overall decision is based on the empirical distribution of the aggregated statistic obtained through random permutations.

Conclusion

In both simulation studies and real biological data analyses, our proposed method of multi-platform integration has better control over false discovery rates and higher positive selection rates than the uncombined method. The proposed method is also shown to be more powerful than rank aggregation method.

Background

In gene expression experiments, the expression levels of thousands of genes are simultaneously monitored to study the underlying biological process. In proteomic data, the protein levels or protein counts are measured for thousands of genes simultaneously. In addition, there are other types of genomic data with different sizes, formats and structures. Each distinct data type, such as gene expression, protein counts, or single nucleotide polymorphisms, provide potentially valuable and complementary information regarding the involvement of a given gene in a biological process. Many biomarkers that play important roles in biological processes behave differently in treatment versus control groups; this phenomenon can be observed consistently across various data platforms. Therefore, integrating related data sets from different sources is crucial to correctly identify the significant underlying biomarkers. Integrative analysis of multiple data types would improve the identification of biomarkers of clinical end points

The practice of combining different data sources to perform classification analysis has been considered in the literature. Efforts to integrate data and improve classification accuracy are widely seen in recent studies

The problem of how to reliably combine data from different experiment platforms to identify significant biomarkers has recently received considerable attention in the bioinformatics literature. The rank aggregation method

The proposed method is similar in spirit to a meta-analysis. Both methods combine statistical evidence across multiple data sets. However, in meta-analysis different data sets are based on the same type of experiments or observational studies, and therefore the measurements are the same variables. Across different data sets, the quality of the data may vary. The goal of meta-analysis is to fully utilize all the information from different data sets and construct a weighted estimate of the effect size. Different weighting schemes are available depending on the statistical models

Methods

The aim of our multi-platform integration method is to select a set of significant biomarkers that are involved in a biological process and thus behave differently in the treatment group and the control group. In order to combine statistical evidence across different platforms, our method requires that analogous hypotheses based on the features being measured are formulated for each platform. Each null analogous hypothesis specifies the unrelatedness of the biomarker in that particular experimental setting, but all of them infer the unrelatedness of the biomarker to the biological process being investigated. Based on the set of Q analogous hypotheses for Q data sources, we construct a set of Q corresponding test statistics for each type of data. The test statistics can be different and tailored to the specific experimental settings. For example, if the microarray experiment has a multifactorial design, the appropriate test statistic can be an F statistic based on an ANOVA test. If the proteomics experiment generates counting data for diseased versus normal groups, the appropriate test statistic can be a nonparametric Wilcoxon rank sum test. A vector of observed statistics across multi-platforms is obtained. We then randomly permute data across diseased and control groups. All measurements from different platforms are permuted. In this way, we obtain an empirical null distribution of the vector of test statistics. In order to pool the randomized values of the statistics across the biomarkers to form the empirical null distribution, we assume data from different biomarkers are independent or have an exchangeable correlation structure. For the validity of the randomization procedure, we assume an exchangeable covariance structure for the measurements within each platform. Finally, we construct a weighted sum of the test statistics across different platforms with the weights being the inverse of the empirical standard deviation of each statistic. We determine a set of significant biomarkers based on the aggregated test statistic.

In the following, we demonstrate our method by integrating microarray expression data and proteomic data as an example. We consider two experiments, the first having microarray expression data measured on _{1} diseased samples and _{2} control samples and the second having proteomic data measured on _{1} diseases samples and _{2} control samples. The objective is to find biomarkers significantly involved in disease development.

Step 1): Define two analogous null hypotheses. For microarray data, the null hypothesis would be _{01}: the gene’s mRNA level is the same in diseased and normal populations; for proteomic data, the null hypothesis would be _{02}: the protein level is the same in diseased and normal populations.

Step 2): Based on the hypotheses, construct two test statistics, _{m
} and _{p}, tailored to each type of data. Consequently, we obtain a vector of two observed statistics (_{m},_{p})^{′} across two data platforms. The test statistics can be of any type as long as they summarize information from the data and can be used to assess the statistical significance of the data toward the hypotheses. Let _{1} gene expression measurements in the disease group, _{2} gene expression measurements in the control group, _{1} protein measurements in the disease group and _{2} protein measurements in the control group,

and

where ^{2} denotes the sample variance. The test statistics should be formulated so that a larger test statistic in the positive direction indicates more evidence towards the alternative hypotheses. For example, if Student’s t-statistic is used, then a one-sided alternative hypothesis corresponds to a one-sided t-statistic, whereas the two-sided alternative leads to the absolute value of the t-statistic. Consider _{
mi
},_{
pi
})^{
′
},

Step 3): The samples are randomly permuted across diseased and control groups. If the same sample is being measured across different platforms, all the measurements from the different platform are permuted simultaneously. The simultaneous permutation preserves the dependency relationship among the measurements from different platforms. Based on random permutation, we obtain an empirical null distribution of the vector (_{
m
},_{
p
})^{
′
}.

Step 4): The aggregated test statistic will be:

where _{
m
} and _{
p
} based on the empirical null distribution, and _{
m
} and _{
p
} are the observed t-statistics or the absolute values of the t-statistics based on the direction of the alternative hypotheses. At significance level _{
α
}, such that _{
α
} is the 100(1−_{
A
}, which can be obtained from the empirical null distribution. Construct a decision line that separates selected significant biomarkers and nonsignificant biomarkers. The resulting separation line is:

All the biomarkers with (_{
m
},_{
p
}) above the separation line will be declared as significantly involved in the disease development.

In the more general case, suppose we have Q data platforms with the observed test statistics (_{1},…,_{
Q
})^{
′
}. From random permutation, we obtain the joint empirical distribution of this vector of test statistics under the global null hypothesis. Let

The resulting critical region will take the form:

where _{
α
} is the 100(1−_{
A
}. Any biomarker with _{
A
} > _{
α
} will be selected as behaving significantly differently between the diseased group and control group.

Our method aggregates actual values of the test statistics across different data platforms, which preserves more information compared to the rank aggregation method. Moreover, our method assigns different weights to each data set according to the variability of the test statistics: larger the variation in the test statistic, the smaller the weight assigned to it, and vice versa. The threshold _{
α
} is determined based on the empirical null distribution of the aggregated test statistics, which implicitly takes into account the dependency relationships among the test statistics. Furthermore, our method can deal with different data types and formats generated by various experimental settings.

There are two major ways to perform the multiplicity adjustment. The first is the Bonferroni correction. If we wish to control the familywise type I error rate at ^{∗}, then the individual level ^{∗}/

Different platforms can be used to test different sub-hypothesis. All of these sub-hypotheses should be concordant in supporting the overall biological hypothesis. For example, the involvement of a gene in disease development can be supported by both mRNA expression level changes and proteomic level changes. In most cases, changes in measurements from different platforms are expected to occur in the same direction. However, our method is also applicable even if the changes are in different directions, as long as the statistical evidence from both sources can be combined. For example, consider _{10}: mRNA is increasing in normal group; _{20}: antibody count is decreasing in normal group. Even though the actual measurements from two platforms are negatively correlated, we can construct the test statistics _{1} and _{2} so that the positive value of the statistics supports the alternative hypotheses and the weighted average can be used as combined evidence of the involvement of the biomarker in the process.

Results

Results on simulated data

In this section, we examine the performance of our proposed method by examining its positive selection rates and false discovery rates under various testing scenarios. We simulate data sets from _{
q
}. For each data set, we assume that _{
q
} = _{
qi
}= _{
qi1} denotes data from the control group with mean _{
qi1} and _{
qi2} denotes data from the diseased group with mean _{
qi2}. The total number of biomarkers is set to be _{
qi1} ≠ _{
qi2}. The number _{
qi1} = 0 × 1_{
m
}, _{
qi2} = _{
m
}, where _{
qi
}is generated from a Poisson(_{
qi1} = _{
qi1} for the control group and _{
qi2} = _{
qi1} + _{1} = 100 and _{2} = 100. Each group is assigned a different effect size

To compare our multi-platform integration method with the individual platform analysis method, the positive selection rate (PSR) and false discovery rate (FDR) are calculated to assess the performance of each method for selecting the differentially expressed biomarkers:

and

Tables

**Methods**

**Multi-platform**

**1st individual**

**2nd individual**

Scenario 1:

_{1} + _{2} = 200

Right-side

Experiment1:

e = 0.5 for _{1} = 100; e = 2 for _{2} = 100

Experiment2:

e = 1.5 for _{1} = 100; e = 1 for _{2} = 100

0.7895

0.5372

0.5588

0.0007

0.0007

0.0010

0.1907

0.2680

0.2600

0.0007

0.0013

0.0009

Left-side

Experiment1:

e = -0.5 for _{1} = 100; e = -2 for _{2} = 100

Experiment2:

e = -1.5 for _{1} = 100; e = -1 for _{2} = 100

0.7908

0.5330

0.5556

0.0006

0.0006

0.0012

0.1891

0.2673

0.2649

0.0006

0.0009

0.0011

Two-sided

Experiment1:

e = -1 for _{1} = 100; e = 1.5 for _{2} = 100

Experiment2:

e = 2 for _{1} = 100; e = -1 for _{2} = 100

0.6988

0.4113

0.5403

0.0011

0.0011

0.0010

0.2145

0.3202

0.2694

0.0007

0.0016

0.0012

Scenario 2:

_{1} + _{2} = 200

Right-side

Experiment1:

e = 0.5 for _{1} = 100; e = 2 for _{2} = 100

Experiment2:

e = 1.5 for _{1} = 100; e = 1 for _{2} = 100

0.9405

0.6319

0.7819

0.0003

0.0005

0.0007

0.1560

0.2410

0.2051

0.0005

0.0009

0.0007

Left-side

Experiment1:

e = -0.5 for _{1} = 100; e = -2 for _{2} = 100

Experiment2:

e = -1.5 for _{1} = 100; e = -1 for _{2} = 100

0.9400

0.6316

0.7871

0.0002

0.0004

0.0006

0.1605

0.2419

0.2024

0.0005

0.0007

0.0006

Two-sided

Experiment1:

e = -1 for _{1} = 100; e = 1.5 for _{2} = 100

Experiment2:

e = 2 for _{1} = 100; e = -1 for _{2} = 100

0.9377

0.6670

0.7327

0.0003

0.0010

0.0007

0.1622

0.2270

0.2122

0.0005

0.0009

0.0007

**Method**

**Multi-plat**

**1st ind.**

**2nd ind.**

**3rd ind.**

**4th ind.**

**5th ind.**

Scenario 1:

_{1} + _{2} = 200

Exp1:

e = 1.5 for g = 200

Exp2:

e = 1.5 for _{1} = 100; e = 1 for _{2} = 100

Exp3:

e = -0.5 for _{1} = 100; e = -2 for _{2} = 100

Exp4:

e = -1 for _{1} = 100; e = 1.5 for _{2} = 100

Exp5:

e = 2 for _{1} = 100; e = -1 for _{2} = 100

0.9517

0.5601

0.4130

0.4464

0.4213

0.4471

0.0002

0.0012

0.0011

0.0004

0.0010

0.0005

0.1572

0.2605

0.3299

0.3108

0.3205

0.2727

0.0004

0.0011

0.0018

0.0009

0.0010

0.0010

Scenario 2:

_{1} + _{2} = 200

Exp1:

e = 1.5 for g = 200

Exp2:

e = 1.5 for _{1} = 100; e = 1 for _{2} = 100

Exp3:

e = -0.5 for _{1} = 100; e = -2 for _{2} = 100

Exp4:

e = -1 for _{1} = 100; e = 1.5 for _{2} = 100

Exp5:

e = 2 for _{1} = 100; e = -1 for _{2} = 100

0.9998

0.8360

0.6655

0.5682

0.6712

0.5699

2.7e-06

0.0006

0.0010

0.0004

0.0010

0.0008

0.1281

0.1898

0.2217

0.2593

0.2314

0.2093

0.0004

0.0006

0.0009

0.0007

0.0007

0.0008

**Methods**

**Multi-platform**

**1st individual**

**2nd individual**

Experiment1:

Continues; _{1} = 100; e = 2 for _{2} = 100

Experiment2:

Discrete; _{
qn1} = 5, e = 3 for g = 200

0.7356

0.5327

0.5228

0.0008

0.0004

0.0012

0.1967

0.2702

0.2763

0.0008

0.0012

0.0012

Figure

Decision lines for comparing methods

**Decision lines for comparing methods.** Vertical lines use data from the first individual platform, horizontal lines use data from the second individual platform, and dashed lines use our multi-platform integration method. Circles represent non-differentially expressed biomarkers and triangles represent differentially expressed biomarkers. Plots are based on one simulated data set and 100 permutations.

As we examine a large number of biomarkers, we need to investigate the control of the false discovery rate of the proposed method with regards to multiple hypothesis testing

**Methods**

**
α
**

**0.05**

**0.01**

**0.005**

**40**

**8**

**4**

multi-platform

224

165

143

(

6.5547

6.0820

5.5202

44.8125

8.0250

3.8375

(

7.3348

3.4778

2.263

0.1563

0.0386

0.0214

(

0.0219

0.0161

0.0125

0.1428

0.0388

0.0225

(

0.0041

0.0014

0.0009

1st individual

165

107

91

(

8.8797

5.3066

4.9031

50.5125

9.9000

4.6500

(

8.9101

3.4982

2.1766

0.2431

0.0736

0.0406

(

0.0326

0.0246

0.0183

0.1940

0.0600

0.0353

(

0.0103

0.0030

0.0019

2nd individual

197

106

79

(

7.2442

8.2303

6.3222

48.9250

9.6000

5.000

(

7.1862

3.5750

2.5376

0.1986

0.0721

0.0506

(

0.0245

0.0258

0.0251

0.1630

0.0607

0.0408

(

0.0060

0.0048

0.0033

Results on real data

In this section, we apply our method to data from a study of growth and stationary phase adaption in ^{
TM
})-derived shotgun proteomic data and DNA microarray transcriptome data. To study different growth stages of ^{
TM
} system can only analyze four distinct samples in a single experiment, the eight protein samples were distributed across three runs of mass spectrometric (MS) analysis, The protein sample from 11 h was run in three MS experiments, so it serves as a reference. Therefore, protein abundance ratios ^{
TM
} software and the inbuilt Paragon^{
TM
} search engine. Only proteins identified with ≥ 99

For microarray data, total mRNA from the same eight time point samples were isolated and a spotted DNA microarray experiment was conducted. Hybridization was performed using genomic DNA (gDNA) as a reference. The mRNA abundance was obtained using _{log2}[cDNA/gDNA]. To be consistent with the protein data, mRNA abundance data from different samples were processed to calculate _{log2}[cDNAi/cDNA_{7hr
}] for each sample with respect to the first time point sample. Only gene expression values with protein values (894 genes) were analyzed. To deal with missing values, we deleted genes that had no values for mRNA at all or had at least five missing values in the protein data set. The rest of the missing values for genes were imputed by using R package MICE. In total, the number of genes suitable for the subsequent integrative analysis was 886. Based on the growth curve, time points were divided into two groups; those from 7, 11, 14 and 16 h represented the growth phase and those from 22, 26, 34 and 38 h represented the stationary phase.

The objective of our analysis is now to select the biomarkers that are differentially expressed between the two phases. We apply our multi-platform integration method to identify differentially expressed biomarkers. For the mRNA data, we formulate the null hypothesis as _{0}: the mRNA expression level is the same between the two phases. Similarly, for protein data, the null hypothesis is formulated as _{0} : the protein ratio is the same between the two phases. For both mRNA data and protein data, two-sided alternatives are considered in the analysis. For each platform, we use Student’s t-statistics to summarize the statistical evidence, which are denoted as _{m} and _{p
}. To obtain the multivariate null distribution, 100 permutations are conducted. The overall correlation between _{m} and _{p
} is 0.2787. The variances of _{m} and _{p
} are 3.0489 and 3.6411, respectively. Based on the decision line constructed at the significance level

Decision lines for real data

**Decision lines for real data.** Vertical lines use the mRNA data, horizontal lines use the protein data, and dashed lines use our multi-platform integration method.

Nine differentially expressed genes are identified by our method but not by the other two methods. Among these, we identify biosynthetic enzymes (SCO5080 actVA5, SCO5072 actVIORFI) involved in actinorhodin production. These genes are up-regulated only at late stages of the culture and produce antibiotics during the stationary phase. Expression of two genes encoding malate oxidoreductase (SCO2951) and translation elongation factor G (SCO4661) have been found to be depressed during the stationary phase compared with the growth phase

**SCO**

**Sanger**

**Sanger**

**Sanger**

**Sanger**

**TIGR**

**Related**

**abbreviation**

**annotation**

**category**

**subcategory**

**category**

**paper***

SCO1958

uvrA

ABC excision

Macromolecule

DNA-replication,

excinuclease ABC,

nuclease subunit A

metabolism

repair, restr./modific’n

A subunit

SCO2940

other

putative

Not classified

Not classified

xanthine

oxidoreductase

(included putative

(included putative

dehydrogenase,

assignments)

assignments)

putative

SCO2951

other

putative malate

Central intermediary

Other central

malate

oxidoreductase

metabolisms

intermediary metabolism

oxidoreductase

SCO3094

other

conserved

hypothetical

Conserved in

conserved

hypothetical

protein

organism other than

hypothetical

protein

protein

Escherichia coli

protein

SCO4661

fusA

elongation

Macromolecule

Proteins -

translation

factor G

metabolism

translation and

elongation

modification

factor G

SCO5072

actVIORF1

hydroxylacyl-CoA

Secondary

PKS

hydroxylacyl-CoA

dehydrogenase

metabolism

PKS

dehydrogenase

SCO5080

actVA5

putative

Secondary

PKS

putative

hydrolase

metabolism

PKS

hydrolase

SCO6219

Other

putative ATP/GTP

Protein

Serine/

binding protein,

kinases

threonine

putative serine

SCO6222

other

putative

Not classified

Not classified

aminotransferase,

aminotransferase

(included putative

(included putative

class I

assignments)

assignments)

Discussion

An ongoing problem in proteomics is that extremely small sample sizes often occur, largely due to biological reasons. To investigate the performance of our method in such situations, we consider a case for each platform wherein the control and the diseased groups each have only two measurements. Our method is applied and the simulation results shown in Table

**Method**

**Multi-plat**

**1st ind.**

**2nd ind.**

Scenario 1:

Extremely small sample size

two measurements from each group

0.3022

0.2363

0.2179

0.0009

0.0006

0.0007

0.3782

0.4436

0.4694

0.0023

0.0025

0.0027

Scenario 2:

Correlation among platforms set to 0.5

Disease and normal groups are independent

0.6689

0.5365

0.5578

0.0009

0.0008

0.0011

0.2255

0.2690

0.2641

0.0008

0.0010

0.0010

Scenario 3:

Non-standardized version of _{
m
} and _{
p
}

i.e. _{
m
} = _{
p
} =

0.8142

0.5479

0.5992

0.0009

0.0005

0.0010

0.1586

0.2358

0.2235

0.0006

0.0011

0.0010

We also consider the situation in which data on the same biomarker from

The proposed method allows different ways of constructing _{
m
} and _{
p
} as long as they provide summarized statistical evidence for that platform. The Student’s _{
m
} and _{
p
} and form a weighted linear sum statistic. To compare the empirical performance of the standardized versus unstandardized versions, we conduct simulations under the setting 1 of Table _{
m
} and _{
p
} has a slightly higher PSR and a slightly lower FDR.

An alternative way of combining test statistics across different platforms is to form a multivariate quadratic statistic. Given two platforms, for example, we consider an alternative test statistic

where _{
m
},_{
p
}) obtained from the empirical null distribution. Such multivariate statistic can be used to test the overall null hypothesis against two-sided alternatives, while the weighted linear statistic that we propose can be used to test one-sided alternatives or two-sided alternatives. Thus, our method is more broadly applicable. We further conduct simulations to compare the multivariate quadratic form with our proposed weighted linear statistic for two-sided tests under the setting of scenario 2, Table

**Method**

**Multi-plat**

**Quadratic**

0.9377

0.9155

0.0003

0.0004

0.1622

0.1804

0.0005

0.0005

Quadratic:

Exp1:

e = -1 for _{1} = 100; e = 1.5 for _{2} = 100

Exp2:

e = 2 for _{1} = 100; e = -1 for _{2} = 100

Finally, we compare our method with the existing robust rank aggregation method

**Setting:**

**Method**

**Multi-plat**

**RRA**

1.

_{1} + _{2} = 100

Exp1: e = 1.5 for g = 200

1.000

0.7497

Exp2: e = 1.5 for _{1} = 100; e = 1 for _{2} = 100

1.98e-6

0.0012

Exp3: e = -0.5 for _{1} = 100; e = -2 for _{2} = 100

0.2803

0.0912

Exp4: e = -1 for _{1} = 100; e = 1.5 for _{2} = 100

0.0011

0.0003

Exp5: e = 2 for _{1} = 100; e = -1 for _{2} = 100

2.

_{1} + _{2} = 200

Exp1: e = 1.5 for g = 100

0.9995

0.4995

Exp2: e = 1.5 for _{1} = 50; e = 1 for _{2} = 50

0.23e-06

0.0008

Exp3: e = -0.5 for _{1} = 50; e = -2 for _{2} = 50

0.1399

0.0823

Exp4: e = -1 for _{1} = 50; e = 1.5 for _{2} = 50

0.0004

0.0004

Exp5: e = 2 for _{1} = 50; e = -1 for _{2} = 50

3.

_{1} + _{2} = 400

Exp1: e = 1.5 for g = 100

0.9992

0.1133

Exp2: e = 1.5 for _{1} = 50; e = 1 for _{2} = 50

2.23e-6

0.0002

Exp3: e = -0.5 for _{1} = 50; e = -2 for _{2} = 50

0.0402

0.0796

Exp4: e = -1 for _{1} = 50; e = 1.5 for _{2} = 50

0.0001

0.0015

Exp5: e = 2 for _{1} = 50; e = -1 for _{2} = 50

Conclusion

With the advent of various types of genomic technologies, it is imperative to develop a method that can integrate different types of genomic data to solve biological questions. We develop a general framework for data integration across multiple data platforms. For each data set, a test statistic is formed to summarize the statistic evidence toward the specific null hypothesis tailored to the data platform. The types of test statistics can vary and their marginal distributions can be different. The observed test statistics can then be aggregated across different data platforms. The overall decision is based on the empirical distribution of the aggregated statistic obtained through random permutations. Our method can accommodate different experimental designs and various data types across platforms.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

SW, XG, YX, XW and ZF developed the algorithm, SW and YX implemented the algorithm, YX, ZF, and XY performed data analysis; and XG supervised the project. All authors read and approved the final manuscript.

Acknowledgements

The authors are grateful to Dr. Lei Nie for his discussion and comments on our project. The authors are very thankful to the editor, associate editor and three referees. Their comments and suggestions lead to a much improved manuscript.