Department of Computer Science and Biomedical Informatics, University of Central Greece, Papasiopoulou 2-4, Lamia 35100, Greece

Abstract

Background

Meta-analysis is a popular methodology in several fields of medical research, including genetic association studies. However, the methods used for meta-analysis of association studies that report haplotypes have not been studied in detail. In this work, methods for performing meta-analysis of haplotype association studies are summarized, compared and presented in a unified framework along with an empirical evaluation of the literature.

Results

We present multivariate methods that use summary-based data as well as methods that use binary and count data in a generalized linear mixed model framework (logistic regression, multinomial regression and Poisson regression). The methods presented here avoid the inflation of the type I error rate that could be the result of the traditional approach of comparing a haplotype against the remaining ones, whereas, they can be fitted using standard software. Moreover, formal global tests are presented for assessing the statistical significance of the overall association. Although the methods presented here assume that the haplotypes are directly observed, they can be easily extended to allow for such an uncertainty by weighting the haplotypes by their probability.

Conclusions

An empirical evaluation of the published literature and a comparison against the meta-analyses that use single nucleotide polymorphisms, suggests that the studies reporting meta-analysis of haplotypes contain approximately half of the included studies and produce significant results twice more often. We show that this excess of statistically significant results, stems from the sub-optimal method of analysis used and, in approximately half of the cases, the statistical significance is refuted if the data are properly re-analyzed. Illustrative examples of code are given in Stata and it is anticipated that the methods developed in this work will be widely applied in the meta-analysis of haplotype association studies.

Background

The continuously increasing number of published gene-disease association studies made imperative the need of collecting and synthesizing the available data

Most of the genetic association studies (and hence the meta-analyses derived from them) are performed using single markers, usually Single Nucleotide Polymorphisms (SNPs). However, the SNP that is under investigation is not always the true susceptibility allele. Instead, it may be a polymorphism which is in Linkage Disequilibrium (LD) with the unknown disease-causing locus

A major problem in haplotype analyses is that in order for the analysis to be performed we need to reconstruct or infer the haplotypes, usually with an approach based on missing data imputation

A graphical representation of the increasing number of published haplotype-association studies

**A graphical representation of the increasing number of published haplotype-association studies**. A search was performed in Pubmed using the terms "haplotype" and "association" from 1997 to 2009. Even though the reference list may include review articles, methodological papers or even irrelevant works, the trend is obvious, especially after 2003 when the HapMap project was presented. The search was conducted during December 2009 and thus the count for 2009 may be an underestimate.

This work has two primary goals. First, to perform a detailed literature search and an empirical evaluation of the published studies that report meta-analyses of haplotype associations; and second, to present a concise overview of the statistical methods that could and should be used in such meta-analyses. These two important issues were not previously studied in the literature and the findings are interesting. Even though the methods presented in this work could be derived in a straightforward manner from extending previous works on multivariate meta-analysis

Methods

Methods for haplotype association

Let's assume we have ^{m }and ^{m }the possible haplotypes would be ^{n}. In a case-control study, a cross-tabulation of haplotypes by disease status, that ignores the individuals and counts only the total number of haplotypes observed in the analysis, would result in data arranged in the form of a 2 × _{j }= _{j }= 1) the underlying risk (i.e. the probability of being a case) of a person carrying a single copy of the ^{th }haplotype. A reasonable choice would be to consider the most common haplotype (i.e. _{1}) as the reference category and create _{j }= 1 for haplotype

Cross-tabulation of haplotypes by disease status

**Haplotype ( z**

**Cases ( y = 1)**

**Controls ( y = 0)**

1

89

183

2

14

26

3

24

22

4

3

3

The haplotype data obtained in a case-control study on 182 caucasian women concerning the association of p53 haplotypes with breast cancer

This model was proposed initially by Wallenstein and co-workers and as we already mentioned, assumes a multiplicative genetic model of inheritance

Alternatively, assuming a multinomial sampling scheme where the total sample size is considered fixed, a multinomial logistic regression model would be appropriate, where the different haplotypes would be the dependent variables. This corresponds to the well-known "retrospective likelihood" (i.e. the likelihood based on the probability of exposure given disease status) applicable in case control studies. In this case, the haplotypes are treated as dependent variables and the case/control status as the predictor in a multinomial (polytomous or polychotomous) logistic regression

By observing that the linear predictor becomes:

it is easy to understand that the _{j }coefficients obtained by fitting the model are estimates of the log-Odds Ratios (i.e. for comparing _{j }vs. _{1}) in equivalence to the respective coefficients of the model in Eq. (1). Obviously, _{1 }= 0 for identifiability since haplotype _{1}) is used as the reference category. The particular model was first used for haplotype analysis by Chen and Kao

Lastly, assuming that the observed counts are realizations of a Poisson random variable, one can fit log-linear models (Poisson regression), where the dependent variable is the counts and thus, the studies, the type of haplotypes and the case/control status are treated as independent variables. Log-linear models are widely used for haplotype analysis, for instance, for detecting LD

This is the standard saturated model for describing the 2 × _{j}'s are the coefficients that correspond to the haplotype by disease interaction and are equivalent to those obtained by fitting the models in Eq. (1) and (2). It is easily verified that the coefficients **β **= **0**) can be tested by performing a multivariate Wald test using the estimated covariance matrix, cov(**β**). Then, the test statistic (score) **β'**cov(**β**)^{-1}**β**, will have asymptotically a ^{2 }distribution on

Whatever the assumed sampling scheme that gave rise to the data of Table

The methods discussed above are simple applications of the generalized linear model extending the analysis of single markers to haplotypes and assume that, i) the haplotype risk follows a multiplicative model of inheritance, ii), the haplotype phase is known and, iii) the population is in Hardy-Weinberg Equilibrium (HWE). The genetic model of inheritance can be handled simply by using in the analysis the so-called haplo-genotypes or diplotypes, instead of the genotypes. This is easily performed with all the previously presented methods by using the pairwise combinations of haplotypes (_{1}_{1}, _{1}_{2 }and so on). In case-control association studies, however, with the exception of some cases where direct genotyping of the haplotypes is applicable (i.e.

Even though a large body of the genetic epidemiology literature is dedicated to such methods, their application in meta-analysis is problematic since in most cases the original data are not available to the analyst. Thus, in the following sections where the methods for meta-analysis are summarized we also assume that the haplotypes are known. An extension when the posterior probabilities of haplotypes are given from the output of the haplotype inference software would then be straightforward.

Methods for meta-analysis of haplotype association

In this section the methods for meta-analysis are presented. Initially we will discuss simple methods using summary data, whereas in the next sub-section more advanced methods that use generalized linear models on grouped or Individual Patients Data (IPD) are presented.

Meta-analysis using summary-data

A commonly used approach that is based on traditional methods and uses solely summary data is to consider separately the effect of the ^{th }haplotype against the

with an asymptotic variance given by:

In this notation, _{ic}_{0 }an _{ic}_{1}, are the counts of the remaining haplotypes (excluding haplotype ^{th }study respectively, given by:

In a standard univariate random-effects model we assume that the logarithm of the OR of each study

Thus, the combined logarithm of the Odds Ratio (log

The between-studies variance (^{2}), could be easily computed by the non-iterative method of moments proposed by Dersimonian and Laird ^{2 }= 0 in Eq. (9) corresponds to the well known fixed-effects estimator with inverse variance weights.

The particular approach is very easily implemented, intuitive and it can be performed in a standard univariate meta-analysis framework. In the results section we will see that several already published meta-analyses used this method. However, the method has some drawbacks. The most important is that it is prone to an increased type I error rate due to multiple comparisons. Multiple comparisons constitute an important problem in haplotype analysis, especially as the number of haplotypes increases

To overcome the multiple comparisons problem, a straightforward alternative would be to extend the model in a multivariate framework modelling simultaneously the logORs derived from comparing haplotypes **y**_{i }the vector containing the **β**, the vector of the overall means given by:

These logORs similarly to Eq. (5) will be given by:

with an asymptotic variance given by:

In the multivariate random-effects meta-analysis, we assume that **y**_{i }is distributed following a multivariate normal distribution around the true means **β**, according to the marginal model:

In the above model, we denote by **C**_{i }the within-studies covariance matrix:

and by **Σ **the between-studies covariance matrix, given by:

The diagonal elements of **C**_{i }are the study-specific estimates of the variance that are assumed known, whereas the off-diagonal elements correspond to the pairwise within-studies covariances, for instance _{w}_{23}s_{2}_{i}s_{3}_{i}=cov(_{2}_{i}, _{3}_{i}). Since the logORs derived for each haplotype are compared against the same reference category, their pairwise covariances will be given

We should mention that from standard normal theory it is known that the multivariate test for **β = 0**, based on **β'**cov(**β**)^{-1}**β**, could yield significant results even if all the

The model can be fitted in any statistical package capable of fitting random-effects weighted regression models with an arbitrary covariance matrix, such as SAS (using

Meta-analysis using binary data

In this section, methods that use directly the binary nature of the data, within a generalized linear mixed model (GLMM) are presented. These methods are usually termed IPD methods

Logistic regression

Using the prospective likelihood we can extend the logistic regression model of Eq. (1) in order to incorporate study specific effects and perform a stratified analysis (fixed effects meta-analysis). To do so, we need to introduce _{i }(taking values equal to zero or one) with coefficients _{0}_{i }that are indicators of the study-specific fixed-effects. Thus, the model is a straightforward extension to the model described previously for meta-analysis of genetic association studies for single nucleotide polymorphisms

Here, the _{j }obtained by fitting the model are the overall estimates of the logORs (i.e. for comparing _{j }vs. _{1}). An overall test for the association of haplotypes with disease can be performed if we denote by **β **the vector of the estimated coefficients and by cov(**β) **its estimated variance-covariance matrix. Then, the test statistic **β**'cov(**β**)^{-1}**β **will have asymptotically a ^{2 }distribution (^{2}_{r}_{-1})

This is the analogue to the Cochran's test for heterogeneity in the univariate meta-analysis. The hypothesis can be tested by performing a multivariate Wald test, where the null hypothesis is:

The test statistic can be constructed analogously to the one used for **β**. If we denote by **γ **the vector of the estimated coefficients, by **V **the estimated variance-covariance matrix and by **Rγ = r **the vector of the (

will have asymptotically a ^{2 }distribution

Moreover, the value of ^{2 }

This measure is quite useful, since it enables us to summarize the overall heterogeneity, instead of having to look at multiple indices of heterogeneity arising from multiple haplotype contrasts.

In order to account for an additive component of heterogeneity and perform a random-effects logistic regression allowing the haplotype effects to vary between studies, the most suitable way is to introduce a set of study-specific random coefficients, representing the deviation of study's true effect from the overall mean effect for each haplotype. Thus, the model becomes:

In this model, the random terms **β**_{i }are distributed as:

where

The between studies variances and covariances have the same interpretation as the ones obtained by the summary-data methods of Eq. (13) and (15).

Multinomial logistic regression

Alternatively, the model may be parameterized assuming a multinomial sampling scheme utilizing the retrospective likelihood. In this case, an extension of the model of Eq. (2), which incorporates fixed-study effects, would be:

The linear predictor in the above model becomes:

Similar to the model based on prospective likelihood, the variables _{i }are indicators of the study-specific fixed-effects. An overall test for the association of haplotypes with disease (**β = 0**) can be performed similarly to the logistic regression model (

The statistics for heterogeneity (^{2 }index derived from it are identical to the one presented in Eq. (19) - (21).

A random-effects extension to the model can be formulated if in the above model, we introduce a haplotype-specific random coefficient _{ij }(for haplotypes

and the model is completely specified as a random effects multivariate meta-analysis, with random terms **β**_{i }distributed similarly as **β**_{i}~**0,Σ**). The interpretation of the variances and covariances of the random terms is identical to the ones presented in Eq. (13). A version of this model has been used previously for meta-analysis of genetic association studies involving single nucleotide polymorphisms

Poisson regression

Lastly, we can extend the log-linear model of Eq. (4) in order to perform a fixed effects meta-analysis allowing for the study-specific effects. The major difference compared to the previous approaches lies in the structure of the log-linear model and the interpretation of the main effects and interactions. Having in mind that we want to model a 2 ×

In this model, the coefficients _{j}, _{ij}, _{0}, _{0}_{j }and _{j }correspond to the ones obtained by fitting the models in Eq. (17) and Eq. (15). The overall test for the association of the haplotypes with the disease (**β **= **0**), is known in the context of log-linear models as the test of "

The test with the null hypothesis _{0}: ** γ = 0 **(_{ij }= 0,

In analogy to models in Eq. (22) and (28), a random coefficient for the disease by haplotype interaction can be applied in order to perform a random-effects meta-analysis:

with random terms **β**_{i }distributed similarly as **β**_{i}~**0,Σ**). Similarly to the multinomial logistic regression model, the interpretation of the variances and covariances of the random terms is identical to the ones presented in Eq. (12).

Continuous traits

The methods discussed so far assume we are dealing with a binary trait, usually in a case-control setting. However, continuous traits are not uncommon in genetic association studies and these should be easily accommodated using a linear model (linear regression). For instance, denoting by _{ij }the continuous trait for a person carrying the ^{th }haplotype in the ^{th }study, the model would be:

The homogeneity of haplotype effects across studies can be subsequently checked using a model with a haplotype x study interaction term:

Finally, a random effects model could be formulated using a liner mixed model:

with random terms **β**_{i }distributed similarly as **β**_{i}~**0,Σ**). Similarly to the previously described models, the interpretation of the variances and covariances of the random terms is identical to the ones presented in Eq. (12). In case where individual data are not available, the above models could be easily fitted using summary data (mean values and standard deviations) per haplotype.

Implementation

The models presented in this section can be easily fitted in Stata using

A sometimes useful simplification can be made in Eq. (15) if we assume that the between-studies variances are equal _{2 }= _{3 }= ... = _{r}, **Σ **reduces to:

Another approximation would be to impose a single between studies correlation, but allow for different between-studies variances

In this work however, we chose to use a different approximation that can be obtained if the number of random effects is reduced by decomposing the random terms using factor loadings such as: _{2}^{2 }= _{2}^{2}^{2}, _{3}^{2 }= _{3}^{2}^{2}, ..., _{r}^{2 }= _{r}^{2}^{2}, and letting _{2 }= 1 for identification. Thus, the covariance matrix becomes now:

The particular approximation is conceptually similar to the one used previously for the so-called "genetic model-free approach" for meta-analysis of genetic association studies ^{2 }thus, it is much faster since the factor loadings _{j }with _{Bjj'}) to be equal to ±1 (depending on the sign of _{j}_{j'}). Nevertheless, the between-studies correlations are usually poorly estimated especially when the number of studies is small (<20) and in such cases they are usually estimated to be equal to ±1

A final comment has to be made concerning the identifiability of the models presented in the previous sections, especially when it comes to the log-linear models which are the ones that contain the largest number of parameters. Concerning the fixed effects methods, the number of parameters of the saturated model of Eq. (30) is equal to 2

In Additional file

**Stata code for fitting the methods described in the manuscript**. The commands should be run within a Stata do-file.

Click here for file

Results

We initially performed a literature search for identifying studies that report meta-analyses of haplotype associations. The initial search in PUBMED using the term "haplotype" combined with "meta-analysis" or "collaborative analysis" or "pooled analysis" yielded 282 studies. Of these, 35 studies could have been identified using solely the terms "collaborative analysis" or "pooled analysis" and "haplotype". After careful screening, 207 studies were excluded as irrelevant ones (they were not meta-analyses of haplotypes), 36 studies were excluded for various reasons (family based-studies, meta-analyses of SNPs with the term "haplotype" appearing in the abstract or haplotype analyses in which the term "meta-analysis" appeared in the abstract etc). Finally, we came up with 39 published papers containing data for 43 associations. Some studies reported different sets of haplotypes from the same gene (Auburn et al, 2008; Zintzaras et al, 2009), haplotypes from different genes (Thakkinstian et al, 2008), or distinct outcomes measured on different subsets of patients (Kavvoura et al, 2007) and thus, they were included twice, whereas from studies that reported different outcomes measured on the same set of individuals we kept only one. There were also some pairs of studies that evaluated the same association and from these we kept only the largest one. 10 out of the 39 published papers could have been identified using solely the terms "collaborative analysis" or "pooled analysis" coupled with the term "haplotype". The 43 studies and their characteristics are presented in Table

List of the 43 meta-analyses that were used in the empirical evaluation

**ID**

**Reference**

**Gene/Locus**

**Disease/Outcome**

**SNPs in haplotype**

**Number of studies**

**Sample Size**

**Method of analysis**

**Data availability**

**Collaborative analysis**

**Significant results**

1

DRD3

Schizophrenia

4

5

7551

1 vs. others

No

No

No

2

ITGAV

Rheumatoid Arthritis

3

3

6851

N/A

Yes

Yes

Yes

3

IL1A/IL1B/IL1RN

Osteoarthritis

7

4

2908

1 vs. others

No

Yes

Yes

4

FRZB

Osteoarthritis

2

10

12380

1 vs. others

No

Yes

No

5

CX3CR1

CAD

2

6

2912

1 vs. others

Yes

No

Yes

6

ALOX5AP

Stroke

4

5

5765

1 vs. others

No

No

No

7

ALOX5AP

Stroke

4

3

3004

1 vs. others

No

No

No

8

GNAS

Malaria

3

7

8154

1 vs. others

No

Yes

Yes

9

GNAS

Malaria

7

6

7632

1 vs. others

No

Yes

Yes

10

PDLIM5

Bipolar Disorder

2

3

1208

1 vs. others

No

No

No

11

PDE4D

Stroke

2

4

4961

1 vs. others

No

No

Yes

12

TGFB1

Renal Transplantation

2

4

438

pooled

No

No

Yes

13

IL10

Renal Transplantation

3

4

348

pooled

No

No

No

14

9p21.3

CAD

4

5

7838

1 vs. others

No

Yes

Yes

15

HLA

SLE

2

3

527

1 vs. others

No

No

Yes

16

CTLA4

Graves Disease

2

10

2564

1 vs. others

Yes

Yes

Yes

17

CTLA4

Hashimoto Thyroiditis

2

5

1210

1 vs. others

Yes

Yes

Yes

18

ENPP1

T2DM

3

3

8676

1 vs. others

No

No

No

19

MTHFR

ALL

2

4

894

Log-linear model

No

No

Yes

20

CAPN10

T2DM

3

11

5862

1 vs. others

Yes

Yes

Yes

21

ADAM33

Asthma

5

3

1899

pooled

Yes

No

No

22

NRG1

Schizophrenia

6

11

8722

1 vs. others

No

No

Yes

23

RGS4

Schizophrenia

4

8

7243

1 vs. others

No

Yes

No

24

ADRB2

Asthma

2

3

2060

N/A

No

No

Yes

25

ESR1

Fractures

3

8

14622

1 vs. others

No

Yes

Yes

26

VDR

Osteoporosis

3

4

2335

Log-linear model

Yes

No

Yes

27

ACE

Alzheimer's Disease

3

4

1619

pooled

Yes

Yes

Yes

28

IGF-I

IGF-I levels

3

3

1929

1 vs. others

No

Yes

Yes

29

TF

Stroke

2

2

818

N/A

No

Yes

No

30

FcgammaR

Celliac Disease

2

2

1057

N/A

Yes

Yes

No

31

VDR

Fractures

3

9

23309

Logistic regression

No

Yes

No

32

G72/G30

Schizophrenia

2

2

1541

N/A

Yes

Yes

Yes

33

VEGF

ALS

3

4

1912

Logistic regression

Yes

Yes

Yes

34

BANK1

Rheumatoid Arthritis

3

4

4445

1 vs. others

No

Yes

Yes

35

CYP19A1

Endometrial Cancer

2

10

13283

Logistic regression

No

Yes

Yes

36

CRP

T2DM

3

3

11876

N/A

No

No

Yes

37

8q24

Colorectal Adenoma

4

3

5385

Logistic regression

No

Yes

Yes

38

CYP1A1

Lung Cancer

2

13

2151

Pooled

No

Yes

Yes

39

TNFA

Prostate Cancer

5

2

4881

Pooled

Yes

Yes

No

40

PTGS2

Prostate Cancer

4

2

4881

Pooled

Yes

Yes

No

41

AR

Endometrial Cancer

5

2

1424

Pooled

No

Yes

No

42

MGMT

Head and Neck Cancer

2

3

1347

Pooled

No

Yes

No

43

SNCA

Parkinson Disease

2

11

5344

1 vs. other

No

Yes

Yes

We list the reference, the gene name, the disease, the number of SNPs included in the haplotypes, the number of studies, the total sample size, the method of analysis (N/A: not available), the availability of data, whether the data was collected in a collaborative setting and whether the study reported significant results.

The average number of polymorphisms included in the haplotypes was 3.19 (SD = 1.37, median = 3, range from 2 to 7), whereas the sample size was 5,017.81 (SD = 4,703.24, median = 3,004, range from 348 to 23,309). The average number of included studies was 5.14 (SD = 3.06, median = 4, range from 2 to 13). Twenty seven studies (62.79%) were conducted in a collaborative setting, whereas sixteen (37.21%) were performed using data derived from the literature. Twenty seven of the meta-analyses (62.79%) reported significant results and the majority (22 studies, 51.16%) were analysed under the "1 vs. others" approach using standard summary based meta-analysis techniques (with fixed or random effects), 11 studies (25.58%) were analysed by pooling the data inappropriately, 6 studies (13.95%) did not report the method or did not perform pooling at all and 4 analyses (9.30%) were performed using a fixed effects logistic regression model. Only 13 studies (30.23%) reported the complete data that suffice for the analysis to be replicated (Table

There was only some weak evidence where collaborative meta-analyses contained larger number of studies compared to literature-based ones (5.67 vs. 4.25), larger sample size (5,651 vs. 3,948) and produced significant results more frequently (66.67% vs. 56.25%). However, these differences did noreach statistical significance (p-values equal to 0.144, 0.256 and 0.506 respectively). The average number of included polymorphisms was also comparable (3.26 vs. 3.06, p-value = 0.654). The thirteen meta-analyses that reported complete data, did not differ significantly from the remaining ones in terms of the included studies (4.46 vs. 5.43, p-value = 0.345), the number of SNPs in the haplotypes (3.08 vs. 3.23, p-value = 0.735) and the proportion of significant findings (69.23% vs. 60%, p-value = 0.576). The proportion of collaborative analyses was higher, even though this difference did not reach statistical significance (76.92% vs. 56.57%, p-value = 0.216). There was however, moderate evidence that the total sample size included in the meta-analyses that reported complete data was smaller compared to the meta-analyses that did not (3,040.31 vs. 5,874.73, p-value = 0.069). We also compared the particular database against a database of 55 representative meta-analyses of genetic association studies of SNPs that was used previously in several empirical evaluations ^{-4}), whereas the proportion of meta-analyses with significant results was twice as large (62.8% vs. 27.27%, p-value = 0.0003).

The thirteen studies that reported the data necessary for the analysis to be replicated were subsequently used in order to apply the methods proposed in this work. We used all the methods described in the methods section except for the simpler approach of comparing 1 vs. the others haplotypes, i.e. Eq.(5). The results are reported in Table **β **= **0**). For the fixed effects IPD methods we additionally report the p-value of the overall test for the heterogeneity (**γ = 0**). Concerning the results obtained using the IPD methods, we report only the ones obtained from the logistic regression method of Eq. (22) using the parameterization of Eq. (37) which is easier to be fitted, even though the multinomial logistic regression and the Poisson regression method would yield similar results. As expected, when the heterogeneity is low (in 8 out of the 13 studies), the random effects methods coincide with their fixed effects counterparts. In general, the methods that use summary data yield slightly different estimates for the ORs compared to the methods that use IPD, when there were rare haplotypes (i.e. small counts) or when the total number of subjects was low (data not shown). In 2 out of the 13 studies the estimates for the multivariate Wald tests for the overall association (**β = 0**) produce marginally different results compared to the univariate ones.

The results obtained using the methods described in this work on the 13 studies that reported complete data that suffice for the analysis to be replicated

**ID/[reference]**

**Gene/Locus**

**Disease/Outcome**

**SNPs in haplotype**

**Number of studies**

**Significant results**

**Fixed effects**

**Random effects**

**β = 0 (summary data)**

**β = 0 (IPD)**

**γ = 0 (IPD)**

**β = 0 (summary data)**

**β = 0 (IPD)**

2/

ITGAV

Rheumatoid Arthritis

3

3

Yes^{$$}

0.2506

0.2489

0.1564

0.3288

0.3851

5/

CX3CR1

CAD

2

6

Yes^{$}

0.0834*

0.0677*

0.6263

0.0883*

0.1031*

16/

CTLA4

Graves Disease

2

10

Yes

<0.0001

<0.0001

0.0371

<0.0001

<0.0001

17/

CTLA4

Hashimoto Thyroiditis

2

5

Yes

0.0011

0.0010

<0.0001

0.0044

0.0072

20/

CAPN10

T2DM

3

11

Yes^{$$}

0.1152

0.1036

0.6145

0.2243

0.1655

21/

ADAM33

Asthma

5

3

No

0.6209

0.5508

0.4697

0.6134

0.5503

26/

VDR

Osteoporosis

3

4

Yes^{$$}

0.1458

0.3051

<0.0001

0.1480

0.5781

27/

ACE

Alzheimer's Disease

3

4

Yes

0.0193

0.0218

0.8906

0.0193

0.0223

30/

FcgammaR

Celliac Disease

2

2

No

0.7331

0.7335

0.9502

0.7331

0.7336

32/

G72/G30

Schizophrenia

2

2

Yes^{$$}

0.7790

0.7757

0.0001

0.5750

0.6719

33/

VEGF

ALS

3

4

Yes^{$}

0.0437*

0.0414

0.0691

0.0716

0.0455*

39/

TNFA

Prostate Cancer

5

2

No

0.2531

0.2515

0.6185

0.2867

0.2511

40/

PTGS2

Prostate Cancer

4

2

No

0.3560

0.3550

0.2087

0.6573

0.4829

For either fixed or random effects methods, we list the p-values for the tests for the overall association (**β = 0**) using the summary data based methods and the IPD methods. The results for the IPD methods were obtained from the logistic regression method even though the multinomial logistic regression and the Poisson regression method yield nearly identical results. For the fixed effects IPD methods we also list the p-value of overall test for the heterogeneity (**γ = 0**).

(*): The significance of the multivariate Wald test (**β = 0**) contradicts univariate one (_{j }= 0).

(^{$}): The initially claimed statistically significant results are contradicted by either the multivariate or univariate Wald tests (random effects).

(^{$$}): The initially claimed statistically significant results are contradicted by both the multivariate and univariate Wald tests (random effects).

The subsequent re-analysis and the contrasting with the initial reports yielded some important findings. Concerning the four studies that initially reported no significant association

The reasons for these discrepancies deserve further investigation. For instance, in the collaborative meta-analysis for the association of CAPN10 haplotypes with Type 2 Diabetes mellitus

Discussion

Although the studies reporting haplotypes comprise a small fraction of genetic association studies, their number is increasingly growing and so there is a need for developing formal methods for combining them in a meta-analysis. In this work, a comprehensive framework for the meta-analysis of haplotype association studies was presented and an empirical evaluation has been performed for the first time in the literature.

The methods proposed in this work are extending previous works in meta-analysis of genetic association studies

The empirical evaluation of the published literature suggests that studies reporting meta-analysis of haplotypes did not systematically differ from the meta-analyses of genetic association using SNPs in terms of the average sample size, but contain approximately half of the included studies and produce significant results twice more often. The meta-analyses that reported the complete data did not significantly differ from the remaining studies in terms of the included studies, the number of SNPs included in the haplotypes, the proportion of significant findings or the proportion of collaborative analyses. There was however, moderate evidence that the total sample size included in the meta-analyses that reported complete data, was smaller compared to the meta-analyses that did not.

The application of the methods proposed in this work in studies that reported the complete data, made clear that approximately half of the significant findings are attributable to the method of analysis used by the primary authors and suffer from an inflated type I error rate. Indeed, for the four out of the nine studies that reported significant results, these were clearly refuted by the multivariate methodology. Three of these studies used the 1 vs. other approach, which although more powerful, is known to suffer from increased type I error rate

All the models presented here assume that the haplotypes are directly observed. However, as we have already discussed, the haplotypes are usually inferred and thus, treating them as known quantities may be problematic

The methods proposed in this work, clearly outperform the traditional naïve method of meta-analysis of haplotypes, which simply consists of contrasting each haplotype against the remaining ones. This is expected to be more profound, especially as the number of possible haplotypes increases, increasing also the type I error rate due to multiple comparisons

Conclusions

We presented multivariate methods that use summary-based data as well as methods that use binary and count data in a generalized linear mixed model framework (logistic regression, multinomial regression and Poisson regression). The methods presented here are easily implemented using standard software such as Stata, R or SAS making them easy to be applied even by non- experts. In the Additional file

Authors' contributions

PGB conceived the study, performed the analyses and wrote the manuscript.

Acknowledgements

The author would like to thank the two anonymous reviewers for their valuable comments that improved the quality of the manuscript.