Abstract
Multivariate phenotypes underlie complex traits. Thus, instead of using the endpoint trait, it may be statistically more powerful to use a multivariate phenotype correlated to the endpoint trait for detecting linkage. In this study, we develop a reverse regression method to analyze linkage of Kofendrerd Personality Disorder affection status in the New York population of the Genetic Analysis Workshop 14 (GAW14) simulated dataset. When we used the multivariate phenotype, we obtained significant evidence of linkage near four of the six putative loci in at least 25% of the replicates. On the other hand, the linkage analysis based on Kofendrerd Personality Disorder status as a phenotype produced significant findings only near two of the loci and in a smaller proportion of replicates.
Background
A complex trait is usually a function of a multivariate phenotype comprising correlated quantitative variables. Since endpoint traits are usually binary in nature (affected/unaffected) and hence contain minimal information on variation within trait genotypes, it may be statistically more powerful to use a correlated multivariate phenotype for identifying genes for the complex trait. Mapping a multivariate phenotype traditionally uses some function of quantitative values of sibpairs or other sets of relatives as a response variable and marker identitybydescent (IBD) scores as explanatory variables [13]. In these analyses, linkage inferences depend strongly on the assumed probability distributions of the quantitative variables, particularly for likelihoodbased approaches such as variance components [3,4]. We propose a linear regression formulation in which the response and explanatory variables are interchanged, such as that used by Sham et al. [5]. Analyses do not require modeling the covariance structure of the multivariate phenotype vector [24] or any data reduction technique, such as principal components [6]. In this study, we use the proposed method for performing a genomewide scan of a multivariate phenotype vector correlated with Kofendred Personality Disorder (KPD) in the New York population of the simulated dataset of GAW14.
Methods
Data description
For our analysis, we considered data on the KPD status (affected or unaffected), twelve associated binary phenotypes, and genomewide information separately on 416 microsatellite marker loci and 917 single nucleotide polymorphisms (SNPs) with average intermarker distances of 7.5 cM and 3 cM, respectively, distributed over 10 autosomal chromosomes for the New York population. Our method utilizes phenotype and marker data on 50 independent sibships of sizes varying from 2 to 9 and their parental genotypes for IBD computations. We analyzed data on all 100 available replicates.
Constructing the multivariate phenotype
Suppose y_{ijk }denote the phenotypic value of the i^{th }trait for the j^{th }sib in the k^{th} sibship, i = 1, 2, 3, 4, 5; j = 1, 2, ..., n_{k}; k = 1, 2, ..., 50. The twelve phenotypes relate to personality traits and therefore may be associated with the endpoint trait, the affectation status of KPD. Thus, instead of using the KPD status as a phenotype for linkage analysis, it may be statistically more powerful to use a multivariate phenotype comprising some of these personality traits, which are highly correlated to the disease status. In order to select a subset of the twelve traits, which may be used as a surrogate for the endpoint trait, we performed a logistic regression of the KPD disease status on the twelve binary phenotypes. To ensure the independence of our observations, the regression was based on the 100 parents of the 50 sibships.
The logistic model used was:
where z_{jk }is the affectation status of KPD of the j^{th }parent of the k^{th }sibship; δ = 0 or 1 according to whether an individual is affected with KPD or not and x_{ijk }is the phenotypic value of the i^{th }trait of the j^{th }parent of the k^{th }sibship. The test for association between the i^{th }(i = 1, 2, ..., 12) personality trait with KPD is equivalent to testing a_{i }= 0 versus a_{i }≠ 0. We used a level of 0.005 for testing each a_{i }in the 12 tests. We obtained five of the phenotypes to be significantly correlated to the endpoint trait (details are provided in the "Results" section). Thus, the multivariate phenotype we used for our linkage analysis comprises five binary personality traits.
The reverse regression procedure
Sham et al. [5] proposed a regression method that interchanges the phenotype and the marker IBD score variables. We adapted their method for the following linear regression model:
where is the estimated marker IBD score of the first and j^{th }sibs of the k^{th }sibship, j = 1, 2, ..., n_{k}; k = 1, 2, ..., 50; e_{jk }values are random environmental errors assumed to have mean 0 and equal variances. We note here that an advantage of using IBD scores instead of the squared sibpair trait differences as the response variable is that in a sibship of size n_{k}, the marker IBD scores π_{1k,2k}, π_{1k,3k}, ..., π_{1k},n_{k }are independent, but the squared differences in trait values for these sibpairs are not independent. Thus, for a sibship of size n_{k}, we have n_{k } 1 independent. We wish to point out here that our method is not related to parity (i.e., birth order). While analyzing data, we suggest that the sib assigned "1" be chosen at random from the sibship. When we computed multipoint IBD scores, the conversion of recombination distances to physical distances (in cM) on chromosomes was based on the Haldane map function [7].
We define our test for linkage between the locus controlling KPD and the marker locus to be equivalent to testing H_{0 }: β_{1 }= β_{2 }= β_{3 }= β_{4 }= β_{5 }= 0 versus H_{1}: β_{1 }< 0 U β_{2 }< 0 U β_{3 }< 0 U β_{4 }< 0 U β_{5 }< 0. In other words, under no linkage between the two loci, the estimated marker IBD score will not be correlated to the squared difference in sibpair trait values. On the other hand, if the two loci are linked, the estimated marker IBD score will not be correlated to the squared difference in sibpair values for at least one of the correlated traits [1].
The test statistic used is
The above statistic is equivalent to the usual likelihood ratio test (LRT) for normally distributed errors. Under the assumption of normality, the test statistic is distributed asymptotically as a mixture of chisquare distributions. It is very unlikely in practice for the errors to be distributed as normal. Thus, instead of making any assumptions on the distribution of the errors, we use Monte Carlo simulations to obtain the empirical pvalues for the observed value of the test statistic. We generate marker IBD scores at random using the marginal distribution of IBD scores (based on a multiallelic modification of Table V in Haseman and Elston [1] and marker allele frequencies as provided in the dataset) and assign them to the different sibpairs in the regression analysis. The squared differences in the phenotypic values of the sibpairs are conserved and the regression is performed to generate values of the test statistic under the null hypothesis of no linkage.
Because our aim is to show that using the multivariate phenotype vector for the linkage scan is statistically more powerful than using the endpoint trait (KPD status), we also perform the reverse regression analysis using only the KPD status. The regression procedure is identical to the one described above with the test for linkage based on only one parameter, i.e., the regression coefficient associated with the KPD status variable.
Results
Based on the logistic regression, five phenotypes: fear/discomfort with strangers, dislike of jokes told face to face, obsession with entertainers, humor impairment, and uncommunicative, contentless speech patterns were found to be significantly correlated to KPD status. As mentioned earlier, we performed two linkage analyses: one based on a multivariate phenotype vector comprising these five traits and the other based on only the KPD status as a phenotype. We used the statistical package MERLIN 0.10.2 [8] for multipoint IBD computations. The reverse regression method described above was performed at the marker/SNP positions. The test for linkage had level 0.001 (for each marker) and the null distribution of the test statistic was determined using 1,000 MonteCarlo simulations. Since the "answers" were available to us, we considered a linkage peak to be true positive if it is within 10 cM from the true position of the putative locus. The results are provided in Table 1 for the endpoint trait and in Table 2 for the multivariate phenotype in terms of the proportion of replicates where significant linkage peaks were obtained along with the markers within 10 cM of those peaks.
Table 1. Significant linkage peaks and microsatellite markers/SNPs within 10 cM of the peaks based on the KPD status
Table 2. Significant linkage peaks and microsatellite markers/SNPs within 10 cM of the peaks based on the multivariate phenotype vector.
When we used the multivariate phenotype, the linkage analyses based on the 416 microsatellite markers yielded significant peaks on 4 chromosomes: D01S0023 on chromosome 1, D03S0127 on chromosome 3, D05S0173 on chromosome 5, and D09S0347 on chromosome 9. The linkage analyses using the 917 SNP markers yielded significant peaks around the same regions as the peaks corresponding to the microsatellite markers: C01R0051 on chromosome 1, C03R0281 on chromosome 3, C05R0381 on chromosome 5, and C09R0763 on chromosome 9. When we used the endpoint KPD status as our phenotype, we obtained significant peaks only at D03S0127 and C03R0281 on chromosome 3; and D05S0173 and C05R0381 on chromosome 5 for microsatellite markers and SNPs, respectively. It is clear from the tables that not only did the multivariate phenotype approach produce significant linkage findings at more locations, but also the proportions of replicates in which we obtained the significant findings for both microsatellite markers and SNPs were much lower when only the KPD status was used.
Based on the multivariate phenotype, we have been able to detect linkage in at least 25% of the replicates for both microsatellite markers and SNP markers on four chromosomes (1, 3, 5, and 9) very close to the putative trait loci. The proportion of replicates in which we obtained significant linkage findings for the SNPs appears to be marginally lower than that for the microsatellite markers. This can be explained by the fact that since the SNPs are less polymorphic compared with microsatellite markers, the information content at the same marker density is higher with microsatellite markers, leading to more efficient estimation of marker IBD scores. Moreover, we used the same level of significance in our tests of linkage for both microsatellite as well as SNP markers. Since the SNPs are at a much higher density, at the same level of singlemarker significance, the genomewide significance level based on SNPs is higher than that for the microsatellite markers.
Conclusion
Our proposed reverse regression method was able to detect linkage near four of the six putative loci controlling KPD in multiple replicates. We found that our linkage analyses based on the multivariate phenotype comprising five binary traits correlated with KPD was more powerful than those based on only the affectation status of KPD as the phenotype. Thus, using a multivariate phenotype vector comprising traits correlated with the endpoint trait may be a prudent strategy for linkage mapping of a complex trait.
While it is important to compare the power of our method with those of existing methodologies, the structure of the dataset did not permit a valid statistical comparison with most existing methods. The variance components methods like those implemented in MERLIN, GENEHUNTER, SEGPATH, and ACT assume multivariate normality of trait values within pedigrees and are designed for quantitative traits. However, all the personality traits in the dataset were binary in nature and assumption of normality for these traits would not be proper. The package SOLAR has an option of using a threshold model for binary traits [9], but like MERLIN and GENEHUNTER, allows for single traits only. Thus, it was difficult to compare our method with other multivariate methods. While we showed that using the multivariate phenotype yields more power than using only KPD status based on the reverse regression strategy, it is of interest to explore whether our multivariate method is more powerful than standard univariate analyses on KPD status implemented in LINKAGE or GENEHUNTER. However, a direct comparison with LINKAGE is difficult because it is parametric in nature and would yield LOD scores as the linkage statistic. Since our method is completely modelfree, it is not possible to compute LOD equivalents from our statistic. On the other hand, because our analyses involved affected and unaffected individuals, it would not be proper to compare with an analysis involving only affected individuals as implemented in modelfree analyses of GENEHUNTER. We may have missed out on valid comparisons with some other existing methodologies and are currently exploring those possibilities.
The overall level of significance would most likely be a function of the level of significance used in the first stage of our analysis in which we are selecting a subset of phenotypes that are significantly associated with the endpoint trait. The nature of dependence of the two stages is quite complex and it is difficult to obtain exact adjustments of the pvalues in the linkage scan after accounting for the pvalues in the first stage. Extensive simulations to examine this issue are being conducted.
Abbreviations
GAW14: Genetic Analysis Workshop 14
IBD: Identity by descent
KPD: Kofendrerd Personality Disorder
LRT: Likelihood ratio test
SNP: Singlenucleotide polymorphism
Authors' contributions
SG proposed and worked on the methodology. SB optimized the linkage statistics and wrote the computer codes. GB and SP managed the data and implemented the software packages/computer programs for IBD computations, logistic regression, and empirical power computations. PPM coordinated the analysis and participated in writing the manuscript.
Acknowledgements
This work was supported by the Fogarty International Center, NIH, through R01 grant TW00660401. The authors acknowledge the two anonymous referees, whose comments helped to substantially improve the presentation of the manuscript. The authors are also grateful to Anurag Mitra, who implemented some other computer programs.
References

Haseman JK, Elston RC: The investigation of linkage between a quantitative trait and a marker locus.
Behav Genet 1972, 2:319. PubMed Abstract  Publisher Full Text

Amos CI, Elston RC, Bonney GE, Keats BJB, Berenson GS: A multivariate method for detecting genetic linkage, with application to a pedigree with an adverse lipoprotein phenotype.
Am J Hum Genet 1990, 47:247252. PubMed Abstract  PubMed Central Full Text

Almasy L, Blangero J: Multipoint quantitativetrait linkage analysis in general pedigrees.
Am J Hum Genet 1998, 62:11981211. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Amos CI: Robust variancecomponents approach for assessing genetic linkage in pedigrees.
Am J Hum Genet 1994, 54:535543. PubMed Abstract

Sham PC, Purcell S, Cherny SS, Abecasis GR: Powerful regressionbased quantitative trait linkage analysis of general pedigrees.
Ann Hum Genet 2002, 68:15271532. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Elston RC, Buxbaum S, Jacobs KB, Olson JM: Haseman and Elston revisited.
Genet Epidemiol 2000, 19:117. PubMed Abstract  Publisher Full Text

Haldane JBS: The combination of linkage values and the calculation of distances between the loci of linked factors.

Abecasis GR, Cherny SS, Cookson WO, Cardon LR: Merlinrapid analysis of dense genetic maps using sparse gene flow trees.
Nat Genet 2002, 30:97101. PubMed Abstract  Publisher Full Text

Williams JT, van Eerdewegh P, Almasy L, Blangero J: Joint multipoint linkage analysis of multivariate qualitative and quantitative traits. I. Likelihood formulation and simulation results.
Am J Hum Genet 1999, 65:11341147. PubMed Abstract  Publisher Full Text  PubMed Central Full Text