Department of Biostatistical Sciences, Wake Forest University School of Medicine Winston-Salem, North Carolina 27157, USA

Department of Biostatistics, Section on Statistical Genetics, University of Alabama at Birmingham Birmingham, Alabama 35294, USA

Department of Epidemiology University of Alabama at Birmingham Birmingham, Alabama 35294, USA

Department of Biostatistics University of Washington Seattle, WA 98195, USA

Division of Internal Medicine and Department of Epidemiology, University School of Medicine Baltimore, MD 21287, USA

Department of Radiology and Medicine John Hopkins University School of Medicine Baltimore, MD 21287, USA

Abstract

Background

Questions remain regarding the utility of self-reported ethnicity (SRE) in genetic and epidemiologic research. It is not clear whether conditioning on SRE provides adequate protection from inflated type I error rates due to population stratification and admixture. We address this question using data obtained from the Multi-Ethnic Study of Atherosclerosis (MESA), which enrolled individuals from 4 self-reported ethnic groups. We compare the agreement between SRE and genetic based measures of ancestry (GBMA), and conduct simulation studies based on observed MESA data to evaluate the performance of each measure under various conditions.

Results

Four clusters are identified using 96 ancestry informative markers. Three of these clusters are well delineated, but 30% of the self-reported Hispanic-Americans are misclassified. We also found that MESA SRE provides type I error rates that are consistent with the nominal levels. More extensive simulations revealed that this finding is likely due to the multi-ethnic nature of the MESA. Finally, we describe situations where SRE may perform as well as a GBMA in controlling the effect of population stratification and admixture in association tests.

Conclusions

The performance of SRE as a control variable in genetic association tests is more nuanced than previously thought, and may have more value than it is currently credited with, especially when smaller replication studies are being considered in multi-ethnic samples.

Background

The use of self-reported race and ethnicity (SRE) in genetic and epidemiologic studies has been much discussed in the literature

Some studies have found SRE to be closely related to an individual's genetically estimated ancestry proportions

We used ancestry informative markers (AIMs) and phenotypic data on left-ventricular hypertrophy (LVH) collected in the context of the Multi-Ethnic Study of Atherosclerosis (MESA) to address two related questions: (1) what agreement is there between SRE and clusters created based on the genotyped AIMs? (2) In multi-ethnic genetic association studies, does SRE provide comparable type I error control to that provided by genetic ancestry background measures, such as individual ancestry proportions or genetic background scores? To address these two questions we compared three sets of measures; SRE, individual ancestry proportion (IAP) estimates obtained using STRUCTURE

LVH is a condition where the ventricular mass increases as existing cells of the LV enlarge or hypertrophy

Results

Agreement between self-reported ancestry and the genetic background scores

Ward's minimum variance and the K-means clustering algorithm applied on the 96 AIMS yielded similar clusters. Here we present only the results from the K-means algorithm. Figure

Principal component analysis in 2D of the MESA AIMs

**Principal component analysis in 2D of the MESA AIMs**. The Hispanic-American group seems to be more heterogeneous than the remaining groups with fairly large number of self-identified Hispanics being assigned to the same cluster as the self-identified European-Americans.

Agreement between self-reported ethnicity and the 4 observed clusters

**Self-reported Ethnicity**

**Assigned Ethnic group**

**Total**

**AA**

**CA**

**EA**

**HA**

African American

637

0

8

67

712

Chinese American

0

717

1

0

718

European American

1

0

630

81

712

Hispanic American

69

3

135

498

705

**Tota**l

707

720

774

646

2847

Agreement between self-reported ancestry and the ancestry proportion estimates

We also ran the program STRUCTURE to obtain individual ancestry proportion estimates assuming 4 ancestral populations. Let _{1},_{2},_{3},_{4}) denote the ancestry proportion of an individual in the dataset where _{i }
^{th }

Comparison between self-reported ancestry and structured estimated ancestry proportion

**Comparison between self-reported ancestry and structured estimated ancestry proportion**. Ideally without admixture, each group would be represented by just one bar. That is if the vector _{1},_{2},_{3},_{4}) represents the ancestry score for each individual in the dataset,

Distribution of body surface adjusted (BSA) LV mass and the LV ejection fraction

The summary statistics for the distribution of BSA LV mass and the LV ejection fraction by self-reported ethnicity is given in Table ^{-14}, which shows strong evidence that the mean LV mass is different between the 4 self reported ethnic groups. The association of SRE with the distribution of the LV ejection fraction is more pronounced compared to what is observed with adjusted LV mass. Although the observed mean and standard deviation of the LV ejection fraction in each ethnicity appear to be very close, its overall distribution varies in the 4 ethnic groups. The Kruskal-Wallis test comparing this distribution in the 4 ethnic groups has a p-value of 2 × 10^{-24}, which strongly suggests that the distributions are different in the 4 self-reported ethnic groups. These two results confirm the presence of the previously identified ethnic specific differences in the distribution of both BSA adjusted LV mass and the LV ejection fraction. We also note that after standardizing both phenotypic variables, SRE explains only 3% of the variation explained in LV mass while it explains 19.5% of the variations in LV ejection fraction. Therefore, it is expected that variations in ethnicity will play a more important role in determining the level of LV ejection fraction compared to adjusted LV mass.

Distribution of adjusted LVH and adjusted ejection fraction by self-reported ethnicity

**Self-reported Ethnicity**

**BSA adjusted LV mass**

**LV ejection fraction**

**AA**

**CA**

**EA**

**HA**

**AA**

**CA**

**EA**

**HA**

N

498

591

544

519

498

591

544

519

Mean

79.9

73.8

75.8

80.7

68.4

72.3

68.5

68.7

Standard deviation

17.0

13.6

15.5

17.6

7.6

6.1

7.4

7.5

Minimum

37.4

42.2

40.4

34.6

40.6

45.3

22.2

28.1

Q1

68.3

64.7

64.9

68.5

63.5

68.4

64.2

64.4

Median

77.9

72.1

73.8

78.0

69.0

72.9

68.6

69.9

Q3

89.7

81.5

85.6

89.6

73.6

76.3

73.6

73.9

Maximum

146.6

180.4

163.6

153.1

88.1

81.6

86.6

84.4

Association between the AIMs and body surface adjusted LV mass and LV ejection fraction

It is known that population stratification can lead to confounding issues in genetic association studies

Type I error associated with the test for association between LV mass and the 96 AIMs

**Control variable**

**Average**

**Type I error**

**Standard**

**error**

**Observed minimum**

**Type I error**

**Observed Maximum**

**Type I error**

Individual admixture estimates

0.033

0.006

0.0105

0.042

Principal components

0.048

0.0074

0.021

0.052

Self-reported Ethnicity

0.037

0.006

0.021

0.052

Ignoring confounding

0.320

0.022

0.221

0.357

Type I error associated with the test for association between LV ejection fraction and the 96 AIMs

**Control variable**

**Average**

**Type I error**

**Standard**

**error**

**Observed minimum**

**Type I error**

**Observed maximum**

**Type I error**

Individual admixture estimates

0.038

0.0058

0.021

0.042

Principal components

0.048

0.007

0.021

0.052

Self-reported Ethnicity

0.042

0.004

0.031

0.053

Ignoring confounding

0.705

0.009

0.694

0.737

Note that when the confounding effect is ignored the type I error is about 14 times the nominal rate. However, controlling for Self-reported ethnicity alone in this case is sufficient to keep the type I error rate at its nominal value.

As mentioned above, the second set of simulation studies was designed to better understand when SRE might perform well as a control variable in genetic association tests.

These simulations showed that when the confounder is univariate - that is, when there are exactly 2 ancestral populations, some type I error inflation may still occur when SRE is used as a control variable. We observed this inflation even in the absence of misclassification error. If there is a discontinuity point, as would most likely be the case when a sample of African Americans and European Americans is collected, the bigger the gap in the observed ancestry proportion the smaller the type I error inflation. Therefore, it is safe to conclude that the inflation rate depends on the composition of the sample. For example, when the study sample is comprised of admixed individuals derived from intermating between exactly 2 ancestral populations, the type I error inflation can be very small when the study sample is composed of individuals whose ancestry proportions are near the extreme values (near 0 or 1). This type I error inflation becomes substantial when the sample is composed of individuals whose ancestry proportion is near 50%. Figure

Empirical type I error observed when self-reported ethnicity (assuming no misclassification error) and individual ancestry proportion are respectively used as control variables in the association test

**Empirical type I error observed when self-reported ethnicity (assuming no misclassification error) and individual ancestry proportion are respectively used as control variables in the association test**. Figure 3a shows the effect of the gap in the distribution of individual admixture on type I error rate when controlling for admixture and ethnicity when the effect size is equal to 0.5. One can observe on figure 3a that there is a slight type I error inflation even when true ethnicity is used instead of the true ancestry proportion. This inflation decreases as the gap in the admixture distribution increases. The admixture distribution in this case is univariate as would be observed for an admixed population derived from intermating between exactly 2 ancestral populations. Figure 3b shows higher type I error inflation rate when the effect size is equal to 1.

Observed type I error rates when controlling for individual ancestry estimate, true ethnicity and self-reported ethnicity assuming various misclassification error rates when admixed population results from intermating between exactly two ancestral populations

**Gap**

**Misclassification rate**

**Ethnicity**

**Admixture**

**SRE**

0.05

0.067

0.046

0.402

0.05

0.075

0.087

0.041

0.46

0.1

0.089

0.061

0.432

0.125

0.083

0.059

0.42

0.15

0.071

0.047

0.427

0.05

0.061

0.054

0.517

0.2

0.075

0.075

0.059

0.52

0.1

0.077

0.051

0.518

0.125

0.063

0.045

0.517

0.15

0.067

0.055

0.492

0.05

0.058

0.05

0.631

0.35

0.075

0.06

0.056

0.622

0.1

0.067

0.051

0.657

0.125

0.06

0.052

0.62

0.15

0.054

0.044

0.642

0.05

0.064

0.066

0.752

0.5

0.075

0.061

0.063

0.767

0.1

0.043

0.038

0.787

0.125

0.05

0.043

0.783

0.15

0.062

0.064

0.765

The type I error rates observed for simulation 3 are displayed in Figure

Observed type I error rate when the sample is multi-ethnic

**Observed type I error rate when the sample is multi-ethnic**. Controlling for self-reported ethnicity leads to the preset significance level. It seems the value of ancestry proportions matters less.

We ran additional simulations studies in order to better understand the effect of misclassification errors on the type I error rate. We considered two scenarios that are described in Figure _{2 }from (0.2, 0.4, 0.6) to (0.4, 0.2, 0.6) to make it independent but identically distributed with G1 then the observed type I error rate dropped from 0.165 to 0.07. We repeated the same experiment by keeping the frequency of G1 fixed and changing the frequency of G2 to (0.1, 0.3, 0.5). The observed type I error was 0.18; as demonstrated above, by changing the order of the first 2 components of the allele frequency vector, this type I error is reduced to 0.065. The type I error associated with the scenario described in figure

Boundaries for simulated ancestry proportions when there are 3 ancestral populations

**Boundaries for simulated ancestry proportions when there are 3 ancestral populations**. Figure 5a shows 4 valid regions, and if one decides to assign ethnicity according to the maximum value of the vector (_{1}, _{2}, _{3}), it is not exactly clear what the correct ethnicity assignment should be for the individuals whose ancestry proportions fall in region IV. There is no such ambiguity in figure 5b.

Observed type I error rate when there is misclassification error between SRE and admixture

**Observed type I error rate when there is misclassification error between SRE and admixture**. We observed significant type I error inflation in figure 6a that is due to the possibility for misclassification error described in figure 5a. Once we the ambiguity is removed (figure 5b) and there is no possible misclassification errors, we see that observed type error rate is maintained at the preset significance level. The distribution of allele frequencies in the 3 ancestral populations appears to also affect the degree of confounding that occurs, such that the type I error inflation due to misclassification errors is worst for some cases than it is for others. For example, when we changed the allele frequency of marker G_{2} from (0.2, 0.4, 0.6) to (0.4, 0.2, 0.6) to make it independent but identically distributed with G_{1} then the observed type I error rate dropped from 0.165 to 0.07. We repeated the same experiment by keeping the frequency of G_{1} fixed and changing the frequency of G_{2} to (0.1, 0.3, 0.5). The observed type I error is now 0.18, and again by changing the order of the first 2 components of the allele frequency vector, this type I error is reduced to 0.065. The type I error associated with the scenario described in figure 5b remain around the preset threshold of 0.05 independently of the choice of allele frequencies in the 3 ancestral populations.

We also considered the effect of misclassification error between self-reported ethnicity and true ethnicity when the study sample is made of 4 ancestral groups according to their representation in the MESA sample. We looked at the effect of discrepancy between true and self-reported in each ethnic group separately. We observed a small, almost negligible, type I error inflation when a misclassification rate varying between 5% and 15% occurred in exactly one ethnic group for individuals whose self-reported ethnicity is simulated to be European-Americans, African-Americans or Hispanic-Americans. However, significantly higher type I error inflation follows when there are discrepancies between SRE and true ancestry for individuals whose initial ancestry was Asian. This result makes sense intuitively because, as can be seen in Figure

Effect of misclassification error in SRE in a multi-ethnic sample

**Effect of misclassification error in SRE in a multi-ethnic sample**. Not all misclassification errors have the same cost. The SRE Chinese American cluster is so well defined in the MESA data that misclassification errors involving them appear to be more costly in terms of the type I error rate.

Discussion

We focused on the utility of self-reported ethnicity as a control variable in genetic and epidemiologic studies. We used data collected in the MESA for LVH traits, specifically, LV mass and ejection fraction, to illustrate our points. LVH is one of the strongest determinants of cardiovascular outcomes. Ethnic differences in the distribution of both LV mass and the LV ejection fraction have been reported in many studies, and we found significant evidence of an ethnicity related effect on these phenotypes in the MESA sample.

We observed a high degree of agreement between self-reported ethnicity and two GBMAs computed using genotyped ancestry informative markers. The self-reported Hispanic-Americans were by far the most heterogeneous group represented in this dataset. This result is not surprising given the current definition of the term "Hispanic" which refers to a group of individuals who are culturally and genetically quite diverse. It is now well accepted that the ancestry distribution of self-reported Hispanics reflects, at different degrees, the genetic contribution of the three ancestral populations Africans, Europeans and Native American

Another factor that may explain the genetic heterogeneity detected among the self-reported Hispanics may be the lack of individuals from the Native American ancestral groups represented in the sample. The initial panel of ancestry informative markers used in the MESA study was chosen based on their capacity to distinguish between individuals of Chinese, African and European ancestry. This panel might not be adequate to detect subtle variation between individuals who self identified as Hispanics. Following this analysis, a new panel of markers known to be particularly informative for Hispanics was typed in an effort to better understand the observed variation in the estimation of ancestry in this ethnic group. However, judging by the observed type I error rates, it appears that ancestry proportions estimated with the current marker panel work well as control variables in all the association tests that we have considered.

We could not directly evaluate the type I error on the original sample since it is not known which markers are really under the null hypothesis in that sample. Nevertheless, self reported ethnicity appeared to be effective as a control variable to protect against population stratification and admixture as the genetic background scores and the estimated individual ancestry proportions since we observed significant agreement between the set of markers that show significant p-values independently of the control variable selected for the analysis. The plasmode analysis showed similar results. The type I error was kept at its appropriate level independent of the choice of controls variables, and did show significant inflation when none of them is included in the model. We should note that we did observe a stronger correlation between the p-values obtained when the control variable was estimated using the AIMS than the p-values observed between either of the genetic based measures and SRE. The simulation study shows that, when the number of ancestral populations is equal to 2, even controlling for an individual's true ethnicity might lead to significant type I error inflation depending on the composition of the study sample. We saw that as the gap value was increasing, the performance of true ethnicity was improving and even got very close to the nominal level when the gap was equal to 0.5. However, controlling for the genetic based measure of ancestry led to the correct type I error rates independently of the value of the gap. It is rarely the case that a study participant will report their ethnicity without errors. Self reported ethnicity errors may occur for various reasons, some people may not be fully aware of their true ethnicity while others may identify with one ethnic group despite their admixed background. Therefore, the use of SRE as a control variable when

Methods

The MESA study was designed to investigate the determinants and progression of subclinical cardiovascular diseases in 4 ethnic groups enrolled from six geographic regions in the United States

As part of a MESA genotyping project, ninety-six AIMs were initially genotyped on more than 2,848 participants. These AIMs were selected from an Illumina proprietary SNP database, and were selected to maximize the difference in allele frequencies between the following pairs of ancestral populations: African vs. Chinese, African vs. European and Chinese vs. European.

Utilizing a plasmode generation approach, we developed a resampling procedure, which generates datasets where the null hypothesis of no association holds between each marker and the phenotypes of interest. A plasmode is similar to a simulated dataset; however, a plasmode dataset uses genotypic and phenotypic information observed in a study to construct pseudo-phenotypes so that the 'truth' of the data generating process is known

In the remainder of the next section, we discuss the clustering method that was used to group MESA participants based on their observed genotypes, present the statistical approach retained for testing for association between each AIM and the 2 phenotypes of interest, and described the simulation procedures in detail.

Classification scheme

We used the 'cluster' and 'tree' procedures in the SAS software (version 9.1) to create 4 clusters based on the principal components computed from the 96 AIMs. We applied Ward's minimum variance method and the K-means clustering algorithms to identified clusters of individuals who similar ancestry proportions. To assess the agreement between self-reported ancestry and these clusters, we used Cohen's weighted kappa

Genetic association tests

We examined all the available AIMs for association with both body surface-adjusted (BSA) LV mass, and LV ejection fraction using 3 different control variables: SRE, ancestry proportion estimates computed using STRUCTURE and genetic background measures computed using principal component analysis. That is, we implemented multivariate linear regression, including (1) SRE as a categorical variable, (2) the proportion of African, Chinese and European ancestry estimated using STRUCTURE, or the first 4 principal components computed using the AIMs data as covariates. We also included gender, income, education level, smoking history, alcohol use, systolic blood pressure, diastolic blood pressure, body mass index, and waist circumference as covariates in each model. Both analyses are conducted using generalized linear models. They also both fall under the structured association test (SAT) framework, which consists in testing for genetic association controlling for a genetic based measure of ancestral background

Simulation studies

The first set of simulations is a resampling procedure that guarantees that the control variables are being compared under the null hypothesis of no association between each marker and the phenotype of interest. The second set of simulations seeks to expand upon the results observed under the first set of simulations by identifying the conditions under which accounting for ethnicity (either true or self-reported) is likely to provide appropriate type I error control and evaluating the effect of misclassification errors of SRE in genetic association studies.

Resampling procedure

Let

Simulation 1

For

For

Regress ^{th }

For

compute ^{th }

}

Sort the

Compute a new pseudo-phenotype

For

If (

Regress

Compute the Wald test p-value for the regression coefficient of _{m}

}

}

Compute

}

Compute

_{iter }

Description of the second set of simulation procedures

As can be seen in Tables

First, we considered the case where the confounder is univariate. This could arise when the study sample comprised admixed individuals born from intermating between exactly two ancestral populations. We evaluated how the performance of SRE as a control variable depends on the distribution of ancestry proportions in the sample. Specifically, we wanted to see how the continuity and the size of the gap in a discontinuous ancestry proportion distribution would affect the performance of SRE. Note that a continuous distribution would have a gap of zero. The gap for a discontinuous distribution is defined as the range of the discontinuity region. For example, admixture is an ongoing process. A sample of admixed individuals can comprise individuals who are at different stage of the admixture process. Therefore, it is possible to recruit a sample that can be divided into 2 subsets of individuals: one with very high level of European ancestry and the other with very low level European ancestry. If the minimum European ancestry in the first subset is 0.8 and the maximum European ancestry in the second subset is 0.2 then gap value would simply be (0.8-0.2) = 0.6. We also looked at the effect of various misclassification rates in these association tests.

Simulation 2 (univariate ancestry proportion distribution (K = 2))

Let

For gap = 0.05 to 0.5 by 0.05{

Draw individual ancestry proportion

Set true ethnicity to 1 if

Let

For

Draw the random variable

If (

else change the true ethnicity so that a misclassification occurs.

}

Compute _{
s
1
}and _{
s
2
}represent the allele frequency of the ^{th }
**1**
_{
N
}is vector of ones, and ^{th }
_{1 }
_{2}
_{
1
}from _{
2
}from _{
1
}and _{
2
}are independent conditional on the individual ancestry. We will use _{
1
}to generate the trait and _{
2
}to test for genetic association.

Compute _{0 }+ α_{1}
_{1 }
**0**, ^{2}).

## Note that we set _{
1
}and

## Note that conditional on the individual ancestry, the random variable _{2}

Fit the following 3 linear regressions:

1. _{0 }+ _{1}
_{1}
_{2 }+

2. _{0 }+ _{1}
_{2}
_{2 }+

3. _{0 }+ _{1}
_{2}
_{2 }+

Test whether _{2 }is statistically significant than 0 at the 0.05 level in each case.

## A statistically significant association observed between _{2 }

}

Repeat the experiment 10,000 times for each configuration of the gap value and the misclassification rate and count the proportion of times that the parameter _{2 }is statistically significant for each control variable. These proportions for the ancestry proportion and SRE regressions are shown in Figure

Simulation 3 (multivariate ancestry proportion distribution (K > 2))

We wanted to determine how SRE, when used as a control variable against population stratification, would perform in a multivariate setting. That is, when the number of number of ancestral populations is greater than 2. This simulation procedure resembles the previous one, except that the individual ancestry proportions are drawn from a Dirichlet distribution. We used a different set of parameters for each ethnic group in order to create the conditions needed for confounding to occur. The parameter used to generate the Dirichlet distribution can be represented by a 4 × 4 matrix, where the rows represent the expected individual ancestry proportions in each ethnic group, and for a fixed row, the columns represents the expected individual ancestry proportion from each ancestral population.

1) Let _{k }

2) Let the SRE for an individual in the ^{th }

3) We considered 4 cases; the parameters used for the Dirichlet distribution in each case are chosen as follows:

(a) Proportions that are very close to ancestry proportion estimates obtained with STRUCTURE in the MESA sample;

(b) Values near a 4 × 4 identity matrix with diagonal elements equal to 0.97 and off diagonal elements equal to 0.01;

(c) Values very close to what would be observed in equally admixed individuals, that is all proportions are set 0.25;

(d) Different ancestry proportions where the contribution of one specific ancestral population is clearly greater in each admixed population. That is the diagonal elements are set to 0.55 and off diagonal elements at 0.15 and made sure that the row and column sums are equal to 1.

4) Let _{
1
}= (0.1, 0.5, 0.3, 0.9) and _{
2
}= (0.05, 0.25, 0.50, 0.75) be the frequency of the reference allele of the markers _{
1
}and _{
2
}respectively in each ancestral population. These frequencies were chosen arbitrarily with the only constraint being that they vary greatly between the 4 ancestral populations. Confounding will occur if the distribution of the trait is also different in the 4 ancestral populations.

5) The allele frequencies of _{
1
}and _{
2
}in the admixed population (

6) Draw _{
1
}and _{
2
}according to these averages.

7) The outcome variable ^{2 }represents the pooled variance.

8) The remaining steps are similar to those taken in simulation 2. That is, we fitted the following linear regressions:

1. _{0 }+ _{1}
_{1 }+ _{2}
_{2 }+ _{3}
_{3 }+ _{4}
_{2 }+ _{1}
_{2 }
_{3 }

2. _{0 }+ _{1}
_{2}
_{2 }+ _{1 }

We then tested whether _{4 }and _{2 }are statistically different than 0 at the 0.05 level in each model, and repeated each experiment 10,000 times. The results of this simulation procedure can be seen in Figure

To better understand when the use of SRE as a control variable may fail, we devised a situation where it may be unclear which ethnicity to assign to individuals whose ancestry proportions take specific values. To facilitate the graphical representation of each scenario, we focused on the case where there are exactly 3 ancestral populations. In this case, an individual's ancestry proportion can be represented by a vector with 3 components, _{1}, _{2 }
_{3 }such that _{
1
}+ _{
2
}≤ **1**, define 3 or 4 specific regions. Figure _{
1
}, _{
2
}, _{
3
}), it is not exactly clear what the correct ethnicity assignment should be for the individuals whose ancestry proportions fall in region IV. There is no such ambiguity in Figure

The simulation steps are similar to those described in simulation 2, except that there are now two gap values: one for _{
1
}and one for _{
2
}. The ancestry proportions are each drawn independently from a uniform distribution, which has been rescaled such that the proportions add up to 1. We then excluded the ancestry proportions that fell in the regions defined by the gaps. In Figure _{
1
}and _{
2
}that fall between 0.1 and 0.3. The range of excluded values went from 0.1 to 0.5 in 5b. As in all previous cases, we considered 2 markers G1 and G2. We used G1 to simulate the trait, and G2 to test for association with the simulated trait. All significant association is seen as type I error. We use the vector (0.4, 0.2, 0.6) as the allele frequency in the 3 ancestral populations for G1 and (0.2, 0.4, 0.6) for G2. We also considered various effect sizes for evaluating the contribution of admixture in the confounding pathway. We also changed the allele frequencies vector to account for the fact that _{
1
}and _{
2
}in the model. The error term in all models is drawn from a normal distribution with 0 and variance 1 such that the effect size associated with _{
1
}and _{
2
}are equal to

Simulation 4 (effect of misclassification error on SRE when K = 4)

As can be seen in Figure

1) Let _{k }

2) Let the true ethnicity of any individual in the ^{th }

3) Draw _{k }
_{k}
_{k }

That is, (**0.09, 0.02, 0.84, 0.05**) for the African-Americans, (**0.89, 0.02, 0.02, 0.07**) for the European-Americans, (**0.24, 0.05, 0.14, 0.57**) for the Hispanic-Americans and (**0.01, 0.97, 0.01, 0.01**) for the Chinese-Americans.

To evaluate the effect of misclassification errors on the performance of SRE as a control variable, the true ethnicity of a fraction

4) We let

5) The remaining simulations steps are similar to those taken in simulation 3, and are not repeated here.

List of abbreviations

MESA: Multi-Ethnic Study of Atherosclerosis; SRE: self-reported ethnicity; GBMA: genetic based measures of ancestry; AIMs: ancestry informative markers; IAP: individual ancestry proportion; GBS: genetic background score; AA: African-Americans; CA: Chinese-Americans; EA: European-Americans; HA: Hispanic-Americans; LVH: Left ventricular hypertrophy; LV: Left ventricular; MRI: magnetic resonance imaging; BSA: body surface-adjusted; SAT structured association test.

Authors' contributions

JD conceived the manuscript, conducted the analyses, interpreted the results and drafted the manuscript. DTR and DBA helped conceive the simulation experiments, interpret the results and edit the manuscript. KMR, LKV and MAP helped during the analyses and reviewed the manuscript. DAB, JHY and DKA made the data available, helped interpret the results and reviewed the manuscript. All authors have read and approved the manuscript.

Acknowledgements

We thank the other investigators, the staff, and the participants of the MESA study for their valuable contributions. A full list of participating MESA investigators and institutions can be found at