Institute of Genetic Medicine, Newcastle University, Central Parkway, Newcastle upon Tyne, NE1 3BZ, UK

Abstract

Background

Here we present two new computer tools, PREMIM and EMIM, for the estimation of parental and child genetic effects, based on genotype data from a variety of different child-parent configurations. PREMIM allows the extraction of child-parent genotype data from standard-format pedigree data files, while EMIM uses the extracted genotype data to perform subsequent statistical analysis. The use of genotype data from the parents as well as from the child in question allows the estimation of complex genetic effects such as maternal genotype effects, maternal-foetal interactions and parent-of-origin (imprinting) effects. These effects are estimated by EMIM, incorporating chosen assumptions such as Hardy-Weinberg equilibrium or exchangeability of parental matings as required.

Results

In application to simulated data, we show that the inference provided by EMIM is essentially equivalent to that provided by alternative (competing) software packages such as MENDEL and LEM. However, PREMIM and EMIM (used in combination) considerably outperform MENDEL and LEM in terms of speed and ease of execution.

Conclusions

Together, EMIM and PREMIM provide easy-to-use command-line tools for the analysis of pedigree data, giving unbiased estimates of parental and child genotype relative risks.

Background

Genomewide association studies have popularized the use of the case/control design to detect effects associated with an individual’s own genotype, however many diseases (especially those related to pregnancy outcomes) may in fact be due to more complex effects such as maternal genotype effects, maternal-fetal genotype interactions or parent-of-origin (imprinting) effects. To detect such effects it is necessary to collect genotype data from one or both parents of cases, in addition to genotyping the cases themselves. Two existing popular approaches analyse either genetic data from affected offspring and their mothers (case/mother duos), along with an appropriate control sample

Full details and evaluation of the multinomial modelling approach used by EMIM have been described previously

PREMIM: Pedigree file conversion

For each SNP in turn, PREMIM performs a simple algorithm to select from each pedigree the most informative sub-unit of child-parent genotype data. Different pedigree sub-units are chosen in order of preference as listed in Table

**Order**

**Pedigree sub-unit**

1

case/parent trio

2

case/mother duo

3

case/father duo

4

case

5

case parental mating

6

case mother

7

case father

8

control parental mating

9

control/mother duo

10

control/father duo

11

control

There are a number of options that may be given to PREMIM. In particular, it is possible to override the default choice of individuals by stating a proband subject for certain pedigrees. These proband subjects are then chosen as cases (with parents where available). This may be useful to avoid possible bias when larger pedigrees have been ascertained on the basis of a specific affected individual. For larger pedigrees, it is also possible to select multiple case/parent trios or multiple control matings from each pedigree, potentially increasing the power to detect genetic effects. This option does have the potential to generate bias (depending on the analysis options chosen

EMIM methodology

The basic principle behind EMIM is simple: to test for the existence of (and estimate) genotype relative risk parameters that increase (or decrease) the probability that a child is affected. By default, PREMIM chooses the minor allele to be considered as the ‘risk’ allele, although this option can be overridden if required. We denote by _{1}(_{2}) the factor by which an individual’s disease risk is multiplied if they possess one (two) risk alleles at a given locus. We denote by _{1}(_{2}) the factor by which an individual’s disease risk is multiplied if their mother possesses one (two) risk alleles at that locus. We denote by _{
m
}(_{
p
}) the factor by which an individual’s disease risk is multiplied if they inherit a risk allele from their mother (father). Lastly, to test for mother-child interactions, we denote by _{
ij
} the factor by which an individual’s disease risk is multiplied if the mother carries

**Parameter**

**Description**

_{1}

Child has one minor allele (child genotype effect)

_{2}

Child has two minor alleles (child genotype effect)

_{1}

Mother has one minor allele (maternal genotype effect)

_{2}

Mother has two minor alleles (maternal genotype effect)

_{11}

Mother has one minor allele and child has one minor

allele (mother-child interaction effect)

_{12}

Mother has one minor allele and child has two minor

alleles (mother-child interaction effect)

_{21}

Mother has two minor alleles and child has one minor

allele (mother-child interaction effect)

_{22}

Mother has two minor alleles and child has two minor

alleles (mother-child interaction effect)

_{
m
}

The child receives a minor allele from the mother

(maternally operating imprinting effect)

_{
p
}

The child receives a minor allele from the father

(paternally operating imprinting effect)

As an example, denote the major and minor alleles by 1 and 2, then for a case/parent trio where the genotypes of the mother, father and child are 22, 11, 12, respectively, the penetrance is modelled as:

where _{
m
}, _{
f
} and _{
c
} are the genotypes of the mother, father and child.

EMIM uses a multinomial model to estimate the relative risk parameters on the basis of observed counts of genotype combinations in case/parent trios as shown in Table _{1}
_{2}
_{1}
_{2}
_{
m
}
_{
p
}
_{11}
_{12}
_{21}
_{22}). A maximum of 7 parameters are estimable, meaning that not all of these parameters can be estimated simultaneously. Cordell et al.

where _{1}−_{6}(corresponding to mating type stratification parameters as indexed in Table

**Genotypes**
^{
a
}

**Index of**

**Index of CEPG**
^{
b
}

**Index of CPG**
^{
c
}

**Observed**

**
g
**

**
g
**

**
g
**

**combination**

**parental mating type**

**parental mating type**

**count**

^{a}
_{
m
},_{
f
},_{
c
}=genotypes of mother, father, child, respectively.

^{b}CEPG= conditional on exchangeable parental genotypes.

^{c}CPG= conditional on parental genotypes.

22

22

22

1

1

1

_{1}

22

12

22

2

2

2

_{2}

22

12

12

3

2

2

_{3}

12

22

22

4

2

3

_{4}

12

22

12

5

2

3

_{5}

22

11

12

6

3

4

_{6}

11

22

12

7

3

5

_{7}

12

12

22

8

4

6

_{8}

12

12

12

9

4

6

_{9}

12

12

11

10

4

6

_{10}

12

11

12

11

5

7

_{11}

12

11

11

12

5

7

_{12}

11

12

12

13

5

8

_{13}

11

12

11

14

5

8

_{14}

11

11

11

15

6

9

_{15}

If any of the subjects are missing, we no longer have 15 genotype counts as shown in Table

**Genotypes**
^{
a
}

**Index of**

**Observed**

_{
m
}

_{
c
}

combination

count

^{a}
_{
m
},_{
c
}=genotypes of mother, child, respectively.

22

22

1

_{1}=_{1} + _{2}

22

21

2

_{2}=_{3} + _{6}

12

22

3

_{3}=_{4} + _{8}

12

12

4

_{4}=_{5} + _{9} + _{11}

12

11

5

_{5}=_{10} + _{12}

11

12

6

_{6}=_{7} + _{13}

11

11

7

_{7}=_{14} + _{15}

where

In practice, at any given SNP, we observe genotype counts (some of which may equal 0) for the following types of unit: case/parent trios (15 possible genotype combinations); parents of cases (9 possible genotype combinations); case/mother duos (7 possible combinations); case/father duos (7 possible combinations); mothers of cases (3 possible combinations); fathers of cases (3 possible combinations); cases (3 possible combinations). The data for each unit creates a table corresponding to a (possibly collapsed) version of Table

By default, EMIM assumes ‘mating symmetry’ _{
m
}=_{
f
}=_{
m
}=_{
f
}=_{1}−_{6}(see Table

1. A model that assumes parental allelic exchangeability (PAE) _{4}=_{3})

2. A model that assumes Hardy-Weinberg equilibrium (HWE) and random mating, estimating a single allele frequency parameter in place of the six mating type stratification parameters.

In addition to these more restricted models, a less restricted ‘conditional on parental genotypes’ (CPG) _{1}−_{9}, see Table

EMIM reads in genotype data from input files created by PREMIM. In addition, there are two other files required by EMIM. Firstly, a file ‘emimmarkers.dat’, which provides the minor allele frequencies for each SNP (used as starting values in the maximization algorithm). These can optionally be estimated by PREMIM using the pedigree data, although other (e.g. population-based) sources for this information may be preferred where available. (See

Implementation

PREMIM is written in C++ and for a binary pedigree file with 913 pedigrees, 1730 subjects and 45323 SNPs it takes 19 seconds to process on a Six-Core AMD Opteron^{TM} Processor with 2.6 GHz CPUs. EMIM is written in FORTRAN 77 and makes use of a subroutine MAXFUN, originally written as part of the S.A.G.E.

Results and discussion

Example analysis using simulated data

We used the program SimPed _{1}=1.5 and _{2}=2.25) at SNP 76 and maternal genotype effects (_{1}=2 and _{2}=3) at SNP 6004. We then used EMIM to test for maternal effects, with and without allowing for child genotype effects (Figure

Genetic Effects.

**Genetic Effects.** Plot of the **A**) child genetic effects; (**B**) maternal genetic effects; (**C**) maternal genetic effects whilst allowing for child effects; and (**D**) child genetic effects whilst allowing for maternal effects.

A tutorial for this example (with a listing of the required commands) is available on the PREMIM and EMIM website:

Comparison of HWE, PAE, CEPG and CPG likelihoods

The power to detect genetic effects can vary depending on the assumptions made. As a demonstration, we simulated 1000 replicates of data at a single SNP for a sample consisting of 50 of each of the following units: case/parent trios, case/mother duos, case/father duos, control matings, control/mother duos and control/father duos. We assumed either a child genotype effect (_{2}=2), a maternal genotype effect (_{2}=2), or a maternal imprinting effect (_{
m
}=1.8). PREMIM and EMIM were used to estimate the parameters _{1}, _{2}, _{1}, _{2} and _{
m
} for each different likelihood assumption and for each set of simulated data. Figure

Comparison of Likelihood Assumptions in EMIM.

**Comparison of Likelihood Assumptions in EMIM.** Results from simulated data. (**A-C**) The power of likelihood ratio tests to achieve significance levels (p-values) of 0.001, 0.01 and 0.05 for different simulated effects and likelihood assumptions: HWE - Hardy-Weinberg Equilibrium; CEPG - Conditional on Exchangeable Parental Genotypes; CPG - Conditional on Parental Genotypes; PAE - Parental Allelic Exchangeability. (**D-F**) Box plots on a log-scale of parameter estimates for _{1}, _{2}, _{1}, _{2}and _{m}, assuming HWE. Dotted lines show the true parameter values.

Effect of missing data on power

As a demonstration of the effect that missing data has on the power, we performed analyses at a single SNP using simulated data (10,000 replicates, each replicate consisting of 100 case/parent trios and 100 control/parent trios) and assuming a range of probabilities of missing genotype data. We assumed a maternal genotype effect (_{1}=1.5, _{2}=2.25). The expected proportion of pedigree units of different types remaining in the analysis are shown in Figure

Effect of Missing Genotype Data.

**Effect of Missing Genotype Data.** Plots showing the effect as the probability of missing genotype data is increased for data simulated with a maternal genetic effect. **A**: The expected proportions of different types of pedigree unit output by PREMIM from a set of case/parent trios, for different probabilities of missing genotypes. **B**: The expected proportions of different types of pedigree unit output by PREMIM from a set of control trios, for different probabilities of missing genotypes. **C**: Power of EMIM to detect maternal genetic effects (by estimating parameters _{1}and _{2}). **D**: Power of EMIM to detect maternal genetic effects masquerading as child genetic effects (by estimating parameters _{1}and _{2}).

Buyske

Comparison with MENDEL

Several other software packages exist that allow testing and estimation of genotype relative risk parameters similar to those tested in EMIM. One such package is MENDEL

We used computer simulations (500 replicates each with 200 case parent trios) to compare the performance of MENDEL and EMIM under three different comparable models:

1. **Model 1**. This model has been used to test for RhD incompatibility _{01}(corresponding to the parametrization of _{01}=2.

2. **Model 2**. This model has been used to test for non-inherited maternal antigens (NIMA) on rheumatoid arthritis (RA) _{10}) for MFG when the mother has one risk allele and the child has no risk alleles, and two parameters for child effects when the child has one or two risk alleles. In order to compare EMIM with MENDEL under this model, we used PREMIM to reassign which allele should be considered as the risk allele by EMIM. A model equivalent to MENDEL’s NIMA model can then be fit in EMIM by estimating parameters (with respect to the reassigned allele) _{1}
_{2}and _{12}. Data were simulated assuming an MFG effect _{10}=2. The power to detect the the MFG effect in either MENDEL or EMIM was calculated by considering twice the difference between the negative log likelihood from a model that includes all three parameters (_{1}
_{2}and the MFG parameter) and that from a model where the MFG parameter has been removed.

3. **Model 3.** This MENDEL model is a general MFG test consisting of one relative risk parameter for each of the 7 mother/child genotype combinations. The relative risk parameter denoted U_00 in the MENDEL documentation (corresponding to the situation where the mother and child have no risk alleles) was set to 1 and not estimated to avoid over-parametrization. The other 6 parameters, U_22, U_21, U_12, U_11, U_10, U_01, were estimated. The 6 parameters estimated by EMIM were _{1}, _{2}, _{1}, _{2}, _{11}and _{22}. These parameters are not indvidually equivalent to the 6 MENDEL parameters, but the models as a whole can be shown to be equivalent. Data for this comparison were simulated assuming _{1}=_{1}=_{11}=_{22}=1.5 and _{2}=_{2}=2.25.

Figures

Mother-Child Interaction Effects Comparison with MENDEL, RHD.

**Mother-Child Interaction Effects Comparison with MENDEL, RHD.** Plots showing the comparison of EMIM and MENDEL - “option 26, Model 1: RHD” using simulated data. **A**: Plot of the null model log likelihood values calculated using EMIM and MENDEL. **B**: Plot of the full (alternative) log likelihood values calculated using EMIM and MENDEL. **C**: The power to detect a genetic effect for p-values of 0.05, 0.01 and 0.001. **D**: Plot of the MFG parameter estimates calculated using EMIM and MENDEL.

Mother-Child Interaction Effects Comparison with MENDEL, NIMA.

**Mother-Child Interaction Effects Comparison with MENDEL, NIMA.** Plots showing the comparison of EMIM and MENDEL - “option 26, Model 2: NIMA” using simulated data. **A**: Plot of the null model log likelihood values (with child effects fitted, but no MFG interaction effect) calculated using EMIM and MENDEL. **B**: Plot of the full (alternative) log likelihood values (fitting both child effects and MFG effect) calculated using EMIM and MENDEL. **C**: The power to detect the MFG effect for p-values of 0.05, 0.01 and 0.001. **D**: Plot of the MFG parameter estimates calculated using EMIM and MENDEL.

One difference between EMIM and MENDEL was the time taken to perform the analysis, with EMIM performing considerably quicker than MENDEL. For example, the time to run model 3 (with 200 case/parent trios) showed that PREMIM and EMIM combined took 0.0257 seconds and MENDEL took 6.45 seconds (averaged over 300 runs). This shows that PREMIM and EMIM combined were approximately 250 times faster than MENDEL in this example. The same analysis with 400 case/parent trios gave times of 0.0302 seconds for PREMIM and EMIM combined and 14.3 seconds for MENDEL (averaged over 300 runs), showing PREMIM and EMIM to be approximately 472 times faster then MENDEL. A possible reason for the difference in running times is the fact that the extended MFG model

Comparison with LEM

Another program with the capability to analyse complex genetic effects (most notably mother/child/imprinting effects) is LEM

1. **Case/parent trios.** SimPed _{1}=1.5, _{2}=2.25) were simulated at SNP number 1004 and maternal effects (_{1}=2, _{2}=3) were simulated at SNP number 6004. In both EMIM and LEM we tested for maternal effects while allowing for child and maternal imprinting effects (i.e. we compared an alternative 5-parameter model (_{1}
_{2}
_{1}
_{2}
_{
m
}) with a null 3-parameter model (_{1}
_{2}
_{
m
})). We calculated the p-value for LEM on the basis of the reported log likelihoods by using the Wald statistic as a ^{
χ2}value with 2 degrees of freedom. (The p-value reported by LEM was not suitable as it is only given to 3 decimal places, which was insufficient for SNPs with p-values less than 1^{0−3}).

2. **Case/mother duos and control/mother duos.** Again, data were simulated at 8000 SNPs but this time for 2000 case/mother duos and 2000 control/mother duos. Child effects (_{1}=1.5, _{2}=2.25) were simulated at SNP number 1000 and maternal effects (_{1}=2, _{2}=3) were simulated at SNP number 6004. In both EMIM and LEM we tested for maternal and child effects i.e. we compared a null model with no fitted parameters to an alternative model with parameters (_{1}, _{2}, _{1}, _{2}).

A comparison of EMIM versus LEM for the case/mother and control/mother duos is shown in Figure _{1} and _{1} are approximately equal and Figures _{2}and _{2} are also approximately equal, but with more variability.

Comparison of EMIM and LEM for Child/Mother Duos.

**Comparison of EMIM and LEM for Child/Mother Duos.** Plots showing the comparison of EMIM and LEM using simulated data for 2000 case/mother duos and 2000 control/mother duos, assuming _{1}=1.5 and _{2}=2.25 at SNP number 1000 and _{1}=2 and _{2}=3 at SNP number 6004. Plots of the **A**: EMIM and **B**: LEM. Plots of the **C**: _{1}; **D**: _{2}; **E**: _{1}and **F**: _{2}. **G**: Plot of the

Figure _{
m
}. We see that the p-values and parameter estimates provided by the two programs are virtually indistinguishable.

Comparison of EMIM and LEM for Case/Parent Trios.

**Comparison of EMIM and LEM for Case/Parent Trios.** Plots showing the comparison of EMIM and LEM using simulated data for 4000 case/parent trios, assuming _{1}=1.5 and _{2}=2.25 at SNP number 1004 and _{1}=2 and _{2}=3 at SNP number 6004. Plots of the **A**: EMIM and **B**: LEM. Plots of the **C**: _{1}; **D**: _{2}; **E**: _{1}; **F**: _{2}; **G**: _{m}. **H**: Plot of the

These results indicate that the inference provided by LEM and EMIM is essentially identical. This is as expected given the mathematical equivalence ^{TM} Processor with 2.6 GHz CPUs) or 2 minutes 4 seconds on Windows (using a 2-core Intel^{TM} Processor with 2.93 GHz CPUs), whereas the same analysis in LEM took 16 hours, 52 minutes and 8 seconds on Windows (via the DOS command line). The difference in speed between the two programs for the case/parent trios analysis was not as extreme, with PREMIM/EMIM taking 3 minutes 7 seconds on Linux or 4 minutes 49 seconds on Windows, versus LEM’s time of 63 minutes 58 seconds on Windows. The improved speed for the LEM trios analysis was most likely due to the fact that it took fewer steps than the duos analysis during the likelihood maximization process (possibly on account of the fact that the example parameter file we were using requested the program to switch to using a Newton-Raphson algorithm following 10 iterations of an EM algorithm). It is possible that differences between maximization algorithms and convergence criteria could account for some of the differences in speed between PREMIM/EMIM and LEM; we found it difficult to determine how to obtain precise control over such factors in LEM and were forced to use input files that very closely matched the examples provided by

Conclusions

Here we have presented two new computer tools, PREMIM and EMIM, for the estimation of parental and child genetic effects, based on genotype data from a variety of different child-parent configurations. The current version of EMIM improves upon the early beta version described in

In application to simulated data, we have shown that the inference provided by EMIM is essentially equivalent to that provided by alternative (competing) software packages such as MENDEL and LEM. EMIM does have the advantage of allowing easy implementation of a wider class of models than are most easily implemented in MENDEL and LEM, although the expert MENDEL/LEM user could probably achieve the same model flexibility through judicious choice of parameter restrictions. However, PREMIM and EMIM (used in combination) considerably outperform MENDEL and LEM in terms of speed of execution, an advantage that is likely to be all the more important when applying these approaches to large-scale data sets such as those generated in genome-wide association studies. To allow further increases in speed, PREMIM and EMIM also have the advantage of allowing easy parallel processing (e.g. on a computer cluster) by dividing the SNPs to analyse into different batches.

Limitations of PREMIM and EMIM include the fact that larger pedigrees are divided into case/parent or control/parent trios (or smaller sub-units) prior to analysis, and the fact that SNPs are analysed one at a time, without borrowing information from neighbouring markers (e.g. on the basis of regional linkage disequilibrium patterns). Methods for dealing with larger pedigrees, valid under the assumptions of random mating and/or Hardy-Weinberg equilibrium (HWE), have been described by

Availability and requirements

**Project name:** EMIM and PREMIM **Project home page:**
**Operating systems:** Windows and Linux executables; FORTRAN and C++ source code **Programming language:** FORTRAN and C++ **Other requirements:** None **Licence**: GNU General Public License **Any restrictions to use by non-academics:** None

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

RH developed the PREMIM software, performed computer simulations and drafted the manuscript. HJC conceived the experiment, developed the EMIM software and revised the manuscript. Both authors read and approved the final manuscript.

Author’s information

HJC is Professor of Statistical Genetics and a Wellcome Senior Fellow at the Institute of Genetic Medicine, Newcastle University, UK. RH is a Research Associate at the Institute of Genetic Medicine, Newcastle University, UK.

Acknowledgements

This work was supported by the Wellcome Trust (Grant reference 087436) and by the European Community’s 7th Framework Programme contract (‘CHeartED’) HEALTH-F2-2008-223040. Some of the results of this paper were obtained by using the program package S.A.G.E., which was supported by a U.S. Public Health Service Resource Grant (RR03655) from the National Center for Research Resources.