Department of Biostatistics, Johns Hopkins University Bloomberg School of Public Health, 615 North Wolfe StreetBaltimore, Maryland 21205, USA

CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100029, P.R. China

University of Chinese Academy of Sciences, Beijing 100049, P.R. China

Abstract

Background

ChIP-seq provides new opportunities to study allele-specific protein-DNA binding (ASB). However, detecting allelic imbalance from a single ChIP-seq dataset often has low statistical power since only sequence reads mapped to heterozygote SNPs are informative for discriminating two alleles.

Results

We develop a new method iASeq to address this issue by jointly analyzing multiple ChIP-seq datasets. iASeq uses a Bayesian hierarchical mixture model to learn correlation patterns of allele-specificity among multiple proteins. Using the discovered correlation patterns, the model allows one to borrow information across datasets to improve detection of allelic imbalance. Application of iASeq to 77 ChIP-seq samples from 40 ENCODE datasets and 1 genomic DNA sample in GM12878 cells reveals that allele-specificity of multiple proteins are highly correlated, and demonstrates the ability of iASeq to improve allelic inference compared to analyzing each individual dataset separately.

Conclusions

iASeq illustrates the value of integrating multiple datasets in the allele-specificity inference and offers a new tool to better analyze ASB.

Background

In a diploid organism, each somatic cell has two copies of the genome. At certain genomic loci, gene expression, DNA methylation, transcription factor (TF) binding or histone modification (HM) can be allele-specific. In other words, the two alleles can behave differently. These phenomena, also known as allele-specific expression (ASE), allele-specific DNA methylation (ASM) and allele-specific binding (ASB, including both allele-specific TF binding and allele-specific histone modifications), can contribute to phenotypic diversity and may play important roles in adaptive evolution

Early methods for analyzing AS events rely on low-throughput technologies such as real time quantitative PCR

ChIP-seq, a technology that couples chromatin immunoprecipitation with high-throughput sequencing, has become the state-of-the-art approach for mapping genome-wide TF binding sites and HMs

ChIP-seq data in public domains grow rapidly. A recently developed database hmChIP, for instance, has compiled over 450 human and mouse ChIP-seq datasets representing approximately 2000 samples from 140+ different TFs and HMs

In this article, we present an integrated solution to this problem by developing a new approach, iASeq, for jointly analyzing allele-specificity in multiple ChIP-seq datasets. iASeq uses a Bayesian hierarchical mixture model to describe unknown correlation patterns of allele-specificity among multiple datasets. These patterns can be discovered automatically from the data by fitting the model using an Expectation-Maximization (EM) algorithm. Using the identified correlation patterns, the model allows one to integrate information from multiple datasets to improve the ASB detection. Applying this approach, we analyzed 40 ENCODE

Methods

Data structure

Suppose there are _{
d
} replicate samples (Figure

The iASeq model

**The iASeq model. (a)** An example of the data structure. Each row represents a SNP and each column corresponds to either the reference allele (R) or the non-reference allele (N) read counts from a ChIP-seq sample in a dataset. A dataset could be a TF ChIP-seq experiment or a HM ChIP-seq experiment, and can have multiple replicate samples (Rep). iASeq assumes the following data generating process. **(b)** First, SNPs belong to _{i}is randomly assigned according to a class abundance probability vector ** Π**. Given the class label, a configuration [

After read mapping and data preprocessing (see Additional file _{
idj
} and _{
idj
}be the read counts for the reference allele and non-reference allele respectively. Let _{
idj
}=_{
idj
} + _{
idj
} be the total read count (See Figure _{
id
} to indicate whether SNP _{
id
}=1) or not (_{
id
}=0) in dataset _{
id
}=1, then SNP _{
id
} to indicate whether SNP _{
id
}and _{
id
} cannot be equal to one at the same time. If _{
id
}=0 and _{
id
}=0, then SNP **
B
**

**Supplemental Methods.** A PDF file including: data preprocessing procedures; method of moment estimation in the beta distribution; parameter choice for the Dirichlet prior; derivation of the EM algorithm for iASeq; Bayesian Information Criterion for choosing K; data generation procedure in simulation studies; single dataset based EM analysis.

Click here for file

Main intuition and challenge

Our primary goal is to infer for each SNP whether there is allelic imbalance in each dataset. This is equivalent to inferring _{
id
} and _{
id
}. A simple solution to this problem is to analyze each individual dataset separately, but this approach has low statistical power since the counts (_{
idj
},_{
idj
}) usually are small.

If one knows how different datasets are correlated in terms of allelic imbalance, this knowledge may be used to improve the data analysis. For instance, if the allelic imbalance of two proteins A and B are closely correlated, then observing skewed read counts for protein A will provide information for inferring the allelic imbalance of protein B. Integrating the data from both A and B will increase the effective number of reads available for statistical inference, which will then lead to increased statistical power.

In reality, how different proteins are correlated is usually unknown. However, one may learn it by studying the data from many SNPs. Each SNP has three possible states in each dataset: SR, SN and NS. For ^{
D
} possible configurations in total. From studying many SNPs, one can know the relative frequencies (or mixing proportions) of these 3^{
D
}configurations. The mixing proportions will tell how different datasets are correlated. For instance, let [_{1},_{2},⋯,_{
D
}] be the skewness configuration of a SNP in the ^{
D
}configurations will tell one the correlation structure in the data. This knowledge can then be used to improve statistical inference at each individual SNP by facilitating information sharing across datasets. For example, if the configuration [

A limitation of this approach is that one has to enumerate all 3^{
D
} AS configurations in order to describe the correlation. As the number of datasets increases, the number of possible configurations increases exponentially. Thus this approach does not scale well with the increasing ^{
D
}>10^{19}. This simple approach is clearly intractable.

To circumvent the difficulty of documenting the frequencies of all 3^{
D
}configurations, iASeq employs a technique that can describe the major correlation patterns in the data using a few probability vectors whose values vary from 0 to 1 rather than being dichotomous (i.e., 0 or 1). This approach significantly reduces the model complexity but keeps the flexibility to account for all 3^{
D
}configurations. It is easily scalable to increasing dataset number. The correlation structure in the model can then be used to improve the statistical inference of allelic imbalance at each SNP in each individual dataset.

Probability model

iASeq is based on the Bayesian hierarchical mixture model below that uses several probability vectors to describe the major correlation patterns among multiple datasets (Figure ^{
D
}), and the observed data are viewed as generated as follows:

● First, a class label _{
i
}is randomly assigned to each SNP **
Π
**=(

● If the class label _{
i
}=0, then **
B
**

● Next, the observed read counts are generated based on the AS configurations specified by **
B
**

For SNP _{
d
} into **
X
**

Let **
A
**=(

Organize the probability vectors **
V
**

Based on this model, each SNP class **
V
**

Our model assumes that [_{
id
};_{
id
}]s of the same SNP in different datasets are a priori independent conditional on the class membership _{
i
}. However, [_{
id
};_{
id
}]s from different datasets are not independent marginally if one integrates out the class label _{
i
}. For example, the marginal probability _{
id
};_{
id
}]= [1;0])=**
B
**

Data generating distributions

To fully specify the model, one also needs to specify the data generating distributions _{
idj
},_{
idj
}|_{
id
},_{
id
})=_{
idj
})_{
idj
}|_{
idj
},_{
id
},_{
id
}). The primary goal of iASeq is to infer whether two alleles are different. We assume that information on allele-specificity is only contained in _{
idj
}|_{
idj
},_{
id
},_{
id
}), and therefore the exact form of _{
idj
}), i.e., the marginal probability distribution of the total read count, is irrelevant for our purpose. As such, we mainly focus on modeling the conditional distribution of _{
idj
}given _{
idj
}, _{
id
} and _{
id
}, i.e., the three distributions _{
idj0}(_{
idj1}(_{
idj2}(

iASeq models these distributions hierarchically in two steps. First, _{
idj
} is assumed to follow a binomial distribution _{
idj
}|_{
idj
},_{
idj
}∼_{
idj
},_{
idj
}), where _{
idj
} is the probability that a read generated at SNP _{
idj
}depending on the values of _{
id
}and _{
id
}.

If _{
id
}=0 and _{
id
}=0, SNP _{
idj
}follows a Beta distribution _{
dj
}
_{
dj
}) with mean _{
dj0}=_{
dj
}/(_{
dj
} + _{
dj
}). Note that a simpler model for _{
idj
} would be to set it to a constant _{
dj0} which reflects the background ratio of read counts between two alleles. However, previous studies have shown that many background SNPs can have _{
idj
} slightly different from the average background _{
dj0}even though they do not have biologically meaningful allele-specificity _{
dj0} is not sufficient to describe the background variation. For this reason, we adopt the Beta distribution to describe _{
idj
} instead of setting it to a constant (See the blue lines illustrated for _{
idj
}|_{
id
}=0,_{
id
}=0) in Figure _{
dj0}, would be equal to 0.5. However, in reality _{
dj0}may be slightly different from 0.5 due to various sources of read mapping biases. For example, allowing the same number of mismatches, reads from the reference allele are easier to be mapped back to the reference genome than reads from the non-reference allele. Therefore, in iASeq _{
dj0}may take values different from 0.5. Indeed, it is determined by the parameters _{
dj
} and _{
dj
} in the Beta distribution which are estimated from the data using a moment matching approach (see Additional file _{
dj
}, _{
dj
} and _{
dj0}are treated as fixed and known parameters. Based on the model for _{
idj
}, we integrate out all possible values of _{
idj
} to obtain the distribution of _{
idj
}conditional on _{
id
}=0 and _{
id
}=0, which is a beta-binomial distribution:

Here

If _{
id
}=1 and _{
id
}=0, SNP _{
idj
} follows a uniform distribution _{
dj0},1](See the dark blue lines illustrated for _{
idj
}|_{
id
}=1,_{
id
}=0) in Figure _{
dj0}=_{
dj
}/(_{
dj
} + _{
dj
}) is defined as above. After integrating out _{
idj
}, the distribution of _{
idj
}conditional on _{
id
}=1 and _{
id
}=0 is

If _{
id
}=0 and _{
id
}=1, SNP _{
idj
} follows a uniform distribution _{
dj0}] (See the light blue lines illustrated for _{
idj
}|_{
id
}=0,_{
id
}=1) in Figure _{
idj
}, the distribution of _{
idj
}conditional on _{
id
}=0 and _{
id
}=1 is

Joint probabilities and model fitting

Based on the model above, the complete data likelihood can be derived as:

Define

To infer **
Π
**,

Conditional on the observed data, **
N
**) is a constant that does not contain parameters of interest, therefore it is absorbed into a proportionality constant not shown in the formula above. Using this joint posterior, an EM algorithm can be derived to search for posterior mode

For the Dirichlet prior, we use

Statistical inference of allele-specificity

The estimated **
Π
**,

Define

Using _{
id
}=1,_{
id
}=0|**
X
**

Formula 7 shows that two types of information contribute to _{
id
},_{
id
}|**
X
**

Results

GM12878 data and preprocessing

We collected 40 ENCODE

**Table S1.** Description of all GM12878 ChIP-seq and RNA-seq studies. An excel file showing the name of TF and HM, the number of replicates for each dataset in GM12878 cells.

Click here for file

As previously described in _{
dj0} which models the background skewing probability and is estimated using all reads mapped to heterozygote SNPs in each sample. If there is reference mapping bias, _{
dj0}will take a value different from 0.5 to adjust for the bias. One may remove reference bias before the analysis by masking SNPs in the reference genome during the alignment or by aligning reads to a diploid personal genome. This situation will also be automatically recognized by iASeq through the estimation of _{
dj0} from the data (if there is no bias, _{
dj0}=0

The intrinsic bias is a different type of bias. As shown by

We began with 1,704,166 heterozygote SNPs and filtered out 149,996 (8.8%) SNPs with inherent bias. Next, we eliminated SNPs that were not bound by any TF or associated with any HM in any dataset (see Additional file

**Table S2.** Raw read count data for 94,519 analyzed SNPs.

Click here for file

A simulation study

Before we apply iASeq to the real data, we first tested its performance in simulations that took into account real data characteristics. Our simulations kept the same design as the real GM12878 ChIP-seq data, with the same number of datasets and the same number of replicates within each dataset, except that the genomic DNA sample was not used here since we knew the truth in the simulations and did not need genomic DNA as a control for potential bias. To create the simulation data, we first applied iASeq to the real GM12878 data to identify 86,353 SNPs that were not skewed in any dataset using _{
i
}=0|**
X
**

● Scenario 1: Two types of ASB SNPs (classes 1 and 2) were created in addition to the background SNPs (class 0). The SNP number for class 0, 1, and 2 was 85,069, 4,725 and 4,725 respectively. Thus the true _{
k
}for the three classes was 0.90, 0.05 and 0.05 respectively. SNPs in class 1 were SR in datasets 1 to 30 (i.e., their _{
id
}=1 for _{
id
}=1 for

Simulation design and patterns discovered by iASeq

**Simulation design and patterns discovered by iASeq. (a)** The true ASB patterns in simulation 1. Two patterns were simulated in addition to the background pattern. The two non-background patterns are shown. Each pattern has 4725 SNPs. Each row in the plot represents a SNP class, and each column represents a dataset. Black means skewed, and white means not skewed. **(b)** The BIC values for different class number **(c)** Patterns discovered by iASeq in simulation 1. The plot shows the estimated ** V** and

● Scenario 2: Four correlation patterns (classes 1-4) were created in addition to the background class (class 0). Class 1 and class 2 were the same as in simulation 1. Classes 3 and 4 were two new patterns. SNPs in class 3 were SR in datasets 21-40, and SN in datasets 1-10. Class 4 was the mirror image of class 3. The abundance of the classes 0 to 4 was (0.90,0.025,0.025,0.025,0.025).

Given the simulated [**
B
**

We applied iASeq to both simulations. In both cases, iASeq was able to identify the correct number of SNP classes using BIC (Figures

1. _{
id
}/_{
id
}−_{
d0}|. Here we estimated ^{
′
}is the number of SNPs for which _{
id
}≠0.

The Receiver Operating Characteristic (ROC) curves for simulations

**The Receiver Operating Characteristic (ROC) curves for simulations. (a)-(c)** We plot the number of true allele-specific SNPs (i.e., true positives, TP) among the top **(d)** For each ranking method and each dataset, we computed the area under the ROC curve (AUC) using the 2000 top ranked SNPs. dAUC, the proportion of improvement of AUC brought by iASeq over the best AUC obtained from the single-dataset based methods, was computed for each dataset. **(e)**-**(g)** Results in three representative datasets from simulation 2. Results in all other datasets were similar. **(h)** The distribution of dAUC in all 40 datasets is shown for simulation 2.

2. _{
d0}was estimated as in the

3. _{
d0}was estimated as in the _{
id
}|_{
id
}∼_{
id
},_{
id
}) and _{
id
}∼_{
d
},_{
d
}) with _{
id
}is used to construct the ranking statistic.

4. _{
d
}and _{
d
}based on the observed data using the method of moments as in iASeq (see Additional file

5. _{
idjp
}(·),_{
d
}, _{
d
}and 1−_{
d
}−_{
d
}for each dataset

Figure _{
d
}(_{
d
}(

In general, the observed differences between iASeq and the

To examine whether iASeq was able to bring improvement in all datasets, we computed the Area under the Receiver Operating Characteristic (ROC) curves (AUC) for each method in each dataset using the top 2000 ranked SNPs. In each dataset, we computed the proportion of improvement in terms of AUC brought by iASeq over the best single-dataset based ranking method (i.e.,

In Figure

Estimated FDR against true FDR in simulations

**Estimated FDR against true FDR in simulations. (a)-(d)** Results for four representative datasets in simulation 1. **(e)**-**(h)** Results for four representative datasets in simulation 2. Results for all other datasets were similar.

Analysis of real data

Our simulation study demonstrates the ability of iASeq to discover correlation patterns of allele-specificity and improve the detection of skewed SNPs. Next, we applied iASeq to analyze the 41 real datasets (78 samples) in GM12878 cells. In real data, we do not have comprehensive knowledge about the truth. Therefore, unlike simulations, we were not able to assess the FDR estimates. For this reason, we mainly focused on analyzing the correlation patterns of allele-specificity and testing whether iASeq can improve the SNP ranking.

Correlation patterns of allele-specificity

Figure _{
k
}was estimated to be 0.0696 and 0.0691 respectively, suggesting that they cover 6.96% and 6.91% of the analyzed SNPs. Due to the background noises, not all SNPs in these two classes can be confidently detected. At the 0.90 posterior probability cutoff, iASeq reported 1868 and 2138 SNPs for classes 1 and 2 respectively (Figure

Correlation patterns of allele-specificity among different TFs and HMs in GM12878 cells discovered by iASeq.

**Correlation patterns of allele-specificity among different TFs and HMs in GM12878 cells discovered by iASeq. (a)** The BIC values for different class number **(b)** The estimated ** V** and

Figures **
V
**

The coordinated allelic imbalance of different proteins toward the same allele has also been observed in a recent study

Increased power for detecting allele-specificity compared with single dataset analysis

We ranked SNPs based on the posterior probabilities

First, we evaluated different methods by counting how many of their top ranked SNPs were located in the non-pseudoautosomal regions of chromosome X (chrX-npa) (Figure

The ROC curves with chrX-npa SNPs as gold standard in the GM12878 analysis

**The ROC curves with chrX-npa SNPs as gold standard in the GM12878 analysis.** We plot the number of non-pseudoautosomal region X chromosome SNPs, denoted by _{d}(**(a)**-**(g)** Results in 7 representative datasets. **(h)** In each dataset, we computed the area under the ROC curve (AUC) using the 2000 top ranked SNPs for each method. dAUC, the proportion of improvement of AUC brought by iASeq over the best AUC from the single-dataset based methods, was computed for each dataset. The distribution of dAUC in all 40 datasets is shown.

Second, we evaluated different methods by using independent RNA-seq data. From RNA-seq, one can identify exonic ASE SNPs and use them as gold standard. We collected two RNA-seq datasets in GM12878, one from the California Institute of Technology (Caltech) and the other from the Yale/Stanford University (Yale) (Additional file

**Supplemental comparison of defining allele-specific SNPs as SNPs that have RNA-seq exonic ASE SNPs in their 1kb neighborhood.** Supplemental Figure

Click here for file

**Figure S1.** ROC curves for GM12878 using Yale Exonic RNA-seq ASE SNPs as gold standard.

Click here for file

The ROC curves in GM12878 data using Caltech RNA-seq ASE SNPs as gold standard

**The ROC curves in GM12878 data using Caltech RNA-seq ASE SNPs as gold standard.** We plot _{d}(**(a)**-**(g)** Results in 7 representative datasets. **(h)** In each dataset, we computed the area under the ROC curve (AUC) using the 2000 top ranked SNPs for each method. dAUC, the proportion of improvement of AUC brought by iASeq over the best AUC from the single-dataset based methods, was computed for each dataset. The distribution of dAUC in all 40 datasets is shown.

To ensure that the increased statistical power was not completely attributed to X chromosome SNPs, we repeated the benchmark analysis based on RNA-seq using only SNPs in autosomal chromosomes, and we obtained similar results (Figure

**Figure S2.** ROC curves for GM12878 using Yale autosomal exonic RNA-seq ASE SNPs as gold standard.

Click here for file

The ROC curves in GM12878 data using Caltech RNA-seq autosomal ASE SNPs as gold standard

**The ROC curves in GM12878 data using Caltech RNA-seq autosomal ASE SNPs as gold standard.** We plot _{d}(**(a)**-**(g)** Results in 7 representative datasets. **(h)** In each dataset, we computed the area under the ROC curve (AUC) using the 2000 top ranked SNPs for each method. dAUC, the proportion of improvement of AUC brought by iASeq over the best AUC from the single-dataset based methods, was computed for each dataset. The distribution of dAUC in all 40 datasets is shown.

Comparisons with other methods

Most existing studies on allele-specificity were conducted using in-house data analysis pipelines. A tool developed by Skelly et al.

AlleleSeq _{
d
}denote the number of reported ASB SNPs for each TF dataset _{
d
}SNPs ranked by iASeq. We then compared these two methods based on how many of their top _{
d
} SNPs were in chrX-npa, and how many of them were associated with exonic ASE SNPs determined by RNA-seq. For the benchmark analysis based on RNA-seq, we associated exonic ASE SNPs with ChIP-seq SNPs using both 10kb and 1kb neighborhood. We also performed the comparison after excluding the chromosome X SNPs. Table

**Table S3.** Comparison of iASeq and AlleleSeq using Yale RNA-seq exonic ASE SNPs as gold standard.

Click here for file

**Figure S7.** The ROC curves for comparison between AlleleSeq and iASeq.

Click here for file

**Gold standard**

**ChrX**

**All Caltech ASE exonic SNPs**

**Autosomal Caltech ASE exonic SNPs**

Column 1: TF name. Column 2: _{
d
}is the number of AlleleSeq reported ASB SNPs. Columns 3-4: the number of non-pseudoautosomal region X chromosome SNPs among the top _{
d
}allele-specific SNPs reported by AlleleSeq and iASeq. Column 5: _{
d
}is the number of AlleleSeq reported ASB SNPs that had an exonic SNP within their 10kb neighborhood. Columns 6-7 show among the top _{
d
}allele-specific SNPs reported by AlleleSeq and iASeq, how many SNPs had ≥1 exonic ASE SNP in their 10kb neighborhood according to the Caltech RNA-seq experiment. Column 8: _{
d
}is the number of AlleleSeq reported autosomal ASB SNPs that had an exonic SNP within their 10kb neighborhood. Columns 9-10 show among the top _{
d
}autosomal allele-specific SNPs reported by AlleleSeq and iASeq, how many SNPs had ≥1 exonic ASE SNP in their 10kb neighborhood according to the Caltech RNA-seq experiment. Additional file

**TF**

**T**
_{
d
}

**AlleleSeq**

**iASeq**

**T**
_{
d
}

**AlleleSeq**

**iASeq**

**T**
_{
d
}

**AlleleSeq**

**iASeq**

YaleCFOS

41

3

4

9

5

3

9

5

3

YaleMYC

122

9

22

39

5

10

38

5

10

YaleJUND

289

13

31

24

4

8

23

4

7

YaleMAX

105

3

18

18

3

1

18

3

2

YalePolIII

25

2

2

0

0

0

0

0

0

Discussion

Interpretation of the correlation patterns

When analyzing the real data in GM12878, iASeq found two non-background AS patterns, representing two opposite directions of allelic imbalance. Since the assignment of reference and non-reference allele depends on the reference genome, whether a SNP is skewed toward reference or non-reference allele

In general, although one may view different allelic imbalance patterns in iASeq as different clusters of SNPs, these clusters only describe the similarities among SNPs in terms of their skewness directions, rather than the similarities in terms of their functions. The direction is defined using the reference/non-reference allele. The reference or non-reference allele for different SNPs can have different meanings (e.g., for one SNP, the maternal allele may be the reference allele, whereas for another SNP the paternal allele may be the reference allele). Therefore within each cluster, even though SNPs have similar skewness pattern, they are not necessarily functionally related to each other. One should not confuse the SNP clusters here with the clusters obtained from the traditional gene expression microarray data analysis, where co-expressed genes in a cluster often have similar functions. In iASeq, the clusters only serve as a tool to describe the correlation structure among different datasets (i.e., proteins), rather than the functional correlation among different SNPs. The correlation patterns among datasets are used by iASeq to inform one how to integrate information across datasets (i.e., which datasets are highly correlated and therefore can borrow information from each other) to improve detection of AS events for each individual SNP and dataset. In order to understand functions of the detected AS events, one needs to further correlate the iASeq results with other information (e.g., one may determine the parent-of-origin of each SNP first and then study various phenomena such as imprinting).

Our observation that different proteins prefer to be skewed in the same direction is consistent with a recent observation reported in

While our results show that most analyzed TFs/HMs tend to be skewed toward the same direction, these results do not imply that these proteins are perfectly correlated in terms of allele-specificity at each and every SNP. In iASeq, the correlation patterns **
V
**

Consistent with **
V
**

In summary, while the correlation patterns in iASeq provide some insights on the correlation of allelic imbalance among different datasets, one should not over-interpret them. The primary goal of these patterns is to describe the correlation structure in the data so that information from different datasets can be shared in a principled way to increase the power of statistical inference. This also points to an important difference between this study and previous studies that reported coordinated allele-specificity among multiple proteins. The previous studies only reported the correlation as a biological finding, but did not provide a statistical method to further utilize the correlation structure to improve the statistical inference. In contrast, iASeq provides a general and rigorous statistical method that utilizes the automatically discovered correlation patterns to increase the statistical power of AS detection. As such, it represents a novel development for the analysis of allele-specificity.

Model, algorithm, and possible extensions

Unlike tools such as AlleleSeq which mainly focus on the preprocessing steps for the AS analysis (e.g., construction of diploid personal genome), iASeq is developed as a general model working downstream of the preprocessing pipelines. The input data for iASeq are the read counts in the format shown in Figure

In iASeq, we used an EM algorithm to find the posterior mode of parameters and carried out statistical inference accordingly. In principle, one may also use a full Bayesian approach and Markov Chain Monte Carlo (MCMC) to perform the posterior inference. However, since MCMC usually takes much longer to run for a big dataset and it is not easy for users to monitor convergence, we decided to use the posterior mode and EM-based approach in our implementation. For analyzing the GM12878 data with 94,519 SNPs, iASeq took 5 hours to run the EM algorithm to fit a single model with

In principle, the statistical model developed in iASeq may also be applied to analyze other types of AS events, such as ASE and ASM. In the future, we plan to improve the model by incorporating information from the spatial correlation among closely located SNPs. For example, for the ASE analysis, one may jointly model SNPs from the same gene, similar to

Implications on future studies

The analysis of AS events using the high-throughput sequencing data frequently faces the problem of low statistical power due to the limited amount of information available at heterozygote SNPs. One way to increase the power is to increase the sequencing depth for one data type (e.g., MYC ChIP-seq). An alternative approach is to spend the same amount of money to generate data for multiple different but related data types (e.g., ChIP-seq for MYC, H3K4me1, H3K4me3, etc.), each with a lower coverage. One can then integrate the multiple datasets to increase the statistical power of allele-specificity analysis. The merit of the second approach is that one can collect multiple different types of information which might be useful for other purposes (e.g., in addition to studying MYC binding using MYC ChIP-seq, one may couple H3K4me1 ChIP-seq data with DNA motif information to locate active enhancers and predict binding sites of other TFs in the genome). If the second approach is used in the study design, then iASeq will offer a flexible, powerful and scalable framework for better analyzing the AS events in the data. As ChIP-seq data continue to grow rapidly, this integrative approach will allow us to use the data more efficiently to characterize the allele-specificity.

Conclusions

In summary, we have proposed a Bayesian hierarchical mixture model iASeq to integrate multiple ChIP-seq datasets for analyzing allele-specificity. The primary goal of iASeq is to increase the statistical power of AS detection, and it does so by taking the advantage of correlations among datasets. Since the correlation structure may not be known before the data analysis, iASeq learns it from the data automatically. Application of iASeq to the ENCODE GM12878 data shows that allelic imbalance of most analyzed TFs and HMs have strong preference to be skewed toward the same direction. Analysis of both the simulated and real data show the effectiveness of iASeq to improve detection of allele-specificity compared to single-dataset based methods.

Abbreviations

AS: allele-specific; ASB: allele-specific binding, including both allele-specific TF binding and allele-specific histone modifications; ASE: allele-specific expression; ASM: allele-specific DNA methylation; AUC: area under receiver operating characteristic curve; EM: Expectation-Maximization algorithm; FDR: false discovery rate; HM: histone modification; NS: not allele-specific; ROC: receiver operating characteristic curve; SN: skewed to the non-reference allele; SR: skewed to the reference allele; TF: transcription factor.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

Conceive study: HJ. Develop model: HJ, YW. Implementation: YW. Data collection: XL, QW. Data analysis: YW, XL. Write paper: HJ, YW, XL, QW. All authors read and approved the final manuscript.

Acknowledgements

This work was supported by the National Institute of Health grant R01HG006841 to HJ, the Strategic Priority Research Program of the Chinese Academy of Sciences, Grant No. XDA01010305 and Hundred Talents Program of the Chinese Academy of Sciences to QW. The authors would like to acknowledge Dr. Joel Rozowsky and Dr. Mark Gerstein for providing AlleleSeq data on GM12878 for the method comparisons.