Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Research article

dsPIG: a tool to predict imprinted genes from the deep sequencing of whole transcriptomes

Hua Li12, Xiao Su3, Juan Gallegos4, Yue Lu5, Yuan Ji6, Jeffrey J Molldrem2 and Shoudan Liang7*

Author Affiliations

1 Shanghai Center for Systems Biomedicine, Key Laboratory of Systems Biomedicine (Ministry of Education), Shanghai Jiao Tong University, Shanghai, 200240, China

2 Department of Stem Cell Transplantation and Cellular Therapy, The University of Texas M D Anderson Cancer Center, Houston, TX, 77030, USA

3 Division of Biostatistics, The University of Texas School of Public Health at Houston, Houston, TX, 77030, USA

4 Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA

5 Department of Leukemia, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA

6 Center for Clinical and Research Informatics, NorthShore University HealthSystem, Chicago, Il, 60201, USA

7 Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA

For all author emails, please log on.

BMC Bioinformatics 2012, 13:271  doi:10.1186/1471-2105-13-271

Published: 19 October 2012

Additional files

Additional file 1:

Figure S1. Distribution of simulation-generated allelic counts vs. observed distribution in real data. Red line stands for the generated distribution; black line stands for the observed distribution. The red text and black text in the upper right green box are summarized statistics for red line and black line, respectively

Format: TIFF Size: 148KB Download file

Open Data

Additional file 2:

Figure S2. Flowchart showing steps in data simulation and model assessment. In step 2, the differences in data generation are caused by two factors: (i) imprinted genes need to express only one allele at a tissue level while non-imprinted genes don’t, (ii) two alleles expressed from non-imprinted genes need to be sequenced in RNA-Seq with an equal probability, while imprinted genes only have one allele expressed. In this step we also need to assume that sequencing error leads to misread of one nucleotide to the other three with an equal probability. RT-PCR amplification is not shown in the process because we assume that it amplifies both alleles synchronously (for details, see Discussion)

Format: TIFF Size: 373KB Download file

Open Data

Additional file 3:

Table S1. The predicted imprinted genes based on mRNA-Seq data from Group I and Group II samples. Abbreviations: rs#-SNP identification number, Chr-chromosome, Str-strand, SS-sample size. “NA” in the “FDR” column means the FDR could not be estimated based on our 20,000-time simulations

Format: DOC Size: 283KB Download file

This file can be viewed with: Microsoft Word Viewer

Open Data

Additional file 4:

Table S2.The FDR values with respect to different sample sizes and allele frequencies. “ NA” means FDR could not be estimated based on our 20,000-time simulations

Format: DOC Size: 78KB Download file

This file can be viewed with: Microsoft Word Viewer

Open Data

Additional file 5:

Figure S3. Distribution of SNPs’ sample sizes in Group I (from 9 diverse tissue samples) and Group II (from 20 cerebellum samples). Group I and Group II had a total of 44007 SNPs and 66294 SNPs with sample size >0, respectively

Format: PNG Size: 33KB Download file

Open Data

Additional file 6:

Figure S4. Functional enrichment analysis of the 57 genes and the 94 genes in Ingenuity® Pathway Analysis. (a) Comparison of enrichment in Bio Functions between the two gene lists. (b) Comparison of enrichment in Canonical Pathways between the two gene lists. The height of each bar in (a) and (b) represents the logartihm (10-based) transformed p-values calculated from Fisher’s exact test. In (a), the horizontal yellow line is the threshold [i.e., -log10(0.05)] above which bars (p-values) were considered significant; in (b), there is no bar above the threshold which is not shown here. (Figure S4 is located in a separate PDF file: “Fig. S4.pdf”.)

Format: PDF Size: 566KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 7:

Figure S5. Sensitivity and specificity analysis of dsPIG based on the genes with known patterns of allelic expression. (a) Sensitivity analysis based on the validated imprinted genes. (b) Specificity analysis based on the validated non-imprinted genes. In (a) and (b), the solid black lines, which showed the numbers of genes identified by dsPIG as “imprinted”, were based on the mRNA-Seq data of Group I samples, and the dotted black lines were based on Group II samples; the red line is the cut-off (0.2) used in this study to predict imprinted genes. (c) ROC curve for dsPIG based on Group I samples. (d) ROC curve for dsPIG based on Group II samples

Format: PNG Size: 236KB Download file

Open Data

Additional file 8:

R package (dsPIG, version 3.0) for UNIX).

Format: GZ Size: 1.4MB Download file

Open Data

Additional file 9:

R package (dsPIG, version 3.0) for Windows).

Format: ZIP Size: 1.8MB Download file

Open Data

Additional file 10:

The instruction and the sample files for the R package of dsPIG.

Format: ZIP Size: 164KB Download file

Open Data

Additional file 11:

The annotated R code and C code for dsPIG used in our study.

Format: ZIP Size: 11KB Download file

Open Data

Additional file 12:

Figure S6. Simulated (log-transformed) posteriors of genes with biallelic expression in only one sample. Each positive integer (x) on the x-axis (“Sample size”) includes two parts: 1 sample of biallelic expression and (x-1) samples of imprinted expression. Posteriors were calculated by dsPIG. The dashed line stands for the log-transformed prior (0.01). This result was based on 20,000-time simulations with geometric mean as the method of averaging posteriors

Format: PNG Size: 4KB Download file

Open Data

Additional file 13:

Figure S7. Distributions of QS in 32 samples (including 3 breast cancer cell line samples). The x-axis is the number of sequencing tags that covered the SNP site, and the y-axis is the QS. Tissue names are located at the lower right side of each plot, where “Cancer” stands for “breast cancer cell line sample” and “C.” stands for “cerebellum sample”. Dashed lines in the 32 samples represent the cut-off (0.9) for QS

Format: PNG Size: 407KB Download file

Open Data