Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Research article

Response network analysis of differential gene expression in human epithelial lung cells during avian influenza infections

Ken Tatebe1, Ahmet Zeytun2, Ruy M Ribeiro3, Robert Hoffmann4, Kevin S Harrod5 and Christian V Forst1*

Author Affiliations

1 Department of Clinical Sciences, University of Texas Southwestern Medical Center, Dallas, TX, USA

2 Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, USA

3 Theoretical Biology and Biophysics, Los Alamos National Laboratory, Los Alamos, NM, USA

4 Computational Biology Center, Memorial Sloan Kettering Cancer Center, New York, NY, USA

5 Lovelace Respiratory Research Institute, Albuquerque, NM, USA

For all author emails, please log on.

BMC Bioinformatics 2010, 11:170  doi:10.1186/1471-2105-11-170

The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/11/170


Received:19 May 2009
Accepted:6 April 2010
Published:6 April 2010

© 2010 Tatebe et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

The recent emergence of the H5N1 influenza virus from avian reservoirs has raised concern about future influenza strains of high virulence emerging that could easily infect humans. We analyzed differential gene expression of lung epithelial cells to compare the response to H5N1 infection with a more benign infection with Respiratory Syncytial Virus (RSV). These gene expression data are then used as seeds to find important nodes by using a novel combination of the Gene Ontology database and the Human Network of gene interactions. Additional analysis of the data is conducted by training support vector machines (SVM) with the data and examining the orientations of the optimal hyperplanes generated.

Results

Analysis of gene clustering in the Gene Ontology shows no significant clustering of genes unique to H5N1 response at 8 hours post infection. At 24 hours post infection, however, a number of significant gene clusters are found for nodes representing "immune response" and "response to virus" terms. There were no significant clusters of genes in the Gene Ontology for the control (Mock) or RSV experiments that were unique relative to the H5N1 response. The genes found to be most important in distinguishing H5N1 infected cells from the controls using SVM showed a large degree of overlap with the list of significantly regulated genes. However, though none of these genes were members of the GO clusters found to be significant.

Conclusions

Characteristics of H5N1 infection compared to RSV infection show several immune response factors that are specific for each of these infections. These include faster timescales within the cell as well as a more focused activation of immunity factors. Many of the genes that are found to be significantly expressed in H5N1 response relative to the control experiments are not found to cluster significantly in the Gene Ontology. These genes are, however, often closely linked to the clustered genes through the Human Network. This may suggest the need for more diverse annotations of these genes and verification of their action in immune response.

Background

Techniques such as microarray analysis allow measurements of the differential gene expression in cells for tens of thousands of genes simultaneously. The ability to measure changes in the transcription activity of a cell in response to an external stimulus allows for a system-wide approach in which pathways and sub-networks are analyzed rather than the activity of isolated genes [1]. Further, the development of biological knowledge systems such as the Gene Ontology (GO) [2] have provided a framework in which groups of genes can be classified in three areas: biological processes, molecular function and cellular components. This ontological classification scheme of gene function gives a hierarchical context in which groups of genes can be regarded to determine how closely they are functionally related [3].

A complimentary approach to above classification based on GO is the assessment of molecular functions in the context of known interactions between genes, DNA/RNAs, proteins and small chemicals, as mapped in biochemical interaction maps, pathways and networks [4]. The novel combination of these biochemical networks, along with the classifications provided by the GO, allows important clusters of genes in the cellular response to be identified. It further provides evidence by adjacency and pathway connectivity to assign genes that may not be significantly expressed to the relevant gene clusters.

Here, these methods are used to study properties of avian influenza infection with H5N1 virus, and to compare this infection to another upper respiratory tract infection, namely respiratory syncytial virus (RSV). The H5N1 influenza virus shows remarkably higher virulence than other strains of influenza [5]. A greater understanding of the H5N1 virus is motivated by epidemiological concerns, particularly in-light of the recent emergence of the H1N1 strain, because it is a prime candidate as a future pandemic infection should it mutate or reassort into a form that can be easily contracted by humans [6,7].

As a cross check on the significance analysis, a support vector machine (SVM) algorithm [8] is used to identify which genes show the greatest differences in their expressions between the H5N1 infected and control samples. This ranking is then compared to the significance analysis and genes associated with over represented GO nodes as well as genes that are closely associated with significant genes through interactions in the Human Network. This cross check will allow us to identify nodes that are not properly annotated, thus are not captured by the GO analysis.

Results and Discussion

Results are presented in threefold. First, significant clustering of genes into sub-graphs of the Gene Ontology are identified using the GO enrichment analysis plugin BiNGO [9] for CytoScape [10]. Second, many of the most strongly up and down regulated genes are examined and their possible function in the cellular response of Normal Human Bronchial Epithelial (NHBE) cells to H5N1 is considered with comparison to RSV and Mock infections. Comparisons were made at 8 and 24 hours postinfection for comparison of gene expression changes prior to and following productive viral replication, generally 12-14 hours postinfection in this NHBE culture model. Viral replication was verified by standard plaque assay techniques under the appropriate biocontainment levels. A total of 138 and 213 genes were found to be both significant biologically and statistically post infection at 8 and 24 hours, respectively. Of these, only a small fraction are associated with over represented GO nodes. Third, human response networks are calculated and correlated with GO.

Significant Sub-Graphs

After 8 and 24 hours of H5N1 exposure, 138 and 213 significant genes relative to the controls were identified, respectively. In spite of 138 genes being found at 8 hours, no significant over-representation of any GO nodes was found by BiNGO. At 24 hours after infection, several GO nodes showed over-representation and are listed in Table 1 along with the significant genes that are annotated with the corresponding GO ID. Only nodes that are significant at the 0.05 level using a Bonferroni correction are listed. Using the more liberal Benjamini correction did not yield appreciably different graphs [11]. The genes associated with each of these GO nodes and their functional interactions with other genes as described by our constructed Human Network are shown in Figure 1 (see Additional Files 1 and 2). Most of the genes in the GO clustering and those found associated/related in the Human Network are up-regulated, with a minority being down-regulated. Many of the genes with the greatest change in expression, however, were down regulated such as EGR2, FOS and EGR1 as shown in Table 2 (see Table 3 for a list of gene-definitions).

Table 1. Significant GO nodes

Table 2. Top 3 most strongly regulated genes at 8 and 24 hours

Table 3. Definition of genes referred to in the main text

thumbnailFigure 1. Network of significant genes and GO nodes at 24 hours. The Gene Ontology nodes found to be significant in the BiNGO analysis are shown as rectangles, with the more orange nodes being more statistically significant. The genes associated with the GO nodes are listed in ovals connected by black arrows to the GO nodes. These genes are further connected to other genes in the Human Network via yellow and orange edges. Red ovals indicate up-regulated genes while blue indicates down-regulated genes (see Additional files 1 &2).

Additional file 1. Figure 1 network data. Figure 1 network data in XML format

Format: XML Size: 184KB Download fileOpen Data

Additional file 2. Figure 1 interaction data. Figure 1 interaction data in sif format

Format: SIF Size: 4KB Download fileOpen Data

Another visualization of these results is depicted in Figure 2 (see Additional files 3 &4), which contains a network that used daughters of BiNGO nodes as seeds for the human response network. As a result there are many genes connected to each GO node. The increased number of paths between GO nodes obscures the genes that are at the hub of activity, though FOS, IFI27, STAT1, CXCL11 and CDC6 are rather well connected. This is in contrast to Figure 1, which shows a graph where the significant genes were used as seeds regardless of their GO connections to generate a human response network, and the resulting graph was then merged with the GO nodes found to be significant with BiNGO.

thumbnailFigure 2. Network of genes using daughters of BiNGO nodes as seeds. Genes that are up-regulated are shown in red and those in blue are down-regulated. Any gene found to be a daughter of a GO node found to be significant under the BiNGO analysis was included in a seed set used to generate the resulting network connecting the BiNGO nodes. In this case, the seed genes are not necessarily significant, rather, it is their associated GO node that is significant (see Additional files 3 &4).

Additional file 3. Figure 2 network data. Figure 1 network data in XML format

Format: XML Size: 355KB Download fileOpen Data

Additional file 4. Figure 2 interaction data. Figure 2 interaction data in sif format

Format: SIF Size: 9KB Download fileOpen Data

Interestingly, no significant over-representation of GO nodes can be found among the significantly represented genes during RSV infection. That is, looking at the genes that are both biologically and statistically significant in the RSV but not in the Mock assays. This may indicate that the response to RSV at times earlier than 24 hours does not involve activating genes that are not part of the normal metabolic behavior of the cell. It is possible that there is a more subtle interplay between the gene activations in how the cell orchestrates a response. Alternatively, the effects of early RSV infection could be small such that the cell has not mustered significant changes in gene expression by 24 hours post infection. In any case, using the union of all genes found to be significant in any of the control experiments is rather conservative and may exclude sufficient true positives to make finding significantly over represented GO nodes difficult; however, this also highlights the dramatic response of the cell to infection by H5N1.

Genes of Interest

Although no significant GO nodes are found at 8 hours, there is one sub-network associated with binding in the Gene Ontology that contains a fair number of genes. At 8 hours this sub-network contains a handful of genes including CTGF, CYR61, SDPR and LGALS2. By 24 hours this has expanded to accommodate numerous genes. This parent node common to the genes of this subnetwork is associated with IFIT5, a gene that codes for an interferon induced protein. Other subnetwork that are easily identified by visual inspection include genes involving transcription regulation, with the genes EGR1, JUN, MCM6, FOSB, FOS and ZFP36 being significantly regulated at both 8 and 24 hours, and genes involved with post transcriptional/translational modification such as TAP1, ISG15, and PLK3. There are also a number of kinase/phosphatase related genes, DUSP1, ABCC3, SOCS3. Among these, several of these genes also appear among the top 35 components to the vector normal to the OHP in the SVM analysis. Table 4 lists genes from Table 5 that appear as significant in the 8 and 24 hour experiments. Although most of these genes are not found in the BiNGO analysis, they are generally found to be highly connected to those genes through the Human Network. Many of these genes, including CYR61, EGR1, JUN and DUSP1 are prominently displayed in the response network in Figure 1.

Table 4. Genes common to significance and SVM analyses

Table 5. Components of the Optimal Hyper Plane (OHP) in the SVM analysis

Among the three most dramatically regulated genes at both 8 and 24 hours is the Early Growth Response 1 gene EGR1, which is down-regulated by about an order of magnitude at both 8 and 24 hours. Other studies of epithelial lung cells in mice have shown significant up regulation of this gene in association with lung injury, though on much shorter time scales [12]. Up-regulation of EGR1 and EGR2 have been seen after wounding in mouse studies at several stages of development from embryo to adult, though knockout studies show that neither gene is essential to healing [13]. This suggests a redundancy in systems in the cellular response to trauma. Such backup systems are excluded in this study by the selection criteria as a gene must appear in all three experimental sets to be selected. The timescale for EGR1 up-regulation in these cases is much faster and is reported to be insignificant 90 min post trauma. Similarly, EGR2 was reported to decrease to near zero after 1 hour. EGR1 is a master transcription factor that has been shown to be up-regulated by histamine in human aortic endothelial cells [14]. It is not known whether this effect also exists in pulmonary endothelial cells, but based on these data this seems to be a possibility. It is curious, however, that this study finds EGR1 to be strongly down regulated, in contrast to other studies. The immediate early gene EGR1 is also activated by injury, suggesting that mechanical stress of the cellular membrane and viral infection may have a common factor in the response triggered in the cell [15]. There are also significant differences between an intense, onetime injury and the more gradual onset and constant stress of a viral infection. These differences may help explain the faster timescales of gene activation associated with injury.

With respect to virus response, studies by Djavani et al. (2007) report down-regulation of EGR1 and EGR2 after infection with the lymphocytic choriomeningitis virus (LCMV) in a monkey model [16]. Down-regulation of these genes has been recorded by Djavani et al. after day 1 until the end of their measurements at day 7. This evidence is consistent with our findings. Thus it seems that virus infection triggers different reactions in the same first responder genes compared to wound healing. Interested to report with this respect is a regulatory network identified by Djavani et al. that includes EGR1, EGR2, FOS, JUN and PTGS2 (Figure 3, Additional files 5 &6). In contrast to Djavani et al., PTGS2 and IL1RL1 are up-regulated (Not shown: IL1R1 and IL1R2 are also up-regulated). These results indicate moderately different host responses during different viral infections even for such a compact and highly connected network of major host responding genes.

thumbnailFigure 3. Pathway analysis of major gene products affected by virulent infection. A sub-network of the response network from Figure 2 reveals interactions between the major gene products affected by virulent infections. This network is in accordance with Figure 5 of Djavani et al. (2007) [16]. In contrast to Djavani et al., PTGS2 and IL1RL1 (not shown: as well as IL1R1 and IL1R2) are up-regulated. EGR1, EGR2, FOS, FOSB are transcription factors. PTGS2 encodes prostaglandin-endoperoxide synthase 2, IL1A/B code for interleukin 1 and IL1RL1 belongs to the interleukin 1 receptor family. Color coding is according to Figure 1. Black arrows indicate known connections not realized in this particular response network (see Additional files 5 &6).

Additional file 5. Figure 3 network data. Figure 3 network data in XML format

Format: XML Size: 34KB Download fileOpen Data

Additional file 6. Figure 3 interaction data. Figure 3 interaction data in sif format

Format: SIF Size: 1KB Download fileOpen Data

A study of cellular response in birds to H5N1 [17] reports several genes to be significantly regulated. None of the genes reported match the genes found in this study, though several are similar. Many genes, such as SOCS3, IL17C and STAT1 in this study, are similar to the genes SOCS1, IL17A and STAT3 reported in [17]. These genes show opposite regulation in this study compared to the results in birds. Other genes, such as TLR2 in this study and TLR15 in birds are both found to be activated by exposure to H5N1.

Of the strongly regulated genes, several are also found in the SVM analysis described in the last portion of the Methods section. The genes common to the significance analysis and the top 35 components of the SVM analysis are shown in Table 4. The two largest components of the vector describing the optimal hyperplane are for the genes CYR61 and SOCS3. These genes have been associated with ventilator-induced lung injury in mice, along with CCL2 [18]. It has also been found that androgen receptors enhance STAT3, which in turn regulates transcription of SOCS3 [19]. This system is responsible for leptin regulation and may indicate a change in the energy use of a cell when responding to a pathogen. Variations in levels of CYR61 have also been reported in human tumor cell lines from nervous tissue along with a structurally related gene, CTGF [20].

Over-expression of ZFP36 has been found after wounding keratinocytes in human tissue [21], though this is in contrast to the down regulation observed in this study. The study by [21] also shows FOS and EGR1 to be activated by injury, again with opposite results of the data presented here where both are found to be down regulated. Up regulation of ZFP36 has also been reported between bouts of muscular exercise [22]. The reason for the down-regulation of ZPF36 upon exposure to H5N1 while it is up regulated in response to physical injury is probably due to modulation effects caused by the virus infection.

SVM results don't share any of the genes found by BiNGO analysis. That is, no gene was found to be both associated with a significantly over represented GO node and among the most relevant genes in the SVM analysis. Rather, the genes found by SVM tend to be functionally related to many of the genes found in significant GO clusters. These genes include JUN, FOS, FOSB, EGR1, DUSP1, and CYR61. This distinction is likely due to SVM selecting based on differential expression and without regard to relative statistical significance or gene relationships. Curiously, the CCL2 gene that features as a prominent hub in the human network graph is not found to be significant in either the BiNGO or the SVM analysis.

The synthesized networks enable us to determine the regulations (the genes located in up- and down-stream of the pathways) of the selected genes. For example, SAA (serum amyloid protein) 2 and 4 are important host markers for H5N1 HPAI infection. CXCL9, CXCL 10 and 11, CXCR4 along with IL1A regulates expression of SAA proteins that induce immune response molecules specific for H5N1 infection (OAS1, OAS2, OAS3, OASL, GBP1, GBP2, and TAP1), but not RSV or mock infection. Figure 4 shows response networks at 8 h and 24 h after infection between differentially expressed genes of H5N1 and RSV (see Additional files 7, 8, 9 and 10). Already at 8 h, CXCL9 is 3.8 times more expressed in H5N1 than in RSV. This ratio increases to 18.5 at 24 h. Similar increases can be identified for CXCL10 and CXCL11. The relative expression between H5N1 and RSV for these genes change from 2.3 and 2.0 at 8 h to 6.8 and 10.5 at 24 h, respectively. With respect to OAS genes, relative expression of OASL changes from 2.9 to 6.1 between 8 h and 24 h. Interesting to note at 24 h is the ICAM1 triggered two-pronged expressed cascade to the FOS/JUN pair as well as to the chemokine ligand family (CCL, CXCL) and their corresponding receptors (CCR, CXCR). Thus, H5N1 seems to trigger a variety of cytokine response in contrast to RSV infections. The exceptions are CCR3, CXCL13 and CCL19. These chemokines are suppressed by H5N1, indicating a weakened anti-viral response induced by H5N1 infections. The immune responses are further developed into viral specific immune responses by expressing molecules in interferon family. Targeting any of these genes aligned in immune response pathways can boost human anti-viral immune system.

thumbnailFigure 4. Differential expression network between H5N1 and RSV. Response network of differentially expressed genes between H5N1 and RSV. (a) A response network H5N1 versus RSV at 8 h is shown with 54 nodes and 65 interactions. The network was calculated with parameters k = 3 and l = 4.5. (b) A H5N1 versus RSV response network at 24 h is depicted with 53 nodes and 71 interactions. Parameters k = 3 and l = 5 were used for calculation. Color coding is according to Figure 1. Edges with arrows indicate chemical reactions, diamond-shaped edge tips denote activation, circle-shaped tips refer to phorphorylation reactions and non-decorated edges are non-directional interactions by physical binding or by inferrence from the iHOP database (see Additional files 7, 8, 9 and 10).

Additional file 7. Figure 4a network data. Figure 4a network data in XML format

Format: XML Size: 91KB Download fileOpen Data

Additional file 8. Figure 4a interaction data. Figure 4a interaction data in sif format

Format: SIF Size: 2KB Download fileOpen Data

Additional file 9. Figure 4b network data. Figure 4b interaction data in XML format

Format: XML Size: 96KB Download fileOpen Data

Additional file 10. Figure 4b interaction data. Figure 4b interaction data in sif format

Format: SIF Size: 3KB Download fileOpen Data

Conclusions

The expression of approximately 100 genes in human epithelial lung cells is found to be significantly different at 8 and 24 hours after H5N1 exposure compared to normal cellular metabolism or to response to RSV infection. Only those genes found to be significant at 24 hours are significantly clustered in the Gene Ontology with Cell Cycle and various Immune Response nodes being over represented. Although many genes are found to be differentially expressed after exposure to RSV that are not found in either the control or in the H5N1 experiment, no significant clustering was found using the BiNGO analysis. The faster time scales and more intense immunological response may be factors in the virulence of H5N1. These results are consistent with reports from infections caused by other aggressive viruses, such as LCMV. Network responses of major gene products affected by virulent infection seems to be conserved between H5N1 and LCMV infections. Although detailed responses clearly distinguish between different viruses. H5N1 causes host immune response by inducing large number of chemokines at the early stages (24 hour) of infection. However, CXCL13 and CXCL14 that are the keys to develop host anti-viral responses are suppressed. Balance among CXCL chemokines may be important to reduce the H5N1 mediated complications (Figure 4). Future studies looking at time scales from 12-48 hours would be helpful in addition to real-time PCR of the most significant genes found here to better determine their roles in the response to H5N1.

Potential applications of these findings include; (1) the identification early host biomarkers that can be used to detect specifically H5N1 infection during the early stages when the clinical symptoms have not been developed, (2) the selection of host targets to develop therapeutic for H5N1 infection.

Methods

Experiments

The gene expression data of epithelial cells were measured by Miltenyi Biotec using an Agilent DNA chip. We used Agilent 60-mer Whole Human Genome Oligo Microarray (with array numbers 251485014481 and 251485014482 resp.), containing approximately 44 K genes and genes candidates. Raw expression files can be accessed at: http://www4.utsouthwestern.edu/ForstLab webcite. NHBE cells were purchased from Lonza (Allendale, NJ. USA), cultured and differentiated as described previously [23]. Briefly, the cells are propagated by plating 2-3 × 106 cells in a 100 mm collagen coated dish in BEGM media (Lonza, Allendale, NJ. USA) until reaching 70-90% confluency, generally in 3-5 days. NHBEs are detached by low concentration trypsin digestion (Sigma, St. Louis, MO, USA) and plated onto the apical surface of Corning permeable inserts at a cell density of 5 × 105 in 24 mm (6-well) and 2.5 × 105 in 12 mm (12-well) in ALI media (50:50 mix of BEGM media and DMEM-H (Sigma, St. Louis, MO, USA) without antibiotics) in both the apical and basolateral chambers. Retinoic acid final concentration in ALI media is at 50 nM. At confluency, the apical media is removed and the cells exposed to air for 28 d. Basolateral media is changed and apical surface washed with 1× PBS every 48 hr. Differentiation is determined by microscopic observation of ciliary beat and confirmed in one-well by formalin-fixed, paraffin embedded hematoxylin and eosin staining for apical ciliary axonemes under microscopic examination.

The fully differentiated cells were infected with High Pathogenic Avian Influenza A (HPAI) H5N1 A/Hong Kong/483/97 in BSL-3 facility with Multiplicity Of Infection (MOI) of 0.01. The infectivity of the virus was determined with a plaque assay prior to the experiments. Apical compartment were washed with warm PBS and the viruses were added to the growth media. After one-hour incubation at 35°C, unattached viruses were removed from the apical compartment and one ml of fresh media was added. Eight and 24 hour after infection the cells were harvested by adding RLT lysis buffer (Qiagen RNeasy Kit) and total RNA was isolated according to the Manufacturer's instructions. One μg of total RNA was used for microarray analysis. The time-points were chosen to record physiological changes during infections. After 8 h the first viral particle start forming inside of host cells. By 24 hours the newly formed viral particles infect the un-infected neighboring cells.

Differential expression levels of mRNA were measured at 8 and 24 hours after exposure to one of three agents. One agent was the H5N1 virus. The other two, which served as controls, were the more benign Respiratory Syncitial Virus (RSV) and Mock samples consisting of growth medium. Three independent experiments were conducted to measure mRNA levels after exposure to H5N1 at both time points, for a total of 6 H5N1 data sets. Two experiments each were carried out for both controls and at both times, for a total of 8 control data sets.

Vials containing human RNA samples were shipped to Miltenyi Biotech on dry ice. At the company, the samples were further quality checked via Agilent 2100 Bioanalyzer platform, T7-base amplified, Cy3/Cy5-labeled and hybridized to the Agilent Whole Human Genome Oligo Microarrays using Agilent's recommended hybridization chamber and oven. Fluorescence signals of the hybridized Agilent Oligo Microarrays were detected using Agilent's DNA microarray scanner. The Agilent Feature Extraction Software (FES) was used to read out and process the microarray image files. Feature intensities and ratios, including background subtraction and normalization, were determined, outliers rejected and statistical confidences (p-values) calculated (see Additional file 11 for microarray statistics).

Additional file 11. Supplemental Data. Statistics of microarray experiments

Format: PDF Size: 2MB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

Expression Data Analysis

Genes in each of the data sets were deemed as significant if they showed at least a two-fold change in expression and met a preset p-value cutoff. How this value is selected is discussed below. Those genes that were significant in all three exposures to H5N1 after 8 or 24 hours but were not significant in any of the control trials were taken as relevant to H5N1 response. Stated formulaically,

(1)

(2)

(3)

where E is the set of genes expressed after H5N1 exposure, C is the set of control genes and S is the set of genes determined to be significantly relevant to H5N1 response uniquely. The subscripts indicate the time point. Thus, the significantly expressed genes relevant to H5N1 response at a given time point of 8 or 24 hours are taken to be the intersection of all genes significant in all three trials, but not significantly expressed in any of the control trials. In these experiments, the number of probe sets differed between the RSV, H5N1 and Mock conditions. In order to avoid false results due to a probe appearing in one condition but not another, only the probes that were common to all three conditions were kept to be processed by the above analysis.

The timescale of response to RSV and H5N1 may differ substantially, thus no attempt is made to distinguish between controls at 8 hours and those at 24 hours. Any genes that are significant in any of the control studies are assumed to be part of normal cellular metabolism and unrelated to a response to H5N1. This is an oversimplification and will probably lead to several genes in the RSV control studies being incorrectly interpreted as being part of the normal cellular metabolism. What remains, however, can then be more reliably identified as being both significant and genuinely a response of the cell to H5N1. This bias should generally select for genes responding to the more virulent nature of H5N1 since the basic response of the cells to the more benign RSV will be interpreted as normal expression through the control.

To find which genes were significantly expressed as a part of the cellular immune response a conservative interpretation of the data is used. Only genes which are found to be significant in all three experimental trials are considered to be genuine. Thus, a gene must be both biologically and statistically significant in all three experiments at 8 or 24 hours post-exposure to be considered relevant. As mentioned above, the control studies are also given a conservative interpretation. If a gene is found to be both biologically and statistically significant in any one of the control studies then it is interpreted as being part of the normal functioning of the cell. This rather conservative interpretation has the effect of revealing only the most significant genes, and hopefully, the genes that are most important in the reactions of the cell that are most specific to H5N1.

Evaluation of Significance

A gene is considered biologically significant if there is at least a 2-fold change in the level of gene expression at the given time frame after exposure in all three repeats of the experiment. Statistical significance is determined by demanding that there be, on average, less than one false positive due to statistical effects in the set of genes deemed to be significant. To do this, only data with p-values below a certain cutoff value are deemed to be statistically significant. This cutoff value is

(4)

where N ≈ 3 × 104 is the number of genes measured. Since the p-value indicates the fraction of random, i.e., insignificant, results that will by chance appear significant, the number of false positives, FP, is assumed to be FP = pcutoff(N - TP), where TP is the number of true positives. Since, in general, TP N it can be approximated that the number of false positives is FP pcutoffN. By setting a cutoff such that FP = 0.5 it can be assumed that, on average, there are less than 0.5 false positives per experiment, i.e., a family-wise error rate of 0.5. Since there are three H5N1 experiments at each time point, and it is required that a result appear significant in all three data sets, the likelihood of a false positive goes down by the cube of the single experiment rate. This is equivalent to a Bonferroni correction, modified for the case where n independent measurements are made for each variable. Thus, to maintain a FP rate of 0.5 the cutoff given in Equation 4 is used. The tacit assumption here is that TP genes will reliably reappear as significant across experiments while FP results will be evenly distributed amongst the insignificant genes randomly. This assumption will become important presently.

In the data sets several genes are listed twice or more. These repeats are due to different oligonucleotide probe sequences being used on the gene chip that code for different parts of the same gene. Genes that appeared more that once were not all necessarily measured to be (in)significant in all instances. In the cases where the results on a single chip differed, a way of evaluating the significance of that gene is needed. In these cases, if a gene appeared as significant by p-value more than once, and its average fold change over all instances was greater than a factor of two, then it was taken to be significant.

Since genes which are repeated have more chances to appear significant the cutoff p-value given above may not be sufficient. If a gene appears beneath the p-value cutoff twice, then the FP rate will be less than p2. As the number of instances of the gene that are not measured to be significant increases, however, there is a greater chance the gene is not significant. To determine the proper outcome the false negative rate, FN, must be known. This quantity is unknown but is tacitly assumed by the above procedure to be FN , making two positive occurrences compelling. The precise FN rate is complicated since the redundant entries represent diffferent probe sequences for the same gene. The average fold change of all instances is used in determining biological significance to avoid biasing the measure. This is not a rigorous statistical treatment, but should provide reasonable results in conjunction with the other data sets and for reasonable values of a FN rate.

Changing the p-value cutoff described above did not appreciably change the numbers of significant genes identified. With respect to the H5N1 response after 8 h and 24 h, a total of 152 and 209 genes, respectively, satisfy Equation 3 when the p-value cut-off is set to unity. These numbers change to 138 and 213 genes when, in addition, the p-value cutoff in Equation 4 is used. The increase from 209 to 213 genes at 24 hours after the inclusion of the p-value cutoff reflects the cutoff eliminating genes from being considered significant in the control study. All but about 20 genes in both time sets remained significant regardless of what p-value cutoff was used in the range pcutoff <p < 1.0.

Gene Ontology

The genes found to be significant in the experiments involving H5N1 exposure are analyzed using the BiNGO package for CytoScape. Here, a list of significantly over-represented Gene Ontology nodes associated with the significant genes is constructed. Similar to enrichment analysis [24], the Gene Ontology is then searched for nodes that are over represented at a statistically significant level. Significance is evaluated using the hypergeometric test given by

where qi gives the probability for a given GO node having i of its f daughter nodes associated with genes from a random list of c genes, and g is the number of GO nodes in the whole Ontology. Thus, the chance of n or more genes randomly being clustered under a given GO node (given that c genes were found to be significant) has a probability of

(5)

Otherwise stated, this is the probability of finding a clustering at least as extreme by random chance, i.e. the p-value of the clustering.

Human Network Reconstruction

To construct a hybrid Homo sapiens interaction and reaction network we have combined protein-protein interactions with directional signal transduction and metabolic reactions. We have been using interaction information from IntAct [25], NetworKin [26] and from Palsson's group, consisting of 37,000 nodes (genes, proteins and small chemicals) as well as 156,000 interactions (gene-protein, protein-protein) and reactions (chemical, protein-phosphorylation, etc). We have also integrated this network with a larger literature based network available from iHOP [27] with 45,041 nodes and 438,567 interactions, which are already about 2/3s of 650,000 interactions predicted by Stumpf et al. (2008) [28]. As a third reference network, the Homo sapiens protein interaction network was downloaded from the BioGRID database version 2.0.39 [29], which was generated from literature curation of protein interaction data. The data set was filtered to include only direct and physical interactions between human proteins and all loops and duplicate edges were removed. Although, duplicate edges from different data sources and different property (e.g., an interaction identified as generic protein-protein interaction in one data-set and predicted as phosphorylation of a protein by a kinase in another data-set) were kept to emphasize the importance/validity of such interactions.

Human Response Network Analysis

The list of genes S8 and S24 are used as seed nodes to create so called response networks using the Human Network. The k-shortest paths between all pairs of seed nodes are found, where k is a positive integer. The length of a path is weighted at each edge by the average absolute log-fold change in expression of the two genes at either end. The parameter k is chosen such that the graphs showed a large number of interaction pathways while remaining easily interpretable. The reader is referred to Cabusora et al. (2005) [30] where the algorithm is described in detail.

Within the Human Network, 37 of the 8 hour nodes and 80 nodes in the 24 hour set were identified. The degree to which each gene is interconnected and with which genes they interact can help identify what role they play in the cell's response to H5N1. These networks can also help identify important centers of activity such as the FOS and EGR1 genes seen in Figure 1.

In the Human Networks generated using the genes found in the BiNGO clustering analysis as seeds it is desirable to know how often particular genes are added when Human Networks are constructed. The probability a given gene will be added to a human network given a random list of seed genes is determined using Monte Carlo simulations for calculating the p-values of the appearance of genes. The p-values calculated are not significantly different for the top several genes for 1000 or 10000 iterations. In general, the distribution is consistent within the first 1000 and afterward statistical noise is reduced allowing better accuracy and resolution of smaller values. Genes with no appearances in the Monte Carlo can only be said to have an estimated p-value of less than 1/interactions. The number of seed genes used in the Monte Carlo simulations was 213, the number found in the significance analysis. The most significant GO nodes and their associated raw p-values are given in Table 6. The p-values for the genes included by the Human Network, not including the seed genes, are listed in Table 7.

Table 6. Most significant GO nodes

Table 7. Genes found in the Human iHOP Network using genes significant at 24 hours as seeds

The network generated from the BiNGO results show several closely related genes to be networked with the genes found in the significance analysis. Interestingly, there is no correlation between the connectivity of a gene in the Human Network and the p-value of that gene appearing in the human network given a random set of seed genes. Neither the first nor the second order connectivity of the nodes is correlated with the gene p-values. How important the higher order connectivities are, however, depends on the maximum path-length allowable. Second, the weights in the expression data will alter the weight of including each gene in a path. Thus, genes that are only moderately well connected but have a low cost of inclusion may appear more frequently than genes that are better connected but with a higher cost of inclusion.

Support Vector Machines

To provide a check against the significance analysis described above, the data are also examined using Support Vector Machines (SVM). SVM is a technique used in machine learning that is designed to classify objects into one of two classes [31]. Each object has a number of parameters describing it that have been measured. In the case presented here the set of measured parameters is the set of gene expression values in one experiment, i.e., where each gene's level of expression constitutes one parameter. Each object, i.e., experiment, can be plotted in Cartesian coordinates of a dimension equal to the number of parameters, with each axis encoding the value of one parameter. I this implementation the gene list is limited to those genes with a p-value of 0.05, or about twice pcutoff or less. This helps avoid fitting to noise. Using this cutoff results in having 14 points plotted in a space of 332 dimensions rather than the total 3 × 104. Thus, each experiment can be represented in this space as a vector whose components describe the degree of regulation for each gene measured. Once all the objects are plotted, it is often possible to draw a hyperplane that divides the space so that all objects of one class are found one one side of the hyperplane and all objects of the other class are on the other side. Such a hyperplane could then be used to classify the results from a gene chip and help determine whether the results indicated the cell had been exposed to H5N1.

In this case the components of the hyperplane itself are examined rather than using it to classify unknown objects. The vector normal to the Optimal Hyper Plane (OHP) separating the two classes of objects is used to determine which genes may be important in cellular response to H5N1. The hyperplane separating the two groups (control and H5N1 exposures) maximizes the margin between the groups, i.e., the distance perpendicular to the OHP separating the nearest members of the two groups. Thus, the largest components of this vector are the directions in which the margin between the control and exposed groups is greatest. A list of the genes whose change in regulation show the greatest margin between the two groups is then compiled. The list of genes used in the SVM analysis is restricted to those with p-values of 0.05 or less. This is to avoid fitting as much statistical noise as possible. In spite of no cutoff for the fold change being required explicitly, the most prominent genes in the SVM analysis have fold changes that are generally more than a factor of two and p-values that are typically much less than the cutoff. The results are shown in Table 5 and can be compared to the results from the analysis described in Evaluation of Significance. Based on the average p-values of the genes listed, the expected number of false positives among the 35 genes listed is only around 0.3.

Authors' contributions

KT carried out the computational biology studies and performed the statistical and network analysis as well as drafted the manuscript. AZ carried out the BSL-2 molecular biology experiments and performed initial gene expression analysis. RH carried out the iHOP interaction prediction and provided the iHOP network. KSH performed the BSL-3 infection experiments and mRNA isolation for expression profiling. KSH, RMR and CVF conceived of the study and participated in its design. CVF further performed additional respone network analysis, helped to draft and finalized the manuscript. All authors have read and have approved the final manuscript.

Acknowledgements

Part of this work was sponsored by DOE/LANL under grant 20070099DR.

References

  1. Ideker T, Thorsson V, Ranish JA, Christmas R, Buhler J, Eng JK, Bumgarner R, Goodlett DR, Aebersold R, Hood L: Integrated Genomic and Proteomic Analyses of a Systematically Perturbed Metabolic Network.

    Science 2001, 292:929-934. PubMed Abstract | Publisher Full Text OpenURL

  2. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry MJ, Davis AP, Dolinsky K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: Tool for the unification of biology. [http://www.geneontology.org] webcite

    Nature Genetics 2000, 25:25-29. PubMed Abstract | Publisher Full Text OpenURL

  3. Xie H, Wasserman A, Levine Z, Novik A, Grebinskiy V, Shoshan A, Mintz L: Large-Scale Protein Annotation through Gene Ontology.

    Genome Res 2002, 12:785-794. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  4. Bader GD, Cary MP, Sander C: Pathguide: a pathway resource list. [http://www.pathguide.org] webcite

    Nucleic Acids Res 2006, 1(34):D504-506. Publisher Full Text OpenURL

  5. Horimoto T, Kawaoka Y: Pandemic Threat Posed by Avian Influenza A Viruses.

    Clin Microbiol Rev 2001, 14:129-149. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  6. Webby RJ, Webster RG: Are We Ready for Pandemic Influenza?

    Science 2003, 302(5650):1519-1522. PubMed Abstract | Publisher Full Text OpenURL

  7. Maines TR, Jayaraman A, Belser JA, Wadford DA, Pappas C, Zeng H, Gustin KM, Pearce MB, Viswanathan K, Shriver ZH, Raman R, Cox NJ, Sasisekharan R, Katz JM, Tumpey TM: Transmission and Pathogenesis of Swine-Origin 2009 A(H1N1) Influenza Viruses in Ferrets and Mice.

    Science 2009, 325(5939):484-487. PubMed Abstract | Publisher Full Text OpenURL

  8. Cristianini N, Shawe-Taylor J: An Itroduction to Support Vector Machines and other kernel-based learning methods. Cambridge, United Kingdom: Cambridge University Press; 2000. OpenURL

  9. Maere S, Heymans K, Kuiper M: BiNGO: a Cytoscape plugin to assess overrepresentation of Gene Ontology categories in Biological Networks.

    Bioinf 2005, 21(16):3448-3449. Publisher Full Text OpenURL

  10. Shannon1 P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks. [http://www.cytoscape.org] webcite

    Genome Res 2003, 13(11):2498-2504. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  11. Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing.

    J R Statist Soc B 1995, 57:289-300. OpenURL

  12. Yan SF, Fujita T, Lu J, Okada K, Zou YS, Mackman N, Pinsky DJ, Stern DM: Egr-1, a master switch coordinating upregulation of divergent gene families underlying ischemic stress.

    Nat Medicine 2000, 6:1355-1361. Publisher Full Text OpenURL

  13. Grose R, Harris B, Cooper L, Topilko P, Martin P: Immediate Early Genes krox-24 and krox-20 Are Rapidly Up-Regulated After Wouding in the Embryonic and Adult Mouse.

    Developmental Dynamics 2002, 223:371-378. PubMed Abstract | Publisher Full Text OpenURL

  14. Hao F, Tan M, Xu X, et al.: Histamine Induces Egr-1 Expression in Human Aortic Endothelial Cells via the H1 Receptor-Mediated Protein Kinase Cδ-Dependent ERK Activation Pathway.

    The Journal of Biological Chemistry 2008, 283(40):26928-26936. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  15. Grembowicz K, Sprague D, McNeil P: Temporary Disruption of the Plasma Membrane Is Requireed for c-fos Expression in Response to Mechanical Stress.

    Molecular Biology of the Cell 1999, 10:1247-1257. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  16. Djavani MM, Crasta OR, Zapata JC, Fei Z, Folkerts O, Sobral B, Swindells M, Bryant J, Davis H, Pauza CD, Lukashevich IS, Hammamieh R, Jett M, Salvato MS: Early Blood Profiles of Virus Infection in a Monkey Model for Lassa Fever.

    J Virol 2007, 81(15):7960-7973. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  17. Sarmento L, Afonso C, Estevez C, Wasilenko J, pantin Jackwood M: Differential host gene expression in cells infected with highly pathogenic H5N1 avian influenza viruses.

    Vet Immunol and Immunopathol 2008, in press.

    doi:10.1016/j.vetimm.2008.05.021

    OpenURL

  18. Ma S, Grigoryev D, Taylor A, Nonas S, Sammani S, Ye S, Garcia J: Bioinformatic identification of novel early stress response genes in rodent models of lung injury.

    Am J Physiol Lung Cell Mol Physiol 2005, 289:L468-L477. PubMed Abstract | Publisher Full Text OpenURL

  19. Fan W, Yanase T, Nishi Y, Chiba S, Okabe T, Nomura M, Yoshimatsu H, Kato S, Takayanagi R, Nawata H: Functional Potentiation of Leptin-stat3 Signaling by the Androgen Receptor.

    Endocrinology 2008.

    doi:10.1210/en.2009-0431

    PubMed Abstract | Publisher Full Text OpenURL

  20. Martinerie C, Viegas-Pequignot E, Nguyen V, Perbal B: Chromosomal mapping and expression of the human cyr61 gene in tumor cells from the nervous system.

    J Clin pathol: Mol Pathol 1997, 50:310-316. Publisher Full Text OpenURL

  21. Dayem M, Moreilhon C, Turchi L, Magnone V, Christen R, Ponzio G, Barbry P: Early gene expression in wounded human keratinocytes revealed by DNA microarray analysis.

    Comparative and Functional Genomics 2003, 4:47-55. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  22. Hubal M, Chen T, Thompson P, Clarkson P: Inflamatory gene changes associated with the repeated-bout effect.

    Am J Physiol Regul Integr Comp Physiol 2008, 294:R1628-R1637. PubMed Abstract | Publisher Full Text OpenURL

  23. Fulcher ML, Gabriel S, Burns KA, Yankaskas JR, Randell SH: Methods in Molecular Medicine. Volume 107. Humana Press Inc, 2005 chap. Well-Differentiated Human Airway Epithelial Cell Cultures; :183-206. PubMed Abstract OpenURL

  24. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles.

    Proc Natl Acad Sci USA 2005, 102(43):15545-15550. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  25. Kerrien S, Alam-Faruque Y, Aranda B, Bancarz I, Bridge A, Derow C, Dimmer E, Feuermann M, Friedrichsen A, Huntley R, Kohler C, Khadake J, Leroy C, Liban A, Lieftink C, Montecchi-Palazzi L, Orchard S, Risse J, Robbe K, Roechert B, Thorneycroft D, Zhang Y, Apweiler R, Hermjakob H: IntAct Open Source Resource for Molecular Interaction Data.

    Nucleic Acids Res 2006, 35:D561-D565. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  26. Linding R, Jensen LJ, Ostheimer GJ, van Vugt MA, Jørgensen C, Miron IM, Diella F, Colwill K, Taylor L, Elder K, Metalnikov P, Nguyen V, Pasculescu A, Jin J, Park JG, Samson LD, Woodgett JR, Russell RB, Bork P, Yaffe MB, Pawson T: Systematic Discovery of In Vivo Phosphorylation Networks.

    Cell 2007, 129:1415-1426. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  27. Hoffmann R, Valencia A: A gene network for navigating the literature.

    Nat Genetics 2004, 36:664. Publisher Full Text OpenURL

  28. Stumpf MPH, Thorne T, de Silva E, Steward R, An HJ, Lappe M: Estimating the size of the human interactome.

    Proc Nat Acad Sci 2008, 105:6959-6964. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  29. Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets.

    Nucleic Acids Res 2006, 34:D535-D539. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  30. Cabusora L, Sutton E, Fulmer A, Forst CV: Differential Network Expression During Drug and Stress Response.

    Bioinformatics 2005, 21(12):2898-2905. PubMed Abstract | Publisher Full Text OpenURL

  31. Cortes C, Vapnik V: Support-Vector Networks.

    Machine Learning 1995, 20:273-297. OpenURL