Email updates

Keep up to date with the latest news and content from BMC Genomics and BioMed Central.

Open Access Research article

Computational analysis of LexA regulons in Cyanobacteria

Shan Li, Minli Xu and Zhengchang Su*

Author Affiliations

Bioinformatics Research Center, Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, Bioinformatics Building 351, 9201 University City Blvd., Charlotte, NC 28223, USA

For all author emails, please log on.

BMC Genomics 2010, 11:527  doi:10.1186/1471-2164-11-527

The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2164/11/527


Received:2 February 2010
Accepted:29 September 2010
Published:29 September 2010

© 2010 Li et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

The transcription factor LexA plays an important role in the SOS response in Escherichia coli and many other bacterial species studied. Although the lexA gene is encoded in almost every bacterial group with a wide range of evolutionary distances, its precise functions in each group/species are largely unknown. More recently, it has been shown that lexA genes in two cyanobacterial genomes Nostoc sp. PCC 7120 and Synechocystis sp. PCC 6803 might have distinct functions other than the regulation of the SOS response. To gain a general understanding of the functions of LexA and its evolution in cyanobacteria, we conducted the current study.

Results

Our analysis indicates that six of 33 sequenced cyanobacterial genomes do not harbor a lexA gene although they all encode the key SOS response genes, suggesting that LexA is not an indispensable transcription factor in these cyanobacteria, and that their SOS responses might be regulated by different mechanisms. Our phylogenetic analysis suggests that lexA was lost during the course of evolution in these six cyanobacterial genomes. For the 26 cyanobacterial genomes that encode a lexA gene, we have predicted their LexA-binding sites and regulons using an efficient binding site/regulon prediction algorithm that we developed previously. Our results show that LexA in most of these 26 genomes might still function as the transcriptional regulator of the SOS response genes as seen in E. coli and other organisms. Interestingly, putative LexA-binding sites were also found in some genomes for some key genes involved in a variety of other biological processes including photosynthesis, drug resistance, etc., suggesting that there is crosstalk between the SOS response and these biological processes. In particular, LexA in both Synechocystis sp. PCC6803 and Gloeobacter violaceus PCC7421 has largely diverged from those in other cyanobacteria in the sequence level. It is likely that LexA is no longer a regulator of the SOS response in Synechocystis sp. PCC6803.

Conclusions

In most cyanobacterial genomes that we analyzed, LexA appears to function as the transcriptional regulator of the key SOS response genes. There are possible couplings between the SOS response and other biological processes. In some cyanobacteria, LexA has adapted distinct functions, and might no longer be a regulator of the SOS response system. In some other cyanobacteria, lexA appears to have been lost during the course of evolution. The loss of lexA in these genomes might lead to the degradation of its binding sites.

Background

The LexA protein was first characterized as the transcriptional regulator of the SOS response in Escherichia coli [1,2], and later in several other bacteria, including Bacillus subtilis [3,4] and Fibrobacter succinogenes [5]. In fact, the lexA gene is found in almost all eubacterial groups examined so far [5,6]. In E. coli, around 30 genes involved in the SOS response are under the regulation of LexA [2]. Under normal growth conditions, LexA represses the SOS response genes by binding to their promoter regions, and thus blocking their transcription. When DNA is damaged, the binding of RecA to the released single-stranded DNA induces the auto-cleavage of the Ala84-Gly85 peptide bond [7,8] in LexA, thereby inhibiting the dimerization of LexA and preventing its binding to DNA [9-11]. In this manner, SOS response genes are de-repressed and expressed at different time points and different levels in a coordinated way [10].

LexA in E. coli consists of an N-terminal DNA-binding domain and a C-terminal dimerization domain [8,12]. The N-terminal contains three α-helices (I, II, III) and an anti-parallel β-sheet [12]. Helices II and III form a helix-turn-helix DNA-binding motif, and all the DNA-contacting residues Ser39, Asn41, Ala42, Glu44 and Glu45 are located in helix III [13] as revealed by both NMR [12] and X-Ray crystallography analyses [8]. The LexA-binding sites in E. coli were found to be a 16-bp palindromic motif with the consensus sequence CTG(TA)5CAG [14]. It has been shown that two reactive residues Ser119 and Lys156 in E. coli LexA are critical for the auto-hydrolysis of the peptide bond Ala84-Gly85 [1,9]. The core set of the SOS response system consists of lexA, recA, uvrABCD, umuCD and ruvB [10]. Upon the auto-hydrolysis of LexA, the uvrABCD operon is expressed first, whose products are responsible for the nucleotide excision repair (NER). Then recA and several other genes for homologous recombination are expressed, retrieving the excised DNA double strands. Next, the cell division inhibitor SfiA is induced to guarantee a sufficient time for the DNA repairing to be completed. In the end, if the DNA is not completely repaired, the operon umuCD encoding the mutagenic DNA repair polymerase Pol V will be induced to perform translesion DNA synthesis [9,14]. Since the lexA gene itself is also under the control of LexA, after the damaged DNA is repaired, the activity of RecA declines, the production of LexA surpasses its auto-cleavage. Consequently, the increased concentration of LexA restores the inhibition of the expression of the SOS response genes.

More recently, LexA homologs were also experimentally studied in a few cyanobacteria [15-21]. These studies suggest that LexA in Nostoc sp. PCC 7120 [16] binds to the promoter regions of lexA and recA; however, LexA in Synechocystis sp. PCC 6803 may regulate different genes/systems other than the SOS system. Domain et al. concluded from microarray gene profiling analysis [21] that LexA in this species might be involved in carbon metabolism. Later, LexA in Synechocystis sp. PCC 6803 was found to regulate the crhR gene encoding a RNA helicase [19]. Moreover, it has been shown that the transcription of the bidirectional hydrogenase genes hoxEFUYH was regulated by LexA in Synechocystis sp. PCC6803[17]. In Nostoc sp. PCC 7120, hoxEFUYH genes are split into two separate operons, and LexA was found to bind to the upstream regions for both operons [15]. Mazon et al. [16] showed that the LexA-binding sites in Nostoc sp. PCC 7120 have a 14-bp pseudo-palindromic structure in the form of RGTACNNNDGTWCB, which are similar to those in B. subtilis. Additionally, Sjöholm et al. [15] found two putative palindromic LexA-binding sites: one in the promoter region of alr0750-hoxE-hoxF that resembles Mazon's LexA boxes[16], and another, TTACACTTTAA in the upstream region of hoxU in Nostoc sp. PCC 7120. Meanwhile, multiple putative LexA boxes were identified in Synechocystis sp. PCC6803: a 13-bp pseudo-palindromic segment AGTAACTAGTTCG in the upstream region of hoxE, which is similar to Mazon's site but with one base deletion [17]; another direct repeat pattern, CTA-N9-CTA proposed to be recognized by LexA in vitro [20]; and two putative LexA boxes that resemble none of the putative LexA boxes listed above [18]. Despite this progress, a more extensive study of LexA proteins and their binding sites and regulons in cyanobacterial genomes is still needed. In this study, we have predicted LexA-binding sites and regulons in all the sequenced cyanobacterial genomes that harbor a lexA gene, and analyzed the evolutionary changes in the LexA regulons in cyanobacteria, as well as their relationship with those in proteobacteria and gram-positive bacteria.

Results and Discussion

Conservation of the DNA-binding domain of LexA in cyanobacteria

We identified orthologs of the LexA protein in Nostoc sp. PCC7120 (alr4908) in 26 of the 33 sequenced cyanobacterial genomes using the bi-directional best hit (BDBH) method based on BLASTP search with an E-value cutoff 10-10 (see Methods). Seven genomes appear not to harbor a lexA gene under this criterion, namely, Gloeobacter violaceus PCC7421, Synechococcus sp. JA-3-3Ab A-Prime, Synechococcus sp. JA-2-3B'a(2-13) B-Prime, Synechococcus elongatus PCC6301, Synechococcus elongatus PCC7942, Trichodesmium erythraeum IMS101 and Thermosynechococcus elongatus BP-1. We removed the Synechococcus elongatus PCC7942 genome from our study since Synechococcus elongatus PCC6301 is virtually identical to it [22]. However, an ortholog of the lexA gene (Gll0709) does exist in Gloeobacter violaceus PCC7421. The reason we failed to identify this ortholog is that it does not meet our BDBH criterion due to its largely divergent sequence. The phylogenetic tree of these 27 LexA amino acid sequences indicates that they can be clustered into three groups (Figure 1A), corresponding to the previously described Clade A (containing Gloeobacter violaceus PCC7421), Clade C (containing small marine Prochlorococcus and Synechococcus), and Clade B (containing most remaining cyanobacteria) [23,24]. However, aside from Gloeobacter violaceus PCC7421, the DNA-binding domains (DBD) of LexA from these cyanobacteria are highly conserved (Figure 1B), especially the helix III, where DNA-contacting residues are located [13]. This result is in agreement with earlier observations [16,21]. This provides the rationale of our analysis, including the phylogenetic footprinting analysis (next section) and genome-wide scanning for LexA-binding site predictions. On the other hand, since the DBD in Gloeobacter violaceus PCC7421 is quite different from those in other cyanobacteria, especially where the DNA-contacting residues locate, thus, we excluded it from our study, leaving 31 species/strains for the putative LexA regulon prediction.

thumbnailFigure 1. Phylogenetic relationships of 27 cyanobacterial LexA proteins and their DNA-binding domains. (A) Phylogenetic relationships of the 27 cyanobacterial LexA proteins. The tree is rooted with the LexA in E. coli K12. Bootstrap values are shown on the nodes. (B) Alignment of the DNA-binding domain (DBD) of the 27 cyanobacterial LexA proteins. The DBD of LexA contains a helix-turn-helix motif, and DNA-contacting residues are located in helix III, and are labelled by vertical arrows.

LexA-binding sites predicted by phylogenetic footprinting

We considered both an operon and a singleton gene as a transcription unit (TU). As Mazon et al. [16] have demonstrated the binding of LexA to the upstream regions of two genes, lexA and recA, and predicted LexA-binding sites for other four genes (uvrA, ssb, alr4905, and all4790) in Nostoc sp PCC7120, we used phylogenetic footprinting to identify possible LexA-binding sites in the pooled 118 inter-TU sequences associated with these six genes in Nostoc sp. PCC7120 [16] and their orthologs in the other 25 cyanobacterial genomes (excluding Gloeobacter violaceus PCC7421) that harbor a lexA gene (see Methods). We identified 49 high-scoring 14-bp palindromic sequences (Table 1) out of the 118 input sequences by applying the motif finding tools MEME [25] and BioProspector [26] and incorporating the best motifs found by these two programs (See Methods and Additional file 1). However, the putative LexA box AGTCCTAGAGTCCT (Additional file 1) identified in Synechocystis sp. PCC6803 was not identified by Patterson-Fortin et al. [20] using DNaseI footprinting assays or by Gutekunst et al. [17]. Therefore, we removed this site, leaving 48 putative LexA-binding sites (Table 1) for profile construction. The two LexA boxes that have been characterized in Nostoc sp PCC 7120 [16] were accurately recovered by the phylogenetic footprinting procedure (Table 1), suggesting that most of these high-scoring motifs are likely to be genuine LexA boxes. These putative LexA-binding sites show either a strong palindromic structure similar to the experimentally characterized LexA boxes in Nostoc sp. PCC7120 [16], or a tandem repeat structure with the consensus sequence AGTACWNWTGTACT. As demonstrated in Figure S1 in Additional file 2, this pattern is rather similar to the consensus sequence of the LexA-binding sites previously identified in B. subitlis (CGAACN4GTTCG) [3], and to a less extent, to that of LexA-binding sites found in α-proteobacteria (GTTCN7GTTC and GAACN7GAAC) [27], but differs remarkably from that in E. coli CTG(TA)5CAG [14]. These results are consistent with our phylogenetic analysis of the 183 LexA proteins detected in 598 genomes, showing that LexA proteins in cyanobacteria are more closely related to those in gram-positive and α-proteobacteria bacteria than to those in γ-proteobacteria (Figure S2 in Additional file 3). Accordingly, since the LexA-binding sites in B. subtilis [3,16] have a palindromic structure, it is not surprising that the LexA-binding sites in cyanobacterial genomes might have a similar palindromic structure.

Table 1. 48 Putative LexA binding sites identified by phylogenetic footprinting analysis

Additional file 1. Supplementary figures. Additional file 1 contains one list: the putative LexA-binding sites found by MEME and BioProspector. Only the top1 motifs are included

Format: DOC Size: 42KB Download file

This file can be viewed with: Microsoft Word ViewerOpen Data

Additional file 2. Supplementary figures. Additional file 2 contains three figures. Figure S1: Phylogenetic tree of LexA-binding sites in cyanobacteria, B. subtilis, α-proteobacteria and E. coli. Figure S3: Results of genome-wide scanning for LexA-like binding sites in the five genomes that do not encode a lexA gene. Figure S4: Multiple sequence alignments of the full-length LexA in the 27 cyanobacterial genomes and E. coli.

Format: PPT Size: 1.6MB Download file

This file can be viewed with: Microsoft PowerPoint ViewerOpen Data

Additional file 3. Supplementary figures. Additional file 3 contains one figure. Figure S2: Phylogenetic tree of 183 LexA sequences from different bacteria domains.

Format: PDF Size: 1.2MB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

Genome-wide prediction of LexA-binding sites and regulons in cyanobacterial genomes

Both consensus sequence and position weight matrix (PWM) have been widely used to represent the pattern of similar sequences. The advantage of PWM (or profile) methods over the consensus sequence methods is that the former can capture more quantitative information about the patterns by using a probabilistic model to represent the sequences. In this way, it can differentiate subtly conserved positions from the non-conserved ones [28]. In our study, We used the profile of these 48 LexA boxes (Table 1) to scan the 31 sequenced cyanobacterial genomes to predict additional putative LexA-binding sites and members of LexA regulons, using a scanning algorithm [29-31] that incorporates orthologous information and computes a log-odds ratio (LOR) score for evaluating the confidence of predictions in each genome (see Methods for details). The predicted results with a p-value < 0.01 for the 26 genomes harboring a lexA gene are listed in Table S1-26 (Additional file 4), while those for the five genomes without a lexA gene are listed in Table S27-31 (Additional file 5). The predicted results with a p-value < 0.05 for the 31 cyanobacteria are summarized in Table S32-62 (Additional file 6).

Additional file 4. Supplementary tables. Additional file 4 contains 26 tables (Table S1-26), the predicted LexA-binding sites and regulons in the 26 cyanobacterial genomes at p < 0.01.

Format: DOC Size: 2.2MB Download file

This file can be viewed with: Microsoft Word ViewerOpen Data

Additional file 5. Supplementary tables. Additional file 5 contains 5 tables: Table S27-31 containing the predicted LexA-binding sites in the five cyanobacterial genomes without a lexA gene at p < 0.01.

Format: DOC Size: 238KB Download file

This file can be viewed with: Microsoft Word ViewerOpen Data

Additional file 6. Supplementary tables. Additional file 6 contains 33 tables. Tables S32-S57 contains the predicted LexA-binding sites and regulons in the 26 cyanobacterial genomes harboring a lexA gene at p < 0.05. Table S58-62: predicted LexA-binding sites and regulons in the five cyanobacterial genomes without a lexA gene at p < 0.05. Table S63: list of the orthologs of the E. coli SOS response genes in 32 sequenced cyanobacteria. Table S64: list of the most conserved LexA regulon members (number of occurrence > = 6) in the 26 cyanobacterial genomes at p < 0.01.

Format: ZIP Size: 408KB Download fileOpen Data

The score of a detected putative LexA binding site for a TU is the sum of two terms: one evaluates the extent to which the putative LexA binding site resembles the scanning profile; the other evaluates the similarity of this binding site to those identified for the orthologs of genes within the TU in the other genomes. To evaluate the confidence of each motif score s, we used randomly selected coding sequences as the null model to test the statistical significance. A false positive rate was used to evaluate this statistical significance, which was defined as the fraction of the randomly selected coding sequences containing binding sites with a score higher than the cutoff s in the genome. We chose randomly selected coding regions as the null model based on the assumption that a coding sequence is less likely to contain cis-regulatory binding sites than an intergenic sequence. Although it might be possible for genuine LexA boxes to occur in coding regions [15,20], such kind of binding sites should be rare. The LOR function for a genome evaluates the ratio of the fraction of the inter-TU sequences containing a binding site with a score higher than s to the fraction of the randomly selected coding sequences containing a binding site with a score higher than the same s in the genome. Accordingly, positive LOR values that increase monotonically with the increase in binding site sores would suggest that an inter-TU sequence is more likely to contain a high-scoring LexA-binding site than does a randomly selected coding sequence in the genome.

As shown in Figure 2, when the motif score s increases beyond some value, the LOR is generally high for most of the 26 cyanobacteria that harbor a lexA gene, therefore those genomes with high LOR values are likely to contain some true binding sites. Exceptions exist in five genomes, namely, Cyanothece sp. PCC 8801, Synechocystis sp. PCC6803, Synechococcus RCC307, Synechococcus sp. PCC 7002, and Microcystis aeruginosa NIES-843, in which the LOR curves oscillate around zero when binding site score s increases. These poor LOR values might suggest that there are not more high-scoring LexA-binding sites in the inter-TU regions than in the coding regions in the five genomes. The reason for this could be that our scanning algorithm rewards a binding site that is shared by orthologs in the other genomes. If a true binding site is unique to a genome, then it will not score high. In this sense, LexA is probably no longer a major SOS response regulator in these genomes. Instead, it might have become a specific local regulator during the course of evolution to adapt to their unique living environments (we will return to this subject later). In the case of Synechocystis sp. PCC6803, it is noted that the LexA-binding sites identified by Patterson-Fortin et al. [20] are totally different from those identified by Mazon et al. [16], and that the LexA sequence in this genome is largely divergent from those in the other genomes (Figure 1A). Accordingly, the LexA binding sites in this genome might differ in some way from those in the other genomes, which can be another reason for its low LOR values.

thumbnailFigure 2. Evaluation of the predictions of LexA-binding sites in the 26 cyanobacterial genomes. The green curves represent the probability <a onClick="popup('http://www.biomedcentral.com/1471-2164/11/527/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/11/527/mathml/M1">View MathML</a> and the blue curves <a onClick="popup('http://www.biomedcentral.com/1471-2164/11/527/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/11/527/mathml/M2">View MathML</a>. The cyan curves are the number of iner-TU regions containing a putative binding site with a score > s, <a onClick="popup('http://www.biomedcentral.com/1471-2164/11/527/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/11/527/mathml/M3">View MathML</a>. The red curves are the log-odds ratio (LOR), defined as <a onClick="popup('http://www.biomedcentral.com/1471-2164/11/527/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/11/527/mathml/M4">View MathML</a>, (see Methods). Refer to Abbreviation for the full name of each genome.

In contrast, as shown in Figure S3 in Additional file 2, the LOR values in the five genomes that do not harbor a lexA gene (Synechococcus sp. JA-3-3Ab A-Prime, Synechococcus sp. JA-2-3B'a (2-13) B-Prime, Synechococcus elongatus PCC 6301, Thermosynechococcus elongates BP-1 and Trichodesmium erythraeum IMS101) oscillate around or decrease below zero when the motif score s increases beyond a certain value, implying that the chance to find a relative high-scoring putative LexA-binding site in an inter-TU region is not higher than in a randomly chosen coding sequence, suggesting that these genomes are unlikely to contain functional LexA-binding sites. On the other hand, the LOR values in the three genomes Synechococcus sp. JA-3-3Ab A-Prime, Synechococcus sp. JA-2-3B'a (2-13) B-Prime and Trichodesmium erythraeum IMS101 are relatively higher than those in the other two genomes (Additional file 2, Figure S3), or even could be comparable to those of the five poor-LOR-valued cyanobacteria that harbor a lexA gene (Figure 2). In fact, the numbers of predicted binding sites in the three genomes are not too small (Table S27, S28, S31 in Additional file 5), which suggests that a few putative LexA-like binding sites exist in these genomes. A possible explanation for this phenomenon could be that these LexA-like binding sites are recognized by other transcription factors that have similar DNA-binding domains to that of LexA. The predictions of LexA regulons in the 26 cyanobacterial genomes that harbor a lexA gene are summarized in Table 2.

Table 2. Summary of genome-wide LexA-binding site predictions in the 26 cyanobacterial genomes

Conservation and diversity of the putative LexA regulons in cyanobacteria

To investigate how well the predicted LexA regulons in the 26 cyanobacterial genomes are conserved, we constructed a LexA regulon conservation tree based on the pairwise comparison of the predicted LexA regulons in these genomes (see Methods). As shown in Figure 3, these genomes are divided into two groups. Interestingly, one group is exclusively comprised of marine strains, and the other group contains the remaining genomes isolated from different non-marine habitats. In the former group, high light (HL) adapted and low light (LL) adapted ecotypes are largely grouped into two sub-groups. The results suggest that the composition of LexA regulons is dependent on the habitat of the organisms to a large extent. The general topology of the tree (Figure 3) is basically consistent with both the 16S rRNA gene tree (Figure 4) and the LexA protein tree of these genomes (Figure 1A). Furthermore, both the HL and LL adapted marine sub-groups are very compact, indicating that the predicted LexA regulons in both sub-groups are relatively conserved. In contrast, the species in the non-marine habitats are not so close to one another (Figure 3), suggesting that the putative LexA regulons in these genomes share few genes with one another except for the closely related Anabaena variabilis ATCC 29413 and Nostoc sp. PCC7120. The tree also indicates that Microcystis aeruginosa NIES 843 and Synechocystis sp. PCC6803 have the most distinct LexA regulons from other cyanobacterial genomes (Figure 3).

thumbnailFigure 3. Conservation relationships among the predicted LexA regulons in the 26 cyanobacterial genomes. The tree is based on the pairwise conservation of the predicted LexA regulons in the 26 cyanobacterial genomes (see Methods).

thumbnailFigure 4. Phylogenetic relationships of 32 cyanobacterial genomes based on the 16S rRNA genes. The tree is rooted with the 16S rRNA gene of E. coli K12. Bootstrap values are shown on the nodes. Cyanobacterial genomes that do not encode a lexA gene are shown in red.

Functional classification of putative LexA regulons in cyanobacteria

Predicted members of LexA regulons in the 26 cyanobacteria that harbor a lexA gene are listed in Tables S1-S26 in Additional file 4, their functions can be summarized as follows.

1. SOS response system

As shown in Table S63 in Additional file 6, all the 33 cyanobacterial genomes included in this study encode a few SOS response genes found in E. coli. Several of the SOS genes in some of the 26 genomes that harbor a lexA gene bear a high-scoring putative LexA-binding site in their regulatory regions (Table S1-26 in Additional file 4). In particular, two of the core SOS response genes [5,6], namely, recA and lexA, are among the most conserved putative LexA targets in the 26 cyanobacterial species/strains (Table S64 in Additional file 6). In addition, the umuC and umuD genes encoded in 13 genomes are also predicted to bear a putative LexA-binding site in their promoter regions (Table S64 in Additional file 6, Table 3 and Additional file 4). These results suggest that as in E. coli, the SOS response in most cyanobacteria might still be regulated by LexA. However, the other SOS response genes were found to bear a putative LexA-binding site only in a few genomes (Table 3). For instance, a high-scoring LexA-binding site for the ruvB gene encoding Holliday junction DNA helicase B was found only in HL adapted Prochlorococcus ecotypes MIT9312, MIT9515, MIT9215, MED4 and AS9601. Moreover, in the case of the nucleotide excision repair (NER) genes uvrA, B, C and D, which are under the regulation of LexA in E. coli [10], we were able to identify putative LexA-binding sites only in the promoter regions of the uvrA and uvrB in Nostoc sp. PCC7120 and the promoter region of uvrD in Prochlorococcus marinus MIT 9312 (Tables 3). Thus, it is likely that the NER process in the remaining genomes is regulated by some transcription factor other than LexA, given that uvr genes are present in all the 32 cyanobacterial genomes analyzed in this study, including those that do not encode a lexA gene (Table S63 in Additional file 6). These results are consistent with the earlier observation that LexA target genes in bacteria are highly diversified in order for them to adapt to different ecological niches [5,6].

Table 3. Putative LexA regulon members involved in various biological processes

On the other hand, in Synechococcus PCC7002, Synechococcus RCC307 and Synechococcus WH7803, LexA boxes were only detected for one of the core SOS response genes, i.e., SYNPCC7002_A0426 (recA) in Synechococcus PCC7002, SynRCC307_1756 (lexA) in Synechococcus RCC307 and SynWH7803_0439 (recA) in Synechococcus WH7803, although these genomes all encode a lexA gene and other core SOS response genes, such as recA (SYNPCC7002_A0426, SynRCC307_2111 and SynWH7803_0439,) and ruvB (SYNPCC7002_A1390, SynRCC307_1756 and SynWH7803_0185), umuC (SynRCC307_0043 and SynWH7803_1080,) and umuD (SynRCC307_0042 and SynWH7803_1081). Since only one single SOS response gene bears a putative LexA box in these genomes, it is likely that the role of LexA in the regulation of the SOS response in these genomes might have been attenuated. The case of Synechocystis sp. PCC6803 seems to go even further in this direction as detailed below.

As indicated previously [16,18], the LexA protein of Synechocystis sp. PCC6803 is unusual in two aspects compared to those in the other genomes analyzed in this study. First, as shown in Figure S4 in Additional file 2, the Ala-Gly dyad in the N-terminus of LexA responsible for auto-cleavage of the protein in all other cyanobacteria as well as in E. coli and B. subtilis [16] are replaced by Gly-Gly in Synechocystis sp. PCC6803. Second, the reactive residue Ser (Ser119 of LexA in E. coli) that attacks the Ala-Gly peptide bond is replaced by Asp of LexA in Synechocystis sp. PCC6803 [16,18]. It has been shown that SOS induction cannot be initiated by a non-cleavable LexA repressor [10,16]. Therefore, it is highly likely that LexA in Synechocystis sp. PCC6803 cannot undergo the auto-cleavage reaction in response to DNA damage, and it might have adopted a different function other than the canonical SOS response regulator seen in E. coli and B. subtilis [32]. This argument is consistent with the observation that Synechocystis sp. PCC6803 has a notably larger branch length in the 27 LexA protein tree (Figure 1A), but this is not seen in the 16S rRNA gene tree (Figure 4).

Although the Synechocystis sp. PCC6803 genome harbors some core SOS response genes including lexA and recA (Table S63 in Additional file 6), none of them belongs to our predicted LexA regulon at a p-value < 0.01 (Table S26 in Additional file 4 and Table 3). The mutS (sll1772) gene is the only gene that is likely to be in involved in DNA mismatch repair, while bearing a putative LexA binding site in the genome. However, the orthologs of mutS is not under the regulation of LexA in E. coli [2,33] or within the putative LexA regulon of any other cyanobacteria (Table S1-26 in Additional file 4). These results suggest that at least most of SOS response genes are not under the regulation of LexA in Synechocystis sp. PCC6803. Indeed, using microarray gene expression profiling in response to lexA depletion, Domain et al. [21] concluded that LexA in Synechocystis sp. PCC6803 might be involved in carbon metabolism or controlled by carbon availability rather than the regulation of SOS response. However, our predicted LexA regulon in Synechocystis sp. PCC6803 (Table S26 in Additional file 4) has no intersection with the LexA-responsive genes identified by Domain et al. [21]. Since the LexA-binding sites that were experimentally characterized [17,20] in Synechocystis sp. PCC6803 are different from the sequences in our scanning profile, and considering the distinct nature of the LexA protein in Synechocystis sp. PCC6803 indicated above, it would be particular interesting to determine by experiment the function of the predicted sites in this genome.

Thus, although Synechocystis sp. PCC6803 clearly harbors the components of a basic SOS response system (Table S63 in Additional file 6), it is probably no longer under the regulation of LexA. Accordingly, LexA in this genome might have adopted a different function. Thus, the loss of the original function of LexA in Synechocystis sp. PCC6803 is coupled with the loss of the sequence constraint, thereby accelerating its divergence from other cyanobacterial LexA proteins, at both the sequence and functional levels. On the other hand, given the importance of the SOS response in cell survival, it is highly likely that the transcriptional regulator of the SOS response system in Synechocystissp. PCC6803 has been replaced by another protein.

2. Other cellular processes

Interestingly, we also found putative LexA-binding sites in the regulatory regions of genes that participate in various cellular processes in these 26 cyanobacterial genomes (Table 3). The major cellular processes that are likely under the regulation of LexA are summarized below.

2.1 Photosynthesis

Putative LexA-binding sites were predicted for the following photosynthetic genes in the 26 cyanobacteria that harbor a lexA gene with p < 0.01(Table 3, Table S1-26 in Additional file 4): Ava_3553, A9601_12231, PMM1117, P9215_12531, P9301_12241, PMT9312_1128, and P9515_12081, coding for a photosystem II reaction center protein PsbY; slr0906, coding for the photosystem II CP47 protein; and MAE_44810, PMT9312_1615, PMN2A_1682a and SYNPCC7002_A1008, coding for a protein involved in photosystem I. These results suggest that the SOS response system might have cross-talk with photosynthesis in those genomes.

2.2. Transporters

Around 20 genes encoding transporters were predicted to bear a putative LexA box (Table 3). Most of them belong to the ABC transporter proteins, including AM1_4624 in Acaryochloris marina MBIC11017, Ava_4995 in Anabaena variabilis ATCC29413, MAE_18340 in Microcystis aeruginosa NIES843, Npun_F3763 in Nostoc punctiforme PCC73102, alr5147 in Nostoc sp PCC7120, P9215_08441 in Prochlorococcus marinus MIT9215, P9303_15661 in Prochlorococcus marinus MIT9303, P9515_06251 in Prochlorococcus marinus MIT9515, SYNW2111 in Synechococcus sp WH8102 and slr0467 in Synechocystis sp. PCC6803. In addition, several toxin and antibiotics exporters were identified to have a putative LexA-binding site in their regulatory regions, including cadmium resistance transporter Ava_4997 in Anabaena variabilis ATCC29413; MFS (major facilitator superfamily) multidrug efflux transporter P9301_11521 in Prochlorococcus marinus MIT9301 and A9601_11511 in Prochlorococcus marinus AS9601; multidrug efflux ABC transporter P9515_06251 in Prochlorococcus marinus MIT9515 and SYNW0959 in Synechococcus sp WH8102; putative ABC transporter/multidrug efflux family protein SYNW2111 in Synechococcus sp WH8102; drug exporter-1 ABC transporter ATPase subunit AM1_4624 in Acaryochloris marina MBIC11017. These findings are interesting since it has been shown that the SOS response system is related to drug resistance in E. coli [34,35] and Staphylococcus aureus [35-37] by mechanisms that are not fully understood. It was reported that the vP2449 gene encoding a toxin exporter responsible for xenobiotic resistance in Vibrionales parahaemolyticus was under the direct control of LexA [38]. Therefore, it is likely that these drug resistance genes are regulated by LexA, thereby coupling the SOS response to drug resistance in these cyanobacteria.

The origin of the lexA gene in cyanobacteria

As indicated earlier, 27 of the 32 cyanobacterial genomes analyzed evidently harbor a lexA ortholog, while the remaining five genomes do not, even when being scrutinized by more sensitive sequence search methods such as PSI-BLAST (data not shown). The five cyanobacteria lacking a LexA are Synechococcus sp. JA-3-3Ab A-Prime, Synechococcus sp. JA-2-3B'a(2-13) B-Prime, Synechococcus elongatus PCC6301, Trichodesmium erythraeum IMS101 and Thermosynechococcus elongatus BP-1. However, the core SOS response genes remain in these five genomes (Table S63 in Additional file 6). In the tree of 183 detected LexA proteins in 598 sequenced genomes (Figure S2 in Additional file 3), the 26 cyanobacterial LexA proteins that are detected by BDBH (see Methods) form a monophyletic group while LexA in Gloeobacter violaceus PCC7421 is clustered with the group of α-proteobacteria. Furthermore, the topology of the 16S rRNA gene tree (Figure 4) and the LexA tree/subtree of 27 cyanobacterial genomes (Figure 1A and Figure S2 in Additional file 3) are quite similar. This result suggests that lexA in the 26 cyanobacterial genomes (excluding Gloeobacter violaceus PCC7421) is likely to be vertically inherited from the last common ancestor of cyanobacteria. However, Gloeobacter violaceus PCC7421 might have lost its LexA protein during evolution and obtained an ortholog later through horizontal transfer from an α-proteobacterium. The five genomes that lack a lexA gene do not form a monophyletic group in the 16S rRNA gene-based phylogenetic tree of these 32 cyanobacteria (Figure 4). In particular, Synechococcus elongatus PCC6301, and Trichodesmium erythraeum IMS101 are spread in a clade whose members except these two genomes harbor a lexA gene. The most parsimonious explanation of this distribution would be that these two genomes Synechococcus elongatus PCC6301 and Trichodesmium erythraeum IMS101 lost their lexA genes through two independent events (one for each genome) to adapt to their corresponding environments during the course of evolution. Furthermore, the remaining three genomes, Thermosynechococcus elongatus BP-1, Synechococcus sp. JA-3-3Ab A-prime and Synechococcus sp. JA-2-3B'a (2-13) B-prime, which do not possess a lexA gene, branch earlier from the others (Figure 4). A plausible explanation of this distribution would be that these genomes lost their lexA genes inherited from the last common ancestor of cyanobacteria during the course of evolution. Interestingly, all these three genomes are thermophilic, their extreme ecological niches might facilitate the loss of the lexA gene. Since the core SOS response genes remain in these five genomes (Table S63 in Additional file 6) after lexA was lost, they might have been hijacked by another transcription factor given the importance of the regulation of the SOS response genes for cell survival. The genomes that lost their lexA gene appear to have lost LexA-binding sites (Figure S3 in Additional file 2). Alternatively, these five cyanobacteria might still harbor a lexA gene that has largely diverged from the others' during evolution to such a level that our method could not detect them.

In addition, it has been suggested that the lexA gene was derived from gram-positive bacteria, which then spread into cyanobacteria and fibrobacteres. Then α-proteobacteria acquired lexA from cyanobacteria [5,6,16]. Our phylogenetic analysis of the LexA proteins and their binding sites supports such an argument. As mentioned before, cyanobacterial LexA proteins are more closely-related to those in gram-positive bacteria and α-proteobacteria than those in the other groups (Figure S2 in Additional file 3), and the predicted LexA-binding sites in cyanobacteria are clustered together with those in the gram-positive bacterium B. subtilis and in α-proteobacteria, but are far away from those in E. coli (Figure S1 in Additional file 2).

Moreover, Erill et al. [27] have suggested that there is a common set of genes in the LexA regulon of proteobacteria and gram-positive bacteria: recA, uvrA, ssb, and ruvC. However, our predicted LexA regulons in cyanobacteria do not always include this set of genes. Thus, the concept of a common set of SOS response gene in its more general form warrants further scrutinization.

Conclusions

In this study we have predicted LexA-binding sites and analyzed the putative LexA regulons in 26 cyanobacterial genomes that harbor a lexA gene using a highly efficient motif scanning and regulon prediction algorithm. In most lexA-containing cyanobacterial genomes, some SOS response genes bear a putative LexA box. Some genes involved in various cellular processes such as photosynthesis, drug resistance, etc. are also predicted to bear a putative LexA box in their promoter regions. However, in Synechocystis sp. PCC6803, LexA might have adopted a new function and no longer be in charge of the SOS response genes. In some genomes, lexA was likely lost during the course of evolution accompanied by the loss of its binding sites. The SOS response genes in these genomes that appear to lack a lexA gene might be regulated by another or multiple transcription factors. Moreover, we conclude that cyanobacteria inherited the lexA gene from their last common ancestor; however, substantial genome-wide turnover seems to have led to the high degree of variation of the LexA regulons in some species during evolution.

Methods

Materials

The sequences and annotation files of 33 sequenced cyanobacterial and the other genomes were downloaded from NCBI at ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ webcite. The cyanbacterial genomes used in this study include: Acaryochloris marina MBIC11017 (MBIC11017), Anabaena variabilis ATCC 29413 (ATCC29413), Synechococcus sp. JA-3-3Ab (A-Prime), Synechococcus sp. JA-2-3B'a (2-13) (B-Prime), Cyanothece sp. PCC 8801(PCC8801), Gloeobacter violaceus PCC7421 (PCC7421), Microcystis aeruginosa NIES 843 (NIES843), Nostoc punctiforme PCC 73102 (PCC73102), Nostoc sp. (PCC7120), Prochlorococcus marinus AS9601 (AS9601), Prochlorococcus marinus CCMP1375 (CCMP1375), Prochlorococcus marinus MED4 (MED4), Prochlorococcus marinus MIT9313 (MIT9313), Prochlorococcus marinus MIT 9211 (MIT9211), Prochlorococcus marinus MIT 9215 (MIT9215), Prochlorococcus marinus MIT 9301 (MIT9301), Prochlorococcus marinus MIT 9303 (MIT9303), Prochlorococcus marinus MIT 9312 (MIT9312), Prochlorococcus marinus MIT 9515 (MIT9515), Prochlorococcus marinus NATL1A (NATL1A), Prochlorococcus marinus NATL2A (NATL2A), Synechococcus sp. CC9311 (CC9311), Synechococcus sp. CC9605 (CC9605), Synechococcus sp. CC9902 (CC9902), Synechococcus sp. PCC 7002 (PCC7002), Synechococcus sp. RCC307 (RCC307), Synechococcus WH 7803 (WH7803), Synechococcus elongatus PCC 6301 (PCC6301), Synechococcus sp. WH8102 (WH8102), Synechocystis sp. PCC6803 (PCC6803), Synechocystis sp. PCC7942 (PCC7942), Thermosynechococcus elongates BP-1 (BP-1) and Trichodesmium erythraeum IMS101 (IMS101).

Prediction of transcription units

We predicted the operon structures in cyanobacterial genomes using the operon prediction algorithm developed by Dam et al. [39]. The algorithm is based on the integration of both genome-specific and comparative genomic information. In this work, both the multi-gene operon and singleton operon (containing one gene) are considered as a transcription unit (TU), and the upstream intergenic sequence of the first open reading frame is not considered as a part of the operon.

Prediction of orthologs

We used the bi-directional best hit (BDBH) method based on BLASTP searches with an E-value cut-off of 10-10 for both directions to predict orthologous protein pairs between any two proteomes. The BDBH method assumes that a cross-species protein pair are orthologous if each protein returns the other as the best hit in the whole proteome comparison [40].

Phylogenetic analysis

To construct the phylogenetic tree of LexA in cyanobacteria, multiple sequence alignment of the LexA amino acid sequences from 27 cyanobacterial genomes and the E. coli K12 genome were made using ClustalW implemented in MEGA [41] with default settings. The phylogenetic tree was then constructed using the neighbor-joining method with Poisson correction model in MEGA. E. coli LexA was placed as the outgroup of the tree. To construct the species tree, the DNA sequences of 16S ribosomal RNA genes from the 32 cyanobacteria and E. coli were aligned using ClustalW with manual adjustment by removing the unalignable regions. A neighbor-joining tree was then constructed with E. coli K12 being the outgroup using the Kimura 2-parameter model. Statistical significance at each node in the trees was evaluated using 500 bootstrap resamplings.

To construct the LexA protein tree across cyanobacteria, gram-positive bacteria, α-proteobacteria, δ-proteobacteria and γ-proteobacteria and some other bacteria strains/species (Figure S2 in Additional file 3), we first downloaded 598 sequenced microbial genome sequences from NCBI, and then identified LexA orthologs in them by the BDBH method described above. Multiple sequence alignments of these LexA sequences were made using ClustalW implemented in MEGA[41] with default settings. The phylogenetic tree was then constructed in the same way as the 27 LexA protein tree.

The phylogenetic tree (Figure S1 in Additional file 2) of LexA-binding sites in cyanobacteria, B. subtilis, α-proteobacteria and E. coli K12 was generated by the STAMP web tool [42] with the default alignment parameters: Pearson correlation coefficient for column comparison metric; ungapped Smith-Waterman for pair-wise alignment. The phylogenetic tree was constructed using the UPGMA method implemented in STAMP [42].

Phylogenetic footprinting and construction of LexA-binding sites in cyanobacteria

The previous study by Mazon et al. [16] characterized the LexA boxes associated with two genes: alr4908 (lexA) and all3272 (recA). Four putative LexA boxes were also identified in the promoter regions of alr3716 (uvrA), alr0088 (ssb), alr4905, and all4790 in Nostoc sp PCC7120 in that study. The orthologs (if they exist) of these six genes in PCC7120 were identified in the other 25 cyanobacterial genomes which harbor a lexA gene. We pooled the entire upstream inter-TU regions of these six genes in the target genome Nostoc sp. PCC7120 as well as those of the TUs containing at least one of the orthologs of these six genes in other cyanobacteria. If the length of the inter-TU region is longer than 800 bases, then only the immediate upstream 800 bases were extracted. Two motif finding programs, MEME [25,43] and BioProspector [26], were then applied to these pooled inter-TU regions to identify palindromic 14-mers as putative LexA-binding sites in these sequences according to previous studies [16]. MEME applies an expectation maximization method to fit a two-component finite mixture model and returns the identified motifs with an E-value[43], while BioProspector employs a Gibbs sampling strategy and estimates the significance of the identified motif by a Monte Carlo method [44]. These two programs were selected as they are widely used and often have complementary predictions [45,46]. MEME identified 45 putative LexA-binding sites with an overall E-value of 1.4e-026 for its most significant predicted motif, while BioProspector detected 39 putative LexA boxes in its most significant predicted motif (see Additional file 1 for details). High-scoring putative LexA-binding sites from either program were selected to build the LexA-binding sites profile (Table 1) in cyanobacteria. Sequence logos of binding sites were created using the Weblogo server [47].

Genome wide prediction of LexA-binding sites

We used the profile constructed above to scan the inter-TU regions of the genomes to predict all putative LexA-binding sites using a scanning algorithm that we previously developed [29-31]. This algorithm is briefly described as follows.

For each predicted TU U(g1,g2,...,gn) composed of genes g1,g2,...,gn in genome G, we extracted its upstream inter-TU regions and the first 40 bases of coding region(if its length is longer than 800 bases, then only the immediate upstream 800 bases were extracted), denoted as IU(g1,g2,...,gn). The set of all the IU(g1,g2,...,gn) in this genome is denoted as IU. To find the best matching substring in a sequence t in IU (t IU) when scanned by profile M, we use the following scoring function:

<a onClick="popup('http://www.biomedcentral.com/1471-2164/11/527/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/11/527/mathml/M5">View MathML</a>

(1)

<a onClick="popup('http://www.biomedcentral.com/1471-2164/11/527/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/11/527/mathml/M6">View MathML</a>

(2)

<a onClick="popup('http://www.biomedcentral.com/1471-2164/11/527/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/11/527/mathml/M7">View MathML</a>

(3)

where l is the length of the binding sites of profile M, h any substring of sequence t with length l (i.e. each l-mer of the sequence t), h(i) the base at the i-th position of h, p(i,b) the frequency of base b at position i in M, q(b) is the frequency of base b in the aggregated inter-TU regions for the organism, Ii is basically the information content or the relative entropy of the column [28,48] divided by a normalization factor a, a is the upper limit of the information content Ii for this column to keep Ii ∈ [0, 1], n the number of binding sites for constructing the profile M. To avoid zero value of the numerator p(i,b), a pseudo count 1 is added to the counts of the each base {A, C, G, T} in column i.

To show the derivation of the normalization factor a, we considered the extreme case: for a column i of profile M containing n binding sites, the more conserved the column is, the higher its information content Ii will be, and the maximum information content for column i occurs when this column is completely homogeneous. That is, all sequences have the same nucleotide, say, A at that position, and this nucleotide has the smallest background frequency, q(A), noted as q0. Thus, after adding one pseudocount to the counts of each of the four nucleotides to column i, the frequency of base A of column i in the profile will therefore be (n+1)/(n+4), and 1/(n+4) for the other three nucleotides. Then the upper limit a of the prenormalized Ii as shown by formula (3) can be derived as follows.

<a onClick="popup('http://www.biomedcentral.com/1471-2164/11/527/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/11/527/mathml/M8">View MathML</a>

(4)

where <a onClick="popup('http://www.biomedcentral.com/1471-2164/11/527/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/11/527/mathml/M9">View MathML</a>.

Intuitively, we slide a window of length l across sequence t with the profile M, and return the substring h with the highest score defined by (1).

Since true regulatory binding sites are likely to be more conserved than other inter-TU sequences and thus tend to be shared by closely related orthologous genes. For each genome (considered as a target genome), we reward its putative binding sites appeared to be conserved in regions upstream from orthologous genes in other genomes. To do this, we assume a transcription unit U(g1,g2,...,gn) in the target genome G is composed of n genes. Gene gi (i = 1...n) has orthologs in mi genomes G1, G2, ..., <a onClick="popup('http://www.biomedcentral.com/1471-2164/11/527/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/11/527/mathml/M10">View MathML</a>, and ok (gi) is the upstream inter-TU sequence associated with the orthologous gene gi in genome Gk (for a graphic explanation, see Figure S5 in Additional file 2). Then the sM(t) score for the inter-TU sequence t upstream from U(g1,g2,...,gn) in genome G can be increased by a term Amax(gi):

<a onClick="popup('http://www.biomedcentral.com/1471-2164/11/527/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/11/527/mathml/M11">View MathML</a>

(5)

where Amax(gi) is the value calculated for gene gi whose orthologs across other genomes have the maximum average of the product of two terms:

<a onClick="popup('http://www.biomedcentral.com/1471-2164/11/527/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/11/527/mathml/M12">View MathML</a>

(6)

where di,k is the Hamming distance between the sequence h detected by the profile M in sequence t and the corresponding sequence in, ok (gi) and l is the length of the binding sites in profile M.

Since the orthologs of genes of a transcription unit in one organism may not comprise a single transcription unit in another. For U(g1,g2,...,gn) in target genome G, orthologs of g1,g2,...,gn may be separated into different TUs in other genomes, therefore we evaluated the orthologous inter-TU sequences for each gene in (g1,g2,...,gn), and chose the gene gi whose orthologs across other genomes have the maximum average of the product of two terms indicated above. Then by combining formula (5) and (6), the refined score of the best putative binding site in sequence t can be defined as:

<a onClick="popup('http://www.biomedcentral.com/1471-2164/11/527/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/11/527/mathml/M13">View MathML</a>

(7)

Statistical significance of predicted binding sites

To evaluate the extent to which a putative binding site with a score s or higher can be found purely by chance, we randomly extracted coding sequence with the same length as IU(g1,g2,...,gn), denoted as CU(g1,g2,...,gn). All the CU(g1,g2,...,gn) extracted in genome G form the set CU. The score of an extracted sequence t (t ∈ CU) scanned by a profile M is also defined by formula (1). Note that each randomly chosen CU(g1,g2,...,gn) has nothing to do with U(g1,g2,...,gn). Therefore, when incorporating the additional score from reference genomes (formula (7)), the coding sequence ok (gi) is unlikely the coding sequence associated with the orthologous genes of g1,g2,...,gn in a reference genome Gk as it is a randomly chosen one. To avoid possible biased sampling of coding sequences for each IU(g1,g2,...,gn) in IU, we randomly extracted 300 coding sequences CU(g1,g2,...,gn) sharing the same length as IU(g1,g2,...,gn). These randomly chosen coding regions for all the U(g1,g2,...,gn) in genome G form a sequence set CU, then each sequence in the set CU was scanned using formula (7). Let S(IU) and S(CU) be the set of scores of binding sites found in IU and CU, respectively, and P(S(t) > s) be the cumulative probability of finding a binding site in a sequence t (t IU (or) t CU) with a score S(t) > s as defined by equation (7). Next, the false positive rate, <a onClick="popup('http://www.biomedcentral.com/1471-2164/11/527/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/11/527/mathml/M2">View MathML</a> can be used to evaluate the statistical significance of the motif score s of a inter-TU sequence. <a onClick="popup('http://www.biomedcentral.com/1471-2164/11/527/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/11/527/mathml/M2">View MathML</a> is actually the fraction of coding sequences bearing a putative binding site with a score higher than s in the coding sequences set CU in genome G. In other words, it describes the extent to which one can find a motif with a score higher than s by chance. Thus, it can be considered as an empirical p-value for a binding site score s. A cut-off score s corresponding to a p-value < 0.01 is used for the LexA-binding site and regulon prediction in each genome in this study.

To evaluate the confidence of our overall predictions in inter-TU regions in one genome, we used a log odds ratio (LOR) to compare the probability of finding a putative binding site in an inter-TU region with the probability of finding a putative binding site in a randomly extracted coding region by considering all the extracted IUs and CUs in a genome. We estimated the statistical significance of the predictions using the LOR function defined as

<a onClick="popup('http://www.biomedcentral.com/1471-2164/11/527/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/11/527/mathml/M14">View MathML</a>

(8)

The LOR function for a genome is the log-odds ratio of the fraction of the inter-TU sequences containing a binding site with a score higher than s to the fraction of the randomly selected coding sequences containing a binding site with a score higher than the same s in the genome. Accordingly, a monotonic increase in positive LOR with the increase in the motif score in a genome would suggest that this genome is likely to contain some high-scoring LexA-binding sites.

Analysis of the conservation of LexA regulons in cyanobacteria

We defined the conservation (cij) between two regulons Ri and Rj from genome i and j, respectively, as,

<a onClick="popup('http://www.biomedcentral.com/1471-2164/11/527/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/11/527/mathml/M15">View MathML</a>

(9)

Where |Ri Rj| is the number of orthologous genes shared by both regulons Ri and Rj. We took the reciprocal of <a onClick="popup('http://www.biomedcentral.com/1471-2164/11/527/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/11/527/mathml/M16">View MathML</a> as the distance dij between the two regulons. A neighbor joining tree (Figure 3) based on a distance matrix such defined was constructed using PHYLIP [49] and displayed by MEGA [41].

Abbreviations

TU: transcription unit; BDBH: bidirectional best hit; DBD: DNA-binding domain; HTH: helix-turn-helix; LOR: log-odds ratio; MBIC11017: Acaryochloris marina MBIC11017; ATCC29413: Anabaena variabilis ATCC 29413; A-prime: Synechococcus sp. JA-3-3Ab; B-prime: Synechococcus sp. JA-2-3B'a(2-13); PCC8801: Cyanothece sp. PCC 8801; PCC7421:Gloeobacter violaceus PCC7421; NIES843: Microcystis aeruginosa NIES-843; PCC73102: Nostoc punctiforme PCC 73102; PCC7120: Nostoc sp. PCC 7120; CCMP1375: Prochlorococcus marinus CCMP1375; MED4: Prochlorococcus marinus MED4; AS9601: Prochlorococcus marinus AS9601; MIT9211: Prochlorococcus marinus MIT 9211; MIT9215: Prochlorococcus marinus MIT 9215; MIT9301: Prochlorococcus marinus MIT 9301; MIT9312: Prochlorococcus marinus MIT 9312; MIT9303: Prochlorococcus marinus MIT9303; MIT9313: Prochlorococcus marinus MIT9313; MIT9515: Prochlorococcus marinus MIT 9515; NATL1A: Prochlorococcus marinus NATL1A; NATL2A: Prochlorococcus marinus NATL2A; PCC7942: Synechococcus elongatus PCC 7942; PCC6301: Synechococcus elongatus PCC 6301; RCC307: Synechococcus sp. RCC307; WH7803: Synechococcus sp. WH 7803; WH8102: Synechococcus sp. WH8102; CC9605: Synechococcus sp. CC9605; CC9902: Synechococcus sp. CC9902; CC9311: Synechococcus sp. CC9311; PCC7002: Synechococcus sp. PCC 7002; PCC6803: Synechocystis sp. PCC 6803; BP-1: Thermosynechococcus elongates BP-1; and IMS101: Trichodesmium erythraeum IMS101.

Authors' contributions

SL designed and conducted the experiment. MX helped conduct some analysis. ZS conceived the project. SL and ZS wrote the manuscript. All authors read and approved the final manuscript.

Acknowledgements

This research was partially supported by a grant from the University of North Carolina at Charlotte, a grant from CMC-UNCC Collaborative Research Fund and a grant (EF0849615) from NSF to ZS. We would like to thank Dr. Shaoqiang Zhang for his help with some data analysis.

References

  1. Little JW, Mount DW: The SOS regulatory system of Escherichia coli.

    Cell 1982, 29(1):11-22. PubMed Abstract | Publisher Full Text OpenURL

  2. Fernandez De Henestrosa AR, Ogi T, Aoyagi S, Chafin D, Hayes JJ, Ohmori H, Woodgate R: Identification of additional genes belonging to the LexA regulon in Escherichia coli.

    Mol Microbiol 2000, 35(6):1560-1572. PubMed Abstract | Publisher Full Text OpenURL

  3. Groban ES, Johnson MB, Banky P, Burnett PG, Calderon GL, Dwyer EC, Fuller SN, Gebre B, King LM, Sheren IN, et al.: Binding of the Bacillus subtilis LexA protein to the SOS operator.

    Nucleic Acids Res 2005, 33(19):6287-6295. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  4. Wojciechowski MF, Peterson KR, Love PE: Regulation of the SOS response in Bacillus subtilis: evidence for a LexA repressor homolog.

    J Bacteriol 1991, 173(20):6489-6498. PubMed Abstract | PubMed Central Full Text OpenURL

  5. Mazon G, Erill I, Campoy S, Cortes P, Forano E, Barbe J: Reconstruction of the evolutionary history of the LexA-binding sequence.

    Microbiology 2004, 150(Pt 11):3783-3795. PubMed Abstract | Publisher Full Text OpenURL

  6. Erill I, Campoy S, Barbe J: Aeons of distress: an evolutionary perspective on the bacterial SOS response.

    FEMS Microbiol Rev 2007, 31(6):637-656. PubMed Abstract | Publisher Full Text OpenURL

  7. Horii T, Ogawa T, Nakatani T, Hase T, Matsubara H, Ogawa H: Regulation of SOS functions: purification of E. coli LexA protein and determination of its specific site cleaved by the RecA protein.

    Cell 1981, 27(3 Pt 2):515-522. PubMed Abstract | Publisher Full Text OpenURL

  8. Luo Y, Pfuetzner RA, Mosimann S, Paetzel M, Frey EA, Cherney M, Kim B, Little JW, Strynadka NC: Crystal structure of LexA: a conformational switch for regulation of self-cleavage.

    Cell 2001, 106(5):585-594. PubMed Abstract | Publisher Full Text OpenURL

  9. Slilaty SN, Little JW: Lysine-156 and serine-119 are required for LexA repressor cleavage: a possible mechanism.

    Proc Natl Acad Sci USA 1987, 84(12):3987-3991. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  10. Michel B: After 30 years of study, the bacterial SOS response still surprises us.

    PLoS Biol 2005, 3(7):e255. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  11. Walker GC: Mutagenesis and inducible responses to deoxyribonucleic acid damage in Escherichia coli.

    Microbiol Rev 1984, 48(1):60-93. PubMed Abstract | PubMed Central Full Text OpenURL

  12. Fogh RH, Ottleben G, Ruterjans H, Schnarr M, Boelens R, Kaptein R: Solution structure of the LexA repressor DNA binding domain determined by 1H NMR spectroscopy.

    Embo J 1994, 13(17):3936-3944. PubMed Abstract | PubMed Central Full Text OpenURL

  13. Knegtel RM, Fogh RH, Ottleben G, Ruterjans H, Dumoulin P, Schnarr M, Boelens R, Kaptein R: A model for the LexA repressor DNA complex.

    Proteins 1995, 21(3):226-236. PubMed Abstract | Publisher Full Text OpenURL

  14. Wertman KF, Mount DW: Nucleotide sequence binding specificity of the LexA repressor of Escherichia coli K-12.

    J Bacteriol 1985, 163(1):376-384. PubMed Abstract | PubMed Central Full Text OpenURL

  15. Sjoholm J, Oliveira P, Lindblad P: Transcription and regulation of the bidirectional hydrogenase in the cyanobacterium Nostoc sp. strain PCC 7120.

    Appl Environ Microbiol 2007, 73(17):5435-5446. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  16. Mazon G, Lucena JM, Campoy S, Fernandez de Henestrosa AR, Candau P, Barbe J: LexA-binding sequences in Gram-positive and cyanobacteria are closely related.

    Mol Genet Genomics 2004, 271(1):40-49. PubMed Abstract | Publisher Full Text OpenURL

  17. Gutekunst K, Phunpruch S, Schwarz C, Schuchardt S, Schulz-Friedrich R, Appel J: LexA regulates the bidirectional hydrogenase in the cyanobacterium Synechocystis sp. PCC 6803 as a transcription activator.

    Mol Microbiol 2005, 58(3):810-823. PubMed Abstract | Publisher Full Text OpenURL

  18. Oliveira P, Lindblad P: LexA, a transcription regulator binding in the promoter region of the bidirectional hydrogenase in the cyanobacterium Synechocystis sp. PCC 6803.

    FEMS Microbiol Lett 2005, 251(1):59-66. PubMed Abstract | Publisher Full Text OpenURL

  19. Patterson-Fortin LM, Colvin KR, Owttrim GW: A LexA-related protein regulates redox-sensitive expression of the cyanobacterial RNA helicase, crhR.

    Nucleic Acids Res 2006, 34(12):3446-3454. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  20. Patterson-Fortin LM, Owttrim GW: A Synechocystis LexA-orthologue binds direct repeats in target genes.

    FEBS Lett 2008, 582(16):2424-2430. PubMed Abstract | Publisher Full Text OpenURL

  21. Domain F, Houot L, Chauvat F, Cassier-Chauvat C: Function and regulation of the cyanobacterial genes lexA, recA and ruvB: LexA is critical to the survival of cells facing inorganic carbon starvation.

    Mol Microbiol 2004, 53(1):65-80. PubMed Abstract | Publisher Full Text OpenURL

  22. Sugita C, Ogata K, Shikata M, Jikuya H, Takano J, Furumichi M, Kanehisa M, Omata T, Sugiura M, Sugita M: Complete nucleotide sequence of the freshwater unicellular cyanobacterium Synechococcus elongatus PCC 6301 chromosome: gene content and organization.

    Photosynth Res 2007, 93(1-3):55-67. PubMed Abstract | Publisher Full Text OpenURL

  23. Gupta RS, Mathews DW: Signature proteins for the major clades of Cyanobacteria.

    BMC Evol Biol 10:24. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  24. Gupta RS: Protein signatures (molecular synapomorphies) that are distinctive characteristics of the major cyanobacterial clades.

    Int J Syst Evol Microbiol 2009, 59(Pt 10):2510-2526. PubMed Abstract | Publisher Full Text OpenURL

  25. Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS: MEME SUITE: tools for motif discovery and searching.

    Nucleic Acids Res 2009, (37 Web Server):W202-208. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  26. Liu X, Brutlag DL, Liu JS: BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes.

    Pac Symp Biocomput 2001:127-138. OpenURL

  27. Erill I, Jara M, Salvador N, Escribano M, Campoy S, Barbe J: Differences in LexA regulon structure among Proteobacteria through in vivo assisted comparative genomics.

    Nucleic Acids Res 2004, 32(22):6617-6626. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  28. Stormo GD: DNA binding sites: representation and discovery.

    Bioinformatics 2000, 16(1):16-23. PubMed Abstract | Publisher Full Text OpenURL

  29. Su Z, Olman V, Mao F, Xu Y: Comparative genomics analysis of NtcA regulons in cyanobacteria: regulation of nitrogen assimilation and its coupling to photosynthesis.

    Nucleic Acids Res 2005, 33(16):5156-5171. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  30. Xu M, Su Z: Computational prediction of cAMP receptor protein (CRP) binding sites in cyanobacterial genomes.

    BMC Genomics 2009, 10:23. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  31. Su Z, Olman V, Xu Y: Computational prediction of Pho regulons in cyanobacteria.

    BMC Genomics 2007, 8:156. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  32. Kelley WL: Lex marks the spot: the virulent side of SOS and a closer look at the LexA regulon.

    Mol Microbiol 2006, 62(5):1228-1238. PubMed Abstract | Publisher Full Text OpenURL

  33. Gama-Castro S, Jimenez-Jacinto V, Peralta-Gil M, Santos-Zavaleta A, Penaloza-Spinola MI, Contreras-Moreira B, Segura-Salazar J, Muniz-Rascado L, Martinez-Flores I, Salgado H, et al.: RegulonDB (version 6.0): gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation.

    Nucleic Acids Res 2008, (36 Database):D120-124. PubMed Abstract | PubMed Central Full Text OpenURL

  34. Cirz RT, Chin JK, Andes DR, de Crecy-Lagard V, Craig WA, Romesberg FE: Inhibition of mutation and combating the evolution of antibiotic resistance.

    PLoS Biol 2005, 3(6):e176. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  35. Butala M, Zgur-Bertok D, Busby SJ: The bacterial LexA transcriptional repressor.

    Cell Mol Life Sci 2009, 66(1):82-93. PubMed Abstract | Publisher Full Text OpenURL

  36. Bisognano C, Kelley WL, Estoppey T, Francois P, Schrenzel J, Li D, Lew DP, Hooper DC, Cheung AL, Vaudaux P: A recA-LexA-dependent pathway mediates ciprofloxacin-induced fibronectin binding in Staphylococcus aureus.

    J Biol Chem 2004, 279(10):9064-9071. PubMed Abstract | Publisher Full Text OpenURL

  37. Cirz RT, Jones MB, Gingles NA, Minogue TD, Jarrahi B, Peterson SN, Romesberg FE: Complete and SOS-mediated response of Staphylococcus aureus to the antibiotic ciprofloxacin.

    J Bacteriol 2007, 189(2):531-539. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  38. Sycheva LV, Permina EA, Gel'fand MS: Taxon-specific regulation of SOS-response in gamma-proteobacteria.

    Mol Biol (Mosk) 2007, 41(5):908-917. PubMed Abstract | Publisher Full Text OpenURL

  39. Dam P, Olman V, Harris K, Su Z, Xu Y: Operon prediction using both genome-specific and general genomic information.

    Nucleic Acids Res 2007, 35(1):288-298. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  40. Hulsen T, Huynen MA, de Vlieg J, Groenen PM: Benchmarking ortholog identification methods using functional genomics data.

    Genome Biol 2006, 7(4):R31. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  41. Tamura K, Dudley J, Nei M, Kumar S: MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0.

    Mol Biol Evol 2007, 24(8):1596-1599. PubMed Abstract | Publisher Full Text OpenURL

  42. Mahony S, Benos PV: STAMP: a web tool for exploring DNA-binding motif similarities.

    Nucleic Acids Res 2007, (35 Web Server):W253-258. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  43. Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers.

    Proc Int Conf Intell Syst Mol Biol 1994, 2:28-36. PubMed Abstract OpenURL

  44. Metropolis N, Ulam S: The Monte Carlo method.

    J Am Stat Assoc 1949, 44(247):335-341. PubMed Abstract | Publisher Full Text OpenURL

  45. Hu J, Li B, Kihara D: Limitations and potentials of current motif discovery algorithms.

    Nucleic Acids Res 2005, 33(15):4899-4913. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  46. Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, et al.: Assessing computational tools for the discovery of transcription factor binding sites.

    Nat Biotechnol 2005, 23(1):137-144. PubMed Abstract | Publisher Full Text OpenURL

  47. Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a sequence logo generator.

    Genome Res 2004, 14(6):1188-1190. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  48. Stormo GD: Consensus patterns in DNA.

    Methods Enzymol 1990, 183:211-221. PubMed Abstract | Publisher Full Text OpenURL

  49. Felsenstein J: Phylogeny Inference Package (Version 3.2).

    Cladistics 1989, 5:164-166. OpenURL