Open Access Research article

Comparative analyses reveal distinct sets of lineage-specific genes within Arabidopsis thaliana

Haining Lin12, Gaurav Moghe1, Shu Ouyang35, Amy Iezzoni4, Shin-Han Shiu1, Xun Gu2* and C Robin Buell1*

Author Affiliations

1 Department of Plant Biology, Michigan State University, 166 Plant Biology Building, East Lansing, MI 48824, USA

2 Department of Genetics, Development, and Cell Biology, Iowa State University, Ames, IA 50011, USA

3 J. Craig Venter Institute, 9712 Medical Center Drive, Rockville, MD 20850, USA

4 Department of Horticulture, Michigan State University, A342 Plant and Soil Science Building, East Lansing, MI 48824, USA

5 Current address: Suite 205, 1003 7th Street, Frederick, MD 21701, USA

For all author emails, please log on.

BMC Evolutionary Biology 2010, 10:41  doi:10.1186/1471-2148-10-41

Published: 12 February 2010

Abstract

Background

The availability of genome and transcriptome sequences for a number of species permits the identification and characterization of conserved as well as divergent genes such as lineage-specific genes which have no detectable sequence similarity to genes from other lineages. While genes conserved among taxa provide insight into the core processes among species, lineage-specific genes provide insights into evolutionary processes and biological functions that are likely clade or species specific.

Results

Comparative analyses using the Arabidopsis thaliana genome and sequences from 178 other species within the Plant Kingdom enabled the identification of 24,624 A. thaliana genes (91.7%) that were termed Evolutionary Conserved (EC) as defined by sequence similarity to a database entry as well as two sets of lineage-specific genes within A. thaliana. One of the A. thaliana lineage-specific gene sets share sequence similarity only to sequences from species within the Brassicaceae family and are termed Conserved Brassicaceae-Specific Genes (914, 3.4%, CBSG). The other set of A. thaliana lineage-specific genes, the Arabidopsis Lineage-Specific Genes (1,324, 4.9%, ALSG), lack sequence similarity to any sequence outside A. thaliana. While many CBSGs (76.7%) and ALSGs (52.9%) are transcribed, the majority of the CBSGs (76.1%) and ALSGs (94.4%) have no annotated function. Co-expression analysis indicated significant enrichment of the CBSGs and ALSGs in multiple functional categories suggesting their involvement in a wide range of biological functions. Subcellular localization prediction revealed that the CBSGs were significantly enriched in proteins targeted to the secretory pathway (412, 45.1%). Among the 107 putatively secreted CBSGs with known functions, 67 encode a putative pollen coat protein or cysteine-rich protein with sequence similarity to the S-locus cysteine-rich protein that is the pollen determinant controlling allele specific pollen rejection in self-incompatible Brassicaceae species. Overall, the ALSGs and CBSGs were more highly methylated in floral tissue compared to the ECs. Single Nucleotide Polymorphism (SNP) analysis showed an elevated ratio of non-synonymous to synonymous SNPs within the ALSGs (1.99) and CBSGs (1.65) relative to the EC set (0.92), mainly caused by an elevated number of non-synonymous SNPs, indicating that they are fast-evolving at the protein sequence level.

Conclusions

Our analyses suggest that while a significant fraction of the A. thaliana proteome is conserved within the Plant Kingdom, evolutionarily distinct sets of genes that may function in defining biological processes unique to these lineages have arisen within the Brassicaceae and A. thaliana.