The Cotton Microsatellite Database (CMD) http://www.cottonssr.org webcite is a curated and integrated web-based relational database providing centralized access to publicly available cotton microsatellites, an invaluable resource for basic and applied research in cotton breeding.
At present CMD contains publication, sequence, primer, mapping and homology data for nine major cotton microsatellite projects, collectively representing 5,484 microsatellites. In addition, CMD displays data for three of the microsatellite projects that have been screened against a panel of core germplasm. The standardized panel consists of 12 diverse genotypes including genetic standards, mapping parents, BAC donors, subgenome representatives, unique breeding lines, exotic introgression sources, and contemporary Upland cottons with significant acreage. A suite of online microsatellite data mining tools are accessible at CMD. These include an SSR server which identifies microsatellites, primers, open reading frames, and GC-content of uploaded sequences; BLAST and FASTA servers providing sequence similarity searches against the existing cotton SSR sequences and primers, a CAP3 server to assemble EST sequences into longer transcripts prior to mining for SSRs, and CMap, a viewer for comparing cotton SSR maps.
The collection of publicly available cotton SSR markers in a centralized, readily accessible and curated web-enabled database provides a more efficient utilization of microsatellite resources and will help accelerate basic and applied research in molecular breeding and genetic mapping in Gossypium spp.
Comprehensive structural, functional and comparative studies of any genome are increasingly dependent upon the availability of an anchored physical map which shows the order of all genetic components in correspondence to their chromosomal localization. Anchoring of phenotypic information (such as trait or QTL) onto the physical map requires its integration with the genetic map of a genome which represents the relative positions of the genes and/or markers on chromosomes.
The International Cotton Genome Initiative (ICGI) was launched to facilitate the development of a saturated and fully integrated genetic and physical map of cotton . A consensus linkage map is being developed by consolidating data generated by the cotton community using a common set of framework markers, such as microsatellites, or simple sequence repeats (SSRs) . The generation of a transportable framework of SSR markers capable of being mapped in any segregating population was one of the major objectives of the ICGI. In keeping with the proposed goals of the ICGI, the Cotton Microsatellite Database (CMD)  has been initiated and funded by Cotton Incorporated. An Advisory Committee comprising both academic and industry representatives was formed to guide the development of CMD and to coordinate it with CottonDB , the genome database serving the international cotton research community.
Microsatellites consist of 1–6 repeating base pairs that are tandemly arranged in genomes . While the number of repeats is highly polymorphic, the sequences flanking the repeats are highly conserved between individuals. The predominant mutation mechanism in microsatellite tracts is 'slipped strand mispairing', which generates the polymorphism with regard to the gain or loss of repeat motifs . Microsatellites are abundant and widely distributed throughout the genomes of many higher plants and animals [7,8] and have been used extensively as molecular markers in the development of saturated linkage and physical maps [9-12].
Microsatellite markers are PCR-based, bi-parentally inherited, co-dominant markers. Polymerase chain reaction (PCR) products of different lengths can be amplified using unique primer pairs flanking the variable repeat microsatellite region after cloning and sequencing one allele. To develop microsatellite markers, primer sequences conserved between individuals and complementary to the microsatellite flanking sequences are identified by computer programs and synthesized.
SSR loci tend to be both multiallelic and highly polymorphic for repeat number, which is easily scored and used for genotyping. SSRs are amenable to analysis on automated DNA sequencers, and can thus be adapted to high-throughput genotyping. SSRs are often markers of choice due to their abundance, co-dominance, reproducibility, and ease of use [7,8]. As microsatellites are generally not as amenable to inter-generic studies as some other types of markers they are generally synthesized for each genus or species.
The applications of microsatellites for plant breeders are numerous. They can be used for gene tagging and genome mapping, for selecting progeny before a desired phenotypic trait is expressed, for localizing qualitatively as well as quantitatively inherited traits, improving the efficacy of selective breeding (particularly for traits with low heritability or that can only be measured in one sex), genetic diversity studies, variety protection, gene and QTL analysis, pedigree analysis, and for introgressing novel genes into breeding germplasm from exotic germplasm .
Microsatellites included in the CMD have been generated from several research groups within the international cotton community who are actively involved with generating, screening and mapping cotton markers. To make significant and timely advances in the genetic improvement of cotton, thousands of portable microsatellite markers are needed for the tetraploid genome of cultivated cottons. As these markers need to be characterized systematically prior to application, a standardized panel of 12 diverse genotypes was selected for screening from cultivated and exotic cottons . This panel represents a balanced diversity of the core Gossypium germplasm (Table 1).
Table 1. A standardized panel of the cotton microsatellite marker database (CMD)
The major goals of the CMD are:
(1) to collect and integrate all the publicly available cotton microsatellite data in a centralized, curated, non-redundant online oracle database,
(2) to provide access to the CMD standardized panel screened data,
(3) to provide a set of comprehensive interface tools for rapid data retrieval,
(4) to provide a suite of stand-alone microsatellite data mining tools,
(5) to provide a communication portal for collaboration within the cotton research community.
Construction and content
Database and web interface development
Currently, the database is composed of 14 tables which store all the data for the microsatellite projects including information on project collaborators, SSR-containing clones, sequences, primers flanking the SSRs, repeat motif, open reading frame position, genetic markers and maps, standardized panel varieties, and data homology, and publications. In a separate but linked database within CMD, the CMap schema consists of 16 tables including information about genetically mapped cotton SSRs. Data for cotton SSR markers and genetic maps, as well as panel screened cotton microsatellites, are submitted by researchers and then curated for any potential errors prior to uploading to the database using scripts written in Perl version 5.8.2. Web interfaces for database query and the query result pages are also developed in Perl.
CMD microsatellite data projects
In cotton, the first SSR markers were developed at the Brookhaven National Laboratory (prefix "BNL"). The 379 BNL microsatellites presented through CMD were derived from G. hirsutum small insert genomic library enriched for (GA/CT)n and (CA/GT)n inserts . Later, the 309 JESPR  and 392 CIR  microsatellites were developed by streptavidin capture of 5'-biotinylated microsatellite-enriched libraries. The 53 CM microsatellites were developed using randomly sheared (nebulized) genomic DNA for adapter-ligation, rigorous removal of biotinylated oligos, and high-density colony blots for constructing enriched libraries . The 84 MGHES , 1169 MUSS/MUCS  and 1032 NAU [20,21] microsatellites were developed by screening public databases for EST-derived SSRs. The 750 TMB  (Yu et al., 2002) and 1316 MUSB  cotton microsatellites are BAC-derived.
The individual project pages contain access to all public data currently available for each microsatellite project, all of which have been approved by the project principal investigator. The standardized project information includes: a project summary abstract, investigator contact information, related publications, microsatellite information, including GenBank accession numbers, clone sequences, primer sequences, repeat motif, standardized panel screened data (if available), mapping data, and any homology with known proteins. Marker data, primers, microsatellite sequences and standardized panel screened data are available for download directly from each project page as well as an overall downloads page. Currently, CMD contains information on 5,484 annotated cotton microsatellites which can be viewed and downloaded. Annotation of the sequences is periodically updated so that our data reflects changes in protein records in the NCBI GenBank non-redundant protein database.
A microsatellite information page displays the sequence along with the repeat sequence and primers. The longest putative open reading frame (ORF) is also marked in color in the sequence along with the microsatellites. SSRs in the non-coding region tend to be more polymorphic and those in the coding region tend to be more transferable among species so the information of SSR position in a gene structure will be useful for marker development .
Currently, 3,452 of the cotton microsatellites available through CMD have been checked for internal redundancy. Any of the following criteria were considered as redundant: 1) identical GenBank accession number; 2) completely identical primer pairs; 3) identical forward primers; 4) identical reverse primers; 5) forward primer identical to reverse and vice versa. From this analysis, 3,135 (90.8%) of the microsatellites checked were considered to be unique and were noted accordingly in the database.
A standardized panel of Gossypium genotypes for systematic characterization of cotton microsatellite markers
Upon extensive discussion and consultation, a standardized panel of 12 Gossypium genotypes for cotton microsatellite database (CMD) was established [13,25]. This genotype panel represents a balanced diversity of the core Gossypium germplasm including cultivated and exotic cottons as shown in Table 1. Among the CMD standardized panel representative genotypes, TM-1 and 3–79 are the genetic standards for AD1 (G. hirsutum) and AD2 (G. barbadense) species, respectively. Because TM-1 and 3–79 are also parents of a permanent RIL mapping population, they are essential for the integrated genome mapping and selection of the core reference markers. Acala Maxxa is California Upland cotton from which a BAC library is also constructed. DPL 458BR, Paymaster 1218BR, Fibermax 832, and Stoneville 4892BR are Upland cotton representatives with a significant acreage across the Cotton Belt and beyond. These Upland selections represent the contemporary Upland cotton variability and extend the Upland cotton diversity. They are often used as the National Variety Test (NVT) standards that provide a database of agronomic performance for any agronomic comparisons. Pima S-6 is the source of G. barbadense Pima germplasm breeding programs. G. arboreum (A2) and G. raimondii (D5) are representatives of A and D subgenomes, respectively. G. tomentosum (AD3) and G. mustelinum (AD4) are possible sources of introgression breeding programs (Table 1).
For each of 12 cotton genotypes three to five individual plants are maintained in a USDA-ARS greenhouse in College Station, Texas . For each genotype only one single plant is flagged for tissue harvest and DNA extraction. The standardization of panel DNA stocks provides the best uniformity for cotton researchers with ongoing SSR marker development. Polymorphisms arising from easily assayed variation in SSR numbers show great utility in crop genetic mapping and other applications. With this standardized genotype panel, cotton SSR markers derived from different sources or groups can be evaluated in a systematic way to minimize the potential redundancy and to determine the markers' Polymorphic Information Content (PIC) values for ready applications. In addition to the information on the clones, sequences, primers, amplification conditions and fluorescent primer labels used, the amplified fragments sizes are currently available in CMD for the 375 BNL, 204 CIR, and 127 JESPR microsatellites screened against the standardized panel. The timetable for the inclusion of further panel screened data is also available through the CMD.
Genetically anchored mapping data
A genetically anchored physical map for cotton is being developed using cotton BAC libraries . Through various genetic markers, including SSRs, the cotton physical map will be anchored on the future consensus cotton genetic map . CMD stores and presents currently available data for major cotton genetic maps with mapped SSRs that were constructed for different crosses. Currently, CMD contains data for four genetic maps: 1 - BC1: ((Guazuncho2 (G. hirsutum) × VH8-4602 (G. barbadense)) × Guazuncho2) [16,26]; 2 - F2: G. hirsutum race 'Palmeri" × G. barbadense Acc. "K101" ; 3 - BC1: (TM-1(G. hirsutum) × Hai7124 (G. barbadense)) × TM-1) [20,21]; 4 - RIL: TM-1 (G. hirsutum) × 3–79 (G. barbadense) . The anchored genetic markers can be viewed in several formats, including an excel spreadsheet, a database search interface, and a graphical interface for comparative visualization of SSR maps. Additional data on the newly mapped SSRs will be available soon.
Utility and discussion
The CMD website is composed of general information pages (Figure 1A), including CMD tutorials, project pages (Figure 1B), database query/browse interfaces and other tools such as a comparative map viewer CMap, sequence similarity server, SSR server, and CAP3 server. The CMD web pages are organized such that users can easily access the data of interest regardless of the navigation starting point. For example, the microsatellite project pages (Figure 1B) have links to the CMD standardized panel pages, marker detail pages, sequence files in FASTA format, a downloads page, sequence similarity server, or abstracts for the related publications. Similarly, the CMD standardized panel (Table 1) details page has links to the SSR project detail page. A general CMD tool bar is also included in each page to aid the ease of navigation through the site.
Figure 1. CMD home page and the representative search pages. This figure illustrates the CMD homepage (A) and the representative search pages; microsatellite projects (B), marker search and view, homology search (C-F).
Database search interface
The initial SSR search result page displays SSR identifiers. The individual SSR entry links to a page where details of the SSR are displayed (Figure 1C) with links to the corresponding project page, the top protein homolog identified through a sequence similarity search (Figure 1D), microsatellite sequence in GenBank, and related publications. Markers can be searched by marker name, chromosome, or cross (Figure 1E,F). Other pages include a mailing group list form, so users can exchange information and be kept up to date on new developments in CMD. The message boards automatically list all the information exchanged by the mailing list. The links page contains appropriate cotton links.
Graphical interface to maps
CMD also provides a graphical tool CMap in which the cotton genetic maps (Figure 2A) are displayed with the number of anchored SSR markers (Figure 2B), and the location of mapped SSRs is compared between different crosses of cotton. CMap is part of the Generic Model Organism Database . CMap allows the user to select the map of interest and the maps for comparisons. The feature search looks for a certain feature by name or accession ID, species, and feature type. The CMap correspondence matrix allows users to view the number of correspondences among all selected maps.
Figure 2. Cotton CMap Viewer. A. The opening page of CMap Viewer with 4 major genetic maps of cotton. B. An example of individual linkage group pages with the associated markers.
The CMD tools page provides access to an SSR server, a CAP3 Assembly server, and a sequence similarity server that includes BLAST and FASTA search tools.
SSR analysis is performed using a modified version (SSR) of a Perl script SSRIT  with parameters set to detect mono- to hexanucleotides of user specified length. To examine the location of SSRs in the sequences in relation to the putative coding region, the SSR server uses the FLIP  program which is available through the Organelle Genome Megasequencing Project . FLIP is a UNIX C program that finds/translates ORFs (open reading frames) in sequences. Using the FLIP output, the longest ORF is identified and the relative SSR location is reported. Potential primers are identified using Primer3 .
Using the SSR server, users can upload a batch of sequences in FASTA format and select the motif type and repeat length to search. After job completion, users are redirected by email to a web page providing 1) a summary report of the SSR analysis, 2) a library file of the uploaded sequences, 3) a library file of the SSR containing sequences, and 4) an excel file of the individual properties of the SSR-containing clones. The individual properties include sequence name, length of the SSR-containing sequence, repeat(s) motif and number, SSR start/stop position, ORF start/stop position, primer pairs, SSR location relative to the ORF, and GC content of the sequence.
To reduce the inherent redundancy and increase transcript length ESTs are routinely assembled into longer consensus sequences, also known as contigs. We have implemented the contig assembly program CAP3  as an online server to allow users to assemble ESTs prior to mining the consensus sequences for microsatellites using the CMD SSR server. Users can upload quality files for their sequences and specify the percentage identity in the overlap region (p value). While the quality and quantity of EST data varies greatly for each species, we have found that using a high level of stringency (p = 90 or 95) tends to prevent over assembly and helps distinguish between gene family members. Assembling ESTs that come from the same transcript is a common method of creating a putative unigene for an organism. As more ESTs are sequenced and added to the public domain, the cotton unigene can be continually refined using the CAP3 server and mined for SSRs using the SSR server.
Sequence similarity servers
The online BLAST and FASTA sequence similarity search servers allow users to perform homology searches between their sequences of interest and the annotated SSR sequences and primers in CMD. From the web interface, researchers can upload a file of sequences, select the search algorithm (e.g. BLAST, FASTA), the database (SSR sequences, SSR primers) and submit their job for processing. Once the job has completed an email is sent with a URL providing secure access to the results of the search. From the URL, users retrieve a summary of the search with the number of sequences that had matches with the database selected, an excel file containing the best match, any known function, match organism, match length, percent identity, expectation value, alignment length, and start and stop alignment positions. Our sequence similarity server, specifically designed for CMD researchers, will help users compare new sequences and primers against existing microsatellites and help decrease redundancy of effort in developing new markers. As we migrate the sequence similarity servers to a computational cluster, we plan to add the following databases: NCBI cotton ESTs, TIGR cotton gene indices, NCBI cotton genomic sequences and NCBI cotton protein sequences.
Future development will focus on the establishment of a standard nomenclature of cotton SSRs, adding new microsatellite data, improving the tools and functionality of the web interface, such as an advanced search site with options for search/display categories, full sequence processing facilities for cotton researchers, and a quarterly newsletter for the cotton community. The annotation of the SSRs with known homology will include further classification using the gene ontology terms associated with the matching sequences in the Swissprot database. When the physical map is available, users also will be able to retrieve the anchored BAC clones containing the SSRs of interest through the anchored BACs page in the map viewer. Data that are currently scheduled to be added in the near future include 800 BAC-derived cotton genomic SSRs and 500 cotton EST-SSRs from the public domain, and 200 SSRs from private companies.
The CMD has been initiated to provide researchers, engaged worldwide in cotton research, with centralized access to microsatellite markers, an invaluable resource for basic and applied research in cotton breeding. As such, the CMD serves the cotton community as a major repository of the publicly available cotton microsatellite data and a unique repository for the CMD standardized panel screened data, a key tool for systematic characterization of the SSR markers developed for cotton. Access to this data is provided through integrated web tools which allow users to directly access individual or combined project data via search interfaces which provide download and visualization of microsatellites, their flanking primers, open reading frames (ORFs), and SSR genetic maps. CMD also provides a suite of online tools for data analysis of new and existing microsatellites through its SSR, CAP3, and FASTA/BLAST servers. Overall, the CMD serves as a major resource for the international cotton community, and can be viewed as an important vehicle toward increased collaboration among cotton scientists.
Availability and requirements
BAC – Bacterial Artificial Chromosome
BC1 – Backcross 1st generation
BLAST – Basic Local Alignment Search Tool
CAP3 – Contig Assembly Program
CIRAD – Centre International en Recherche Agronomique pour le Développement
CIR – CIRad
CM – Cotton Microsatellites
CMap – Comparative Map
EST – Expressed Sequence Tag
EXP – Expectation Value
FASTA – Fast All alignment search tool
FLIP – FLexible In-system Programmer
JESPR – Jenkins, El-Zik, Saha, Pepper, Reddy microsatellite repeats
MGHES – Mississippi Gossypium hirsutum EST-SSR
MUCS – Microsatellite Ulloa Complex Sequence repeats
MUSB – Microsatellite Ulloa Simple BAC repeats
MUSS – Microsatellite Ulloa Simple Sequence repeats
NAU – Nanjing Agricultural University
QTL – Quantitative Trait Loci
RIL – Recombinant Inbred Line
TMB – TM-1 genetic standard BAC/BIBAC libraries microsatellite repeats
AB participated in the database and interface design, performed general CMD data collection, organization and curation. CJ, SM, PY participated in the database and interface design and construction, and developed scripts for database upload and sequence processing. SJ implemented CMap, uploaded mapping data and was involved with the database schema design and implementation, SF managed the oracle system and was involved in database schema design and tool development, MS performed all homology searches and was involved with database design and construction, RE helped design and implement the SSR, FASTA, BLAST and CAP3 servers, JS and BS screened the cotton SSRs against the CMD standardized panel. MP participated in the development of the MUSB microsatellite project and participated in the database and interface design. JML developed and provided data for the CIR project and participated in the CMD general data collection and organization. JY established, maintained, and distributed the 12-genotype CMD panel DNA stocks as well as developed and provided data for the TMB project and participated in the general coordination of the project. MU developed and provided data for the MUSS/MUCS and MUSB projects, participated in collection and organization of mapped SSR data presented in the CMD. SS was the lead scientist in developing MGHES and provided data for the MGHES and JESPR projects and was involved in the general coordination of the project. BB and SL developed and provided data for the BNL project, critically reviewed and contributed to database and manuscript development. TZ developed and provided data for the NAU project, critically reviewed and contributed to database and manuscript development. AP participated in the development of the JESPR project, critically reviewed and contributed to database and manuscript development. DF performed redundancy analysis of the CMD microsatellites and was involved in the general coordination of the project. JT participated in the development of the MUSB microsatellite project. SK and JJ served on the CMD Advisory Board, critically reviewed and contributed to database and manuscript development. RC conceived and performed general coordination of the project. DM supervised the project and was involved with all aspects of database design, construction and implementation. All authors read and approved the final manuscript.
We acknowledge with thanks, Cotton Incorporated for funding this database, Clemson University and Washington State University for providing salary support and hardware infrastructure, and the cotton research community for being so willing to share data and provide feedback for this database. We acknowledge the help of Dr. R. Kantety, Alabama A&M University, in MGHES work, and Dr. J. Frelichowski at USDA-ARS, Shafter, CA, for his contribution in the MUSB project.
J Hered 1997, 88:335-342. PubMed Abstract
Kumpatla SP: Simple sequence repeats: abundance, marker development and applications in plant genetics. In Recent Research Developments in Plant Molecular Biology. Volume 2. Trviandrum, India: Research Signpost; 2005::83-105.
Aranzana MJ, Pineda A, Cosson P, Dirlewanger E, Ascasibar J, Cipriani G, Ryder CD, Testolin R, Abbott A, King GJ, Iezzoni AF, Arus P: A set of simple-sequence repeat (SSR) markers covering the Prunus genome.
Cone KC, McMullen MD, Bi IV, Davis GL, Yim YS, Gardiner JM, Polacco ML, Sanchez-Villeda H, Fang Z, Schroeder SG, Havermann SA, Bowers JE, Paterson AH, Soderlund CA, Engler FW, Wing RA, Coe EH Jr: Genetic, physical, and informatics resources for maize. On the road to an integrated map.
Rong J, Abbey C, Bowers JE, Brubaker CL, Chang C, Chee PW, Delmonte TA, Ding X, Garza JJ, Marler BS, Park CH, Pierce GJ, Rainey KM, Rastogi VK, Schulze SR, Trolinder NL, Wendel JF, Wilkins TA, Williams-Coplin TD, Wing RA, Wright RJ, Zhao X, Zhu L, Paterson AH: A 3347-locus genetic recombination map of sequence-tagged sites reveals features of genome organization, transmission and evolution of cotton (Gossypium).
Plant Mol Biol Rep 1998, 16:341-349. Publisher Full Text
Park YH, Alabady MS, Ulloa M, Sickler B, Wilkins TA, Yu JZ, Stelly DM, Kohel RJ, El-Shihy OM, Cantrell RG: Genetic mapping of new cotton fiber loci using EST-derived microsatellites in an interspecific recombinant inbred line cotton population.
Mol Genet Genom 2005, 274:428-441. Publisher Full Text
Mol Genet Genom 2004, 272:308-327. Publisher Full Text
Mol Genet Genom 2006, 275:479-491. Publisher Full Text
Lacape JM, Nguyen TB, Courtois B, Belot JL, Giband M, Gourlot JP, Gawryziak G, Roques S, Hau B: QTL analysis of cotton fiber quality using multiple Gossypium hirsutum × Gossypium barbadense backcross generations.
Temnykh S, DeClerck G, Lukashova A, Lipovich L, Cartinhour S, McCouch S: Computational and experimental analysis of microsatellites in rice (Oryza sativa L.): frequency, length variation, transposon associations, and genetic marker potential.
Brossard N: FLIP: a Unix Program used to find/translate orfs. [http://www.bch.umontreal.ca/ogmp/manlinks/flip.txt] webcite
Organelle Genome Megasequencing Project (OGMP), Biochemistry Department, University of Montreal [http://www.bch.umontreal.ca/ogmp/manlinks/flip.txt] webcite
Rozen S, Skaletsky HJ: Primer3 on the WWW for general users and for biologist programmers. In Bioinformatics Methods and Protocols: Methods in Molecular Biology. Edited by Krawetz S, Misener S. Totowa, NJ: Humana Press; 2000::365-386.