The increasing use of DNA microarrays for genetical genomics studies generates a need for platforms with complete coverage of the genome. We have compared the effective gene coverage in the mouse genome of different commercial and noncommercial oligonucleotide microarray platforms by performing an in-house gene annotation of probes. We only used information about probes that is available from vendors and followed a process that any researcher may take to find the gene targeted by a given probe. In order to make consistent comparisons between platforms, probes in each microarray were annotated with an Entrez Gene id and the chromosomal position for each gene was obtained from the UCSC Genome Browser Database. Gene coverage was estimated as the percentage of Entrez Genes with a unique position in the UCSC Genome database that is tested by a given microarray platform.
A MySQL relational database was created to store the mapping information for 25,416 mouse genes and for the probes in five microarray platforms (gene coverage level in parenthesis): Affymetrix430 2.0 (75.6%), ABI Genome Survey (81.24%), Agilent (79.33%), Codelink (78.09%), Sentrix (90.47%); and four array-ready oligosets: Sigma (47.95%), Operon v.3 (69.89%), Operon v.4 (84.03%), and MEEBO (84.03%). The differences in coverage between platforms were highly conserved across chromosomes. Differences in the number of redundant and unspecific probes were also found among arrays. The database can be queried to compare specific genomic regions using a web interface. The software used to create, update and query the database is freely available as a toolbox named ArrayGene.
The software developed here allows researchers to create updated custom databases by using public or proprietary information on genes for any organisms. ArrayGene allows easy comparisons of gene coverage between microarray platforms for any region of the genome. The comparison presented here reveals that the commercial microarray Sentrix, which is based on the MEEBO public oligoset, showed the best mouse genome coverage currently available. We also suggest the creation of guidelines to standardize the minimum set of information that vendors should provide to allow researchers to accurately evaluate the advantages and disadvantages of using a given platform.
The wide use of DNA microarrays to query expression of genes has created the need for updated, consistent and meaningful annotations on the probes included in the microarrays. We refer to gene annotation as a recognizable label or gene id identifying the gene that is targeted by a given probe. Gene ids should be stable, widely used and allow reliable associations among genomic databases. Several microarray annotation systems are available for investigators, aiming to address specific user demands. For instance, the KARMA  web server provides periodically updated gene annotations of Keck arrays  and Affymetrix® GeneChips® , and can also annotate user-provided lists of accession numbers for pair-wise comparisons, even for different species. However, providing a gene list is not always a straight forward process given the large heterogeneity in the format that vendors provide sequence identifiers for probes. For instance, one platform can include identifiers in a Genbank header format such as GB|AY073000.1|AAL60663.1 and others may include different types of ids separated by commas or some other character within a single column. In addition, different sequence identifiers in several columns may be provided by vendors and choosing only one of them may not be the best solution. The Resourcerer database  tackles this problem by pre-computing gene annotations on a more exhaustive list of microarrays and oligosets for a number of species . This database is centered on 'tentative consensus' (TC) sequences which are used as gene definitions. TCs group EST sequences that can be aligned and clustered in distinct groups, and these are periodically updated as new ESTs from GenBank become available. Functional annotations on these TCs generate the Gene Indices resource available from the TGI website . TCs allow for cross species comparisons through the Tentative Orthologue Groups (TOGs) database. However, Gene Indices are not stable and cross referencing to other genomic databases is not easy. A different approach has been taken by Mattes who created a set of Perl scripts that use UniGene and LocusLink as gene identifiers, providing a more universal gene definition that can be cross referenced with other databases. Unfortunately, the recent shift of NCBI from LocusLink to the Entrez Gene database format  has limited the functionality of these scripts and rendered them obsolete. The DRAGON  database [10,11], and the DAVID  software  provide web based services of gene annotation with similar objectives. None of them, however, allows for chromosome or genomic-region specific comparisons of gene coverage.
The objective of this study was to compare gene coverage from currently available whole mouse genome microarrays for any region of the genome. We only used the information about probes provided to researchers by vendors before the purchase of a microarray for the purpose of choosing the platform that best fits their needs. We have developed a platform for microarray annotation that not only provides gene annotations for probes but also genomic positions for tested genes in the mouse genome. Coverage comparisons can be obtained for any genomic region of the last available mouse assembly build. The level of coverage of five whole mouse genome microarrays and four oligosets was compared in the present study. Microarrays and oligosets will be referenced here by the short name provided in Table 3. The results were stored in a relational database that can readily be queried for coverage comparisons based on genome position. Figure 1 shows a flowchart diagram for the databases and methods used for the annotation system and for querying gene coverage comparisons.
Figure 1. Flowchart diagram for the construction and query of the ArrayGene and Aligndb databases. Perl scripts are used to parse input files from other databases and upload the processed information to the ArrayGene and Aligndb databases. The ma_compare CGI (Common Gateway Interface) script written in Perl is used to process queries through the web and produce online reports for gene coverage.
Number of genes in the genome
The total number of genes in the genome was defined as the number of Entrez Genes with a unique genomic position at the UCSC Genome Browser Database  (see methods). A total of 198,155 mapped sequences from the Known Gene, RefGene, and mRNA tracks were associated with Entrez Genes. A total of 521 genes could not be used because they are located in unordered scaffolds in Build 35.1 of the mouse genome assembly. Multiple sequence alignments in the genomes were found for a total of 1,766 sequences. For example, the M10062 cDNA aligns with chromosomes 1, 2, 3, 4, 6, 10, 11, 13, 15, 17, 19, X, and Un_random (not chromosome assigned contigs) chromosomes. This cDNA is identified as the Iap gene in the Entrez Gene database. This is a retrotransposon that can be found in several chromosomes and does not have a unique position. Genes like this, and other not so extreme cases, cannot be considered in region-specific coverage comparisons and were therefore discarded. Table 2 shows the number of genes that could be assigned to specific position in the genome in each of the source files. The file Mm.gb_cid_lid in Table 1 was used to incorporate associations between sequences and Unigene ids in the genexref table. Although we do not use Unigene annotations of probes to identify the targeted genes (see discussion section for explanation) this allowed us to associate Entrez Gene ids with EST accession numbers, which are commonly used in microarray genes lists. A total of 25,416 genes could be found having a unique position in the August 2005 mouse genome assembly (Build 35.1). The distribution of these genes in the genome is shown in Figure 2.
Table 1. Files used to create the local database and URLs of sources of data.
Table 2. Gene tracks from the UCSC Genome Browser Database used for finding genomic locations of Entrez Genes and to create the genemap table to store mapping information. Tracks were used hierarchically in the order shown here from top to bottom.Genes in track are the number of distinct Entrez genes in the Sequence Track. Genes in Multiple positions cannot be mapped to a unique position. Genes in Random scaffolds map to unordered scaffolds. Used Genes refers to the number of non redundant genes with a unique position in the genome that were imported to the genemap table.
Figure 2. Distribution of Entrez Genes per chromosome from the UCSC Genome Browser Database. The fraction of genes mapping to unordered scaffolds is shown in black. Mapped genes are in gray. The last bar represents genes mapping to scaffolds that could not be mapped to any chromosome in the genome.
Microarrays probe annotations
The Resourcerer database provides gene annotations for most of the available microarrays and oligosets for mouse and other organisms . However, database updates are done at four-month intervals, and in our experience, gene annotations change periodically and many Entrez Gene ids in the Resourcerer annotations are obsolete. Microarray vendors also provide gene annotations on their probes, but they vary greatly in the number and quality of the annotations. For instance, ABI is the platform with the largest number of probes annotated with a gene id by the vendor (Table 4). However, these are Celera Genomics® gene identifications and cannot be directly matched to public domain genes. Therefore, we opted for performing our own gene annotation of probes using the most updated information at hand (see methods). Our genexref table, in the ArrayGene database, stored cross reference information between 765,289 sequence identifiers and 63,175 Entrez Genes. The different kinds of sequence identifiers included in the database are shown in Figure 3. Performing in-house annotations allowed us to discard any probe that could be associated with more than one gene. Probe annotation files, referred to as Genelists, were obtained directly from vendors' websites and they identify the sequences from which oligonucleotide probes where designed. The format and amount of information provided in these lists varied greatly, Affymetrix being the most comprehensive in the number of different annotations. The efficiency of the gene annotation process varied between platforms depending on the amount of sequence identifiers provided by vendors and the level of specificity of the information provided (Figure 4). Specificity is defined here as the number of associations that can be inferred between a probe and Entrez Genes from all the probe annotations provided by the vendor. The gene annotations from the ArrayGene system do not include any probe that could be associated with more than one gene nor genes with an uncertain position in the genome. This approach created a more conservative set of gene annotations than those included in the Resourcerer database. In most of the platforms we could not match the number of probes annotated with gene ids by the vendor (Entrez Gene ids, gene symbols, UniGene ids, etc) given our stringent criteria to select a unique gene identifier ("Unknown seq id" in Figure 4). The level of redundancy, i.e. the number of probes hybridizing to the same gene, also varied between platforms (Figure 5), the least redundant being the Sigma platform (1.08 probes per gene on average), though it has the least number of probes. However, the newer ABI platform has a comparable level of redundancy (1.17 probes per gene). The most redundant platforms is the Affy array with an average 1.98 probe sets per gene.
Table 3. Mouse oligonucleotide microarray and oligoset platforms included in this study.
Table 4. Summary of gene probe annotations for whole genome mouse microarrays and oligosets. The vendor annotated percent represents the fraction of probes that have a gene id provided by the vendor.ArrayGene annotations are automatically performed by the ArrayGene software by associating an Entrez Gene id to the sequence ids in gene lists provided by vendors. Gene coverage represents the percentage of uniquely mapped genes in the genome that are tested by gene-specific probes in the microarrays and oligosets.
Figure 3. Sequence Identifiers Associated with Entrez Gene ids in the ArrayGene database. The mRNA, genomic, and protein classes represent GenBank accession numbers that have been associated with an Entrez Gene by NCBI. RefSeq groups reference mRNAs, genomic and protein sequences. Ensembl transcript are ids from Ensembl. Unigene, symbol, and synonym are gene identifications than have been cross-referenced with an Entrez Gene id by NCBI. Unigene link are sequence accession numbers that have been clustered into Unigene ids that are associated with Entrez Genes. NIA transcripts represent ids of transcripts from the National Institute of Aging mouse Gene Index (V. 4.0) .
Figure 4. Efficiency of gene annotation of probes by ArrayGene. The height of the bars shows total number of probes in the array. No sequence id represents probes for which there was no sequence annotation available in the gene. Unknown position probes are those that the gene is known but it maps to multiple positions in the genome or to unordered scaffolds. Unspecific probes refer to those that can be associated with more than one Entrez Gene. Unknown sequence id refers to probes annotated with identifiers that could not be associated with an Entrez Gene. ArrayGene annotated probes are those that could be associated with a single Entrez Gene id that map to a known position in the genome. For the ABI platform, the No Sequence id fraction corresponds to about 4,000 probe targeting genes that have not been annotated by the public effort and are only available from the Celera gene discovery system.
Figure 5. Average number of probes or probe sets that test the same gene in different microarray platforms.
Gene coverage from mouse whole genome microarrays and oligonucleotide sets
Gene coverage was estimated as the proportion of Entrez Genes with a unique position in the UCSC Genome database that is tested by a given microarray platform. The genome wide coverage varied from different platforms, ranging from 47.95% to 90.47% (Table 4 and Figure 6). The lowest coverage was observed, as expected, for the oldest platform, the Sigma oligoset, with a total of 16,377 probes testing 12,188 genes. Agilent and Codelink showed very similar coverage levels (79.33% and 78.09%, respectively). Sentrix is the ready-to-use mouse microarray with the highest gene coverage, with 90.47% of the publicly available genes tested. This was followed by the public oligonucleotide set, MEEBO (88.05%), that Sentrix was based on. The Operon AROS Arrays Oligosets showed clear improvement in gene coverage levels as new releases of their oligo data set have become available. The latest release (Operon4) shows 84.03% coverage, higher than Operon3 which only covered 69.89% and even Agilent and Affy (79.33% and75.6%, respectively).
Figure 6. Gene coverage comparison between mouse microarray platforms and oligonucleotide sets. Coverage percent is calculated as (number of genes tested)/(number of genes in the genome) × 100. The total number of genes in the genome was calculated as the number of Entrez Genes that could be mapped to a unique position in the UCSC Genome Browser Database, mouse genome Build 35.1. Microarrays are ordered from left to right by date of release. Although this was not known for every platform, we estimated the date of release using the available information both from vendor web sites and from personal communications with their costumer support (details in ).
Additional File 1. Microarray platforms compared in this study. Table provides details of the Genelists used, including filename, URL, date of release, date updated, and date obtained. The column called annot update lists the date that the Genelist available for the platform was last updated by the vendor
Format: DOC Size: 65KB Download file
This file can be viewed with: Microsoft Word Viewer
Gene coverage by chromosome
The differences between platforms in terms of gene coverage are well conserved across chromosomes (Figure 7). Some changes, though, can be observed in specific cases. For instance, the Affymetrix platform has a particularly low coverage for mouse chromosome 2 and 7 (69.4% and 70.5%), being outperformed even by the older Operon3 oligoset (71.6% and 71.5%, respectively). For the complete list of gene coverage by chromosome per platform see  of the supplementary material. The database can be easily queried for gene coverage comparison on any region of any chromosome. Figure 8 shows an example output report for a 7.9 MB region of mouse chromosome 2.
Additional File 2. Comparative gene coverage from whole mouse genome microarrays and oligo set. Table shows the number of Entrez Genes with a single genomic position in the genome by the UCSC Genome Browser Database, and the number of genes that are tested by each platform as absolute counts and as percentage from the number of genes in the chromosome.
Format: DOC Size: 69KB Download file
This file can be viewed with: Microsoft Word Viewer
Figure 7. Comparative view of gene coverage (%) between microarray platforms for each mouse chromosome.
Figure 8. Example of online summary report of gene annotations of microarrays platforms produced by ArrayGene. The researcher can query the database for a comparison of gene coverage for the whole genome or for any specific region of a given chromosome. The example shows a comparison for genes in mouse chromosome 2, between 100 and 8,000,000 bp (Genome Build 35.1). The results are shown in form of tables and color bar graphs.