SeqWare Query Engine schema. The HBase database is a generic key-value, column oriented database that pairs well with the inherent sparse matrix nature of variant annotations. (a) The primary table stores multiple genomes worth of generic features, variants, coverages, and variant consequences using genomic location within a particular reference genome as the key. Each genome is represented by a particular column family label (such as “variant:genome7”). For locations with more than one called variant the HBase timestamp is used to distinguish each. (b) Secondary indexing is accomplished using a secondary table per genome indexed. The key is the tag being indexed plus the ID of the object of interest, the value is the row key for the original table. This makes lookup by secondary indexes, “tags” for example, possible without having to iterate over all contents of the primary table.
O’Connor et al. BMC Bioinformatics 2010 11(Suppl 12):S2 doi:10.1186/1471-2105-11-S12-S2