General strategy for informatic and functional analysis of centromere satellite domains in complex genomes. The diagram and underlying flow chart highlights three phases involved in the sequence processing and centromeric database construction. The first phase defines the sequences that are unassigned to a specific chromosome in the current genome reference assembly (all reads in that are unassembled as well as constitute the assembled unmapped contigs; or canFam2.0 chrUn). Of the tandemly repeated satellite sequence families within this database, seven were enriched in centromeric regions, resulting in an inventory of all satellites and any adjacent non-satellite sequences. Phase II reformats the read database from Phase I into a list of unique k-mers demonstrated to be specific to the pericentromere and each determined to be single-copy or multi-copy based on observed sequence frequency in the genome. These k-mers result in a library describing all inherent sequence variation in centromeric regions and are useful for investigating enrichment trends using next gen sequence datasets in Phase III, such as CENP-A ChIP sequence reads. Comparative analyses result in a list of functional k-mers that define the genomic context of the centromere. K-mers are mapped back to the read and paired read dataset to study regional sequence organization.
Hayden and Willard BMC Genomics 2012 13:324 doi:10.1186/1471-2164-13-324