A Computation Workflow to Rapidly Identify and Update SFams. This workflow illustrates the general steps (boxes) used to initialize (left) and update (right) the database of SFams (center). Where appropriate, the algorithms used at each step are listed in parenthetical statements as are the e-value (e.g., e10-10) and coverage thresholds (e.g., 80%) used to infer homology between a pair of sequences or a sequence and an HMM. The number of sequences or HMMs considered at various steps is also listed (e.g., N=720). The SFam database was initialized by identifying 720 de novo clustered families that are found in 50% of the 100 phylogenetically diverse representative Bacterial and Archaeal genomes that we selected. The similarity between all pairs of protein sequences from these genomes was calculated and used to cluster proteins into families. Each SFam’s sequences were then aligned and used to train Hidden Markov Models (HMMs). These HMMs were then used to screen for homologs among the ~7 million protein sequences found in the 1,894 genomes we originally downloaded, which include the 100 representative families. Detected homologs were added to the database of previously identified SFams. All protein sequences in the database that are not SFam members were then subject to independent de novo clustering and HMM construction. A similar iterative approach is used to annotate new genome sequences.
Sharpton et al. BMC Bioinformatics 2012 13:264 doi:10.1186/1471-2105-13-264