|
Resolution: standard / high Figure 1.
A Computation Workflow to Rapidly Identify and Update SFams. This workflow illustrates the general steps (boxes) used to initialize (left) and
update (right) the database of SFams (center). Where appropriate, the algorithms used
at each step are listed in parenthetical statements as are the e-value (e.g., e10-10) and coverage thresholds (e.g., 80%) used to infer homology between a pair of sequences or a sequence and an HMM.
The number of sequences or HMMs considered at various steps is also listed (e.g., N=720). The SFam database was initialized by identifying 720 de novo clustered families that are found in 50% of the 100 phylogenetically diverse representative
Bacterial and Archaeal genomes that we selected. The similarity between all pairs
of protein sequences from these genomes was calculated and used to cluster proteins
into families. Each SFam’s sequences were then aligned and used to train Hidden Markov
Models (HMMs). These HMMs were then used to screen for homologs among the ~7 million
protein sequences found in the 1,894 genomes we originally downloaded, which include
the 100 representative families. Detected homologs were added to the database of previously
identified SFams. All protein sequences in the database that are not SFam members
were then subject to independent de novo clustering and HMM construction. A similar iterative approach is used to annotate
new genome sequences.
Sharpton et al. BMC Bioinformatics 2012 13:264 doi:10.1186/1471-2105-13-264 |