Pipeline of how negative and positive datasets of GIs are derived given a single query genome as input. A pre-computed genome distance matrix using CVTree is required as input as well as the query genome A). If there is enough suitable reference genomes selected for comparison with the query genome then the query genome and reference genomes are used in a Mauve multiple genome alignment and all conserved regions are extracted into a negative dataset of GIs B). The positive dataset is constructed by taking each query genome and aligning it pair-wise with each reference genome. Then all unaligned overlapping regions found in the query genome from the pair-wise alignments are filtered using the NCBI BLAST to ensure that they are truly unique to the query genome C).
Langille et al. BMC Bioinformatics 2008 9:329 doi:10.1186/1471-2105-9-329