Log on / register
Feedback | Support | My details
Open AccessSoftware

PILER-CR: Fast and accurate identification of CRISPR repeats

Robert C Edgar email

45 Monterey Dr., Tiburon, CA, USA

author email corresponding author email

BMC Bioinformatics 2007, 8:18doi:10.1186/1471-2105-8-18

Published: 20 January 2007

Additional files

Additional File 2:

Sample output generated by PILER-CR. The report has three sections: Detailed, Summary by Similarity and Summary by Position. The detailed section shows each repeat in each putative CRISPR array. The summary sections give one line for each array. Columns in the detailed section are: Pos, sequence position; Repeat, length of the repeat; %id, identity with the consensus; Spacer, length of spacer to the right of this repeat; Left flank, 10 bases to the left of this repeat, Repeat, sequence of this repeat (dots indicate positions where this repeat agrees with the consensus sequence below); Spacer, sequence of spacer to the right of this repeat, or 10 bases if this is the last repeat. The left flank sequence duplicates the end of the spacer for the preceding repeat; it is provided to facilitate visual identification of cases where the algorithm does not correctly identify repeat endpoints. At the end of each array there is a sub-heading that gives the average repeat length, average spacer length and consensus sequence. Columns in the summary sections are: Array, number 1, 2 ... referring back to the detailed report; Sequence, FASTA label of the sequence; From, start position of array; To end position of array; # copies, number of repeats in the array, Repeat, average repeat length; Spacer, average spacer length; +, +/-, indicating orientation relative to the first array in the group, Distance, distance from previous array; Consensus, consensus sequence. In the Summary by Similarity section, arrays are grouped by similarity of their consensus sequences. If consensus sequences are sufficiently similar, they are aligned to each other to indicate probable relationships between arrays. In this example, Arrays 1–5 are very similar and are thus aligned; Array 6 appears to be unrelated and stands alone. In the Summary by Position section, arrays are sorted by position within the input sequence file. The Distance column facilitates identification of cases where a single array has been reported as two adjacent arrays. In such a case, (a) the consensus sequences will be similar or identical, and (b) the distance will be approximately a small multiple of the repeat length + spacer length. In the above example, we see how the flanking sequences provide immediate visual feedback. Array 4 has only three repeats, and the last column in the repeat alignment is (ACA), i.e. is not conserved. This column should probably be deleted from the repeat and moved to the spacer. Array 1 has a similar issue, but here it is not so clear that the last column should be deleted. The first three repeats in the array are perfectly conserved, and it is common to find degraded copies of the repeat at the beginning and end of an array. This also illustrates the difficulty of developing heuristics that are able to match human performance in making judgments in more difficult cases.

Format: TXT Size: 7KB Download file

Additional File 1:

Additional file Tables 1 and 2

Format: DOC Size: 438KB Download file

This file can be viewed with: Microsoft Word Viewer

Additional File 3:

Tar with gzip compression. Source code and i86 Linux binary.

Format: GZ Size: 1.5MB Download file


© 1999-2009 BioMed Central Ltd unless otherwise stated. Part of Springer Science+Business Media.