Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change1 Department of Biochemistry & Biophysics, University of Rochester Medical Center, 601 Elmwood Avenue, Box 712, Rochester, New York 14642, USA 2 Department of Biostatistics & Computational Biology, University of Rochester Medical Center, 601 Elmwood Avenue, Box 712, Rochester, New York 14642, USA 3 Center for Pediatric Biomedical Research, University of Rochester Medical Center, 601 Elmwood Avenue, Box 712, Rochester, New York 14642, USA
BMC Bioinformatics 2006, 7:173doi:10.1186/1471-2105-7-173
Additional filesAdditional File 1: Complete ROC curves for classification of sequence pairs by the Dynalign z score method. Adobe Acrobat PDF (version 4.0 or above) file showing complete ROC curves comparing effectiveness of Dynalign z score classification of sequence pairs using three control generation methods and two M parameter values (M = 6 and M = 8). This is the same sequence test set that Figures 3, 4 and 5 are based on. In all cases, increasing the value of the M parameter improves prediction quality. Dark and light green: controls generated by first-order Markov chain sampling, tests run using M = 6 and M = 8, respectively. Brown and orange: controls generated by Altschul-Erikson dinucleotide shuffle, tests run using M = 6 and M = 8, respectively. Dark and light blue: controls generated by the columnwise shuffle, tests run using M = 6 and M = 8, respectively. Format: PDF Size: 51KB Download file This file can be viewed with: Adobe Acrobat Reader Additional File 2: Side-by-side comparison of Dynalign, RNAz, and QRNA classifications for each window in the MUMmer whole genome screen. Plain text, whitespace-delimited tabular data file. Each row is a window in the MUMmer whole genome alignment (15,214 windows total) of E. coli and S. typhi. Columns 1, 2, and 3: E. coli start and end nucleotide indices and strand (plus or minus) for that window. Columns 4, 5, and 6: S. typhi start and end nucleotide indices and strand (plus or minus) for that window. Column 7: Dynalign/LIBSVM probability that the window is ncRNA. Column 8: RNAz probability that the window is ncRNA. Column 9: QRNA classification of the window (ncRNA, ORF, or other). Format: TXT Size: 848KB Download file Additional File 3: Side-by-side comparison of Dynalign, RNAz, and QRNA classifications for each window in the WuBLASTn whole genome screen. Plain text, whitespace-delimited tabular data file. Each row is a window in the WuBLASTn whole genome alignment (90,404 windows total) of E. coli and S. typhi. Columns 1, 2, and 3: E. coli start and end nucleotide indices and strand (plus or minus) for that window. Columns 4, 5, and 6: S. typhi start and end nucleotide indices and strand (plus or minus) for that window. Column 7: Dynalign/LIBSVM probability that the window is ncRNA. Column 8: RNAz probability that the window is ncRNA. Column 9: QRNA classification of the window (ncRNA, ORF, or other). Format: TXT Size: 4.8MB Download file Additional File 4: MUMmer whole genome screen input data to the Dynalign/LIBSVM classifier. Plain text data file formatted for input to LIBSVM (not scaled). This is the MUMmer whole genome screen dataset input to the Dynalign/LIBSVM classifier (before scaling). There is a one-to-one correspondence between rows of this file and rows of 2 – that is, row N in this file corresponds to the window described in row N in 2. Column 1 is the data label (all windows are initially assumed negatives and labelled "-1," but this is irrelevant for these purposes as this is essentially just a placeholder column for LIBSVM). Column 2 is the Dynalign-computed ΔG°total; column 3 is the length of shorter sequence; columns 4, 5, and 6 are A, U, and C frequencies of sequence 1 (E. coli); columns 7, 8, and 9 are A, U, and C frequencies of sequence 2 (S. typhi). Format: TXT Size: 1.1MB Download file Additional File 5: WuBLASTn whole genome screen input data to the Dynalign/LIBSVM classifier. Plain text data file formatted for input to LIBSVM (not scaled). This is the WuBLASTn whole genome screen dataset input to the Dynalign/LIBSVM classifier (before scaling). There is a one-to-one correspondence between rows of this file and rows of 3 – that is, row N in this file corresponds to the window described in row N in 3. Column 1 is the data label (all windows are initially assumed negatives and labelled "-1," but this is irrelevant for these purposes as this is essentially just a placeholder column for LIBSVM). Column 2 is the Dynalign-computed ΔG°total; column 3 is the length of shorter sequence; columns 4, 5, and 6 are A, U, and C frequencies of sequence 1 (E. coli); columns 7, 8, and 9 are A, U, and C frequencies of sequence 2 (S. typhi). Format: TXT Size: 6.8MB Download file Additional File 6: LIBSVM datasets for every possible sequence pair of 5S rRNA, tRNA, and negative sequences. Nine plain text data files formatted for input to LIBSVM (not scaled) and three plain text files containing sequence codes for the LIBSVM files, all archived with GNU 'tar' and compressed with GNU 'gzip'. Our training and testing sets for the Dynalign/LIBSVM classifier were prepared from this dataset as described in "Methods." The file 'LIBSVM-set.5s-real' is every possible pairing of known 309 5S rRNA sequences in our database, not counting sequences paired with themselves. The file 'LIBSVM-set.trna-real' is every possible pairing of known 479 tRNA sequences in our database, not counting sequences paired with themselves. The file 'LIBSVM-set.100ident-real' is the 309 5S rRNA and 479 tRNA sequences paired with themselves (i.e. real sequence pairs of 100% identity). The files denoted 'neg-column' are columnwise-shuffled negatives generated from the corresponding real sequences; the files denoted 'neg-AE' are negatives generated from the corresponding real sequences by the Altschul-Erikson shuffle (see "Methods" for description of both shuffles). The files denoted 'seqlist' contain the codes for sequence pairs (or for sequences aligned with themselves) with lines in a one-to-one correspondence with the appropriate LIBSVM files – for example, line 42 of file 'seqlist.5s-pairs' contains the codes of the two 5S rRNA sequences which were used to generate the data on lines 42 in files 'LIBSVM-set.5s-real', 'LIBSVM-set.5s-neg-column', and 'LIBSVM-set.5s-neg-AE'. For LIBSVM files, column 1 the data label (1 for real, -1 for negative); column 2 is the Dynalign-computed ΔG°total; column 3 is the length of shorter sequence; columns 4, 5, and 6 are A, U, and C frequencies of sequence 1; columns 7, 8, and 9 are A, U, and C frequencies of sequence 2. Format: GZ Size: 3.7MB Download file Additional File 7: LIBSVM model file for the Dynalign/LIBSVM classifier. The model file for a LIBSVM classifier, trained as described in "Methods." LIBSVM classifications with this model file also outputs a probability of prediction (P value), in addition to the prediction itself. Use this with LIBSVM on datasets that have been scaled as described in "Methods" and note that datasets scaled differently will be incorrectly classified. The input dataset should be a plain text, whitespace-delimited tabular file formatted as described in 6 and in the LIBSVM documentation [69]. Format: MODEL Size: 2.1MB Download file |




on Google Scholar







author email
corresponding author email