<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
	<ui>1471-2105-5-132</ui>
	<ji>1471-2105</ji>
	<fm>
		<dochead>Research article</dochead>
		<bibl>
			<title>
				<p>Simple statistical models predict C-to-U edited sites in plant mitochondrial RNA</p>
			</title>
			<aug>
				<au id="A1" ca="yes">
					<snm>Cummings</snm>
					<mi>P</mi>
					<fnm>Michael</fnm>
					<insr iid="I1"/>
					<email>mike@umiacs.umd.edu</email>
				</au>
				<au id="A2">
					<snm>Myers</snm>
					<mi>S</mi>
					<fnm>Daniel</fnm>
					<insr iid="I1"/>
					<email>dmyers@umiacs.umd.edu</email>
				</au>
			</aug>
			<insg>
				<ins id="I1">
					<p>Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742-3360, USA</p>
				</ins>
			</insg>
			<source>BMC Bioinformatics</source>
			<issn>1471-2105</issn>
			<pubdate>2004</pubdate>
			<volume>5</volume>
			<issue>1</issue>
			<fpage>132</fpage>
			<url>http://www.biomedcentral.com/1471-2105/5/132</url>
			<xrefbib>
				<pubidlist>
					<pubid idtype="pmpid">15373947</pubid>
					<pubid idtype="doi">10.1186/1471-2105-5-132</pubid>
				</pubidlist>
			</xrefbib>
		</bibl>
		<history>
			<rec>
				<date>
					<day>07</day>
					<month>4</month>
					<year>2004</year>
				</date>
			</rec>
			<acc>
				<date>
					<day>16</day>
					<month>9</month>
					<year>2004</year>
				</date>
			</acc>
			<pub>
				<date>
					<day>16</day>
					<month>9</month>
					<year>2004</year>
				</date>
			</pub>
		</history>
		<cpyrt>
			<year>2004</year>
			<collab>Cummings and Myers; licensee BioMed Central Ltd.</collab>
			<note>This is an open-access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
		</cpyrt>
		<abs>
			<sec>
				<st>
					<p>Abstract</p>
				</st>
				<sec>
					<st>
						<p>Background</p>
					</st>
					<p>RNA editing is the process whereby an RNA sequence is modified from the sequence of the corresponding DNA template. In the mitochondria of land plants, some cytidines are converted to uridines before translation. Despite substantial study, the molecular biological mechanism by which C-to-U RNA editing proceeds remains relatively obscure, although several experimental studies have implicated a role for <it>cis</it>-recognition. A highly non-random distribution of nucleotides is observed in the immediate vicinity of edited sites (within 20 nucleotides 5' and 3'), but no precise consensus motif has been identified.</p>
				</sec>
				<sec>
					<st>
						<p>Results</p>
					</st>
					<p>Data for analysis were derived from the the complete mitochondrial genomes of <it>Arabidopsis thaliana</it>, <it>Brassica napus</it>, and <it>Oryza sativa</it>; additionally, a combined data set of observations across all three genomes was generated. We selected datasets based on the 20 nucleotides 5' and the 20 nucleotides 3' of edited sites and an equivalently sized and appropriately constructed null-set of non-edited sites. We used tree-based statistical methods and random forests to generate models of C-to-U RNA editing based on the nucleotides surrounding the edited/non-edited sites and on the estimated folding energies of those regions. Tree-based statistical methods based on primary sequence data surrounding edited/non-edited sites and estimates of free energy of folding yield models with optimistic re-substitution-based estimates of ~0.71 accuracy, ~0.64 sensitivity, and ~0.88 specificity. Random forest analysis yielded better models and more exact performance estimates with ~0.74 accuracy, ~0.72 sensitivity, and ~0.81 specificity for the combined observations.</p>
				</sec>
				<sec>
					<st>
						<p>Conclusions</p>
					</st>
					<p>Simple models do moderately well in predicting which cytidines will be edited to uridines, and provide the first quantitative predictive models for RNA edited sites in plant mitochondria. Our analysis shows that the identity of the nucleotide -1 to the edited C and the estimated free energy of folding for a 41 nt region surrounding the edited C are the most important variables that distinguish most edited from non-edited sites. However, the results suggest that primary sequence data and simple free energy of folding calculations alone are insufficient to make highly accurate predictions.</p>
				</sec>
			</sec>
		</abs>
	</fm>
	<bdy>
		<sec>
			<st>
				<p>Background</p>
			</st>
			<p>RNA editing is the process whereby an RNA sequence is modified from the sequence corresponding to the DNA template. A particular form of RNA editing in plant mitochondria, by which some cytidines are converted to uridines before translation, occurs in many land plant lineages. Although cytidine to uridine conversion is most common, the reverse conversion is sometimes observed <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr></abbrgrp>. In plants, the phenomenon is best studied, albeit still poorly understood, in the mitochondria and plastids of angiosperms <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr></abbrgrp>.</p>
			<p>The majority of plant mitochondrial RNA editing occurs in coding sequences, and editing frequently changes codons, resulting in changes of amino acids, or, in some cases, creation of entirely new open reading frames <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr></abbrgrp>. These changes often result in an increase in similarity with respect to homologous protein sequences among different organisms (such as in wheat <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>), and Gray has postulated that the RNA editing process functions as a repair mechanism to correct otherwise-deleterious genomic mutations <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. RNA editing has also been detected in introns, where it is conjectured to improve splicing efficiency <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>.</p>
			<p>The precise biochemical basis for C-to-U editing in plant mitochondria is unknown, although experimental evidence suggests a deamination reaction <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr></abbrgrp>. Despite substantial study, the molecular biological mechanism by which C-to-U RNA editing proceeds remains relatively obscure, although several experimental studies have implicated a role for <it>cis</it>-recognition <abbrgrp><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr></abbrgrp>. The mechanism by which edited sites are recognized is also still poorly understood, but the importance of surrounding nucleotides has been noted <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>. A highly non-random distribution of nucleotides in the immediate vicinity of edited sites (within 10&#8211;20 nucleotides 5' and 3') is observed, but no precise consensus motif has been identified <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B16">16</abbr></abbrgrp>. Additionally, previous studies suggest that inferred secondary structure is not important in site recognition for C-to-U conversion <abbrgrp><abbr bid="B16">16</abbr><abbr bid="B19">19</abbr></abbrgrp>.</p>
			<p>Identifying edited sites thus remains an open problem, one to which we have applied tree-based statistical models and an extension of such models. When applied to a similar problem (predicting peptide binding to major histocompatibility complex (MHC) class I molecules <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>), tree-based statistical methods generated very accurate models, identifying specific important residues when no precise sequence motif had previously been identified. Therefore, we were motivated to apply tree-based statistical models and an extension, random forests, to the problem of C-to-U RNA editing in angiosperm mitochondria using complete mitochondrial genome data for three species: <it>Arabidopsis thaliana</it>, <it>Brassica napus </it>and <it>Oryza sativa</it>. The objective for the current research was to identify sequence features that may provide insights into C-to-U editing of plant mitochondrial RNA. We address the following specific questions. Is there evidence that sufficient information exists within sequence regions flanking edited sites to accurately predict editing? Is there an association between estimated free energy of folding for short sequence regions containing edited sites and C-to-U editing? We report tree-based statistical analysis of three complete mitochondrial genomes and show that relatively simple models provide moderately accurate prediction of C-to-U edited sites.</p>
		</sec>
		<sec>
			<st>
				<p>Results</p>
			</st>
			<sec>
				<st>
					<p>Tree-based statistical models</p>
				</st>
				<p>Analysis of each of the three species-specific mitochondrial genome data sets yielded substantially similar results (Table <tblr tid="T1">1</tblr>). Using flanking nucleotides and estimates of folding energy as predictor variables, the optimistic re-substitution-based estimates for cross-validated pruned models had a mean correct classification rate of 0.705 (sensitivity [the proportion of observations correctly identified as edited] <graphic file="1471-2105-5-132-i1.gif"/> = 0.640, and specificity [the proportion of observations correctly identified as non-edited] <graphic file="1471-2105-5-132-i1.gif"/> = 0.883) across the three species.</p>
				<tbl id="T1">
					<title>
						<p>Table 1</p>
					</title>
					<caption>
						<p>Summary statistics for tree-based statistical models.</p>
					</caption>
					<tblbdy cols="4">
						<r>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>Accuracy</p>
							</c>
							<c ca="center">
								<p>Sensitivity</p>
							</c>
							<c ca="center">
								<p>Specificity</p>
							</c>
						</r>
						<r>
							<c cspan="4">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>Arabidopsis thaliana</it>
								</p>
							</c>
							<c ca="center">
								<p>0.711</p>
							</c>
							<c ca="center">
								<p>0.645</p>
							</c>
							<c ca="center">
								<p>0.888</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>Brassica napus</it>
								</p>
							</c>
							<c ca="center">
								<p>0.693</p>
							</c>
							<c ca="center">
								<p>0.630</p>
							</c>
							<c ca="center">
								<p>0.887</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>Oryza sativa</it>
								</p>
							</c>
							<c ca="center">
								<p>0.709</p>
							</c>
							<c ca="center">
								<p>0.645</p>
							</c>
							<c ca="center">
								<p>0.874</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>combined</p>
							</c>
							<c ca="center">
								<p>0.705</p>
							</c>
							<c ca="center">
								<p>0.640</p>
							</c>
							<c ca="center">
								<p>0.882</p>
							</c>
						</r>
					</tblbdy>
				</tbl>
				<p>As an additional classification tree analysis, we examined a dataset generated by combining the data from the three species. These results were generally similar to those described above for the mean of the individual genome datasets. The classification tree model is shown in Figure <figr fid="F1">1</figr>; the partition is defined based on the nucleotide immediately 5' (-1 position) of the edited/non-edited site. Of the 1972 observations with pyrimidine at the -1 position, 1262 (0.64) are edited and 710 (0.36) are non-edited sites. Of the 722 observations with purine at the -1 position, 85 (0.12) are edited and 637 (0.88) are non-edited sites.</p>
				<fig id="F1">
					<title>
						<p>Figure 1</p>
					</title>
					<caption>
						<p>Cross-validated pruned classification tree for the combined dataset</p>
					</caption>
					<text>
						<p><b>Cross-validated pruned classification tree for the combined dataset. </b>The number of edited and non-edited sites are given at each node. The single split is based on the nucleotides at position -1 relative to the edited site.</p>
					</text>
					<graphic file="1471-2105-5-132-1"/>
				</fig>
			</sec>
			<sec>
				<st>
					<p>Random forests</p>
				</st>
				<p>Results from random forests (Table <tblr tid="T2">2</tblr>) were very similar to those obtained with classification trees and were somewhat more accurate. In single-species analyses, the mean accuracy rate was 0.744 (sensitivity <graphic file="1471-2105-5-132-i1.gif"/> = 0.717, specificity <graphic file="1471-2105-5-132-i1.gif"/> = 0.809). Analysis of the larger, combined data set yielded a model better than any of the single genome models with an accuracy of 0.848 (Table <tblr tid="T2">2</tblr>). Analysis of variable importance showed that the -1 position is overwhelmingly the most important factor in determining editing status. Other variables of lesser predictive value include estimated free energy of folding, and the -2 and +1 positions relative to the edited/non-edited site (Figure <figr fid="F2">2</figr>).</p>
				<tbl id="T2">
					<title>
						<p>Table 2</p>
					</title>
					<caption>
						<p>Summary statistics for random forest models.</p>
					</caption>
					<tblbdy cols="4">
						<r>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>Accuracy</p>
							</c>
							<c ca="center">
								<p>Sensitivity</p>
							</c>
							<c ca="center">
								<p>Specificity</p>
							</c>
						</r>
						<r>
							<c cspan="4">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>Arabidopsis thaliana</it>
								</p>
							</c>
							<c ca="center">
								<p>0.744</p>
							</c>
							<c ca="center">
								<p>0.701</p>
							</c>
							<c ca="center">
								<p>0.811</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>Brassica napus</it>
								</p>
							</c>
							<c ca="center">
								<p>0.765</p>
							</c>
							<c ca="center">
								<p>0.733</p>
							</c>
							<c ca="center">
								<p>0.808</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>Oryza sativa</it>
								</p>
							</c>
							<c ca="center">
								<p>0.722</p>
							</c>
							<c ca="center">
								<p>0.716</p>
							</c>
							<c ca="center">
								<p>0.808</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>combined</p>
							</c>
							<c ca="center">
								<p>0.848</p>
							</c>
							<c ca="center">
								<p>0.823</p>
							</c>
							<c ca="center">
								<p>0.877</p>
							</c>
						</r>
					</tblbdy>
				</tbl>
				<fig id="F2">
					<title>
						<p>Figure 2</p>
					</title>
					<caption>
						<p>Variable importance measures for the combined dataset</p>
					</caption>
					<text>
						<p><b>Variable importance measures for the combined dataset. </b> Numbered positions represent nucleotide state variables (with position zero representing the edited/non-edited site). The importance of each position is the decrease in the Gini index (a measure of impurity) induced by splitting the data on that position averaged over all trees (higher values are more important). The three variables based on estimates of free energy of folding are the codon position of the edited site (cp), estimated free energy of folding for the entire 41-nucleotide sequence centered on the edited/non-edited site (fe), and the difference in estimated free energy of folding between the edited and non-edited versions of the 41-nucleotide sequence (dfe).</p>
					</text>
					<graphic file="1471-2105-5-132-2"/>
				</fig>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Discussion</p>
			</st>
			<p>Despite their simplicity, the tree-based statistical models derived here performed moderately well, with mean accuracies across species generally ~0.71. Single trees were improved upon by constructing models based on ensembles of tree-based models (random forests) each of which was built using random subsamples of the data. This sub-sampling has the effect of reducing the variance through averaging and also reducing the correlation among models.</p>
			<p>One of the advantages that random forests have over single classification trees is that they provide quantitative measures of variable importance, whereas with a simple classification tree, one is primarily limited to inferring variable importance from the frequency and location of the occurrence of variables in the model. One measure of variable importance is the decrease in the Gini index (a measure of impurity of observations at a particular node) induced by splitting on the variable, averaged over all trees <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>.</p>
			<p>In order to infer the relative importance of the predictor variables, we considered the measure of variable importance produced during the random forest run on the combined dataset, which is the most broadly representative dataset considered here. A plot of the variable importance measure for this dataset is shown in Figure <figr fid="F2">2</figr>; more important variables are shown as higher bars. The measure strongly indicates that the residue immediately 5' of the edited site (-1 position) is very important. These variable importance results are in agreement with previous work on C-to-U editing in mitochondria of <it>Arabidopsis thaliana, </it>which noted the -1, and -2 positions had highly non-random nucleotide distributions <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. However, the results here differ from the past study of <it>Arabidopsis </it>in that we find no indication that the -17 position has much importance in edited site recognition. Also previously noted was that for 93.1% of the time <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>, the -1 position contained a pyrimidine, which is the data partition found by the classification trees.</p>
			<p>The free energy results contrast with previous studies indicating that secondary structure was not important in edited site recognition <abbrgrp><abbr bid="B16">16</abbr><abbr bid="B19">19</abbr></abbrgrp>. Our results show free energy is a relatively important variable in the random forest analyses. These results therefore indicate that secondary structure, as measured by free energy of folding for the 41 nt region centered on an edited/non-edited site, does help in distinguishing edited from non-edited sites. Previous studies determined putative secondary structures for mRNA regions containing edited sites and looked for conserved structural motifs. In contrast, we used estimates of free energy of folding, which are much easier to compare quantitatively. It may be that secondary or tertiary structure is even more important in determining edited sites than shown here; however, secondary structure may not be effectively represented by the calculated estimates of free energy of folding analyzed.</p>
		</sec>
		<sec>
			<st>
				<p>Conclusions</p>
			</st>
			<p>Simple models based on nucleotides surrounding edited/non-edited sites and on estimated folding energies of those regions provide moderately accurate prediction of C-to-U RNA edited sites. More nuanced representation of secondary or higher-order structure in combination with variables based on the nucleotide positions found important here might improve models. Overall, the results strongly suggest that the C-to-U editing mechanism in plant mitochondria does not depend exclusively on the primary sequence immediately in the vicinity of the edited site.</p>
		</sec>
		<sec>
			<st>
				<p>Methods</p>
			</st>
			<sec>
				<st>
					<p>Data sources</p>
				</st>
				<p>We obtained complete mitochondrial genome sequences and information regarding edited sites from GenBank <abbrgrp><abbr bid="B25">25</abbr></abbrgrp> for three species: <it>Arabidopsis thaliana </it>(L.) Heynh. (mouse-ear cress), 455 edited sites, GenBank accession number NC_001284 <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>; <it>Brassica napus </it>L. (rapeseed), 425 edited sites, GenBank accession number AP006444 <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>; and <it>Oryza sativa </it>L (rice), 486 edited sites, GenBank accession numbers AB076665 and AB076666 <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>. None of the GenBank entries noted U-to-C RNA edited sites.</p>
			</sec>
			<sec>
				<st>
					<p>Variable selection</p>
				</st>
				<p>Incomplete annotations in the GenBank sequences required us to algorithmically determine on which strand an edited site fell (the GenBank files sometimes supplied only a position number, with no strand information). The algorithm, implemented in a Perl script, scanned the entire GenBank file and built an in-memory representation of the layout of all genes and coding sequence regions in the genome. The strand with which an edited site was associated could then be determined by consulting the resultant genome map and checking which strand at the edited site contained a gene region. In no case were genes on both strands at an edited site, so strand localization was always unambiguous. In a few cases, however, a gene containing an edited site could not be located, or a site marked as a C-to-U edit did not contain a C in either strand. In these cases, the supposed edited site was eliminated from further consideration. Final numbers of included sites were as follows: <it>Arabidopsis</it>, 444; <it>Brassica</it>, 422; <it>Oryza</it>, 481. In total, 19 edited sites in the GenBank files were not included across all three species.</p>
				<p>We also constructed a set of null observations of cytidines that are not edited to uridines. In constructing a null-set, it is important to ensure that the observations are as alike as possible to the edited observations (differing only in the trait to be measured), or the resulting model may be fictive. Here, our null-set observations were non-edited cytidines chosen at random from within gene regions of the genome. Additionally, we chose cytidines such that the null set had exactly the same distribution of codon positions as did the edited set, because the distribution of edited sites within the three possible positions of a codon is highly non-random with a bias to the first two positions <abbrgrp><abbr bid="B9">9</abbr></abbrgrp> (Table <tblr tid="T3">3</tblr>).</p>
				<tbl id="T3">
					<title>
						<p>Table 3</p>
					</title>
					<caption>
						<p>Counts of C-to-U edited sites for each codon position.</p>
					</caption>
					<tblbdy cols="5">
						<r>
							<c>
								<p/>
							</c>
							<c cspan="3" ca="center">
								<p>Codon Position</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c cspan="3">
								<hr/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Species</p>
							</c>
							<c ca="center">
								<p>1</p>
							</c>
							<c ca="center">
								<p>2</p>
							</c>
							<c ca="center">
								<p>3</p>
							</c>
							<c ca="center">
								<p>Not in Codon</p>
							</c>
						</r>
						<r>
							<c cspan="5">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>Arabidopsis thaliana</it>
								</p>
							</c>
							<c ca="center">
								<p>149</p>
							</c>
							<c ca="center">
								<p>231</p>
							</c>
							<c ca="center">
								<p>51</p>
							</c>
							<c ca="center">
								<p>13</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>Brassica napus</it>
								</p>
							</c>
							<c ca="center">
								<p>142</p>
							</c>
							<c ca="center">
								<p>243</p>
							</c>
							<c ca="center">
								<p>33</p>
							</c>
							<c ca="center">
								<p>4</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>Oryza sativa</it>
								</p>
							</c>
							<c ca="center">
								<p>174</p>
							</c>
							<c ca="center">
								<p>230</p>
							</c>
							<c ca="center">
								<p>77</p>
							</c>
							<c ca="center">
								<p>0</p>
							</c>
						</r>
					</tblbdy>
				</tbl>
				<p>For each observation, we recorded 40 nucleotide state variables: one variable for each of the 20 nucleotides sites 5' and 3' of the edited C (on the same strand). We chose a value of 20 for the number of nucleotides 5' and 3' so as to encompass the entire range of semi-conserved positions previously suggested, the most extreme of which occurs 17 bases 5' of the edited site <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. In some cases other edited sites occurred within the 20 nucleotides 5' and 3' of the edited site used as a response variable. In these cases the edited sites as predictor variables were recorded as C. The low frequency of these sites at a particular position with respect to other edited sites results in non-significant effects, independent of how these sites are handled. In those cases where a full 20 nucleotides were not included within an annotated mRNA, the missing nucleotides were treated as unknown. Additionally, we included two variables based on free energy expressed in units of kcal/mole at 20&#176;C: the estimated free energy of folding for each 41-nucleotide sequence (20 bases 5', the edited/non-edited base, 20 bases 3') and the change in free energy of folding between the non-edited and edited versions of the 41-nucleotide sequence. Free energies of folding were calculated using mfold <abbrgrp><abbr bid="B28">28</abbr><abbr bid="B29">29</abbr></abbrgrp> version 3.1 with program parameters except temperature at default values. Finally, we included codon position as a variable, even though the null set had been chosen so non-edited sites had the same distribution of codon position as the edited sites, as shown in Table <tblr tid="T3">3</tblr>. Including codon position as a predictor variable allows for possible interactions with other variables.</p>
				<p>Finally, we created a combined data set to use alongside the species-specific datasets. The combined dataset is the result of combining all edited sites from all three species (there were no observations identical in all predictor variables), and then randomly selecting negative examples from the set of those already chosen for the three individual datasets. Negative examples were chosen to exactly match the positive examples in distribution over both species and codon position. The combined dataset comprises 2,694 observations.</p>
			</sec>
			<sec>
				<st>
					<p>Data analysis</p>
				</st>
				<sec>
					<st>
						<p>Tree-based statistical models</p>
					</st>
					<p>We used the R language for statistical computing <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>, version 1.7.1 to conduct our analyses. Analyses included tree-based statistical models using rpart <abbrgrp><abbr bid="B31">31</abbr></abbrgrp> and random forests using the FORTRAN implementation of random forest version 3.1 <abbrgrp><abbr bid="B24">24</abbr><abbr bid="B32">32</abbr></abbrgrp>.</p>
					<p>Tree-based statistical models <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>, also known as classification and regression trees (CART) <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>, are generated by recursively creating binary partitions of a dataset. Each partition is based on the value of a single predictor variable chosen to best produce homogeneous collections of a nominal or ordinal response variable (classification) or to best separate low and high values of a continuous response variable (regression). More precisely, the partitions may be considered as questions of the following form: Is the observation x<sub><it>i </it></sub>&#8712; <it>A</it>? Where <it>A </it>is a region of the variable space defined by some criterion of a single predictor variable. Answering such a question for all observations produces two groups: those observations for which the answer is <it>yes </it>(those in region <it>A</it>) and those for which the answer is <it>no </it>(x<sub><it>i </it></sub>&#8713; <it>A</it>, those in <graphic file="1471-2105-5-132-i2.gif"/>). Subsequent binary partitioning continues until stopping criteria (variously defined) are met <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>. The result is a classification or a regression tree: a hierarchical series of data bifurcations that depicts the partition definitions and describes the resulting data subsets defined by each partition. To address concerns about possible over-fitting models to the data we used 10-fold cross-validation and pruned trees to the shortest within 1-<it>SE </it>of the best tree.</p>
					<p>We assessed the significance of our tree-based statistical models through permutation where the predictor variables are randomized with respect to the response variable <abbrgrp><abbr bid="B35">35</abbr></abbrgrp>. The frequency of observing a result value equal to or better than the observed value in 1 &#215; 10<sup>4 </sup>permutations is the estimate of the probability associated with the observed result.</p>
				</sec>
				<sec>
					<st>
						<p>Random forests</p>
					</st>
					<p>If one tree-based statistical model is good, then an ensemble (forest) of appropriately constructed tree models should be even better, which is the principal idea of random forests. A random forest attempts to improve upon a simple tree-based statistical model by generating a collection of such models and using them in aggregate <abbrgrp><abbr bid="B24">24</abbr><abbr bid="B32">32</abbr></abbrgrp>. Each model in a random forest is generated from a bootstrap sample of the original dataset, and at each node in each model a search for the best possible split is through a subset of variables selected at random from the bootstrap sample of predictor variables. These randomization steps decrease prediction error through variance reduction resulting from averaging and by decreasing the correlation between individual models in the ensemble <abbrgrp><abbr bid="B36">36</abbr><abbr bid="B37">37</abbr></abbrgrp>. Each of our random forest analyses comprised 1 &#215; 10<sup>4 </sup>individual models constructed by sub-sampling seven predictor variables at each node.</p>
					<p>Several model summary statistics were calculated, including sensitivity, which is the proportion of observations correctly identified as edited, specificity, which is the proportion of observations correctly identified as non-edited, and accuracy, which is the total proportion of observations correctly identified. More formally, these definitions are:</p>
					<p><it>sensitivity = true positives/</it>(<it>true positives </it>+ <it>false negatives</it>);</p>
					<p><it>specificity = true negatives/</it>(<it>true negatives </it>+ <it>false positives</it>); and</p>
					<p><it>accuracy </it>= (<it>true positives </it>+ <it>true negatives</it>)/<it>total</it>.</p>
				</sec>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Authors' contributions</p>
			</st>
			<p>MPC conceived, designed and coordinated the study. DSM carried out the programming and statistical analyses. Both authors wrote and approved the final manuscript.</p>
			<suppl id="S1">
				<title>
					<p>Additional File 1</p>
				</title>
				<text>
					<p><it>Arabidopsis thaliana </it><b>data file </b>File is plain text, space delimited. First row is column headings with variable names: edit; + site is edited, - site is not edited; -20 through 20, nucleotide position relative to edited site; cp, codon position; fe, estimated folding energy; dfe, difference in estimated folding energy between pre-edited and edited sequences; and loc, location of focus site (position 0) in GenBank file. Each subsequent line represents a observation.</p>
				</text>
				<file name="1471-2105-5-132-S1.txt">
					<p>Click here for file</p>
				</file>
			</suppl>
			<suppl id="S2">
				<title>
					<p>Additional File 2</p>
				</title>
				<text>
					<p><it>Brassica napus </it><b>data file </b>File is plain text, space delimited. First row is column headings with variable names: edit; + site is edited, - site is not edited; -20 through 20, nucleotide position relative to edited site; cp, codon position; fe, estimated folding energy; dfe, difference in estimated folding energy between pre-edited and edited sequences; and loc, location of focus site (position 0) in GenBank file. Each subsequent line represents a observation.</p>
				</text>
				<file name="1471-2105-5-132-S2.txt">
					<p>Click here for file</p>
				</file>
			</suppl>
			<suppl id="S3">
				<title>
					<p>Additional File 3</p>
				</title>
				<text>
					<p><it>Oryza sativa </it><b>data file </b>File is plain text, space delimited. First row is column headings with variable names: edit; + site is edited, - site is not edited; -20 through 20, nucleotide position relative to edited site; cp, codon position; fe, estimated folding energy; dfe, difference in estimated folding energy between pre-edited and edited sequences; and loc, location of focus site (position 0) in GenBank file. Each subsequent line represents a observation.</p>
				</text>
				<file name="1471-2105-5-132-S3.txt">
					<p>Click here for file</p>
				</file>
			</suppl>
			<suppl id="S4">
				<title>
					<p>Additional File 4</p>
				</title>
				<text>
					<p><b>Combined data file </b>File is plain text, space delimited. First row is column headings with variable names: edit; + site is edited, - site is not edited; -20 through 20, nucleotide position relative to edited site; cp, codon position; fe, estimated folding energy; dfe, difference in estimated folding energy between pre-edited and edited sequences; and loc, location of focus site (position 0) in GenBank file. Each subsequent line represents a observation.</p>
				</text>
				<file name="1471-2105-5-132-S4.txt">
					<p>Click here for file</p>
				</file>
			</suppl>
		</sec>
	</bdy>
	<bm>
		<ack>
			<sec>
				<st>
					<p>Acknowledgements</p>
				</st>
				<p>We thank AL Bazinet and MC Neel for comments on the manuscript.</p>
			</sec>
		</ack>
		<refgrp>
			<bibl id="B1">
				<title>
					<p>RNA editing in plant organelles: A fertile field</p>
				</title>
				<aug>
					<au>
						<snm>Gray</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci USA</source>
				<pubdate>1996</pubdate>
				<volume>93</volume>
				<fpage>8157</fpage>
				<lpage>8159</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">38639</pubid>
						<pubid idtype="pmpid" link="fulltext">8710840</pubid>
						<pubid idtype="doi">10.1073/pnas.93.16.8157</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B2">
				<title>
					<p>RNA editing in plant mitochondria and chloroplasts</p>
				</title>
				<aug>
					<au>
						<snm>Maier</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Zeltz</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Kossel</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Bonnard</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Gualberto</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Grienenberger</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>Plant Mol Biol</source>
				<pubdate>1996</pubdate>
				<volume>32</volume>
				<issue>1&#8211;2</issue>
				<fpage>343</fpage>
				<lpage>365</lpage>
				<xrefbib>
					<pubid idtype="pmpid">8980487</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B3">
				<title>
					<p>A guide to RNA editing</p>
				</title>
				<aug>
					<au>
						<snm>Smith</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Gott</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Hanson</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>RNA</source>
				<pubdate>1997</pubdate>
				<volume>3</volume>
				<issue>10</issue>
				<fpage>1105</fpage>
				<lpage>1123</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">9326486</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B4">
				<title>
					<p>Diversity and evolution of mitochondrial RNA editing systems</p>
				</title>
				<aug>
					<au>
						<snm>Gray</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>IUBMB Life</source>
				<pubdate>2003</pubdate>
				<volume>55</volume>
				<issue>4&#8211;5</issue>
				<fpage>227</fpage>
				<lpage>233</lpage>
				<xrefbib>
					<pubid idtype="pmpid">12880203</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B5">
				<title>
					<p>RNA Editing in plant mitochondria</p>
				</title>
				<aug>
					<au>
						<snm>Hiesel</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Wissinger</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Wolfgang</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Brennicke</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>Science</source>
				<pubdate>1989</pubdate>
				<volume>246</volume>
				<fpage>1632</fpage>
				<lpage>1634</lpage>
				<xrefbib>
					<pubid idtype="pmpid">2480644</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B6">
				<title>
					<p>Evidence for RNA editing in mitochondria of all major groups of land plants except the Bryophyta</p>
				</title>
				<aug>
					<au>
						<snm>Hiesel</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Combettes</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Brennicke</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci USA</source>
				<pubdate>1994</pubdate>
				<volume>91</volume>
				<issue>2</issue>
				<fpage>629</fpage>
				<lpage>633</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">43002</pubid>
						<pubid idtype="pmpid" link="fulltext">8290575</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B7">
				<title>
					<p>RNA editing in bryophytes and a molecular phylogeny of land plants</p>
				</title>
				<aug>
					<au>
						<snm>Malek</snm>
						<fnm>O</fnm>
					</au>
					<au>
						<snm>Lattig</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Hiesel</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Brennicke</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Knoop</snm>
						<fnm>V</fnm>
					</au>
				</aug>
				<source>EMBO J</source>
				<pubdate>1996</pubdate>
				<volume>15</volume>
				<fpage>1403</fpage>
				<lpage>1411</lpage>
				<xrefbib>
					<pubid idtype="pmpid">8635473</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B8">
				<title>
					<p>Occurance of plastid RNA editing in all major lineages of land plants</p>
				</title>
				<aug>
					<au>
						<snm>Freyer</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Kiefer-Meyer</snm>
						<fnm>MC</fnm>
					</au>
					<au>
						<snm>K&#246;ssel</snm>
						<fnm>H</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci USA</source>
				<pubdate>1997</pubdate>
				<volume>94</volume>
				<fpage>6285</fpage>
				<lpage>6290</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">21041</pubid>
						<pubid idtype="pmpid" link="fulltext">9177209</pubid>
						<pubid idtype="doi">10.1073/pnas.94.12.6285</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B9">
				<title>
					<p>RNA editing in <it>Arabidopsis </it>effects 441 C to U changes in ORFs</p>
				</title>
				<aug>
					<au>
						<snm>Gieg&#233;</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Brennicke</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci USA</source>
				<pubdate>1999</pubdate>
				<volume>96</volume>
				<issue>26</issue>
				<fpage>15324</fpage>
				<lpage>15329</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">24818</pubid>
						<pubid idtype="pmpid" link="fulltext">10611383</pubid>
						<pubid idtype="doi">10.1073/pnas.96.26.15324</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B10">
				<title>
					<p>Creation of a novel protein-coding region at the RNA level in black pine chloroplasts: The pattern of RNA editing in the gymnosperm chloroplast is different from that in angiosperms</p>
				</title>
				<aug>
					<au>
						<snm>Wakasugi</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Hirose</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Tsudzuki</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Kossel</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Sugiura</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci USA</source>
				<pubdate>1996</pubdate>
				<volume>93</volume>
				<fpage>8766</fpage>
				<lpage>8770</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">38748</pubid>
						<pubid idtype="pmpid" link="fulltext">8710946</pubid>
						<pubid idtype="doi">10.1073/pnas.93.16.8766</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B11">
				<title>
					<p>RNA editing in wheat mitochondria results in the conservation of protein sequences</p>
				</title>
				<aug>
					<au>
						<snm>Gualberto</snm>
						<fnm>JM</fnm>
					</au>
					<au>
						<snm>Lamattina</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Bonnard</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Weil</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Grienenberger</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>Nature</source>
				<pubdate>1989</pubdate>
				<volume>341</volume>
				<fpage>660</fpage>
				<lpage>666</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1038/341660a0</pubid>
						<pubid idtype="pmpid">2552325</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B12">
				<title>
					<p>RNA editing in plant mitochondria and chloroplasts</p>
				</title>
				<aug>
					<au>
						<snm>Gray</snm>
						<fnm>MW</fnm>
					</au>
					<au>
						<snm>Covello</snm>
						<fnm>PS</fnm>
					</au>
				</aug>
				<source>FASEBJ</source>
				<pubdate>1993</pubdate>
				<volume>7</volume>
				<fpage>64</fpage>
				<lpage>71</lpage>
			</bibl>
			<bibl id="B13">
				<title>
					<p>RNA editing status of <it>nad7 </it>intron domains in wheat mitochondria</p>
				</title>
				<aug>
					<au>
						<snm>Carrillo</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Bonen</snm>
						<fnm>L</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Research</source>
				<pubdate>1997</pubdate>
				<volume>25</volume>
				<issue>2</issue>
				<fpage>403</fpage>
				<lpage>409</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">146442</pubid>
						<pubid idtype="pmpid" link="fulltext">9016571</pubid>
						<pubid idtype="doi">10.1093/nar/25.2.403</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B14">
				<title>
					<p>RNA editing in plant mitochondria: &#945;-phosphate is retained during C-to-U conversion in mRNAs</p>
				</title>
				<aug>
					<au>
						<snm>Rajasekhar</snm>
						<fnm>V</fnm>
					</au>
					<au>
						<snm>Mulligan</snm>
						<fnm>R</fnm>
					</au>
				</aug>
				<source>Plant Cell</source>
				<pubdate>1993</pubdate>
				<volume>5</volume>
				<fpage>1843</fpage>
				<lpage>1852</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">160409</pubid>
						<pubid idtype="pmpid">12271058</pubid>
						<pubid idtype="doi">10.1105/tpc.5.12.1843</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B15">
				<title>
					<p>RNA editing in wheat mitochondria procedes by a deamination mechanism</p>
				</title>
				<aug>
					<au>
						<snm>Blanc</snm>
						<fnm>V</fnm>
					</au>
					<au>
						<snm>Litvak</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Araya</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>FEBS Letters</source>
				<pubdate>1995</pubdate>
				<volume>373</volume>
				<fpage>56</fpage>
				<lpage>60</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1016/0014-5793(95)00991-H</pubid>
						<pubid idtype="pmpid" link="fulltext">7589434</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B16">
				<title>
					<p>RNA editing in higher plant mitochondria: analysis of biochemistry and specificity</p>
				</title>
				<aug>
					<au>
						<snm>Yu</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Fester</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Block</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Schuster</snm>
						<fnm>W</fnm>
					</au>
				</aug>
				<source>Biochemie</source>
				<pubdate>1995</pubdate>
				<volume>77</volume>
				<fpage>79</fpage>
				<lpage>86</lpage>
				<xrefbib>
					<pubid idtype="doi">10.1016/0300-9084(96)88108-9</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B17">
				<title>
					<p>RNA editing in wheat mitochondria</p>
				</title>
				<aug>
					<au>
						<snm>Arya</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Blanc</snm>
						<fnm>V</fnm>
					</au>
					<au>
						<snm>Begu</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Crabier</snm>
						<fnm>F</fnm>
					</au>
					<au>
						<snm>Mouras</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Litvak</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Biochemie</source>
				<pubdate>1995</pubdate>
				<volume>77</volume>
				<fpage>87</fpage>
				<lpage>91</lpage>
				<xrefbib>
					<pubid idtype="doi">10.1016/0300-9084(96)88109-0</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B18">
				<title>
					<p>Evidence for a site-specific cytidine deamination reaction involved in C to U RNA editing of plant mitochondria</p>
				</title>
				<aug>
					<au>
						<snm>Yu</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Schuster</snm>
						<fnm>W</fnm>
					</au>
				</aug>
				<source>J Biol Chem</source>
				<pubdate>1995</pubdate>
				<volume>270</volume>
				<issue>31</issue>
				<fpage>18227</fpage>
				<lpage>18233</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1074/jbc.270.31.18227</pubid>
						<pubid idtype="pmpid" link="fulltext">7629140</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B19">
				<title>
					<p>RNA Editing site recognition in higher plant mitochondria</p>
				</title>
				<aug>
					<au>
						<snm>Mulligan</snm>
						<fnm>RM</fnm>
					</au>
					<au>
						<snm>Williams</snm>
						<fnm>MA</fnm>
					</au>
					<au>
						<snm>Shanahan</snm>
						<fnm>MT</fnm>
					</au>
				</aug>
				<source>J Heredity</source>
				<pubdate>1999</pubdate>
				<volume>90</volume>
				<issue>3</issue>
				<fpage>338</fpage>
				<lpage>344</lpage>
				<xrefbib>
					<pubid idtype="doi">10.1093/jhered/90.3.338</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B20">
				<title>
					<p>Cross-competition in transgenic chloroplasts expressing single editing sites reveals shared <it>cis </it>elements</p>
				</title>
				<aug>
					<au>
						<snm>Chateigner-Boutin</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Hanson</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Mol Cell Biol</source>
				<pubdate>2002</pubdate>
				<volume>22</volume>
				<issue>24</issue>
				<fpage>8448</fpage>
				<lpage>8456</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">139884</pubid>
						<pubid idtype="pmpid" link="fulltext">12446765</pubid>
						<pubid idtype="doi">10.1128/MCB.22.24.8448-8456.2002</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B21">
				<title>
					<p><it>cis </it>recognition elements in plant mitochondrion RNA editing</p>
				</title>
				<aug>
					<au>
						<snm>Farr&#233;</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Leon</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Jordana</snm>
						<fnm>X</fnm>
					</au>
					<au>
						<snm>Araya</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>Mol Cell Biol</source>
				<pubdate>2001</pubdate>
				<volume>21</volume>
				<issue>20</issue>
				<fpage>6731</fpage>
				<lpage>6737</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">99851</pubid>
						<pubid idtype="pmpid" link="fulltext">11564858</pubid>
						<pubid idtype="doi">10.1128/MCB.21.20.6731-6737.2001</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B22">
				<title>
					<p>Editing site recognition in plant mitochondria: the importance of 5'-flanking sequences</p>
				</title>
				<aug>
					<au>
						<snm>Williams</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Kutcher</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Mulligan</snm>
						<fnm>R</fnm>
					</au>
				</aug>
				<source>Plant Mol Biol</source>
				<pubdate>1998</pubdate>
				<volume>36</volume>
				<issue>2</issue>
				<fpage>229</fpage>
				<lpage>37</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1023/A:1005961718612</pubid>
						<pubid idtype="pmpid" link="fulltext">9484435</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B23">
				<title>
					<p>Relating genotype to phenotype: analysis of peptide binding data</p>
				</title>
				<aug>
					<au>
						<snm>Segal</snm>
						<fnm>MR</fnm>
					</au>
					<au>
						<snm>Cummings</snm>
						<fnm>MP</fnm>
					</au>
					<au>
						<snm>Hubbard</snm>
						<fnm>AE</fnm>
					</au>
				</aug>
				<source>Biometrics</source>
				<pubdate>2001</pubdate>
				<volume>57</volume>
				<fpage>632</fpage>
				<lpage>643</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1111/j.0006-341X.2001.00632.x</pubid>
						<pubid idtype="pmpid">11414594</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B24">
				<title>
					<p>Random forests &#8211; random features</p>
				</title>
				<aug>
					<au>
						<snm>Breiman</snm>
						<fnm>L</fnm>
					</au>
				</aug>
				<source>Tech Rep 567, Department of Statistics, University of California</source>
				<pubdate>2001</pubdate>
			</bibl>
			<bibl id="B25">
				<title>
					<p>GenBank</p>
				</title>
				<aug>
					<au>
						<snm>Benson</snm>
						<fnm>DA</fnm>
					</au>
					<au>
						<snm>Karsch-Mizrachi</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Lipman</snm>
						<fnm>DJ</fnm>
					</au>
					<au>
						<snm>Ostell</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Wheeler</snm>
						<fnm>DL</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2004</pubdate>
				<volume>32</volume>
				<fpage>D23</fpage>
				<lpage>26</lpage>
				<xrefbib>
					<pubid idtype="doi">10.1093/nar/gkh045</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B26">
				<title>
					<p>The complete nucleotide sequence and RNA editing content of the mitochondrial genome of rapeseed (<it>Brassica napus </it>L.): comparative analysis of the mitochondrial genomes of rapeseed and <it>Arabidopsis thaliana</it></p>
				</title>
				<aug>
					<au>
						<snm>Handa</snm>
						<fnm>H</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2003</pubdate>
				<volume>31</volume>
				<issue>20</issue>
				<fpage>5907</fpage>
				<lpage>5916</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">219474</pubid>
						<pubid idtype="pmpid" link="fulltext">14530439</pubid>
						<pubid idtype="doi">10.1093/nar/gkg795</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B27">
				<title>
					<p>The complete sequence of the rice (<it>Oryza sativa </it>L.) mitochondrial genome: frequent DNA sequence acquisition and loss during the evolution of flowering plants</p>
				</title>
				<aug>
					<au>
						<snm>Notsu</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Masood</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Nishikawa</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Kubo</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Akiduki</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Nakazono</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Hirai</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Kadowaki</snm>
						<fnm>K</fnm>
					</au>
				</aug>
				<source>Mol Genet Genomics</source>
				<pubdate>2002</pubdate>
				<volume>268</volume>
				<issue>4</issue>
				<fpage>434</fpage>
				<lpage>445</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1007/s00438-002-0767-1</pubid>
						<pubid idtype="pmpid" link="fulltext">12471441</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B28">
				<title>
					<p>Algorithms and thermodynamics for RNA secondary structure prediction: a practical guide</p>
				</title>
				<aug>
					<au>
						<snm>Zuker</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Mathews</snm>
						<fnm>DH</fnm>
					</au>
					<au>
						<snm>Turner</snm>
						<fnm>DH</fnm>
					</au>
				</aug>
				<source>In RNA Biochemistry and Biotechology, no. 70 in NATO Science Partnership Sub-Series 3: High Technology, Dordrecht</source>
				<publisher>The Netherlands: Kluwer Academic Publishers</publisher>
				<pubdate>1999</pubdate>
				<fpage>11</fpage>
				<lpage>43</lpage>
			</bibl>
			<bibl id="B29">
				<title>
					<p>Expanded sequence dependence of thermodynamic parameters provides robust prediction of RNA secondary structure</p>
				</title>
				<aug>
					<au>
						<snm>Mathews</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Sabina</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Zucker</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Turner</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>J Mol Biol</source>
				<pubdate>1999</pubdate>
				<volume>288</volume>
				<fpage>910</fpage>
				<lpage>940</lpage>
				<xrefbib>
					<pubid idtype="doi">10.1006/jmbi.1999.2700</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B30">
				<title>
					<p>R: a language for data analysis and graphics</p>
				</title>
				<aug>
					<au>
						<snm>Ihaka</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Gentleman</snm>
						<fnm>R</fnm>
					</au>
				</aug>
				<source>Comput Graph Stat</source>
				<pubdate>1996</pubdate>
				<volume>5</volume>
				<fpage>299</fpage>
				<lpage>314</lpage>
			</bibl>
			<bibl id="B31">
				<title>
					<p>An introduction to recursive partitioning using the RPART routines</p>
				</title>
				<aug>
					<au>
						<snm>Therneau</snm>
						<fnm>TM</fnm>
					</au>
					<au>
						<snm>Atkinson</snm>
						<fnm>EJ</fnm>
					</au>
				</aug>
				<source>Tech Rep Mayo Foundation</source>
				<pubdate>1997</pubdate>
			</bibl>
			<bibl id="B32">
				<title>
					<p>Random Forests</p>
				</title>
				<aug>
					<au>
						<snm>Breiman</snm>
						<fnm>L</fnm>
					</au>
				</aug>
				<source>Machine Learning</source>
				<pubdate>2001</pubdate>
				<volume>45</volume>
				<fpage>5</fpage>
				<lpage>32</lpage>
				<xrefbib>
					<pubid idtype="doi">10.1023/A:1010933404324</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B33">
				<aug>
					<au>
						<snm>Clark</snm>
						<fnm>LA</fnm>
					</au>
					<au>
						<snm>Pergibon</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Statistical Models in S</source>
				<publisher>London: Chapman and Hall</publisher>
				<pubdate>1993</pubdate>
			</bibl>
			<bibl id="B34">
				<aug>
					<au>
						<snm>Breiman</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Friedman</snm>
						<fnm>JH</fnm>
					</au>
					<au>
						<snm>Olshen</snm>
						<fnm>RA</fnm>
					</au>
					<au>
						<snm>Stone</snm>
						<fnm>CJ</fnm>
					</au>
				</aug>
				<source>Classification and Regression Trees</source>
				<publisher>Pacific Grove, CA: Wadsworth and Brooks</publisher>
				<pubdate>1984</pubdate>
			</bibl>
			<bibl id="B35">
				<title>
					<p>Applying permutation tests to tree-based statistical models: extending the R package rpart</p>
				</title>
				<aug>
					<au>
						<snm>Cummings</snm>
						<fnm>MP</fnm>
					</au>
					<au>
						<snm>Myers</snm>
						<fnm>DS</fnm>
					</au>
					<au>
						<snm>Mangelson</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Tech Rep CS-TR-4581, UMIACS-TR-2004-24, Center for Bioinformatics and Computational Biology, Institute for Advanced Computer Studies, University of Maryland</source>
				<pubdate>2004</pubdate>
			</bibl>
			<bibl id="B36">
				<title>
					<p>Bagging predictors</p>
				</title>
				<aug>
					<au>
						<snm>Breiman</snm>
						<fnm>L</fnm>
					</au>
				</aug>
				<source>Mach Learn</source>
				<pubdate>1996</pubdate>
				<volume>24</volume>
				<fpage>123</fpage>
				<lpage>140</lpage>
				<xrefbib>
					<pubid idtype="doi">10.1023/A:1018054314350</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B37">
				<aug>
					<au>
						<snm>Hastie</snm>
						<fnm>TJ</fnm>
					</au>
					<au>
						<snm>Tibshirani</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Friedman</snm>
						<fnm>JH</fnm>
					</au>
				</aug>
				<source>The Elements of Statistical Learning</source>
				<publisher>New York: Springer</publisher>
				<pubdate>2001</pubdate>
			</bibl>
		</refgrp>
	</bm>
</art>
