Abstract
Background
Protein tertiary structure can be partly characterized via each amino acid's contact number measuring how residues are spatially arranged. The contact number of a residue in a folded protein is a measure of its exposure to the local environment, and is defined as the number of C_{β }atoms in other residues within a sphere around the C_{β }atom of the residue of interest. Contact number is partly conserved between protein folds and thus is useful for protein fold and structure prediction. In turn, each residue's contact number can be partially predicted from primary amino acid sequence, assisting tertiary fold analysis from sequence data. In this study, we provide a more accurate contact number prediction method from protein primary sequence.
Results
We predict contact number from protein sequence using a novel support vector regression algorithm. Using protein local sequences with multiple sequence alignments (PSIBLAST profiles), we demonstrate a correlation coefficient between predicted and observed contact numbers of 0.70, which outperforms previously achieved accuracies. Including additional information about sequence weight and amino acid composition further improves prediction accuracies significantly with the correlation coefficient reaching 0.73. If residues are classified as being either "contacted" or "noncontacted", the prediction accuracies are all greater than 77%, regardless of the choice of classification thresholds.
Conclusion
The successful application of support vector regression to the prediction of protein contact number reported here, together with previous applications of this approach to the prediction of protein accessible surface area and Bfactor profile, suggests that a support vector regression approach may be very useful for determining the structurefunction relation between primary protein sequence and higher order consecutive protein structural and functional properties.
Background
Prediction of protein threedimensional structure from primary sequence is the central problem in structural bioinformatics. One approach is to use known structur∈homolog proteins as templates to determine the tertiary structures of novel proteins of unknown structure. Approaches include comparative modelling, threading and fold recognition methods. One protein structural feature is of particular interest here, namely, residue contact number (CN) which can be used to enhance protein fold recognition [1]. This measure has also been regarded as the conserved solvent exposure descriptor of similar folds without a common evolutionary origin [2]. Furthermore, contact number may be used to determine the energy function allowing molecular dynamics simulations of protein structures [3]. Here, we seek to use protein contact number to assist with the tertiary fold prediction of novel proteins for which an accurate functional relationship between a protein's primary sequence and its residues' contact numbers must be determined. To fulfil the task, we use a new method, the socalled support vector regression, to approximate the sequencecontact number relationship. We demonstrate that, as a result, we achieve more accurate predicted contact numbers than have been achieved to date.
The contact number, or coordination number, of a given residue of a folded protein is defined as the number of C_{β }(or C_{α}) atoms in other residues within a sphere around the C_{β }(or C_{α}) atom of that given residue. Previous approaches to the prediction of protein contact number fall into two categories: classification and regression. In the classification approach, residue contact numbers were first classified into two populations allowing a subsequent use of machine learning methods such as recurrent neural networks to perform predictions [4,5]. Unfortunately, decomposing contact numbers into two states via an arbitrary threshold oversimplifies the problem and much original information is lost. In contrast, the regression approach provides a direct and more accurate way to determine a functional relationship matching contact numbers and protein sequence and thus to provide more accurate contact number predictions. A recent study of Kinjo et al. [3], followed this approach but used a simple linear regression scheme to determine the functional relationship. They reported that the predicted and observe contact numbers had a correlation coefficient (CC) of 0.627. However, most functions in nature are nonlinear and cannot be accurately approximated by linear formulas. Under the reasonable expectation that the sequencecontact number is indeed nonlinear, we use a more complicated machine learning method to determine the relationship and expect thereby to obtain more accurate predictions. In particular, we adopt a support vector regression (SVR) algorithm fully capable of determining a nonlinear sequencecontact number relationship.
In our former work, we studied the dependence of protein accessible surface area (ASA) [6,7] and Bfactor [8] on primary sequence. These works established that ASAs can be predicted and match observed values with a correlation coefficient of 0.69, while Bfactors can be predicted and match observed values with a correlation coefficient of 0.53. These approaches established that multiple sequence inputs outperform single sequence inputs significantly. The importance of using multiple sequences was also observed in prior predictions of contact numbers [3]. Thus, in this present work, we focus on multiple sequence inputs. For completeness, we examine a range of different definitions of contact number ("consecutive" and "discrete"), and also examine whether including further information such as protein molecular weight and amino acid composition allows improved predictions. As a result, we are able to make predictions which match observed values with a correlation coefficient of 0.73, a significant improvement on earlier studies.
Results
Contact numbers according to different r_{d }values
We give 8 definitions of contact number and show their CN distributions in Fig. 1. For each definition, we compute the mean and standard deviation (Table 1). For the same radius cutoff r_{d}, the discrete and consecutive definitions have very similar distributions with nearly the same mean and standard deviation. Their correlations are greater than 0.99 for all values of r_{d }(8 Å, 10 Å, 12 Å and 14 Å). The contact numbers defined by different radius cutoffs have CCs greater than 0.83. Distributions with larger radius cutoffs more closely approximate normal distributions as their lefthand tails are almost all present. Since absolute contact numbers are normalized by a linear transformation (Equation 3), the general characteristics of their distributions will still be kept even after the normalization.
Figure 1. Contact number distributions according to different definitions. The radius cutoffs are selected as 8 Å, 10 Å, 12 Å and 14 Å, represented by dotted, slashed, solid and dotandslashed lines, respectively. A is for discrete contact number while B is for consecutive contact number.
Table 1. The mean (
) and standard deviation (SD) of contact numbers according to different radius (r_{d}) cutoffs. All results are expressed as ( , SD).To study the relationship between CN and ASA of a residue, we obtained the ASA for each residue in the 945 proteins using the DSSP program [9]. Using discrete definition of CN with a radius cutoff of 12 Å, we calculate the mean and standard deviation of ASA for each contact number and show the results in Fig. 2. A strong negative correlation between the two solvent exposure descriptors can be observed as indicated by a correlation coefficient of 0.734.
Figure 2. The accessible surface area as a function of contact number. Discrete contact numbers are used with a radius cutoff of 12 Å. Error bars represent the standard deviations.
Estimating the sequencecontact number relationship
When we train the SVR algorithm, normalized CNs are used instead of absolute CNs because the normalized values are always located between 3.0 and 3.0 for all radius cutoffs. Therefore, in all cases, the same set of SVR learning control parameters can be applied. Among the three groups of proteins (each has 315 chains), we estimate the sequencecontact number function in turn using one group and examine the estimated function using the remaining two groups. The correlation coefficient and root mean square error (RMSE) are computed for each test group, and their averages are shown in Table 2.
Table 2. Correlation coefficient (CC) and root mean square error (RMSE) for different contact number predictions. All results are expressed as mean ± standard deviation.
For all radius cutoffs, predictions using consecutive contact numbers are slightly better than predictions using their discrete counterparts. However, when the larger thresholds (e.g. 12 Å and 14 Å) are used, the accuracy difference decreases to insignificance. Previous work has shown that CNs with a radius cutoff of 12 Å or 14 Å are more useful for protein fold recognition [1]. Likewise, in this work, we also find that these cutoffs give better predictions. But, compared with the discrete contact numbers, the consecutive contact numbers give only a very slight improvement in predictions. The best accuracies are for consecutive contact numbers with thresholds of 12 Å and 14 Å, in which case the correlation coefficient between predicted and observed values can reach 0.70. If we convert the normalized contact numbers to their original absolute ones, the RMSE of 0.72 is equal to an actual error of 7.5 for a threshold of 12 Å, and an actual error of 11.1 for a threshold of 14 Å.
We also calculate the CC and RMSE for each individual protein using discrete contact numbers with a radius cutoff of 12 Å. The average CC and RMSE are then 0.67 and 7.31, respectively. More than half of the proteins are predicted with CCs greater than 0.70, and more than half are predicted with RMSEs less than 6.93. To illustrate the meaning of the CC and RMSE measures in this study, two comparisons of predicted and observed values are given in Figure 3. This figure shows the better agreement between the predicted and observed values in GP130 (PDB: 1bj8) as the CC is 0.75 and the RMSE is 6.07. In contrast, the prediction for human chorionic gonadotropin (PDB: 1dz7, chain A) yields a correlation coefficient of 0.58 and a RMSE of 9.73 with the region between position 40 and 50 being worst predicted.
Figure 3. The predicted and observed contact numbers for proteins GP130 (PDB: 1bj8) and Human chorionic gonadotropin (PDB: 1dz7, chain A). Discrete contact numbers are used with a radius cutoff of 12 Å. Observed and predicted contact numbers are represented by solid and dashed lines, respectively. A) GP 130 is predicted with a correlation coefficient of 0.75 and a rootmeansquar∈error of 6.07; B) Human chorionic gonadotropin is predicted with a correlation coefficient of 0.58 and a root mean square error of 9.73.
Sequence weight, as a feature, can improve prediction accuracy significantly
Prediction accuracy can be improved by taking account of protein size as measured by its weight. Given a sequence, the weight is the sum of individual weights of consisting amino acids. We use discrete contact numbers (radius cutoff = 12 Å) and divide all proteins into three groups with equal number of proteins, according to their weights. For the three groups, the weights (mean ± standard deviation) are 10611 ± 1637, 17421 ± 2666 and 40073 ± 5952 Daltons. Their average correlation coefficient between predicted and observed contact numbers are 0.61, 0.67 and 0.72, while their average RMSEs are 7.49, 7.09 and 7.35, respectively. These results suggest that smaller molecules are worst predicted.
To consider this effect, we calculate weight for each protein sequence and include this data as an additional input to the machine learning algorithm and rerun the training and testing procedures. Additional information, that of protein amino acid composition, was also included as an input, either individually or together with sequence weight data. For the separate groups of small chains, median chains, large chains and all chains, we calculate the mean and median values of the correlation coefficient between predicted and observed values of contact number, and their RMSEs, according to each set of different input information: local sequence ("LS"), local sequence plus sequence weight ("LS+W"), local sequence plus amino acid composition ("LS+AA") and local sequence plus sequence weight and amino acid composition ("LS+W+AA"). All results are shown in Table 3.
Table 3. Correlation coefficients (CCs) and root mean square errors (RMSEs) for individual proteins in different weight groups. The results are given as (mean, median).
For all cases, it was determined that amino acid composition information can improve prediction performance. However, sequence weight can give yet more significant improvements. For example, in the group of small molecules, data about amino acid composition can increase the CC mean to 0.62 and decrease the RMSE mean to 7.24, while sequence weight data can increase the CC mean to 0.64 and decrease the RMSE mean to 6.68. When information about both sequence weight and amino acid composition is used together, we find still further improvement compared with using each data individually, although this may not be reflected by all measures. Particularly, the difference between "LS+W" and "LS+W+AA" is very minor. However, all the results clearly show that sequence weight is more important than amino acid composition in the prediction of contact numbers.
Fig. 4 gives the overall distributions of CCs and RMSEs for 945 proteins. Compared with the CC, the RMSE is more sensitive in that its distribution more clearly reflects the improvement, while the distributions of "LS+W" and "LS+W+AA" are nearly identical. For all cases, the peak values of CC and RMSE are around 0.70 and 6.0, respectively.
Figure 4. Distributions of correlation coefficients and root mean square errors given different input information. Discrete definition is used with a radius cutoff of 12 Å. The four inputs "LS", "LS+W", "LS+AA" and "LS+W+AA" are represented by dotted, slashed, dotandslashed and solid lines, respectively. A is for correlation coefficients while B is for root mean square errors.
In addition to the above analyses based on individual proteins, we also measure the accuracies on protein residues in the whole data set. We calculate CCs and normalized RMSEs for six testing groups, and express them as mean ± standard deviation, as given in Table 4. The CCs for "LS", "LS+W", "LS+AA" and "LS+W+AA" are 0.69, 0.733, 0.708 and 0.734, respectively. The normalized RMSEs are 0.72, 0.68, 0.71 and 0.68, respectively. Therefore, an obvious improvement can be found even if we measure it at the residue level.
Table 4. Correlation coefficients (CCs) and root mean square errors (RMSEs) are calculated for all residues when performing support vector regression algorithm using different input information.
To measure the prediction performance for residues with different contact numbers, we compute the absolute errors for the residue with contact numbers from 0 to 60. The mean absolute errors for certain contact numbers are shown in Fig. 5, partitioned according to the four information inputs. Clearly, "LS+W+AA" gives the least absolute errors and therefore perform the best. Residues with about 20 contacting C_{β }atoms are predicted with the least mean absolute error (4.1) and are the best predicted. This is because they have the largest number of samples in the dataset. Greater errors are found at each tailend of the distribution corresponding to the residues with smaller or greater contact numbers. This is due to the small number of data points in each tail fed into support vector machines, and their representation is not adequate. Furthermore, this can be used to explain why the region from residue 40 to residue 50 of protein 1dz7 in Fig 3B is the worst predicted. This region mostly contains exposed residues with smaller contact numbers and thus, the residues cannot be well predicted by our method. An improvement on this part may be achieved by using other predicted features such as accessible surface area.
Figure 5. The mean absolute errors for residues of different contact numbers. The four inputs "LS", "LS+W", "LS+AA" and "LS+W+AA" are represented by dotted, slashed, dotandslashed and solid lines, respectively.
Examining performance by formulating regression as a twoclass problem
CN prediction has previously been examined as a twoclass classification problem through use of a threshold with the accuracy being defined as the percentage of the correctly predicted residues on the overall residues [35]. However, this is not a good measure because the accuracy is susceptible to changes in the selected threshold that splits the data set. If the data set is heavily unbalanced, accuracy is always very high [10]. In this study, we use a different measure, and adopt the least (worst) prediction accuracy for each case to reflect its performance. Table 5 gives the least twoclass prediction accuracies for eight definitions of contact number when using only local sequence information. Note that all the accuracies are the average of six tests. All accuracies are found to be greater than 74%, and in particular, when r_{d }= 12 Å and a consecutive contact number definition is adopted, the least prediction accuracy is around 76.1%, which is comparable with the accuracy 76.3% recently reported based on choosing a particular threshold in linear models also using the same consecutive contact number definition [3].
Table 5. The least prediction accuracy (%) for twoclass problems according to different contact number definitions. Only local sequence information is used.
We use discrete definitions of contact number and let r_{d }= 12 Å. Using all the SVM outputs from six tests, we choose a number of thresholds to classify the data points as being either "contacted" or "noncontacted" and calculate their accuracies. All accuracies are plotted in Fig. 6, according to different information input. The least accuracies for "LS", "LS+W", "LS+AA" and "LS+W+AA" are 75.8%, 77.2%, 76.1% and 77.1%, respectively. Using sequence weight is much better than using amino acid composition.
Figure 6. Prediction accuracies when predictions are formulated as twoclass problems using different contact number thresholds. The four inputs "LS", "LS+W", "LS+AA" and "LS+W+AA" are represented by dotted, slashed, dotandslashed and solid lines, respectively.
Discussion
Protein structural properties such as secondary structure, solvent accessibility and contact number provide valuable information for prediction of protein tertiary structures. How to improve the prediction accuracy of these parameters is still a challenging problem. Following Rost and Sander's pioneering work [11] on how to find a conserved and useful prediction index, Hamelryck [2] examined the conservation of nine solventexposure measures and found that contact number is the most conserved (correlation coefficient 0.72). His study suggested that CN is more suitable for fold recognition than other descriptors such as ASA. However, difficulties in accurately expressing the prediction problem (for example, it was previously framed as a two class problem using an arbitrary threshold) limited its further application. Recent work on contact number [3] formulating the problem for a regression analysis has enhanced studies in this area. From our work here, we confirm the utility of a regression analysis, and more specifically, establish that allowing for nonlinearity via support vector regression allows a more accurate determination of the sequencecontact number relation which further illuminates relationships between protein structural and functional properties and their primary sequence and other features.
Conclusion
We provide a new method for the prediction of protein contact number. Using protein local sequence information generated by multiple sequence alignments, the correlation coefficient between predicted and observed contact numbers can reach 0.70, with normalized root mean square error less than 0.72. The addition of information about sequence weight and amino acid composition as input features can increase the correlation coefficient to 0.734 and decrease the root mean square error to 0.68. This improvement is mainly attributed to the information about sequence weight while the information about amino acid composition only contributes slightly. Moreover, more than half of the proteins are predicted with correlation coefficients greater than 0.71. The prediction accuracies in the twoclass problems, regardless of the cutoff thresholds, are greater than 77.0%. The successful application of SVR approach in this study suggests that it can more accurately describe the relationship between protein contact numbers and primary sequence.
Methods
Residue contact number
We take two definitions of contact number in this study, namely, that of "discrete" and "consecutive" contact number. The "discrete" contact number, N_{d}, is defined by the number of C_{β }atoms on other residues located within a sphere of radius r_{d }centred on the C_{β }atom of the residue of interest. The discrete contact number for ith residue in a sequence with M residues is given by
where r_{i,j }is the distance between the C_{β }atoms of the ith and jth residues which are understood to be separated in sequence by at least two amino acids. Note that is a discrete integer. By replacing the step function σ(r_{i,j}) with a sigmoid function, becomes a real number. This procedure was previously adopted by Kinjo et al. [3] to smooth the discrete contact numbers. A particular sigmoid function is given by
σ(r_{i,j}) = 1/{1 + exp [3(r_{i,j } r_{d})]}. (2)
We have tried four values of r_{d }(8 Å, 10 Å, 12 Å and 14 Å) with discrete and consecutive definitions and thus have 8 combinations all of which will be used in our SVR approach.
Normalization of contact number
The distributions of contact numbers can be approximated by normal distributions, as shown in Fig. 1. With respect to a certain r_{d}, we calculate the mean () and standard deviation (SD). So, the normalized contact number N_{norm }is determined by the following formula:
At the first step, we predict the normalized contact number because 1) it is easy to handle the data, and 2) it is easy to compare the results for different r_{d }thresholds. At the second step, we recover the absolute contact numbers from their predicted normalized values using this equation.
Sequence coding
We predict contact number from protein local sequence. For a given residue, the local sequence contains its Nterminal and Cterminal seven nearestneighbour residues. Thus, the local sequence makes a window of fifteen amino acids. We code each residue in the window using the PSIBLAST positionspecific scoring matrix [12]. The matrices are obtained by querying the input sequence using PSIBLAST against the NCBI nonredundant protein sequence database with three rounds, masking coilcoiled and lowcomplexity regions [13]. The elements in the row of the matrix reflect the probabilities for 20 amino acids occurring at this position. All the elements are divided by 10 for normalization and thus each residue is represented by a 20dimesional vector. Since the residues in coilcoil and lowcomplexity regions do not have meaningful scores, we encode the residue with an orthogonal scheme. In the 20dimensional vector coding a given residue, only the entry representing this type of amino acid is assigned as 0.5 with all other entries set as zeros. To consider the terminal residues, we expend the 20dimensional vector to being 21dimensional for all residues. When the last entry is set as 0.5 and other entries have zeros, it represents a blank residue added to the Nterminal or the Cterminal to make a local sequence of 15residue length. For all other residues, the 21st entries are set to zero. In summery, a residue is coded by a 315dimensional vector.
Support vector regression
To find the function between protein local sequence and normalized contact number, we use ∈insensitive support vector regression (∈SVR) [14,15]. The expected function can be formulated as
f(X_{i}) = 〈W, Φ(X_{i})〉 + b, (4)
where W is the weight and b is the bias. Φ(X_{i}) is a nonlinear function mapping a data point from the input space to the feature space, so consequently, SVR is able to perform nonlinear regression. The goal of the regression is to find the optimal W and b using some optimisation criteria. In εSVR, errors greater than ε are penalized, where two positive variables ξ and ξ* are used to measure the deviation of samples outside the εinsensitive tube. The optimisation problem can be expressed as
where C is the regularization constant that determines the trad∈off between the norm and the error penalty.
The solution of the above problem was given by the authors of ∈SVR [14,15] as follows,
where α_{i }and are Lagrange multipliers. We can replace 〈Φ(X_{i}), Φ(X)〉, the inner product of Φ(X_{i}) and Φ(X), by a kernel function K(X_{i}, X), if K(X_{i}, X) = 〈Φ(X_{i}), Φ(X)〉. The radial basis function are used in our study, as given by
K(X_{i}, X) = exp(γX_{i } X^{2}), (7)
where γ is a parameter to be tuned by the user.
We constantly set ε as 0.01, γ as 0.01 and C as 5.0, because this set of parameters yielded the best performance in our previous work [6,8]. A number of software packages can be used to find the solution such as SVMlight [16].
Dataset preparation and prediction evaluation
To test our approach, we selected 945 unique protein chains, which were previously used for prediction of protein ASA, and were prepared by PDBREPRDB [17]. The structures solved by Xray crystallography were with resolution less than 2.0 Å and with an Rfactor less than 0.2. All chains are at least 60 amino acids or longer, and the pairwise identity is less than 25%. The protein names can be found in the 1 (supplementary material).
Additional File 1. The names of 945 protein chains. The first four characters are their PDB names. The fifth is the chain name and "_" means single chain.
Format: DOC Size: 92KB Download file
This file can be viewed with: Microsoft Word Viewer
The proteins are randomly divided into three groups with each group having 315 chains. Each group is in turn used for training with the remaining two groups used for testing. Therefore, each group is tested twice by the two functions derived from the other groups, and as a result we have six groups of examination results.
Pearson's correlation coefficients and root mean square errors are calculated with respects to all residues and individual proteins. In addition, the absolute errors are calculated for the residues with different contact numbers. In order to compare with previous classification methods, we use different thresholds to classify contact numbers as "contacted" or "noncontacted" and compute the overall accuracy. The accuracy is defined as the ratio between the number of correctly predicted residues and the total number.
Acknowledgements
The author thanks Dr. Rohan Teasdale for kind assistance and Dr. Michael Gagen for critical reading of this article and some helpful suggestions. The work was supported by funds from the Australian Research Council (ARC) and The University of Queensland Early Research Grant. The computer simulations were performed at the High Performance Computing Facility at The University of Queensland.
References

Karchin R, Cline M, Karplus K: Evaluation of local structure alphabets based on residue burial.
Proteins 2004, 55:508518. PubMed Abstract  Publisher Full Text

Hamelryck T: An amino acid has two sides: A new 2D measure provides a different view of solvent exposure.
Proteins 2005, 59:3848. PubMed Abstract  Publisher Full Text

Kinjo AR, Horimoto K, Nishikawa K: Predicting absolute contact numbers of native protein structure from amino acid sequence.
Proteins 2005, 58:158165. PubMed Abstract  Publisher Full Text

Pollastri G, Baldi P, Fariselli P, Casadio R: Prediction of coordination number and relative solvent accessibility in proteins.
Proteins 2002, 47:142153. PubMed Abstract  Publisher Full Text

Pollastri G, Baldi P, Fariselli P, Casadio R: Improved prediction of the number of residue contacts in proteins by recurrent neural networks.
Bioinformatics 2001, 17 Suppl 1:S23442. PubMed Abstract  Publisher Full Text

Yuan Z, Huang B: Prediction of protein accessible surface areas by support vector regression.
Proteins 2004, 57:558564. PubMed Abstract  Publisher Full Text

Yuan Z, Bailey TL: Prediction of protein solvent profile using SVR.
Proceedings of the 26th Annual International Conference of the IEEE EMBS 2004, 28892892.

Yuan Z, Bailey TL, Teasdale RD: Prediction of protein Bfactor profiles.
Proteins 2005, 58:905912. PubMed Abstract  Publisher Full Text

Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogenbonded and geometrical features.
Biopolymers 1983, 22:25772637. PubMed Abstract  Publisher Full Text

Thompson MJ, Goldstein RA: Predicting solvent accessibility: higher accuracy using Bayesian statistics and optimized residue substitution classes.
Proteins 1996, 25:3847. PubMed Abstract  Publisher Full Text

Rost B, Sander C: Conservation and prediction of solvent accessibility in protein families.
Proteins 1994, 20:216226. PubMed Abstract  Publisher Full Text

Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSIBLAST: a new generation of protein database search programs.
Nucleic Acids Res 1997, 25:33893402. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Jones DT: Protein secondary structure prediction based on positionspecific scoring matrices.
J Mol Biol 1999, 292:195202. PubMed Abstract  Publisher Full Text

Vapnik V: The Nature of Statistical Learning Theory. Springer, New York, ; 2000.

Smola A, Scholkopf B: A tutorial on support vector regression. [http://www.neurocolt.com] webcite

Joachims T: Making largeScale SVM Learning Practical. In Advances in Kernel Methods  Support Vector Learning. Edited by Schölkopf B, Burges C and Smola A. , MIT Press; 1999.

Noguchi T, Akiyama Y: PDBREPRDB: a database of representative protein chains from the Protein Data Bank (PDB) in 2003.
Nucleic Acids Res 2003, 31:492493. PubMed Abstract  Publisher Full Text  PubMed Central Full Text