Prediction of HIV drug resistance from genotype with encoded three-dimensional protein structure

Yu, Xiaxia; Weber, Irene T; Harrison, Robert W

doi:10.1186/1471-2164-15-S5-S1

Volume 15 Supplement 5

Selected articles from the Third IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS 2013): Genomics

Research
Open access
Published: 14 July 2014

Prediction of HIV drug resistance from genotype with encoded three-dimensional protein structure

Xiaxia Yu¹,
Irene T Weber² &
Robert W Harrison^1,2

BMC Genomics volume 15, Article number: S1 (2014) Cite this article

4107 Accesses
20 Citations
6 Altmetric
Metrics details

Abstract

Background

Drug resistance has become a severe challenge for treatment of HIV infections. Mutations accumulate in the HIV genome and make certain drugs ineffective. Prediction of resistance from genotype data is a valuable guide in choice of drugs for effective therapy.

Results

In order to improve the computational prediction of resistance from genotype data we have developed a unified encoding of the protein sequence and three-dimensional protein structure of the drug target for classification and regression analysis. The method was tested on genotype-resistance data for mutants of HIV protease and reverse transcriptase. Our graph based sequence-structure approach gives high accuracy with a new sparse dictionary classification method, as well as support vector machine and artificial neural networks classifiers. Cross-validated regression analysis with the sparse dictionary gave excellent correlation between predicted and observed resistance.

Conclusion

The approach of encoding the protein structure and sequence as a 210-dimensional vector, based on Delaunay triangulation, has promise as an accurate method for predicting resistance from sequence for drugs inhibiting HIV protease and reverse transcriptase.

Background

HIV/AIDS is a pandemic disease and more than 30 million people are infected worldwide [1]. There is no effective vaccine or medicine to completely cure AIDS; however, the long-term survival of many patients has been enabled by drug therapy. Highly Active Antiretroviral Therapy (HAART) using three or four different drugs with different viral targets is very effective in stabilizing the infection [2]. These antiviral drugs target different stages in the viral life-cycle. Two important drug targets are the HIV protease (PR) and reverse transcriptase (RT), which have essential roles in viral replication. HIV RT converts the viral RNA genome into DNA, which is translated by the host cell machinery into the viral precursor proteins. HIV PR functions to cleave the large viral precursor proteins into individual enzymes and structural proteins, which produces infectious viral particles. Among the 23 approved drugs in current clinical use, there are seven nucleoside RT inhibitors (NRTIs), four non-nucleoside RT inhibitors (NNRTIs), and eight PR inhibitors (PIs) [3]. The approved PIs were designed to bind in the active site of HIV PR, and prevent the processing of viral precursor proteins (Figure 1). NRTIs are chemical analogs of the natural nucleoside substrates of the HIV RT that bind to the protein active site and block its activity in synthesizing DNA from viral RNA. The inhibitors in the NNRTI class also decrease the enzymatic activities of RT, however, they bind in an allosteric site in the palm domain of the p66 subunit instead of the active site of RT (Figure 1).

Despite the success of HAART, current therapy is limited by the rapid emergence of drug resistance [3]. The virus can mutate to acquire resistance during therapy due to the lack of proofreading by RT [4] and high replication rate [5]. These resistance mutations alter the drug targets such as PR and RT [6]. Some of the 35 mutations associated with resistance to PRIs alter amino acids located in the active site of PR while the majority alter residues in distal regions of the enzyme structure [7]. Similarly for RT, several of the mutations associated with resistance to NRTIs alter amino acids in the active site of the enzyme while others are located in more distal regions. The amino acid mutations occurring in association with resistance to the NNRTIs tend to cluster around the inhibitor binding site [8, 9]. The molecular mechanisms for these antiviral drugs are described in the review [10].

These resistance mutations lower the effectiveness of specific drugs and may cause failure of the treatment. Infections with resistant HIV are prevalent; surveys in North America and Europe show that 8-20% of HIV infections in untreated people contain primary drug resistance mutations [10]. Over time, multiple mutations can accumulate giving a huge number of possible combinations of mutations in each protein. This persistent problem led to the recommendation for resistance testing to guide the choice of drugs in AIDS therapy [11–13]. Fast sequencing of the genome of the infecting virus can be combined with computational predictions of resistance to guide the choice of effective antiviral drugs [13]. Accurate and fast computational predictions are desirable to avoid the expense, limited availability and time involved for performing an experimental cell-based assay for resistance where results can take four weeks.

Accurate predictions can be valuable for prescribing the most effective drugs for infections with resistant HIV. Most genotype interpretation algorithms in clinical use are knowledge based [14]. These interpretation algorithms apply a set of rules or scores for each mutation and drug. The performance of several commonly used interpretation algorithms: Stanford HIVdb [15], HIV-grade [16], REGA and ANRS (http://www.hivfrenchresistance.org/) has been compared [16]. In addition, many computational classification techniques have been evaluated for predicting drug resistance from the genotype data. The standard classification techniques of artificial neural networks (ANN) [17–21], decision tree [19–22], random forests [21], support vector machine (SVM) [21–23] and regression analysis [19] have been applied in HIV drug resistance predictions. Statistical methods can also be applied to analyze the relationship between genotype and phenotype. The association of mutations with resistance to the PIs saquinavir (SQV) and indinavir (IDV) was determined using cluster analysis, recursive partitioning, and linear discriminant analysis [24]. These methods are limited by the high dimensionality of the genotype data, hence non-parametric methods have been proposed and tested on resistance data for the PI amprenavir [25, 26]. Protein structural information has been used to generate statistical potentials of mutants for training with SVM or random forest learning algorithms and tested in predicting resistance to the RT inhibitor nevirapine (NVP) [27].

We have evaluated an efficient encoding of information from the three-dimensional protein structure for the prediction of resistance from genotype. The structural encoding via Delaunay triangulation improves the quality of the predictions by representing interactions between amino acid neighbours in the three-dimensional structure unlike the linear sequence representation of other methods. This unified sequence-structure representation was used in supervised training with SVM, ANN, and a new sparse dictionary classification method. The compressive sensing/sparse dictionary representation [28, 29] has been applied successfully in image analysis to enhance learning capacity and efficiency. Sparse representation has been employed for image restoration [30, 31], denoising [32], deblurring [33], signal processing [34], and face detection [35]. Initial tests of this procedure for classifying resistance to four PIs was presented in [36]. Here, the structural encoding has been expanded to regression analysis and classification of genotype-phenotype data for seven PIs, six NRTIs and three NNRTIs.

Results

We combined structural information with genotype for regression analysis and supervised learning on resistance data. The new graph based sequence-structure encoding was tested with the Genotype-Phenotype Data from the Stanford HIV drug resistance database [37] (http://hivdb.stanford.edu/cgi-bin/GenoPhenoDS.cgi). Data were available for two different protein targets: HIV-1 PR and HIV-1 RT. For HIV-1 PR, eight PR inhibitors atazanavir (ATV), IDV, nelfinavir (NFV), ritonavir (RTV), lopinavir (LPV), tipranavir (TPV) and SQV were tested. While for the study of HIV RT inhibitor resistance, NNRTIs nevirapine (NPV), delaviridine (DLV), efavirenz (EFV), and NRTIs lamivudine (3TC), abacavir (ABC), zidovudine (AZT), stavudine (D4T), didanosine (DDI) and tenofovir (TDF) were tested. The data include the protein sequence (genotype) and resistance value (phenotype) from the PhenoSense (ViroLogic™) assay for each virus isolate. Genotype-phenotype data were available for 744 to 1674 isolates for different inhibitors of HIV PR, while RT was represented by 353 to 746 records for the 9 different NRTIs and NNRTIs. The preprocessing of the sequence and resistance data is detailed in Methods. Genotypes were expanded to unique protein sequences due to the presence of more than one amino acid at some positions. This expansion resulted in a total of 10,228 to 17,545 unique sequences of HIV PR mutants and 2,004 to 11,367 RT mutants for the various inhibitor resistance values.

Graph based protein sequence/structure representation using Delaunay triangulation

The sequences were combined with information from the three-dimensional protein structures by employing a graph generated by Delaunay triangulation as described in [38]. Two structure templates were used: 3OXC [4] for HIV-1 PR, and 2WOM [39] (from http://www.pdb.org). Only one structure vector is needed for each protein. In other words, all PR mutant sequences are combined with a single 210-dimensional vector derived from one PR structure, and similarly, a single structure vector is used for the RT mutants in subsequent regression and classification of resistance data. As a result, all mutants are represented as vectors of constant dimensionality, which is a desirable property for most of the pattern recognition algorithms. This structure vector was combined with sequences in regression analysis and classification for resistance.

Multiple regression on HIV protease inhibitor resistance

After each of the mutated sequences was represented by a 210-dimensional vector, we performed the regression analysis for the drug resistance data. We performed k-fold (k = 5) regression analysis on the sequence and resistance data. The predicted values for relative resistance are plotted against the experimental values as shown in (Figure 2) for the PR inhibitors ATV, NFV, RTV, IDV, LPV, TPV and SQV.

The multiple regression gave high R² values of 0.579-0.783 and very low standard deviations as listed in Table 1. The values are the average of all the R² values from k-fold regression. The high variance seen for high values of resistance is likely due to limitations of the experimental assay such that the measured resistance value has a cutoff at the upper limit, while the viral strains may have an effective resistance above this cutoff. The excellent correlations demonstrate that relative resistance to PIs can be predicted successfully from genotype by the new sequence/structure encoding method. In order to avoid training to an "optimal" n-fold set for cross validation, cross validation sets are chosen independently for each training run. Therefore, there is always a small variation in the results.

Table 1 Multiple regression on predicted relative resistance to HIV-1 PR inhibitors.

Selected articles from the Third IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS 2013): Genomics

Prediction of HIV drug resistance from genotype with encoded three-dimensional protein structure

Abstract

Background

Results

Conclusion

Background

Results

Graph based protein sequence/structure representation using Delaunay triangulation

Multiple regression on HIV protease inhibitor resistance

Multiple regression on HIV reverse transcriptase inhibitor resistance

Classification of resistance with support vector machine

Classification with Artificial Neural Networks

Classification using sparse dictionary

Comparison with standard genotype interpretation methods

Discussion

Conclusions

Materials and methods

Data sets and data preparation

Pre-processing of the datasets

Cutoffs for resistance/susceptibility for each drug

Encoding structure and sequence with Delaunay triangulation

k-fold validation

Regression analysis for drug resistance prediction

Sparse dictionary classification

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us