Los Alamos National Laboratory, Los Alamos, NM 87545, USA

Univeristy of Massachusetts, Amherst, MA 01002, USA

University of Arizona, Tucson, AZ 85721, USA

The Santa Fe Institute, Santa Fe, NM 87501, USA

Abstract

Background

The occurrence of a genetic bottleneck in HIV sexual or mother-to-infant transmission has been well documented. This results in a majority of new infections being homogeneous,

Results

We have devised a web tool that analyzes genetic diversity in acutely infected HIV-1 patients by comparing it to a model of neutral growth. More specifically, we consider a homogeneous infection (

Conclusions

When the underlying assumptions of our model (homogeneous infection prior to selection and fast exponential growth) are met, we are under a very particular scenario for which we can use a forward approach (instead of backwards in time as provided by coalescent methods). This allows for more computationally efficient methods to derive the time since the most recent common ancestor. Furthermore, the tool performs statistical tests on the Hamming distance frequency distribution, and outputs summary statistics (mean of the best fitting Poisson distribution, goodness of fit p-value, etc). The tool runs within minutes and can readily accommodate the tens of thousands of sequences generated through new ultradeep pyrosequencing technologies. The tool is available on the LANL website.

Background

The occurrence of a genetic bottleneck in HIV sexual or mother-to-infant transmissions has been well documented

Correlation between time since MRCA estimated by Beast and the Poisson method

**Correlation between time since MRCA estimated by Beast and the Poisson method**. Estimated time since the MRCA calculated by the Poisson methods for samples from 53 homogeneous HIV-1 infections

Implementation

Our basic framework is that of an exponentially growing population following a narrow bottleneck, with lineage-independent mutation rates at all sites and no differential selection among the attested forms. When the resulting diversity is small, almost every change is at a distinct locus, and the pairwise differences between genetic strains,

Input data

The Poisson Fitter tool

Controlling for APOBEC enrichment

APOBEC is a host enzyme. During replication it causes mutations that can introduce stop codons and inactivate the virus. These substitutions are recognizable by a

Star phylogeny

When the samples are relatively small, under our assumed model of neutral evolution and rapid exponential growth, all sequences are likely to coalesce to the same founder or MRCA _{0 }frequency distribution, where HD_{0 }is the Hamming distance from the consensus sequence. When the sequences coalesce at the founder we can use the following mathematical formulation to compute the HD frequencies

where _{x, y }is the Kroenecker delta, and _{i }_{j }_{j }

Fitting the Poisson distribution

A Poisson distribution is fitted to the observed pairwise HD distribution using a maximum likelihood method (see ^{2 }goodness of fit (GOF) is performed to test whether the HD distribution significantly diverges from a Poisson (small P-values indicate a bad fit). The test takes into account the non-independence of pairwise HD distances by comparing the observed frequencies to the expected ones if the sample were to follow a star phylogeny. Prior to the onset of positive selection, the population is assumed to undergo a rapid expansion during which the basic reproductive number _{0 }> 1. Therefore, when the sample yields a GOF P-value above 0.05 (indicating a non-significant divergence from a Poisson distribution), we can estimate the time since the MRCA using the parameters characterizing intrahost HIV evolution. Following Stafford _{0 }= 6 for acute HIV-1 infection samples. We assume a constant mutation rate across lineages, which we fix at an average value of ϵ = 2.16 × 10^{-5 }per site and per replication cycle. This value has been adjusted from what was originally derived for HIV-1 by Mansky

Results and Discussion

All the parameters explained in the previous section are computed and included in the output table called "Log Likelihood - Estimated Parameters." This comprises, for each sample: the number of sequences in the sample, the mean and maximum pairwise HD, the mean of the best fitting Poisson distribution, the corresponding time since the MRCA, and the goodness of fit P-value It is important to notice that when the sample meets our model's assumptions, the mean of the best fitting Poisson distribution is in fact the mean pairwise HD of the sample. A second table, called "Convolution Estimates," provides the observed HD frequencies and the estimated ones calculated using equation (1). A more detailed explanation of the parameters is provided in the

Figure

Example of output graphics for a 454 sample that conformed to the model

**Example of output graphics for a 454 sample that conformed to the model**. Pairwise HD frequency plots on a logarithmic scale (black, left panel), together with the best fitting Poisson (blue) and the theoretical counts expected if the sample were to follow a star-like phylogeny. The right panel shows the pairwise HD histogram and the best fitting Poisson distribution (red). In the legend we report the GOF P-value (

As a second example, Figure ^{-6 }for APOBEC enrichment. Therefore, only when all positions with a

Example of output graphics for an SGA sample that was enriched for APOBEC mediated substitutions

**Example of output graphics for an SGA sample that was enriched for APOBEC mediated substitutions**. HD frequency plots with best fitting Poisson (red line), on the left (panels A and C), and with theoretical star-phylogeny frequencies (red line), on the right (panels B and D). The top panels represent an alignment not corrected for enrichment for APOBEC motifs, whereas the bottom panels represent the same alignment after the G positions that in the consensus are in the APOBEC context have been removed from the alignment. Prior to the correction, the Poisson does not fit the HD frequency distribution (GOF

Unlike the example in Figure

Both of the examples above obviously meet our model's assumptions of exponential growth with no selection and negligible recombination rate. When one or more assumption is not met the goodness of fit P-value lowers considerably and therefore the time since the MRCA is inaccurate. There are several factors that can cause this to happen: for instance, the infection may be non-homogeneous, the sample may not be "early" enough, or one may have sampled an unlikely early random mutation that distorts the Poisson distribution. When analyzing HIV-1 data, we recommend using samples taken within the first 2-5 weeks of infection, or characterized as Fiebig stage I or II

Finally, we notice that our tool can be applied to subsets of sequences sampled at later time points when there is evidence of a narrow bottleneck. For example, in Fischer et al.

Conclusions

Our tool enables quantitative characterization of acute infection samples and can be usefully applied in large scale vaccine and prophylaxis studies where estimates of the time since the MRCA and/or the timing of the onset of host selection can be extremely informative. The tool can rapidly detect whether the mutational distribution in a set of HIV sequences is consistent with a star phylogeny and/or a Poisson model, indicative of a population evolving from a single ancestor, with lineage independent mutations under no differential selection of surviving forms. If the model is violated, the tool automatically evaluates whether this is a consequence of APOBEC mediated substitutions. When the model is satisfied, it can be be used to estimate times to the most recent common ancestor of the lineage, rapidly providing timing estimates that are in good accord with coalescent methods. The speed and simplicity of the algorithm enables it to be applied to massive data sets obtained through ultra-deep sequencing methods

Authors' contributions

BTK was the project PI, provided the APOBEC modeling concept, and helped draft the manuscript; BF built the web interface; GA provided codes to be included in the tool; both BF and GA helped with the web tool development and programming; ASP, BTK, and TB provided the theoretical frame for the study and helped drafting the manuscript; EEG provided codes for the tool, contributed to the analysis, and drafted the manuscript. All authors read and approved the final manuscript.

Acknowledgements

This work was supported by the Los Alamos National Laboratory Directed Research and Development program, the Center for HIV/AIDS Vaccine Immunology, NIH grants U19-AI067854-05, AI28433-19, RR06555-18, and by NIAID via NIH-DOE interagency agreement (Y1-AI-1500-01). We wish to thank Brian Gaschen for technical support.