|Home | About | Journals | Submit | Contact Us | Français|
Considerable inaccuracy in estimates of human immunodeficiency virus (HIV) incidence has been a serious obstacle to the development of efficient HIV/AIDS prevention and interventions. Accurately distinguishing recent or incident infections from chronic infections enables one to monitor epidemics and evaluate the impact of HIV prevention/intervention trials. However, serological testing has not been able to realize these promises due to a number of critical limitations. Our study is to design a novel scheme of identifying incident infections in a highly accurate manner, based on the characteristics of HIV gene diversification within an infected individual.
We perform a comprehensive meta-analysis on 5596 full envelope HIV genes generated by single genome amplification-direct sequencing from 182 incident and 43 chronic cases. We devise a binary classification test based on the tail characteristics of the Hamming distance distribution of sequences.
We identify a clear signature of incident infectiones, the presence of closely related strains in the sampled HIV envelope gene sequences in each HIV infected patient, in both single-variant and multi-variant transmissions. The sequence similarity used as a biomarker is found to have high specificity and sensitivity, greater than 95%, and is robust to viral and host specific factors such as the clade of the viral strain, viral load, and the length and location of sequences in the HIV envelope gene.
Because of rapid and continuing improvements in sequencing technology and cost, sequence based incidence assays hold great promise as a means of quantifying HIV incidence from a single blood test.
To assess how many people have been recently infected in a given area is an important task in HIV/AIDS prevention . Accurately distinguishing recent or incident infections from chronic infections enables one to monitor epidemics, evaluate the impact of antiretroviral treatment, and assess the efficacy of HIV prevention trials including vaccination , microbicides , and other types of interventions . Precise estimates of HIV incidence are essential to allocate HIV-related health care resources properly. In particular, incident estimates based on single blood draws from cross-sectional sampling of populations are in great need in resource-limited settings.
The approximate window period of HIV incident infections is the first one year post transmission, which covers the eclipse phase and the stages of the Fiebig classification based on orderly appearance of viral RNA, viral antigens such as p24 and p31, and HIV-specific antibodies . This period is characterized by rapid expansion and decline of viral RNA and gradual increase of HIV-specific antibody titers (Figure 1A). Current HIV incidence assays are based on the idea that antibody level or avidity rise in a predictable pattern during the first 4 to 6 months post transmission, eventually reaching a plateau that stays roughly constant for many years (Figure 1A). This approach includes the Serologic Testing Algorithm for Recent HIV Seroconversion (STARHS) [6, 7], the BED capture enzyme immunoassay (BED) , and the guanidine-based antibody avidity assay [9, 10]. However, the serologic assays are found to have a number of critical limitations including difficulty in the standardization, complication in the reproducibility, and strong dependence on the infecting virus clade [9, 11]. The limitations result in notable inaccuracy. The sensitivity, the proportion of incident infections correctly identified as incident, varies in the range of 42% and 100% with median of 89%, across 13 serologic assays . The specificity, the proportion of chronic infections correctly identified as chronic, ranges from 49.5% to 100% with median of 86.8%. The tendency of the misclassification of long-standing infections as recent is pronounced among patients on anti-retroviral treatment . The substantial false recent rate is reported to be one of the limitations of the serologic assays as it results in the overestimation of incidence.
Here, we turn to utilizing recent advances in understanding early HIV infections [13, 14] and demonstrate that information derived from a set of HIV envelope genes obtained from a single blood sample can accurately distinguish incident infections from chronic ones. Single genome amplification and sequencing of HIV envelope genes have shown that the majority of HIV infections originates from a single strain for both subtype B and subtype C infections [13-15]. Not all infections originate from a single founder strain. Two risk groups, men who have sex with men (MSM) and injection drug users (IDUs), show a high chance of being productively infected by more than one type of virus [16, 17]; around 36% of MSM and around 60% of IDUs were found to be multivariant transmission. Figure 1B illustrates the difference between a single-variant and multi-variant transmissions. As the infection propagates within an individual, as a consequence of the dynamic interplay between viral mutation and immune selection , the HIV population diversifies (Figure 1B). This study suggests a novel assay with high levels of sensitivity and specificity based on the characteristics of HIV gene diversification originating from either a single-variant or multiple-variant founder strains.
The HIV env sequences of 182 incident and 43 chronic patients were collected from the published data set in Refs. ,  and . Geographic locations of the cohorts were US, Trinidad, South Africa, Malawi, and Canada. All of the 5596 strains we analyzed were obtained by single genome amplification and sequencing. The incident subjects were sub-staged according to the Fiebig classification : 1 subject was in stage I, 74 in stage II, 24 in stage III, 23 in stage IV, 44 in stage V, and 16 in stage VI. The routes of exposure include 92 heterosexual transmissions, 16 MSM subjects, 12 IDU subjects, and others of unknown route.
As in Ref. , the proportion of incident infections being correctly identified as incident, sensitivity, is plotted against the proportion of misclassification of chronic infections as incident, 1-specificity, as we incrementally change the putative Q10 cutoff value. The optimal cut-off value is determined by the isocost line, maximizing the sum of sensitivity and specificity with equal consideration.
We performed a meta-analysis by collecting 5596 previously published sequences generated by single genome amplification-direct sequencing [20, 21] from 182 incident and 43 chronic cases [13, 14, 17]. The incident subjects were classified as recent HIV infections either by symptoms of acute infection or serologic evidence and the chronic subjects were reported to have an infection period of longer than 1 year (see Methods for more). Incident infections were categorized into either single-variant or multi-variant transmission [13, 14, 17]. The diversification can be quantified using the number of base differences between a pair of sequences, i.e., their Hamming distance (HD): HIV env diversity is the average number of base differences among all possible pairs of sequences sampled from a patient, divided by the sequence length. The env variance is the variance of the number of base differences among the sequences divided by the sequence length.
The high level of viral sequence diversity associated with multi-variant transmissions suggests that a simple measure of the diversity or variance might misclassify early stages of individuals whose infection started with multiple founder viruses as being chronically infected. Indeed, as shown in Figure 2A and B, the level of env diversity of around one third of the incident multi-variant cases overlaps with those of chronic subjects. Furthermore, the third quartile of the env variance of the incident subjects with multiple founder strains is greater than the median env variance of the chronic subjects. Neither envelope gene diversity nor variance provides clear discrimination between incident infections originating from multiple founder strains and chronic infections.
We sought an alternative signature in the HD distribution that discriminates chronic and incident infections. At an early phase, there should exist a fair number of identical or nearly identical sequences in each lineage of transmitted strain. Indeed, Figure 3A shows that the first peak of the HD distribution of incident cases including both single founder (Figure 3A top left) and multiple founder infections (Figure 3A top right), is located in the region of very low Hamming distances, implying the presence of closely related sequences. As infection progresses and the HIV population diversifies, the proportion of similar sequences should decline (Figure 1B); in fact, the proportion of identical sequences has been found to decrease exponentially as a function of time post infection [13, 15]. Figure 3A confirms that chronic subjects have a negligible frequency of sequence pairs in the region of low HD values, suggesting the absence of closely related sequences. We clarify this signature by quantifying the tail characteristics of the HD distribution; we measure the 10% quantile for HD, Q10, i.e., the HD value dividing the HD distribution into 10% below it and 90% above it. Figure 3B highlights the difference between the distribution of the Q10 statistics for the 182 incident infection samples and that for the 43 chronic samples. Here the 182 incident patients include both 102 single founder and 80 multiple founder cases. The incident Q10 distribution (red in Figure 3B), which includes both single and multiple founder infections, is visibly disparate from the chronic distribution (blue in Figure 3B).
The clear difference between the incident and chronic Q10 distributions led us to devise a binary classification test to identify samples from incident infections as being significantly different from the population of chronic infections. If Q10 is greater than the cut-off value , the sample is judged as a chronic infection and otherwise the sample is scored as an incident infection. As with most binary classifications, there is a trade-off between specificity and sensitivity that is controlled by the choice of the threshold. The cut-off value of is objectively determined from an analysis of the receiver operating characteristic (ROC) curve . The isocost line designates as the optimal value. Whereas simple measures of viral diversity and variance fail to discriminate chronic samples from incident ones, the binary classification test statistically differentiates the two groups. All of the 43 chronic subjects showed Q10 values greater than the threshold of 7, indicating a specificity of 100%. Only 5 out of 182 incident subjects had Q10 values greater than the threshold; the measured sensitivity is 97.3% and the majority of the 5 misclassified subjects was infected through intravenous drug use. These high levels of sensitivity and specificity convincingly suggest the possibility of using the tail characteristics of the HD distribution as a biomarker for identification of incident infections. Measuring HD distribution is advantageous because it can be determined from a single blood draw from each individual in cross-sectional blood surveys.
Our biomarker is robust under changes of viral specific and host specific factors such as the viral subtype, the viral load of subjects, and the length and location of the sampled envelope gene sequences. Figure 4A shows the ROC curve of the Q10 distributions when we exclude the dataset of subtype C infections. The area under the ROC curve with subtype B infections only remained the same as that with both subtype B and C infections, 0.998, implying that the biomarker is not sensitive to the clade of the viral strains. The ROC curve with only subtype B infections provides a sensitivity of 95.6% and a specificity of 100% with . This is presumably because the dynamics of early HIV diversification is not greatly affected by viral subtype. In contrast, the existing serologic assays have significantly different window periods of incident infections among subtype B and other subtypes . Little association is observed between the biomarker and the viral load. Figure 4B shows the scatter plot of Q10 values and viral loads measured from both incident and chronic subjects. The correlation coefficients were -0.04 for incident subjects and 0.13 for chronic subjects, suggesting that the biomarker is not sensitive to each patient's viral load.
We also find that the sensitivity and specificity of the assay remain very high under the changes of either the length or region of the envelope gene. While the changes in the incident Q10 distribution by varying the length of env are minor, the mean of the chronic Q10 distribution decreases substantially as the length of env decreases (Figure 4C). Despite this dependence, we stress that the sensitivity and specificity are markedly high, 95.1% or greater regardless of whether we consider 500, 1000, 2000 base long env segments or full env, as we control values objectively based on the ROC curve analysis (see Figure 5A). Our analysis suggests that read lengths of HIV env as short as 500 bases do not affect the accuracy of the assay.
Chronic Q10 distributions show a considerable amount of variation with the choice of the location within env. As Figure 4D displays, the 500 base long segment of env encompassing the major portion of the V3 loop, HXB2 7125-7624, shows the greatest mean of Q10 and the segment of HXB2 7625-8124 shows the smallest mean. We postulate that differing observed distributions imply that, in chronic infections, purifying selections keep certain sections of env quite conserved despite a long period of infection. The presence of purifying selections in chronic infection has been reported . However, the impact of purifying selection does not appear to be strong enough to weaken the signature of chronic infection. The power of discrimination even in the least sensitive region (HXB2 7625-8124) is comparable to the power of the entire env; the sensitivity is 98.4% and the specificity is 97.7% with the optimal ; the 10% quantile of the HD distributions of 179 out of the 182 incident subjects was 0 but only a single chronic subject had a 10% quantile value of 0. We conclude that our HIV incidence assay is robust to changes of the length and location of HIV env as summarized in Figure 5.
We find that sequence similarity used as a biomarker has high specificity and sensitivity and is robust to viral and host specific factors such as the clade of the viral strain, viral load, and the length and location of sequences in the HIV envelope gene. We found that a simple measure of the viral diversity or variance failed to distinguish chronically infected individuals from those infected with multiple founder viruses but who are at an early stage. This is due to the fact that distinct founder strains in multivariant transmissions caused increased HD diversity and variance (see Figure 2 and Figure 3A). On the other hand, there still exists a tangible number of very closely related sequences within each lineage of the founder virus at incident stage, which yields lower Q10 values than do individuals in chronic stage. Consequently, the 10% quantile of the HD distribution, instead of the mean or variance of the HD distribution, was found to be a robust measure for distinguishing incident infections, including multi-variant transmissions, from chronic infections.
One foreseeable issue for the development of a genome-based HIV incidence assay is the decline in viral sequence diversity that occurs during the later stages of infection [24, 25]. This common phenomenon of diversity decline as the end point disease is approached implies that we cannot exclude the possibility that a sequencing based assay might identify some subjects with late infection as having an incident infection. Sequences collected from patients with end point disease would be essential to examine this possibility at the population level. However, such late stage patients should be identifiable based on clinical criteria by introducing additional measures such as CD4+ T cell count.
The datasets used in the present study were obtained by single genome amplification and sequencing which conventionally samples less than 100 sequences. On the other hand, deep sequencing  is capable of producing more than 10,000 reads from a single blood sample. The estimation of tail characteristics of a distribution such as Q10 requires a substantially greater sample size than the estimation of central characteristics such as the mean or median. One of the limitations of the current deep sequencing platforms is that a relatively short read length (400-600 base long) is produced in comparison to single genome amplification (SGA) and Sanger sequencing. Our analysis suggests that short read lengths do not affect the accuracy of the assay, implying that data from deep sequencing methods could also be used to develop the assay. Caution would be needed in terms of sequencing errors and re-sampling issues of deep sequencing methods for future implementation.
Our study has demonstrated that a sequencing based HIV incidence assay could be a very powerful tool for identifying incident infections in a highly accurate manner. The rapid decrease in the cost of DNA sequencing over decades suggests the potential of our assay to be cost-effective and widely adopted in clinical practice. More empirical data will be essential to further validate this new paradigm by examining the sensitivity of the assay to logistical factors. Our assay is advantageous as it is expected to designate incident infection or chronic infection from a single blood draw measure, not necessarily combining different measures and adopting multi-assay algorithms.
We thank Dr. George Shaw, Dr. Stuart Z. Shapiro, and Dr. Steven Wolinsky for critical review of the manuscript.
Financial support: This work was supported by NIH grants R01 AI083115 and AI095066 (to HYL), and R01 RR06555 and R37 AI28433 (to ASP) as well as NSF grant PHY05-51164.
S.Y.P and H.Y.L. designed the project and formulated the assay. T.M.T.L. and S. W. T. provided statistical input and J.N. performed sequence analysis. A.S.P contributed to the initiation and formulation of the project. All authors participated in the writing of the manuscript.