We performed a meta-analysis by collecting 5596 previously published sequences generated by single genome amplification-direct sequencing [20
] from 182 incident and 43 chronic cases [13
]. The incident subjects were classified as recent HIV infections either by symptoms of acute infection or serologic evidence and the chronic subjects were reported to have an infection period of longer than 1 year (see Methods for more). Incident infections were categorized into either single-variant or multi-variant transmission [13
]. The diversification can be quantified using the number of base differences between a pair of sequences, i.e., their Hamming distance (HD): HIV env
diversity is the average number of base differences among all possible pairs of sequences sampled from a patient, divided by the sequence length. The env
variance is the variance of the number of base differences among the sequences divided by the sequence length.
The high level of viral sequence diversity associated with multi-variant transmissions suggests that a simple measure of the diversity or variance might misclassify early stages of individuals whose infection started with multiple founder viruses as being chronically infected. Indeed, as shown in , the level of env diversity of around one third of the incident multi-variant cases overlaps with those of chronic subjects. Furthermore, the third quartile of the env variance of the incident subjects with multiple founder strains is greater than the median env variance of the chronic subjects. Neither envelope gene diversity nor variance provides clear discrimination between incident infections originating from multiple founder strains and chronic infections.
Figure 2 The env diversity (A) and variance (B) of 102 acutely infected subjects (red) with a single strain infection, 80 acutely infected subjects with multiple strain transmission (pink), and 43 chronically infected subjects (blue). Each horizontal black line (more ...)
We sought an alternative signature in the HD distribution that discriminates chronic and incident infections. At an early phase, there should exist a fair number of identical or nearly identical sequences in each lineage of transmitted strain. Indeed, shows that the first peak of the HD distribution of incident cases including both single founder ( top left) and multiple founder infections ( top right), is located in the region of very low Hamming distances, implying the presence of closely related sequences. As infection progresses and the HIV population diversifies, the proportion of similar sequences should decline (); in fact, the proportion of identical sequences has been found to decrease exponentially as a function of time post infection [13
]. confirms that chronic subjects have a negligible frequency of sequence pairs in the region of low HD values, suggesting the absence of closely related sequences. We clarify this signature by quantifying the tail characteristics of the HD distribution; we measure the 10% quantile for HD, Q10
, i.e., the HD value dividing the HD distribution into 10% below it and 90% above it. highlights the difference between the distribution of the Q10
statistics for the 182 incident infection samples and that for the 43 chronic samples. Here the 182 incident patients include both 102 single founder and 80 multiple founder cases. The incident Q10
distribution (red in ), which includes both single and multiple founder infections, is visibly disparate from the chronic distribution (blue in ).
Figure 3 A. The HD distribution of the sampled sequences from two incident patients, ACT54869022 in Ref.  (top left) and 703010228 in Ref.  (top right) and two chronic subjects in Ref. , SMRE4166 (bottom left) and SHKE4761 (bottom right). The red dashed (more ...)
The clear difference between the incident and chronic Q10
distributions led us to devise a binary classification test to identify samples from incident infections as being significantly different from the population of chronic infections. If Q10
is greater than the cut-off value
, the sample is judged as a chronic infection and otherwise the sample is scored as an incident infection. As with most binary classifications, there is a trade-off between specificity and sensitivity that is controlled by the choice of the threshold. The cut-off value of
is objectively determined from an analysis of the receiver operating characteristic (ROC) curve [22
]. The isocost line designates
as the optimal value. Whereas simple measures of viral diversity and variance fail to discriminate chronic samples from incident ones, the binary classification test statistically differentiates the two groups. All of the 43 chronic subjects showed Q10
values greater than the threshold of 7, indicating a specificity of 100%. Only 5 out of 182 incident subjects had Q10
values greater than the threshold; the measured sensitivity is 97.3% and the majority of the 5 misclassified subjects was infected through intravenous drug use. These high levels of sensitivity and specificity convincingly suggest the possibility of using the tail characteristics of the HD distribution as a biomarker for identification of incident infections. Measuring HD distribution is advantageous because it can be determined from a single blood draw from each individual in cross-sectional blood surveys.
Our biomarker is robust under changes of viral specific and host specific factors such as the viral subtype, the viral load of subjects, and the length and location of the sampled envelope gene sequences. shows the ROC curve of the Q10
distributions when we exclude the dataset of subtype C infections. The area under the ROC curve with subtype B infections only remained the same as that with both subtype B and C infections, 0.998, implying that the biomarker is not sensitive to the clade of the viral strains. The ROC curve with only subtype B infections provides a sensitivity of 95.6% and a specificity of 100% with
. This is presumably because the dynamics of early HIV diversification is not greatly affected by viral subtype. In contrast, the existing serologic assays have significantly different window periods of incident infections among subtype B and other subtypes [11
]. Little association is observed between the biomarker and the viral load. shows the scatter plot of Q10
values and viral loads measured from both incident and chronic subjects. The correlation coefficients were -0.04 for incident subjects and 0.13 for chronic subjects, suggesting that the biomarker is not sensitive to each patient's viral load.
Figure 4 A. Dependence of the ROC curve on the subtype of infection. The blue line represents the original ROC curve with the samples from both subtype B and C infections. The red line represents the ROC curve when we exclude 69 incident samples with subtype C (more ...)
We also find that the sensitivity and specificity of the assay remain very high under the changes of either the length or region of the envelope gene. While the changes in the incident Q10
distribution by varying the length of env
are minor, the mean of the chronic Q10
distribution decreases substantially as the length of env
decreases (). Despite this dependence, we stress that the sensitivity and specificity are markedly high, 95.1% or greater regardless of whether we consider 500, 1000, 2000 base long env
segments or full env
, as we control
values objectively based on the ROC curve analysis (see ). Our analysis suggests that read lengths of HIV env as short as 500 bases do not affect the accuracy of the assay.
Figure 5 A. The optimal cut-off value for the 10% quantile, , of the binary classification test for each length and placement of the viral segments. The starting position of each segment is referenced to the HXB2 strain. As the length of envelope gene shortens (more ...)
distributions show a considerable amount of variation with the choice of the location within env
. As displays, the 500 base long segment of env encompassing the major portion of the V3 loop, HXB2 7125-7624, shows the greatest mean of Q10
and the segment of HXB2 7625-8124 shows the smallest mean. We postulate that differing observed distributions imply that, in chronic infections, purifying selections keep certain sections of env
quite conserved despite a long period of infection. The presence of purifying selections in chronic infection has been reported [23
]. However, the impact of purifying selection does not appear to be strong enough to weaken the signature of chronic infection. The power of discrimination even in the least sensitive region (HXB2 7625-8124) is comparable to the power of the entire env
; the sensitivity is 98.4% and the specificity is 97.7% with the optimal
; the 10% quantile of the HD distributions of 179 out of the 182 incident subjects was 0 but only a single chronic subject had a 10% quantile value of 0. We conclude that our HIV incidence assay is robust to changes of the length and location of HIV env
as summarized in .