We performed a meta-analysis by collecting 5596 previously published sequences generated by single genome amplification-direct sequencing [

20,

21] from 182 incident and 43 chronic cases [

13,

14,

17]. The incident subjects were classified as recent HIV infections either by symptoms of acute infection or serologic evidence and the chronic subjects were reported to have an infection period of longer than 1 year (see Methods for more). Incident infections were categorized into either single-variant or multi-variant transmission [

13,

14,

17]. The diversification can be quantified using the number of base differences between a pair of sequences, i.e., their Hamming distance (HD): HIV

*env* diversity is the average number of base differences among all possible pairs of sequences sampled from a patient, divided by the sequence length. The

*env* variance is the variance of the number of base differences among the sequences divided by the sequence length.

The high level of viral sequence diversity associated with multi-variant transmissions suggests that a simple measure of the diversity or variance might misclassify early stages of individuals whose infection started with multiple founder viruses as being chronically infected. Indeed, as shown in , the level of *env* diversity of around one third of the incident multi-variant cases overlaps with those of chronic subjects. Furthermore, the third quartile of the *env* variance of the incident subjects with multiple founder strains is greater than the median *env* variance of the chronic subjects. Neither envelope gene diversity nor variance provides clear discrimination between incident infections originating from multiple founder strains and chronic infections.

We sought an alternative signature in the HD distribution that discriminates chronic and incident infections. At an early phase, there should exist a fair number of identical or nearly identical sequences in each lineage of transmitted strain. Indeed, shows that the first peak of the HD distribution of incident cases including both single founder ( top left) and multiple founder infections ( top right), is located in the region of very low Hamming distances, implying the presence of closely related sequences. As infection progresses and the HIV population diversifies, the proportion of similar sequences should decline (); in fact, the proportion of identical sequences has been found to decrease exponentially as a function of time post infection [

13,

15]. confirms that chronic subjects have a negligible frequency of sequence pairs in the region of low HD values, suggesting the absence of closely related sequences. We clarify this signature by quantifying the tail characteristics of the HD distribution; we measure the 10% quantile for HD,

*Q*_{10}, i.e., the HD value dividing the HD distribution into 10% below it and 90% above it. highlights the difference between the distribution of the

*Q*_{10} statistics for the 182 incident infection samples and that for the 43 chronic samples. Here the 182 incident patients include both 102 single founder and 80 multiple founder cases. The incident

*Q*_{10} distribution (red in ), which includes both single and multiple founder infections, is visibly disparate from the chronic distribution (blue in ).

The clear difference between the incident and chronic

*Q*_{10} distributions led us to devise a binary classification test to identify samples from incident infections as being significantly different from the population of chronic infections. If

*Q*_{10} is greater than the cut-off value

, the sample is judged as a chronic infection and otherwise the sample is scored as an incident infection. As with most binary classifications, there is a trade-off between specificity and sensitivity that is controlled by the choice of the threshold. The cut-off value of

is objectively determined from an analysis of the receiver operating characteristic (ROC) curve [

22]. The isocost line designates

as the optimal value. Whereas simple measures of viral diversity and variance fail to discriminate chronic samples from incident ones, the binary classification test statistically differentiates the two groups. All of the 43 chronic subjects showed

*Q*_{10} values greater than the threshold of 7, indicating a specificity of 100%. Only 5 out of 182 incident subjects had

*Q*_{10} values greater than the threshold; the measured sensitivity is 97.3% and the majority of the 5 misclassified subjects was infected through intravenous drug use. These high levels of sensitivity and specificity convincingly suggest the possibility of using the tail characteristics of the HD distribution as a biomarker for identification of incident infections. Measuring HD distribution is advantageous because it can be determined from a single blood draw from each individual in cross-sectional blood surveys.

Our biomarker is robust under changes of viral specific and host specific factors such as the viral subtype, the viral load of subjects, and the length and location of the sampled envelope gene sequences. shows the ROC curve of the

*Q*_{10} distributions when we exclude the dataset of subtype C infections. The area under the ROC curve with subtype B infections only remained the same as that with both subtype B and C infections, 0.998, implying that the biomarker is not sensitive to the clade of the viral strains. The ROC curve with only subtype B infections provides a sensitivity of 95.6% and a specificity of 100% with

. This is presumably because the dynamics of early HIV diversification is not greatly affected by viral subtype. In contrast, the existing serologic assays have significantly different window periods of incident infections among subtype B and other subtypes [

11]. Little association is observed between the biomarker and the viral load. shows the scatter plot of

*Q*_{10} values and viral loads measured from both incident and chronic subjects. The correlation coefficients were -0.04 for incident subjects and 0.13 for chronic subjects, suggesting that the biomarker is not sensitive to each patient's viral load.

We also find that the sensitivity and specificity of the assay remain very high under the changes of either the length or region of the envelope gene. While the changes in the incident

*Q*_{10} distribution by varying the length of

*env* are minor, the mean of the chronic

*Q*_{10} distribution decreases substantially as the length of

*env* decreases (). Despite this dependence, we stress that the sensitivity and specificity are markedly high, 95.1% or greater regardless of whether we consider 500, 1000, 2000 base long

*env* segments or full

*env*, as we control

values objectively based on the ROC curve analysis (see ). Our analysis suggests that read lengths of HIV env as short as 500 bases do not affect the accuracy of the assay.

Chronic

*Q*_{10} distributions show a considerable amount of variation with the choice of the location within

*env*. As displays, the 500 base long segment of env encompassing the major portion of the V3 loop, HXB2 7125-7624, shows the greatest mean of

*Q*_{10} and the segment of HXB2 7625-8124 shows the smallest mean. We postulate that differing observed distributions imply that, in chronic infections, purifying selections keep certain sections of

*env* quite conserved despite a long period of infection. The presence of purifying selections in chronic infection has been reported [

23]. However, the impact of purifying selection does not appear to be strong enough to weaken the signature of chronic infection. The power of discrimination even in the least sensitive region (HXB2 7625-8124) is comparable to the power of the entire

*env*; the sensitivity is 98.4% and the specificity is 97.7% with the optimal

; the 10% quantile of the HD distributions of 179 out of the 182 incident subjects was 0 but only a single chronic subject had a 10% quantile value of 0. We conclude that our HIV incidence assay is robust to changes of the length and location of HIV

*env* as summarized in .