Synthetic coding sequence reconstruction.
HCV subtype 1a (n = 390) and 1b (n = 296) sequences that included at least the entire open reading frame of the polyprotein, were obtained from human specimens, and were not epidemiologically redundant were downloaded from GenBank (accession numbers AB016785, AB049087-101, AB154177, AB154179, AB154181, AB154183, AB154185, AB154187, AB154189, AB154191, AB154193, AB154195, AB154197, AB154199, AB154201, AB154203, AB154205, AB191333, AB249644, AB429050, AF009606, AF139594, AF165045, AF165047, AF165049, AF165051, AF165053, AF165055, AF165057, AF165059, AF165061, AF165063, AF176573, AF207752-74, AF208024, AF313916, AF356827, AF483269, AF511948-50, AJ000009, AJ132996-97, AJ238799-800, AJ278830, AY045702, AY460204, AY587844, AY615798, AY695437, AY956463-8, D10749, D10934, D11168, D14484, D50480-82, D63857, D85516, D89815, D89872, D90208, DQ071885, DQ838739, EF032883, EF032886, EF032892, EF032900, EF407411-57, EF407458-504, EF621489, EF638081, EU155213-16, EU155217-35, EU155233, EU155236-381, EU234061, EU234063-65, EU239713, EU239714, EU239715-17, EU255927-99, EU255960-2, EU256000-1, EU256002-97, EU256045, EU256054, EU256059, EU256061-2, EU256064-6, EU256075-103, EU256104, EU256106-7, EU260395-6, EU362882, EU362888-901, EU362911, EU482831-2, EU482833, EU482834-89, EU482839, EU482849, EU482859, EU482860, EU482874, EU482875, EU482877, EU482879-81, EU482883, EU482885-6, EU482888, EU529676-81, EU529682, EU569722-23, EU595697-99, EU660383-85, EU660386, EU660387, EU660388, EU677248, EU677253, EU687193-95, EU857431, EU862823-24, EU862826-27, EU862835, FJ024086, FJ024087, FJ024274-76, FJ024277, FJ024278, FJ024279, FJ024280-82, FJ181999-201, FJ205867-69, FJ390394-95, FJ390396-8, FJ390399, FJ410172, L02836, M58335, M84754, U01214, U16362, U45476, U89019, and X61596).
We refer to the data set of the 390 subtype 1a sequences as the “original data set” for the rest of the paper. The sequences were aligned using MUSCLE v3.0 (9
) and modified using BioEdit v126.96.36.199 (13
). To avoid idiosyncrasies of any individual phylogeny, we constructed 2 independent phylogenetic trees by applying MrBayes v3.2 (31
) to nucleotide positions 869 to 1292 (Core/E1) and 8276 to 8615 (NS5B) of the full-genome alignment (position numbers are based on the reference genome H77 [GenBank accession number AF009606
]). These segments were chosen because they were shown to be the most phylogenetically informative (33
). We refer to them as “Simmonds' regions” in this paper. We ran 30 million iterations of MrBayes v3.2 and confirmed convergence of parameters for phylogenetic trees inferred from both of Simmonds' regions using Tracer v1.5 (Rambaut A, available from the author [http://beast.bio.ed.ac.uk/Tracer
]). Simmonds' regions yielded different trees, a result which is expected due to the large number of possible trees (11
); nonetheless, analysis of these two data sets converged with similar model parameters. In addition, recombination in HCV is rare (40
). Hence, we can assume the same phylogenetic tree or same evolutionary history for the entire length of the genome (17
). Using both phylogenetic trees reconstructed with Simmonds' regions, we inferred ancestral sequences for each of the HCV-1a coding regions (31
). The ancestral sequence is obtained as a probability distribution for each position such that there is a probability of observing each base. Bole1a is derived in the following manner.
(i) For each nucleotide position i in the genome, if both trees agreed on the maximum posterior probability (MPP) residue, the probability of that position pi was selected to be the greater of the two MPPs. We define these positions as concordant.
(ii) For discordant positions (where the MPP residues did not agree), the joint probabilities of the codon k containing the discordant positions based on both trees were designated pck(core/E1) and pck(NS5B). For concordant residues within such codons, the pi calculated in the previous step was used in calculating the joint probability.
(iii) The codon with the higher joint MPP from the two trees was selected to represent that codon position. This codon-based analysis resolves cases in which more than one position in the codon is discordant and accommodates 6-fold degenerate codons.
(iv) To determine a stringent threshold for codon MPP, the inflection in the distribution of codon MPPs at which the variance in the second derivative was less than 10−6 for MPP values was found to be 0.9837, corresponding to individual residue MPPs of >0.99.
(v) Each codon with an MPP greater than or equal to 0.9837 based on either tree was accepted as ancestral, and its constituent positions were defined as resolved.
(vi) Covariance analysis was used to examine still-unresolved positions. The basic assumption of phylogenetic reconstruction that each site evolves independently ignores covarying and interacting sites. In order to take such sites into consideration, the observed (o
) and expected (e
) frequencies of pairs of bases were determined and the chi-square metric was calculated as shown in equation 1
and adjusted for multiple comparisons using the Holm-Bonferroni method at an α value of 0.05 (14
Using the adjusted chi-square metric, all resolved positions j
that significantly covaried with unresolved positions i
were identified. In the case of a positive interaction (oij
), the MPP codon containing the positively interacting residue was selected. For negative interactions (oij
), all codons with the negatively interacting base were eliminated and the MPP codon was selected from the remaining codons.
(vii) At still-unresolved sites, the MPP codon was selected even if the MPP was less than 0.9837 (as noted in Results, this was rarely necessary).