Impact of Base Composition on Phylogenetic Reconstruction
In reciprocal Blast analyses, we mapped the 454-derived transcriptome contig sequences from nine bird species onto the zebra finch genome, which allowed orthologous assessment of genes sequenced in different genomic regions. Due to the random nature of shotgun transcriptome sequencing, different overlapping sets of cDNA sequences were present in different avian samples. To handle this combinatorial gene mixture, we built a matrix for making phylogenomic inference. The matrix contained sequence data from 1,995 genes present in the zebra finch genome, representing more than 10% of the estimated total number of genes in avian genomes (ICGSC 2004
; Warren et al. 2010
). These genes have a total length after concatenation of 774,279 bp of exonic sequence. The amount of missing data was unevenly distributed across species and high for some, from 3.5% for the zebra finch to 87.9% for the ruby-throated hummingbird (); the missing data for the zebra finch was the result of removing partial finch sequences during the supermatrix building process (see Materials and Methods). However, we still had mostly over 100 kb of exonic sequence per species, which is unprecedented in studies of avian molecular evolution in terms amount of sequence data.
Basic Sequence Statistics for Each Species.
We found that the GC content in the third codon position (GC3) varied significantly across species (black boxes in ). Passerines had the most GC3-rich transcriptome (GC3 = 52.1%, bootstrap value range 50.9–52.6%). The budgerigar and the emu had the least GC3-rich (46.1% and 47.0%, respectively). In contrast, the GC1 and GC2 contents (1st and 2nd codon positions) were almost identical across species (gray and dark-gray boxes in , respectively).
FIG. 1. GC content at the 1st, 2nd, and 3rd codon position (GC1, GC2, and GC3) in light-gray, dark-gray, and black boxes, respectively. Dashed red lines show the observed value (number of Gs + Cs divided by the sequence length). Boxes give the quartiles of the (more ...)
There was also significant variation in GC3 among genes within a species. To evaluate this variation, we distributed genes in 5 Mb windows according to their position in the zebra finch genome. The between-window variation in GC3 was very large in all species, with an average standard deviation of 0.10 (), as compared with an average standard deviation of 0.002 of the bootstrap values (black boxes, ). This variation in GC3 among genes showed consistency across species since GC3 of particular windows varied from 33% to 77% when averaged across species. Finally, the between-species standard deviation of GC3 per window varied from 0.02 to 0.18 and was positively correlated to the mean GC3 per window (Spearman's Rho = 0.42, P = 1.3 × 10−9, n = 194, 5Mb windows: ). These results demonstrate that the highest variation in GC3 among lineages was concentrated to the most GC3-rich regions of the genome. GC1 and GC2 showed much less variation between species than GC3 (standard deviation of 0.042 and 0.046, respectively).
FIG. 2. (A) Distribution of GC3 in 5 Mb windows according to their position in the zebra finch genome. Boxes give the quartiles of the distribution; whiskers extend to the most extreme values obtained by bootstrapping. (B) Relationship between mean GC3 in 5 Mb (more ...)
To make phylogenetic inference, we started with a classical ML analysis using nucleotide data (GTR + Gamma4 model) without any attempt to control for GC3 variation. This analysis confirmed without ambiguity the Neoavian (beyond chicken) and the passerine monophylies and the basal suboscine (manakin) and oscine songbird (other passerines) dichotomy of passerines (). Within the passerines and in support of previous findings (Barker et al. 2004
) we found strong support for the split between Corvoidea (i.e., crows) and Passerida (other oscines). A similar topology, except concerning the zebra finch/pied flycatcher relationship, was obtained using a three codon position partitioned GTR + Gamma4 model (data not shown).
ML tree obtained with a GTR + Gamma4 model and the complete nucleotide data set. Scale indicates substitution/site.
To investigate the influence of GC3 heterogeneity on phylogenetic inference, we split our data set according to codon position, separating the 3rd position from the 1st + 2nd positions. Interestingly, ML analyses of the two data sets differed with respect to the position of the budgerigar (a parrot) and on the zebra finch/blue tit/pied flycatcher relationship. In agreement with data from DNA hybridization (Sibley and Ahlquist 1990
) and whole mitochondrial genome sequence analyses (Pratt et al. 2009
), the 3rd codon position data set did not support budgerigar as a sister group to passerines () (Ericson et al. 2006
; Hackett et al. 2008
). This topology was most similar to the one obtained with the full data set (). In contrast, the 1st + 2nd codon position data set supported a closer relationship between budgerigar and passerines (), in partial agreement with recent results obtained with nuclear markers (Ericson et al. 2006
; Hackett et al. 2008
), but this topology grouped budgerigar and hummingbird but with moderate bootstrap support (55%; ). The 3rd codon position favored the zebra finch/flycatcher clade reported in Barker et al. (2004)
, whereas the 1st + 2nd codon position data set supported a blue tit/pied flycatcher cade reported in Johansson et al. (2008)
, each with comparable strong bootstrap support (>97; ). The blue tit/pied flycatcher /zebra finch relationship within the basal Passerida radiation is known to be notoriously difficult to resolve (for a review, see Johansson et al. 2008
). The different topologies obtained with the two data sets were not due a difference in the number of informative sites; jackknifing the 3rd codon position data set to a similar number of sites as for the 1st + 2nd codon position data set strongly supported the same topology as obtained using the full 3rd codon position data set.
FIG. 4. Phylogenetic trees based on transcriptome data. (A) ML tree obtained with a GTR + Gamma4 model and the third codon position data set. (B) ML tree obtained with a GTR + Gamma4 model and the 1st and 2nd codon position data set. (C) ML tree obtained with (more ...)
The above analyses suggest that GC3 content in the full nucleotide data set is significantly influencing the overall topology of the avian tree with important contradictions to GC1 + 2. To this hypothesis, we estimated the phylogeny using the full nucleotide data set with GC3 content normalized. To do so, we used the RY sequence recoding method, with purines coded as R and pyrimidines as Y. RY recoding of the transcriptome placed the budgerigar as a sister group to passerines, more consistent with the result obtained with the 1st + 2nd codon position data set, again with moderate bootstrap support (59%; ). This data set also supported the blue tit/flycatcher relationship (bootstrap support of 58%) as did the 1st + 2nd codon position data set. This finding indicates that the discrepancy between the phylogenetic results obtained with different codon positions is a result of the large heterogeneity in GC3.
To make an independent test of overall GC influence, we applied the nonhomogeneous model of nucleotide sequence evolution developed by Galtier and Gouy (1998)
. This model allows variation in GC content between branches (regardless of codon position) by relaxing the assumption of homogeneous base composition. We estimated the ML of several alternative topologies using a branch-specific nonhomogeneous model with the T92 + Gamma4 substitution model (supplementary fig. S1
, Supplementary Material
online). The highest likelihood was obtained for a topology (supplementary fig. S1C
, Supplementary Material
online) identical to the one obtained with the homogeneous model (). This result indicates that relaxing the stationary assumption is not sufficient to disrupt the budgerigar position and that additional features such as, for example, long-branch attraction (Felsenstein 1978
; Bergsten 2005
), among-site rate variation (Yang 1996
), and heterotachy (among branch/site rate variation) (Lopez et al. 2002
) contribute to this result.
Finally, we applied a CAT mixture model (Lartillot and Philippe 2004
) using Bayesian analysis on proteome (amino acid) data, which has been shown to outperform empirical substitution matrices in the case of long-branch attraction problems (e.g., Lartillot et al. 2007
; Delsuc et al. 2008
; Philippe et al. 2009
). Almost all relationships had high bootstrap support (), except for the zebra finch/blue tit/pied flycatcher node. Notably, the topology inferred with the CAT mixture model placed budgerigar as the sister group to passerines (). But unlike the nucleotide trees, it also brought doves closer to passerines.
FIG. 5. Unrooted proteome tree. Majority-consensus tree of Bayesian phylogenetic inference conducted under the CAT + Gamma4 mixture model using the software PHYLOBAYES. Values behind the nodes are site bootstrap support (100 times). Values in front of the nodes (more ...)
In summary, the position of budgerigar is strongly influenced by phylogenetic method and the data set used. Because the basal position is disrupted upon the use of 1) only the 1st + 2nd codon positions, 2) RY sequence recoding, and 3) a complex CAT mixture Bayesian model applied on proteome sequences, we suggest that the basal position of budgerigar indicated by the full data set was most likely an artifact of phylogenetic reconstruction at least in part caused by the similarity in base composition between the budgerigar and emu. Of course, we cannot exclude the possibility that there may have been other factors that contributed to this result.
Evolution of Base Composition in Birds
To further explore the variation in GC3 content among species and to understand the evolution of base composition in birds, we assessed the impact of missing data. We compared GC3 variation across species when estimated on the entire data set and when estimated on data sets based on sites only available in one species relative to all other species (replicated 11 times = the number of species). The rank order of all GC3 estimates using these reduced species-defined data sets correlated extremely well with the ranking obtained from the complete matrix (Spearman's Rho between 0.90 and 0.99, mean = 0.95; supplementary fig. S2
, Supplementary Material
online), showing that missing data does not explain the relative differences in GC3 among species. However, for some reduced data sets, GC3 of all species were systematically higher or lower than that obtained using the complete data set (supplementary fig. S2
, Supplementary Material
online). To quantify this deviation, we computed for each reduced data set the mean of the differences between GC3 of the complete data set and of the reduced data set. This value (deltaGC) provides an approximation of the GC3 deviation introduced by gene sampling in each species; a positive value, for example, indicates that the selected genes have, on average, a lower GC3 than the complete set of genes. The most extreme deviation was observed for the manakin and the budgerigar data sets with—1.7% and +1.9% deltaGC, respectively. Using these values, we calculated a corrected GC3 for each species by subtracting the deviation to the current GC3. After correcting for the missing data, there is still substantial variation among species in GC3 (, grey dots). For example, the extreme GC3 content of the budgerigar transcriptome cannot be explained by biased sampling and even seems to have been underestimated in the original analyses (original estimate = 46.1%, corrected estimate = 44.4%).
The observed variation in GC3 among species must translate into differences in the substitution pattern of third codon position among lineages. To estimate the ancestral GC3 at each node and the equilibrium GC3 (GC3*) for each branch, we applied a nonhomogeneous model of DNA sequence evolution (Galtier and Gouy 1998
) on third codon positions, using the proteome topology (). This analysis showed an increase in GC3 in most branches of the Neognath phylogeny as manifested by a higher GC3* () than ancestral GC3 (). The branch leading to the budgerigar has the smallest GC3* (52.4%) of all Neoaves branches, although this GC3* is still higher than the ancestral GC3 in Neoaves (43.0%). It suggests that the low-current GC3 estimated for the budgerigar is unlikely to come from a decrease of GC3 in the parrot lineage but rather represents a moderate increase of the low ancestral GC3. Interestingly, an increase in GC3 appears to have occurred independently in the Neoaves lineage and along the chicken branch, as GC3* at the ancestor node of all Neoaves (43.0 %, CI = 37.2–49.1; ) was smaller than GC3 estimated at the root of the Neognathes (47.3%, CI = 47.0–47.6; ). The ancestral Neoaves branch and the branch leading to ratites (emu) were the only lineages showing a GC3* below the ancestral root value (). These findings suggest that there has been convergence to a higher GC content in Neoaves and chicken lineages. The most marked increase in GC3* has taken place in the branch leading to passerines, with an extremely high GC3* (74.9%, CI = 72.2–77.3; ). This implies a strong fixation bias toward Gs and Cs in the third codon position of passerines. The fact that GC3* is higher in all terminal passerine branches than current GC3 of passerines demonstrates an on-going process of GC enrichment in this order.
(A) Estimation of ancestral GC3 content at each node and current GC3 for each species and (B) Equilibrium GC3* estimated for each branch of the avian phylogeny. Values in brackets show the 95% bootstrap (100 times) CIs.
We sought a functional genomic explanation for the large-scale variation and convergence of GC3 in the avian transcriptome and considered recombination. Recombination may drive the evolution of GC content through GC-biased gene conversion (gBGC) (Galtier et al. 2001
; Meunier and Duret 2004
; Spencer et al. 2006
). Of relevance here, recombination rates in avian genomes are highly heterogeneous and correlate negatively with both distance to chromosome ends (i.e., telomere) and chromosome size (ICGSC 2004
; Groenen et al. 2009
), with the telomere effect being particularly strong in zebra finch (Backström et al. 2010
). To test if the increase in GC3 in passerines is linked to an increased recombination rate, we performed two additional analyses.
First, we analyzed two gene sets defined as subtelomeric and subcentromeric genes, respectively, according to their position in the zebra finch genome (see Materials and Methods). For these two gene sets, the ancestral GC3, current GC3, and GC3* were almost consistently higher in subtelomeric compared with subcentromeric regions (). Moreover, we found that zebra finch genes located in subtelomeric regions were significantly more GC3-rich than orthologous chicken genes (55.4%, CI = 55.2–55.6 vs. 50.6%, CI = 50.3–50.8), a difference not found in subcentromeric regions (46.5% CI = 46.3–46.7 vs. 45.8% CI = 45.6–46.0). Moreover, the GC3* estimated for the ancestral passerine branch was strikingly higher for genes in subtelomeric compared with subcentromeric regions (82.6% CI = 80.0–85.0 vs. 57.7% CI = 54.0–60.1). These findings suggest that the strong increase in the GC3 in passerines is associated with a high-recombination rate environment.
Ancestral GC3, Current (observed) GC3, and Equilibrium GC3 (GC3*) in Subtelomeric (<15 Mb from chromosome ends) and Subcentromeric Regions (>15 Mb from chromosome ends) of Different Nodes and Branches of the Avian Phylogenetic Tree.
Second, we took advantage of the regional recombination rates that have recently been estimated for the zebra finch genome (Backström et al. 2010
), to seek a more direct relationship between evolution of GC3 and recombination rate. Using the 5 Mb windows defined above, we found a strong correlation between GC3 and the mean recombination rate per window (Spearman's Rho = 0.65, P
< 2 × 10−16
= 134, ). Interestingly, GC3* was also strongly correlated with recombination rate (Spearman's Rho = 0.54, P
= 11 × 10−11
= 134, ), suggesting that recombination might have a strong effect on the substitution pattern in the branch leading to the zebra finch. In agreement with a previous study of chicken, based on retrotransposon elements (Webster et al. 2006
), we found a strong correlation between GC3 and GC3* (Spearman's Rho = 0.67, P
< 2 × 10−16
= 134) with GC3* almost always higher than current GC3 (in 102 of 134 windows). These finding indicates that the increase in GC3 is still an on-going process in zebra finch protein-coding genes.
FIG. 7. Relationship between recombination rate (in cM/Mb; sex-averaged; extracted from Backström et al. 2010) and (A) the current GC3 as well as (B) the equilibrium GC3 (GC3*). Sizes of the circles are proportional to the length of the alignments used (more ...)
Substitution Rate Variation among Lineages
We also used the transcriptome data set to address nucleotide substitution rate variation among avian lineages. A simple molecular clock model (i.e., one rate for all branches) was strongly rejected by a likelihood-ratio test (χ2 = 2508, df = 10, P < 2 × 10−16). To further investigate substitution rate variation among lineages, we applied a Bayesian relaxed molecular clock method using the MCMCTREE software (version 4.4c). This method estimates simultaneously substitution rates and divergence dates across the phylogeny, although we here focus on substitution rates only. We used the topology found in the proteome Bayesian analysis as the reference topology ().
In order to be able to use the well-known bird/crocodile fossil calibration point at 235 and 254 My (Benton and Donoghue 2007
), we incorporated brain transcriptome sequences of an American alligator. A reciprocal Blast analysis found 612 crocodile orthologs in the set of 1,995 avian genes used in the phylogenetic matrix. After removal of genes using the same selection process as for the bird data, we obtained 488 alligator genes spanning 161,251 bp in exonic length, with 79.1% of missing data for the alligator. In addition to the bird/crocodile split at 235–254 My, we use as calibration points several suggested divergence times within Aves (Paleognathae–Neognathae split, oscine–suboscine split, and first split within Passerida), as described in Materials and Methods.
Excluding the branch leading to the American alligator, the mean substitution rate per branch was 0.683 (±0.300) substitutions/site/billion years. Some general conclusions about substitution rate evolution across the avian phylogeny can be drawn (; excluding short branches with large CIs). First, passerines evolve 45% more rapidly than other birds (0.768 vs. 0.530). This difference is even more pronounced when considering the flycatcher–zebra finch clade, having the highest rate of the whole tree with a mean of 0.895 (). Second, the long ratite branch leading to emu had the lowest rate of substitution, with 0.335. There is thus 2.6-fold variation in substitution rate between the most extreme lineages. Excluding the bird/crocodile calibration point does not affect these results (data now shown). Moreover, we obtain similar substitution rate variation using the local molecular clock approach implemented in BASEML (clock = 2, model GTR + Gamma4; Yoder and Yang 2000
). For example, using this approach we estimate that passerines evolve 48% faster than other birds.
Median Substitution Rate (substitution per site per billion of years) Estimated for Different Branches of the Avian Phylogeny.
To test if substitution rate estimation is related to gene sampling, we used the subset of sites only present in the emu (212,751 sites) and compared the results with that obtained using all data. We found that the rate estimates were extremely well correlated (Spearman's Rho = 0.96, P = 3.7 × 10−6, n = 22) and that the emu-defined subset of sites quantitatively reproduce the amount of variation estimated using the complete data set (for instance, for the emu-defined subset emu substitution rate = 0.343 and pied flycatcher = 0.840). This result shows that the low-substitution rate observed in the ratite branch was not the result of gene sampling bias. We also replicated rate estimates using only Anna's hummingbird- or Dove-specific sites and found highly comparable substitution rate estimation (data not shown).
Finally, because we estimated substitution rates using protein-coding genes, the observed rate variation among lineages could in theory be linked to mutation rate variation or differences in the regimes of natural selection acting on protein evolution (or both). In case of the latter, a selection model may imply that the overall rate of adaptive evolution or the overall rate of fixation of slightly deleterious mutations differ among avian lineages. Under both of these scenarios, the nonsynonymous substitution rate will vary among lineages, thereby contributing to the overall rate variation in protein-coding sequence. Lineage-specific variation in the role of natural selection has been suggested to occur due to long-term differences in the effective population size among lineages (Ellegren 2010
We assessed possible variation in the intensity of natural selection across the bird phylogeny by estimating the nonsynonymous/synonymous substitution rate ratio (ω) for specific branches: all passerines, all nonpasserine Neoaves, the chicken, and the emu lineages. Passerines showed an intermediate ω of 0.0784, slightly lower than that for nonpasserine Neoaves (0.0847) but higher than that for emu (0.0586) and chicken (0.0626). This result indicates that the increase in passerine substitution rate is not linked to selective processes affecting nonsynonymous substitution only but more likely to an increase in mutation rate affecting both synonymous and nonsynonymous substitution rates.