Ribosomal loci represent a major tool for investigating environmental diversity and community structure via high-throughput marker gene studies of eukaryotes (e.g. 18S rRNA). Since the estimation of species’ abundance is a major goal of environmental studies (by counting numbers of sequences), understanding the patterns of rRNA copy number across species will be critical for informing such high-throughput approaches. Such knowledge is critical, given that ribosomal RNA genes exist within multi-copy repeated arrays in a genome. Here we measured the repeat copy number for six nematode species by mapping the sequences from whole genome shotgun libraries against reference sequences for their rRNA repeat. This revealed a 6-fold variation in repeat copy number amongst taxa investigated, with levels of intragenomic variation ranging from 56 to 323 copies of the rRNA array. By applying the same approach to four C. elegans mutation accumulation lines propagated by repeated bottlenecking for an average of ~400 generations, we find on average a 2-fold increase in repeat copy number (rate of increase in rRNA estimated at 0.0285-0.3414 copies per generation), suggesting that rRNA repeat copy number is subject to selection. Within each Caenorhabditis species, the majority of intragenomic variation found across the rRNA repeat was observed within gene regions (18S, 28S, 5.8S), suggesting that such intragenomic variation is not a product of selection for rRNA coding function. We find that the dramatic variation in repeat copy number among these six nematode genomes would limit the use of rRNA in estimates of organismal abundance. In addition, the unique pattern of variation within a single genome was uncorrelated with patterns of divergence between species, reflecting a strong signature of natural selection for rRNA function. A better understanding of the factors that control or affect copy number in these arrays, as well as their rates and patterns of evolution, will be critical for informing estimates of global biodiversity.
Ontologies support automatic sharing, combination and analysis of life sciences data. They undergo regular curation and enrichment. We studied the impact of an ontology evolution on its structural complexity. As a case study we used the sixty monthly releases between January 2008 and December 2012 of the Gene Ontology and its three independent branches, i.e. biological processes (BP), cellular components (CC) and molecular functions (MF). For each case, we measured complexity by computing metrics related to the size, the nodes connectivity and the hierarchical structure.
The number of classes and relations increased monotonously for each branch, with different growth rates. BP and CC had similar connectivity, superior to that of MF. Connectivity increased monotonously for BP, decreased for CC and remained stable for MF, with a marked increase for the three branches in November and December 2012. Hierarchy-related measures showed that CC and MF had similar proportions of leaves, average depths and average heights. BP had a lower proportion of leaves, and a higher average depth and average height. For BP and MF, the late 2012 increase of connectivity resulted in an increase of the average depth and average height and a decrease of the proportion of leaves, indicating that a major enrichment effort of the intermediate-level hierarchy occurred.
The variation of the number of classes and relations in an ontology does not provide enough information about the evolution of its complexity. However, connectivity and hierarchy-related metrics revealed different patterns of values as well as of evolution for the three branches of the Gene Ontology. CC was similar to BP in terms of connectivity, and similar to MF in terms of hierarchy. Overall, BP complexity increased, CC was refined with the addition of leaves providing a finer level of annotations but decreasing slightly its complexity, and MF complexity remained stable.
The genome content of extant species is derived from that of ancestral genomes, distorted by evolutionary events such as gene duplications, transfers and losses. Reconciliation methods aim at recovering such events and at localizing them in the species history, by comparing gene family trees to species trees. These methods play an important role in studying genome evolution as well as in inferring orthology relationships. A major issue with reconciliation methods is that the reliability of predicted evolutionary events may be questioned for various reasons: Firstly, there may be multiple equally optimal reconciliations for a given species tree–gene tree pair. Secondly, reconciliation methods can be misled by inaccurate gene or species trees. Thirdly, predicted events may fluctuate with method parameters such as the cost or rate of elementary events. For all of these reasons, confidence values for predicted evolutionary events are sorely needed. It was recently suggested that the frequency of each event in the set of all optimal reconciliations could be used as a support measure. We put this proposition to the test here and also consider a variant where the support measure is obtained by additionally accounting for suboptimal reconciliations. Experiments on simulated data show the relevance of event supports computed by both methods, while resorting to suboptimal sampling was shown to be more effective. Unfortunately, we also show that, unlike the majority-rule consensus tree for phylogenies, there is no guarantee that a single reconciliation can contain all events having above 50% support. In this paper, we detail how to rely on the reconciliation graph to efficiently identify the median reconciliation. Such median reconciliation can be found in polynomial time within the potentially exponential set of most parsimonious reconciliations.
Glarea lozoyensis is a filamentous fungus used for the industrial production of non-ribosomal peptide pneumocandin B0. In the scope of a whole genome sequencing the complete mitochondrial genome of the fungus has been assembled and annotated. It is the first one of the large polyphyletic Helotiaceae family. A phylogenetic analysis was performed based on conserved proteins of the oxidative phosphorylation system in mitochondrial genomes.
The total size of the mitochondrial genome is 45,038 bp. It contains the expected 14 genes coding for proteins related to oxidative phosphorylation,two rRNA genes, six hypothetical proteins, three intronic genes of which two are homing endonucleases and a ribosomal protein rps3. Additionally there is a set of 33 tRNA genes. All genes are located on the same strand. Phylogenetic analyses based on concatenated mitochondrial protein sequences confirmed that G. lozoyensis belongs to the order of Helotiales and that it is most closely related to Phialocephala subalpina. However, a comparison with the three other mitochondrial genomes known from Helotialean species revealed remarkable differences in size, gene content and sequence. Moreover, it was found that the gene order found in P. subalpina and Sclerotinia sclerotiorum is not conserved in G. lozoyensis.
The arrangement of genes and other differences found between the mitochondrial genome of G. lozoyensis and those of other Helotiales indicates a broad genetic diversity within this large order. Further mitochondrial genomes are required in order to determine whether there is a continuous transition between the different forms of mitochondrial genomes or G. lozoyensis belongs to a distinct subgroup within Helotiales.
Transcriptome profiles provide a practical and inexpensive alternative to explore genomic data in non-model organisms, particularly in amphibians where the genomes are very large and complex. The odorous frog Odorranamargaretae (Anura: Ranidae) is a dominant species in the mountain stream ecosystem of western China. Limited knowledge of its genetic background has hindered research on this species, despite its importance in the ecosystem and as biological resources. Here we report the transcriptome of O. margaretae in order to establish the foundation for genetic research. Using an Illumina sequencing platform, 62,321,166 raw reads were acquired. After a de novo assembly, 37,906 transcripts were obtained, and 18,933 transcripts were annotated to 14,628 genes. We functionally classified these transcripts by Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG). A total of 11,457 unique transcripts were assigned to 52 GO terms, and 1,438 transcripts were assigned to 128 KEGG pathways. Furthermore, we identified 27 potential antimicrobial peptides (AMPs), 50,351 single nucleotide polymorphism (SNP) sites, and 2,574 microsatellite DNA loci. The transcriptome profile of this species will shed more light on its genetic background and provide useful tools for future studies of this species, as well as other species in the genus Odorrana. It will also contribute to the accumulation of amphibian genomic data.
Paired box (PAX) genes are transcription factors that play important roles in embryonic development. Although the PAX gene family occurs in animals only, it is widely distributed. Among the vertebrates, its 9 genes appear to be the product of complete duplication of an original set of 4 genes, followed by an additional partial duplication. Although some studies of PAX genes have been conducted, no comprehensive survey of these genes across the entire taxonomic unit has yet been attempted. In this study, we conducted a detailed comparison of PAX sequences from 188 chordates, which revealed restricted variation. The absence of PAX4 and PAX8 among some species of reptiles and birds was notable; however, all 9 genes were present in all 74 mammalian genomes investigated. A search for signatures of selection indicated that all genes are subject to purifying selection, with a possible constraint relaxation in PAX4, PAX7, and PAX8. This result indicates asymmetric evolution of PAX family genes, which can be associated with the emergence of adaptive novelties in the chordate evolutionary trajectory.
The prominent attributes of foxtail millet (Setaria italica L.) including its small genome size, short life cycle, inbreeding nature, and phylogenetic proximity to various biofuel crops have made this crop an excellent model system to investigate various aspects of architectural, evolutionary and physiological significances in Panicoid bioenergy grasses. After release of its whole genome sequence, large-scale genomic resources in terms of molecular markers were generated for the improvement of both foxtail millet and its related species. Hence it is now essential to congregate, curate and make available these genomic resources for the benefit of researchers and breeders working towards crop improvement. In view of this, we have constructed the Foxtail millet Marker Database (FmMDb; http://www.nipgr.res.in/foxtail.html), a comprehensive online database for information retrieval, visualization and management of large-scale marker datasets with unrestricted public access. FmMDb is the first database which provides complete marker information to the plant science community attempting to produce elite cultivars of millet and bioenergy grass species, thus addressing global food insecurity.
An important challenge in drug discovery and disease prognosis is to predict genes that are preferentially expressed in one or a few tissues, i.e. showing a considerably higher expression in one tissue(s) compared to the others. Although several data sources and methods have been published explicitly for this purpose, they often disagree and it is not evident how to retrieve these genes and how to distinguish true biological findings from those that are due to choice-of-method and/or experimental settings. In this work we have developed a computational approach that combines results from multiple methods and datasets with the aim to eliminate method/study-specific biases and to improve the predictability of preferentially expressed human genes. A rule-based score is used to merge and assign support to the results. Five sets of genes with known tissue specificity were used for parameter pruning and cross-validation. In total we identify 3434 tissue-specific genes. We compare the genes of highest scores with the public databases: PaGenBase (microarray), TiGER (EST) and HPA (protein expression data). The results have 85% overlap to PaGenBase, 71% to TiGER and only 28% to HPA. 99% of our predictions have support from at least one of these databases. Our approach also performs better than any of the databases on identifying drug targets and biomarkers with known tissue-specificity.
MicroRNAs (miRNAs) are important regulators of gene expression at the post-transcriptional level in a wide range of species. Highly conserved miRNAs regulate ancestral transcription factors common to all plants, and control important basic processes such as cell division and meristem function. We selected 21 conserved miRNA families to analyze the distribution and maintenance of miRNAs. Recently, the first genome sequence in Palmaceae was released: date palm (Phoenix dactylifera). We conducted a systematic miRNA analysis in date palm, computationally identifying and characterizing the distribution and duplication of conserved miRNAs in this species compared to other published plant genomes. A total of 81 miRNAs belonging to 18 miRNA families were identified in date palm. The majority of miRNAs in date palm and seven other well-studied plant species were located in intergenic regions and located 4 to 5 kb away from the nearest protein-coding genes. Sequence comparison showed that 67% of date palm miRNA members were present in duplicated segments, and that 135 pairs of miRNA-containing segments were duplicated in Arabidopsis, tomato, orange, rice, apple, poplar and soybean with a high similarity of non coding sequences between duplicated segments, indicating genomic duplication was a major force for expansion of conserved miRNAs. Duplicated miRNA pairs in date palm showed divergence in pre-miRNA sequence and in number of promoters, implying that these duplicated pairs may have undergone divergent evolution. Comparisons between date palm and the seven other plant species for the gain/loss of miR167 loci in an ancient segment shared between monocots and dicots suggested that these conserved miRNAs were highly influenced by and diverged as a result of genomic duplication events.
The Dlx gene family encodes transcription factors involved in the development of a wide variety of morphological innovations that first evolved at the origins of vertebrates or of the jawed vertebrates. This gene family expanded with the two rounds of genome duplications that occurred before jawed vertebrates diversified. It includes at least three bigene pairs sharing conserved regulatory sequences in tetrapods and teleost fish, but has been only partially characterized in chondrichthyans, the third major group of jawed vertebrates. Here we take advantage of developmental and molecular tools applied to the shark Scyliorhinus canicula to fill in the gap and provide an overview of the evolution of the Dlx family in the jawed vertebrates. These results are analyzed in the theoretical framework of the DDC (Duplication-Degeneration-Complementation) model.
The genomic organisation of the catshark Dlx genes is similar to that previously described for tetrapods. Conserved non-coding elements identified in bony fish were also identified in catshark Dlx clusters and showed regulatory activity in transgenic zebrafish. Gene expression patterns in the catshark showed that there are some expression sites with high conservation of the expressed paralog(s) and other expression sites with events of paralog sub-functionalization during jawed vertebrate diversification, resulting in a wide variety of evolutionary scenarios within this gene family.
Dlx gene expression patterns in the catshark show that there has been little neo-functionalization in Dlx genes over gnathostome evolution. In most cases, one tandem duplication and two rounds of vertebrate genome duplication have led to at least six Dlx coding sequences with redundant expression patterns followed by some instances of paralog sub-functionalization. Regulatory constraints such as shared enhancers, and functional constraints including gene pleiotropy, may have contributed to the evolutionary inertia leading to high redundancy between gene expression patterns.
As the major enamel matrix protein contributing to tooth development, amelogenin has been demonstrated to play a crucial role in tooth enamel formation. Previous studies have revealed amelogenin alternative splicing as a mechanism for amelogenin heterogeneous expression in mammals. While amelogenin and its splicing forms in mammalian vertebrates have been characterized, splicing variants of amelogenin gene still remains largely unknown in non-mammalian species. Here, using PCR and sequence analysis we discovered two novel amelogenin transcript variants in tooth organ extracts from a caudate amphibian, the salamander Plethodoncinereus. The one was shorter -S- (416 nucleotides including untranslated regions, 5 exons) and the other larger -L- (851 nt, 7 exons) than the previously published “normal” gene in this species -M- (812 nucleotides, 6 exons). This is the first report demonstrating the amelogenin alternative splicing in amphibian, revealing a unique exon 2b and two novel amelogenin gene transcripts in Plethodoncinereus.
Systematic determination of gene function is an essential step in fully understanding the precise contribution of each gene for the proper execution of molecular functions in the cell. Gene functional linkage is defined as to describe the relationship of a group of genes with similar functions. With thousands of genomes sequenced, there arises a great opportunity to utilize gene evolutionary information to identify gene functional linkages. To this end, we established a computational method (called TRACE) to trace gene footprints through a gene functional network constructed from 341 prokaryotic genomes. TRACE performance was validated and successfully tested to predict enzyme functions as well as components of pathway. A so far undescribed chromosome partitioning-like protein ro03654 of an oleaginous bacteria Rhodococcus sp. RHA1 (RHA1) was predicted and verified experimentally with its deletion mutant showing growth inhibition compared to RHA1 wild type. In addition, four proteins were predicted to act as prokaryotic SNARE-like proteins, and two of them were shown to be localized at the plasma membrane. Thus, we believe that TRACE is an effective new method to infer prokaryotic gene functional linkages by tracing evolutionary events.
Monocots are one of the most diverse, successful and economically important clades of angiosperms. We attempt to analyse the complete plastid genome sequences of two lilies and their lengths were 152,793bp in Liliumlongiflorum (Liliaceae) and 155,510bp in Alstroemeriaaurea (Alstroemeriaceae). Phylogenetic analyses were performed for 28 taxa including major lineages of monocots using the sequences of 79 plastid genes for clarifying the phylogenetic relationship of the order Liliales. The sister relationship of Liliales and Asparagales-commelinids was improved with high resolution. Comparative analyses of inter-familial and inter-specific sequence variation were also carried out among three families of Liliaceae, Smilacaceae, and Alstroemeriaceae, and between two Lilium species of L. longflorum and L. superbum. Gene content and order were conserved in the order Liliales except infA loss in Smilax and Alstroemeria. IR boundaries were similar in IRa, however, IRb showed different extension patterns as JLB of Smilax and JSB in Alstroemeria. Ka/Ks ratio was high in matK among the pair-wise comparison of three families and the most variable genes were psaJ, ycf1, rpl32, rpl22, matK, and ccsA among the three families and rps15, rpoA, matK, and ndhF between Lilium.
Plasmids have long been recognized as an important driver of DNA exchange and genetic innovation in prokaryotes. The success of plasmids has been attributed to their independent replication from the host's chromosome and their frequent self-transfer. It is thought that plasmids accumulate, rearrange and distribute nonessential genes, which may provide an advantage for host proliferation under selective conditions. In order to test this hypothesis independently of biases from culture selection, we study the plasmid metagenome from microbial communities in two activated sludge systems, one of which receives mostly household and the other chemical industry wastewater. We find that plasmids from activated sludge microbial communities carry among the largest proportion of unknown gene pools so far detected in metagenomic DNA, confirming their presumed role of DNA innovators. At a system level both plasmid metagenomes were dominated by functions associated with replication and transposition, and contained a wide variety of antibiotic and heavy metal resistances. Plasmid families were very different in the two metagenomes and grouped in deep-branching new families compared with known plasmid replicons. A number of abundant plasmid replicons could be completely assembled directly from the metagenome, providing insight in plasmid composition without culturing bias. Functionally, the two metagenomes strongly differed in several ways, including a greater abundance of genes for carbohydrate metabolism in the industrial and of general defense factors in the household activated sludge plasmid metagenome. This suggests that plasmids not only contribute to the adaptation of single individual prokaryotic species, but of the prokaryotic community as a whole under local selective conditions.
metagenomic studies; mobilome
Although a large set of full-length transcripts was recently assembled in catfish, annotation of large gene families, especially those with duplications, is still a great challenge. Most often, complexities in annotation cause mis-identification and thereby much confusion in the scientific literature. As such, detailed phylogenetic analysis and/or orthology analysis are required for annotation of genes involved in gene families. The ATP-binding cassette (ABC) transporter gene superfamily is a large gene family that encodes membrane proteins that transport a diverse set of substrates across membranes, playing important roles in protecting organisms from diverse environment.
In this work, we identified a set of 50 ABC transporters in catfish genome. Phylogenetic analysis allowed their identification and annotation into seven subfamilies, including 9 ABCA genes, 12 ABCB genes, 12 ABCC genes, 5 ABCD genes, 2 ABCE genes, 4 ABCF genes and 6 ABCG genes. Most ABC transporters are conserved among vertebrates, though cases of recent gene duplications and gene losses do exist. Gene duplications in catfish were found for ABCA1, ABCB3, ABCB6, ABCC5, ABCD3, ABCE1, ABCF2 and ABCG2.
The whole set of catfish ABC transporters provide the essential genomic resources for future biochemical, toxicological and physiological studies of ABC drug efflux transporters. The establishment of orthologies should allow functional inferences with the information from model species, though the function of lineage-specific genes can be distinct because of specific living environment with different selection pressure.
An extant genome can be the descendant of an ancient polyploid genome. The genome aliquoting problem is to reconstruct the latter from the former such that the rearrangement distance (i.e., the number of genome rearrangements necessary to transform the former into the latter) is minimal. Though several heuristic algorithms have been published, here, we sought improved algorithms for the problem with respect to the double cut and join (DCJ) distance. The new algorithm makes use of partial and contracted partial graphs, and locally minimizes the distance. Our test results with simulation data indicate that it reliably recovers gene order of the ancestral polyploid genome even when the ancestor is ancient. We also compared the performance of our method with an earlier method using simulation data sets and found that our algorithm has higher accuracy. It is known that vertebrates had undergone two rounds of whole-genome duplication (2R-WGD) during early vertebrate evolution. We used the new algorithm to calculate the DCJ distance between three modern vertebrate genomes and their 2R-WGD ancestor and found that the rearrangement rate might have slowed down significantly since the 2R-WGD. The software AliquotG implementing the algorithm is available as an open-source package from our website (http://mosas.sysu.edu.cn/genome/download_softwares.php).
Developmental constraints have been postulated to limit the space of feasible phenotypes and thus shape animal evolution. These constraints have been suggested to be the strongest during either early or mid-embryogenesis, which corresponds to the early conservation model or the hourglass model, respectively. Conflicting results have been reported, but in recent studies of animal transcriptomes the hourglass model has been favored. Studies usually report descriptive statistics calculated for all genes over all developmental time points. This introduces dependencies between the sets of compared genes and may lead to biased results. Here we overcome this problem using an alternative modular analysis. We used the Iterative Signature Algorithm to identify distinct modules of genes co-expressed specifically in consecutive stages of zebrafish development. We then performed a detailed comparison of several gene properties between modules, allowing for a less biased and more powerful analysis. Notably, our analysis corroborated the hourglass pattern at the regulatory level, with sequences of regulatory regions being most conserved for genes expressed in mid-development but not at the level of gene sequence, age, or expression, in contrast to some previous studies. The early conservation model was supported with gene duplication and birth that were the most rare for genes expressed in early development. Finally, for all gene properties, we observed the least conservation for genes expressed in late development or adult, consistent with both models. Overall, with the modular approach, we showed that different levels of molecular evolution follow different patterns of developmental constraints. Thus both models are valid, but with respect to different genomic features.
During development, vertebrate embryos pass through a “phylotypic” stage, during which their morphology is most similar between different species. This gave rise to the hourglass model, which predicts the highest developmental constraints during mid-embryogenesis. In the last decade, a large effort has been made to uncover the relation between developmental constraints and the evolution of genome. Several studies reported gene characteristics that change according to the hourglass model, e.g. sequence conservation, age, or expression. Here, we first show that some of the previous conclusions do not hold out under detailed analysis of the data. Then, we discuss the disadvantages of the standard evo-devo approach, i.e. comparing descriptive statistics of all genes across development. Results of such analysis are biased by genes expressed constantly during development (housekeeping genes). To overcome this limitation, we use a modularization approach, which reduces the complexity of the data and assures independency between the sets of genes which are compared. We identified distinct sets of genes (modules) with time-specific expression in zebrafish development and analyzed their conservation of sequence, gene expression, and regulatory elements, as well as their age and orthology relationships. Interestingly, we found different patterns of developmental constraints for different gene properties. Only conserved regulatory regions follow an hourglass pattern.
Phytoplasmas are a group of bacteria that are associated with hundreds of plant diseases. Due to their economical importance and the difficulties involved in the experimental study of these obligate pathogens, genome sequencing and comparative analysis have been utilized as powerful tools to understand phytoplasma biology. To date four complete phytoplasma genome sequences have been published. However, these four strains represent limited phylogenetic diversity. In this study, we report the shotgun sequencing and evolutionary analysis of a peanut witches'-broom (PnWB) phytoplasma genome. The availability of this genome provides the first representative of the 16SrII group and substantially improves the taxon sampling to investigate genome evolution. The draft genome assembly contains 13 chromosomal contigs with a total size of 562,473 bp, covering ∼90% of the chromosome. Additionally, a complete plasmid sequence is included. Comparisons among the five available phytoplasma genomes reveal the differentiations in gene content and metabolic capacity. Notably, phylogenetic inferences of the potential mobile units (PMUs) in these genomes indicate that horizontal transfer may have occurred between divergent phytoplasma lineages. Because many effectors are associated with PMUs, the horizontal transfer of these transposon-like elements can contribute to the adaptation and diversification of these pathogens. In summary, the findings from this study highlight the importance of improving taxon sampling when investigating genome evolution. Moreover, the currently available sequences are inadequate to fully characterize the pan-genome of phytoplasmas. Future genome sequencing efforts to expand phylogenetic diversity are essential in improving our understanding of phytoplasma evolution.
The crucian carp is an important aquaculture species and a potential model to study genome evolution and physiological adaptation. However, so far the genomics and transcriptomics data available for this species are still scarce. We performed de novo transcriptome sequencing of four cDNA libraries representing brain, muscle, liver and kidney tissues respectively, each with six specimens. The removal of low quality reads resulted in 2.62 million raw reads, which were assembled as 127,711 unigenes, including 84,867 isotigs and 42,844 singletons. A total of 22,273 unigenes were found with significant matches to 14,449 unique proteins. Around14,398 unigenes were assigned with at least one Gene Ontology (GO) category in 84,876 total assignments, and 6,382 unigenes were found in 237 predicted KEGG pathways. The gene expression analysis revealed more genes expressed in brain, more up-regulated genes in muscle and more down-regulated genes in liver as compared with gene expression profiles of other tissues. In addition, 23 enzymes in the glycolysis/gluconeogenesis pathway were recovered. Importantly, we identified 5,784 high-quality putative SNP and 11,295 microsatellite markers which include 5,364 microsatellites with flanking sequences ≥50 bp. This study produced the most comprehensive genomic resources that have been derived from crucian carp, including thousands of genetic markers, which will not only lay a foundation for further studies on polyploidy origin and anoxic survival but will also facilitate selective breeding of this important aquaculture species.
Background and Objectives
Analysis of positively-selected genes can help us understand how human evolved, especially the evolution of highly developed cognitive functions. However, previous works have reached conflicting conclusions regarding whether human neuronal genes are over-represented among genes under positive selection.
Methods and Results
We divided positively-selected genes into four groups according to the identification approaches, compiling a comprehensive list from 27 previous studies. We showed that genes that are highly expressed in the central nervous system are enriched in recent positive selection events in human history identified by intra-species genomic scan, especially in brain regions related to cognitive functions. This pattern holds when different datasets, parameters and analysis pipelines were used. Functional category enrichment analysis supported these findings, showing that synapse-related functions are enriched in genes under recent positive selection. In contrast, immune-related functions, for instance, are enriched in genes under ancient positive selection revealed by inter-species coding region comparison. We further demonstrated that most of these patterns still hold even after controlling for genomic characteristics that might bias genome-wide identification of positively-selected genes including gene length, gene density, GC composition, and intensity of negative selection.
Our rigorous analysis resolved previous conflicting conclusions and revealed recent adaptation of human brain functions.
Positive selection is widely estimated from protein coding sequence alignments by the nonsynonymous-to-synonymous ratio ω. Increasingly elaborate codon models are used in a likelihood framework for this estimation. Although there is widespread concern about the robustness of the estimation of the ω ratio, more efforts are needed to estimate this robustness, especially in the context of complex models. Here, we focused on the branch-site codon model. We investigated its robustness on a large set of simulated data. First, we investigated the impact of sequence divergence. We found evidence of underestimation of the synonymous substitution rate for values as small as 0.5, with a slight increase in false positives for the branch-site test. When dS increases further, underestimation of dS is worse, but false positives decrease. Interestingly, the detection of true positives follows a similar distribution, with a maximum for intermediary values of dS. Thus, high dS is more of a concern for a loss of power (false negatives) than for false positives of the test. Second, we investigated the impact of GC content. We showed that there is no significant difference of false positives between high GC (up to ∼80%) and low GC (∼30%) genes. Moreover, neither shifts of GC content on a specific branch nor major shifts in GC along the gene sequence generate many false positives. Our results confirm that the branch-site is a very conservative test.
adaptive evolution; codon model; base composition
Many studies have reported horizontal gene transfer (HGT) events from eukaryotes, especially fungi. However, only a few investigations summarized multiple interkingdom HGTs involving important phytopathogenic species of Pyrenophora and few have investigated the genetic contributions of HGTs to fungi. We investigated HGT events in P. teres and P. tritici-repentis and discovered that both species harbored 14 HGT genes derived from bacteria and plants, including 12 HGT genes that occurred in both species. One gene coding a leucine-rich repeat protein was present in both species of Pyrenophora and it may have been transferred from a host plant. The transfer of genes from a host plant to pathogenic fungi has been reported rarely and we discovered the first evidence for this transfer in phytopathogenic Pyrenophora. Two HGTs in Pyrenophora underwent subsequent duplications. Some HGT genes had homologs in a few other fungi, indicating relatively ancient transfer events. Functional analyses indicated that half of the HGT genes encoded extracellular proteins and these may have facilitated the infection of plants by Pyrenophora via interference with plant defense-response and the degradation of plant cell walls. Some other HGT genes appeared to participate in carbohydrate metabolism. Together, these functions implied that HGTs may have led to highly efficient mechanisms of infection as well as the utilization of host carbohydrates. Evolutionary analyses indicated that HGT genes experienced amelioration, purifying selection, and accelerated evolution. These appeared to constitute adaptations to the background genome of the recipient. The discovery of multiple interkingdom HGTs in Pyrenophora, their significance to infection, and their adaptive evolution, provided valuable insights into the evolutionary significance of interkingdom HGTs from multiple donors.
ZBED genes originate from domesticated hAT DNA transposons and encode regulatory proteins of diverse function in vertebrates. Here we reveal the evolutionary relationship between ZBED genes and demonstrate that they are derived from at least two independent domestication events in jawed vertebrate ancestors. We show that ZBEDs form two monophyletic clades, one of which has expanded through several independent duplications in host lineages. Subsequent diversification of ZBED genes has facilitated regulation of multiple diverse fundamental functions. In contrast to known examples of transposable element exaptation, our results demonstrate a novel unprecedented capacity for the repeated utilization of a family of transposable element-derived protein domains sequestered as regulators during the evolution of diverse host gene functions in vertebrates. Specifically, ZBEDs have contributed to vertebrate regulatory innovation through the donation of modular DNA and protein interacting domains. We identify that C7ORF29, ZBED2, 3, 4, and ZBEDX form a monophyletic group together with ZBED6, that is distinct from ZBED1 genes. Furthermore, we show that ZBED5 is related to Buster DNA transposons and is phylogenetically separate from other ZBEDs. Our results offer new insights into the evolution of regulatory pathways, and suggest that DNA transposons have contributed to regulatory complexity during genome evolution in vertebrates.
The rapidly growing availability of genome information has created considerable demand for both fast and accurate phylogenetic inference algorithms. We present a novel method called DendroBLAST for reconstructing phylogenetic dendrograms/trees from protein sequences using BLAST. This method differs from other methods by incorporating a simple model of sequence evolution to test the effect of introducing sequence changes on the reliability of the bipartitions in the inferred tree. Using realistic simulated sequence data we demonstrate that this method produces phylogenetic trees that are more accurate than other commonly-used distance based methods though not as accurate as maximum likelihood methods from good quality multiple sequence alignments. In addition to tests on simulated data, we use DendroBLAST to generate input trees for a supertree reconstruction of the phylogeny of the Archaea. This independent analysis produces an approximate phylogeny of the Archaea that has both high precision and recall when compared to previously published analysis of the same dataset using conventional methods. Taken together these results demonstrate that approximate phylogenetic trees can be produced in the absence of multiple sequence alignments, and we propose that these trees will provide a platform for improving and informing downstream bioinformatic analysis. A web implementation of the DendroBLAST method is freely available for use at http://www.dendroblast.com/.
As part of the development of the database Bgee (a dataBase for Gene Expression Evolution), we annotate and analyse expression data from different types and different sources, notably Affymetrix data from GEO and ArrayExpress, and RNA-Seq data from SRA. During our quality control procedure, we have identified duplicated content in GEO and ArrayExpress, affecting ∼14% of our data: fully or partially duplicated experiments from independent data submissions, Affymetrix chips reused in several experiments, or reused within an experiment. We present here the procedure that we have established to filter such duplicates from Affymetrix data, and our procedure to identify future potential duplicates in RNA-Seq data.