The miiuy croaker (Miichthys miiuy) is an important species of marine fish that supports capture fisheries and aquaculture. At present commercial scale aquaculture of this species is limited due to diseases caused by pathogens and parasites which restrict production and limit commercial value. The lack of transcriptomic and genomic information for the miiuy croaker limits the ability of researchers to study the pathogenesis and immune system of this species. In this study we constructed a cDNA library from liver, spleen and kidney which was sequenced using Illumina paired-end sequencing to enable gene discovery and molecular marker development.
In our study, a total of 69,071 unigenes with an average length of 572 bp were obtained. Of these, 45,676 (66.13%) were successfully annotated in public databases. The unigenes were also annotated with Gene Ontology, Clusters of Orthologous Groups and KEGG pathways. Additionally, 498 immune-relevant genes were identified and classified. Furthermore, 14,885 putative simple sequence repeats (cSSRs) and 8,510 putative single nucleotide polymorphisms (SNPs) were identified from the 69,071 unigenes.
The miiuy croaker (Miichthys miiuy) transcriptome data provides a large resource to identify new genes involved in many processes including those involved in the response to pathogens and diseases. Furthermore, the thousands of potential cSSR and SNP markers found in this study are important resources with respect to future development of molecular marker assisted breeding programs for the miiuy croaker.
Previous studies of horse RNA-seq were performed by mapping sequence reads to the reference genome during transcriptome analysis. However in this study, we focused on two main ideas. First, differentially expressed genes (DEGs) were identified by de novo–based analysis (DBA) in RNA-seq data from six Thoroughbreds before and after exercise, here-after referred to as “de novo unique differentially expressed genes” (DUDEG). Second, by integrating both conventional DEGs and genes identified as being selected for during domestication of Thoroughbred and Jeju pony from whole genome re-sequencing (WGS) data, we give a new concept to the definition of DEG. We identified 1,034 and 567 DUDEGs in skeletal muscle and blood, respectively. DUDEGs in skeletal muscle were significantly related to exercise-induced stress biological process gene ontology (BP-GO) terms: ‘immune system process’; ‘response to stimulus’; and, ‘death’ and a KEGG pathways: ‘JAK-STAT signaling pathway’; ‘MAPK signaling pathway’; ‘regulation of actin cytoskeleton’; and, ‘p53 signaling pathway’. In addition, we found TIMELESS, EIF4A3 and ZNF592 in blood and CHMP4C and FOXO3 in skeletal muscle, to be in common between DUDEGs and selected genes identified by evolutionary statistics such as FST and Cross Population Extended Haplotype Homozygosity (XP-EHH). Moreover, in Thoroughbreds, three out of five genes (CHMP4C, EIF4A3 and FOXO3) related to exercise response showed relatively low nucleotide diversity compared to the Jeju pony. DUDEGs are not only conceptually new DEGs that cannot be attained from reference-based analysis (RBA) but also supports previous RBA results related to exercise in Thoroughbred. In summary, three exercise related genes which were selected for during domestication in the evolutionary history of Thoroughbred were identified as conceptually new DEGs in this study.
The availability of many complete, annotated proteomes enables the systematic study of the relationships between protein conservation and functionality. We explore this question based solely on the presence or absence of protein homologues (a.k.a. conservation profiles). We study 18 metazoans, from two distinct points of view: the human's and the fly's. Using the GOrilla gene ontology (GO) analysis tool, we explore functional enrichment of the “universal proteins”, those with homologues in all 17 other species, and of the “non-universal proteins”. A large number of GO terms are strongly enriched in both human and fly universal proteins. Most of these functions are known to be essential. A smaller number of GO terms, exhibiting markedly different properties, are enriched in both human and fly non-universal proteins. We further explore the non-universal proteins, whose conservation profiles are consistent with the “tree of life” (TOL consistent), as well as the TOL inconsistent proteins. Finally, we applied Quantum Clustering to the conservation profiles of the TOL consistent proteins. Each cluster is strongly associated with one or a small number of specific monophyletic clades in the tree of life. The proteins in many of these clusters exhibit strong functional enrichment associated with the “life style” of the related clades. Most previous approaches for studying function and conservation are “bottom up”, studying protein families one by one, and separately assessing the conservation of each. By way of contrast, our approach is “top down”. We globally partition the set of all proteins hierarchically, as described above, and then identify protein families enriched within different subdivisions. While supporting previous findings, our approach also provides a tool for discovering novel relations between protein conservation profiles, functionality, and evolutionary history as represented by the tree of life.
Scarab beetles exhibit an astonishing variety of rigid exo-skeletal outgrowths, known as “horns”. These traits are often sexually dimorphic and vary dramatically across species in size, shape, location, and allometry with body size. In many species, the horn exhibits disproportionate growth resulting in an exaggerated allometric relationship with body size, as compared to other traits, such as wings, that grow proportionately with body size. Depending on the species, the smallest males either do not produce a horn at all, or they produce a disproportionately small horn for their body size. While the diversity of horn shapes and their behavioural ecology have been reasonably well studied, we know far less about the proximate mechanisms that regulate horn growth. Thus, using 454 pyrosequencing, we generated transcriptome profiles, during horn growth and development, in two different scarab beetle species: the Asian rhinoceros beetle, Trypoxylus dichotomus, and the dung beetle, Onthophagus nigriventris. We obtained over half a million reads for each species that were assembled into over 6,000 and 16,000 contigs respectively. We combined these data with previously published studies to look for signatures of molecular evolution. We found a small subset of genes with horn-biased expression showing evidence for recent positive selection, as is expected with sexual selection on horn size. We also found evidence of relaxed selection present in genes that demonstrated biased expression between horned and horn-less morphs, consistent with the theory of developmental decoupling of phenotypically plastic traits.
The human steroid 21-hydroxylase gene (CYP21A2) participates in cortisol and aldosterone biosynthesis, and resides together with its paralogous (duplicated) pseudogene in a multiallelic copy number variation (CNV), called RCCX CNV. Concerted evolution caused by non-allelic gene conversion has been described in great ape CYP21 genes, and the same conversion activity is responsible for a serious genetic disorder of CYP21A2, congenital adrenal hyperplasia (CAH). In the current study, 33 CYP21A2 haplotype variants encoding 6 protein variants were determined from a European population. CYP21A2 was shown to be one of the most diverse human genes (HHe=0.949), but the diversity of intron 2 was greater still. Contrary to previous findings, the evolution of intron 2 did not follow concerted evolution, although the remaining part of the gene did. Fixed sites (different fixed alleles of sites in human CYP21 paralogues) significantly accumulated in intron 2, indicating that the excess of fixed sites was connected to the lack of effective non-allelic conversion and concerted evolution. Furthermore, positive selection was presumably focused on intron 2, and possibly associated with the previous genetic features. However, the positive selection detected by several neutrality tests was discerned along the whole gene. In addition, the clear signature of negative selection was observed in the coding sequence. The maintenance of the CYP21 enzyme function is critical, and could lead to negative selection, whereas the presumed gene regulation altering steroid hormone levels via intron 2 might help fast adaptation, which broadly characterizes the genes of human CNVs responding to the environment.
Selectome (http://selectome.unil.ch/) is a database of positive selection, based on a branch-site likelihood test. This model estimates the number of nonsynonymous substitutions (dN) and synonymous substitutions (dS) to evaluate the variation in selective pressure (dN/dS ratio) over branches and over sites. Since the original release of Selectome, we have benchmarked and implemented a thorough quality control procedure on multiple sequence alignments, aiming to provide minimum false-positive results. We have also improved the computational efficiency of the branch-site test implementation, allowing larger data sets and more frequent updates. Release 6 of Selectome includes all gene trees from Ensembl for Primates and Glires, as well as a large set of vertebrate gene trees. A total of 6810 gene trees have some evidence of positive selection. Finally, the web interface has been improved to be more responsive and to facilitate searches and browsing.
Fatty acid-binding proteins (FABPs) are a family of fatty acid-binding small proteins essential for lipid trafficking, energy storage and gene regulation. Although they have 20 to 70% amino acid sequence identity, these proteins share a conserved tertiary structure comprised of ten beta sheets and two alpha helixes. Availability of the complete genomes of 34 invertebrates, together with transcriptomes and ESTs, allowed us to systematically investigate the gene structure and alternative splicing of FABP genes over a wide range of phyla. Only in genomes of two cnidarian species could FABP genes not be identified. The genomic loci for FABP genes were diverse and their genomic structure varied. In particular, the intronless FABP genes, in most of which the key residues involved in fatty acid binding varied, were common in five phyla. Interestingly, several species including one trematode, one nematode and four arthropods generated FABP mRNA variants via alternative splicing. These results demonstrate that both gene duplication and post-transcriptional modifications are used to generate diverse FABPs in species studied.
Ribosomal loci represent a major tool for investigating environmental diversity and community structure via high-throughput marker gene studies of eukaryotes (e.g. 18S rRNA). Since the estimation of species’ abundance is a major goal of environmental studies (by counting numbers of sequences), understanding the patterns of rRNA copy number across species will be critical for informing such high-throughput approaches. Such knowledge is critical, given that ribosomal RNA genes exist within multi-copy repeated arrays in a genome. Here we measured the repeat copy number for six nematode species by mapping the sequences from whole genome shotgun libraries against reference sequences for their rRNA repeat. This revealed a 6-fold variation in repeat copy number amongst taxa investigated, with levels of intragenomic variation ranging from 56 to 323 copies of the rRNA array. By applying the same approach to four C. elegans mutation accumulation lines propagated by repeated bottlenecking for an average of ~400 generations, we find on average a 2-fold increase in repeat copy number (rate of increase in rRNA estimated at 0.0285-0.3414 copies per generation), suggesting that rRNA repeat copy number is subject to selection. Within each Caenorhabditis species, the majority of intragenomic variation found across the rRNA repeat was observed within gene regions (18S, 28S, 5.8S), suggesting that such intragenomic variation is not a product of selection for rRNA coding function. We find that the dramatic variation in repeat copy number among these six nematode genomes would limit the use of rRNA in estimates of organismal abundance. In addition, the unique pattern of variation within a single genome was uncorrelated with patterns of divergence between species, reflecting a strong signature of natural selection for rRNA function. A better understanding of the factors that control or affect copy number in these arrays, as well as their rates and patterns of evolution, will be critical for informing estimates of global biodiversity.
Ontologies support automatic sharing, combination and analysis of life sciences data. They undergo regular curation and enrichment. We studied the impact of an ontology evolution on its structural complexity. As a case study we used the sixty monthly releases between January 2008 and December 2012 of the Gene Ontology and its three independent branches, i.e. biological processes (BP), cellular components (CC) and molecular functions (MF). For each case, we measured complexity by computing metrics related to the size, the nodes connectivity and the hierarchical structure.
The number of classes and relations increased monotonously for each branch, with different growth rates. BP and CC had similar connectivity, superior to that of MF. Connectivity increased monotonously for BP, decreased for CC and remained stable for MF, with a marked increase for the three branches in November and December 2012. Hierarchy-related measures showed that CC and MF had similar proportions of leaves, average depths and average heights. BP had a lower proportion of leaves, and a higher average depth and average height. For BP and MF, the late 2012 increase of connectivity resulted in an increase of the average depth and average height and a decrease of the proportion of leaves, indicating that a major enrichment effort of the intermediate-level hierarchy occurred.
The variation of the number of classes and relations in an ontology does not provide enough information about the evolution of its complexity. However, connectivity and hierarchy-related metrics revealed different patterns of values as well as of evolution for the three branches of the Gene Ontology. CC was similar to BP in terms of connectivity, and similar to MF in terms of hierarchy. Overall, BP complexity increased, CC was refined with the addition of leaves providing a finer level of annotations but decreasing slightly its complexity, and MF complexity remained stable.
The genome content of extant species is derived from that of ancestral genomes, distorted by evolutionary events such as gene duplications, transfers and losses. Reconciliation methods aim at recovering such events and at localizing them in the species history, by comparing gene family trees to species trees. These methods play an important role in studying genome evolution as well as in inferring orthology relationships. A major issue with reconciliation methods is that the reliability of predicted evolutionary events may be questioned for various reasons: Firstly, there may be multiple equally optimal reconciliations for a given species tree–gene tree pair. Secondly, reconciliation methods can be misled by inaccurate gene or species trees. Thirdly, predicted events may fluctuate with method parameters such as the cost or rate of elementary events. For all of these reasons, confidence values for predicted evolutionary events are sorely needed. It was recently suggested that the frequency of each event in the set of all optimal reconciliations could be used as a support measure. We put this proposition to the test here and also consider a variant where the support measure is obtained by additionally accounting for suboptimal reconciliations. Experiments on simulated data show the relevance of event supports computed by both methods, while resorting to suboptimal sampling was shown to be more effective. Unfortunately, we also show that, unlike the majority-rule consensus tree for phylogenies, there is no guarantee that a single reconciliation can contain all events having above 50% support. In this paper, we detail how to rely on the reconciliation graph to efficiently identify the median reconciliation. Such median reconciliation can be found in polynomial time within the potentially exponential set of most parsimonious reconciliations.
Glarea lozoyensis is a filamentous fungus used for the industrial production of non-ribosomal peptide pneumocandin B0. In the scope of a whole genome sequencing the complete mitochondrial genome of the fungus has been assembled and annotated. It is the first one of the large polyphyletic Helotiaceae family. A phylogenetic analysis was performed based on conserved proteins of the oxidative phosphorylation system in mitochondrial genomes.
The total size of the mitochondrial genome is 45,038 bp. It contains the expected 14 genes coding for proteins related to oxidative phosphorylation,two rRNA genes, six hypothetical proteins, three intronic genes of which two are homing endonucleases and a ribosomal protein rps3. Additionally there is a set of 33 tRNA genes. All genes are located on the same strand. Phylogenetic analyses based on concatenated mitochondrial protein sequences confirmed that G. lozoyensis belongs to the order of Helotiales and that it is most closely related to Phialocephala subalpina. However, a comparison with the three other mitochondrial genomes known from Helotialean species revealed remarkable differences in size, gene content and sequence. Moreover, it was found that the gene order found in P. subalpina and Sclerotinia sclerotiorum is not conserved in G. lozoyensis.
The arrangement of genes and other differences found between the mitochondrial genome of G. lozoyensis and those of other Helotiales indicates a broad genetic diversity within this large order. Further mitochondrial genomes are required in order to determine whether there is a continuous transition between the different forms of mitochondrial genomes or G. lozoyensis belongs to a distinct subgroup within Helotiales.
Transcriptome profiles provide a practical and inexpensive alternative to explore genomic data in non-model organisms, particularly in amphibians where the genomes are very large and complex. The odorous frog Odorranamargaretae (Anura: Ranidae) is a dominant species in the mountain stream ecosystem of western China. Limited knowledge of its genetic background has hindered research on this species, despite its importance in the ecosystem and as biological resources. Here we report the transcriptome of O. margaretae in order to establish the foundation for genetic research. Using an Illumina sequencing platform, 62,321,166 raw reads were acquired. After a de novo assembly, 37,906 transcripts were obtained, and 18,933 transcripts were annotated to 14,628 genes. We functionally classified these transcripts by Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG). A total of 11,457 unique transcripts were assigned to 52 GO terms, and 1,438 transcripts were assigned to 128 KEGG pathways. Furthermore, we identified 27 potential antimicrobial peptides (AMPs), 50,351 single nucleotide polymorphism (SNP) sites, and 2,574 microsatellite DNA loci. The transcriptome profile of this species will shed more light on its genetic background and provide useful tools for future studies of this species, as well as other species in the genus Odorrana. It will also contribute to the accumulation of amphibian genomic data.
Paired box (PAX) genes are transcription factors that play important roles in embryonic development. Although the PAX gene family occurs in animals only, it is widely distributed. Among the vertebrates, its 9 genes appear to be the product of complete duplication of an original set of 4 genes, followed by an additional partial duplication. Although some studies of PAX genes have been conducted, no comprehensive survey of these genes across the entire taxonomic unit has yet been attempted. In this study, we conducted a detailed comparison of PAX sequences from 188 chordates, which revealed restricted variation. The absence of PAX4 and PAX8 among some species of reptiles and birds was notable; however, all 9 genes were present in all 74 mammalian genomes investigated. A search for signatures of selection indicated that all genes are subject to purifying selection, with a possible constraint relaxation in PAX4, PAX7, and PAX8. This result indicates asymmetric evolution of PAX family genes, which can be associated with the emergence of adaptive novelties in the chordate evolutionary trajectory.
The prominent attributes of foxtail millet (Setaria italica L.) including its small genome size, short life cycle, inbreeding nature, and phylogenetic proximity to various biofuel crops have made this crop an excellent model system to investigate various aspects of architectural, evolutionary and physiological significances in Panicoid bioenergy grasses. After release of its whole genome sequence, large-scale genomic resources in terms of molecular markers were generated for the improvement of both foxtail millet and its related species. Hence it is now essential to congregate, curate and make available these genomic resources for the benefit of researchers and breeders working towards crop improvement. In view of this, we have constructed the Foxtail millet Marker Database (FmMDb; http://www.nipgr.res.in/foxtail.html), a comprehensive online database for information retrieval, visualization and management of large-scale marker datasets with unrestricted public access. FmMDb is the first database which provides complete marker information to the plant science community attempting to produce elite cultivars of millet and bioenergy grass species, thus addressing global food insecurity.
An important challenge in drug discovery and disease prognosis is to predict genes that are preferentially expressed in one or a few tissues, i.e. showing a considerably higher expression in one tissue(s) compared to the others. Although several data sources and methods have been published explicitly for this purpose, they often disagree and it is not evident how to retrieve these genes and how to distinguish true biological findings from those that are due to choice-of-method and/or experimental settings. In this work we have developed a computational approach that combines results from multiple methods and datasets with the aim to eliminate method/study-specific biases and to improve the predictability of preferentially expressed human genes. A rule-based score is used to merge and assign support to the results. Five sets of genes with known tissue specificity were used for parameter pruning and cross-validation. In total we identify 3434 tissue-specific genes. We compare the genes of highest scores with the public databases: PaGenBase (microarray), TiGER (EST) and HPA (protein expression data). The results have 85% overlap to PaGenBase, 71% to TiGER and only 28% to HPA. 99% of our predictions have support from at least one of these databases. Our approach also performs better than any of the databases on identifying drug targets and biomarkers with known tissue-specificity.
MicroRNAs (miRNAs) are important regulators of gene expression at the post-transcriptional level in a wide range of species. Highly conserved miRNAs regulate ancestral transcription factors common to all plants, and control important basic processes such as cell division and meristem function. We selected 21 conserved miRNA families to analyze the distribution and maintenance of miRNAs. Recently, the first genome sequence in Palmaceae was released: date palm (Phoenix dactylifera). We conducted a systematic miRNA analysis in date palm, computationally identifying and characterizing the distribution and duplication of conserved miRNAs in this species compared to other published plant genomes. A total of 81 miRNAs belonging to 18 miRNA families were identified in date palm. The majority of miRNAs in date palm and seven other well-studied plant species were located in intergenic regions and located 4 to 5 kb away from the nearest protein-coding genes. Sequence comparison showed that 67% of date palm miRNA members were present in duplicated segments, and that 135 pairs of miRNA-containing segments were duplicated in Arabidopsis, tomato, orange, rice, apple, poplar and soybean with a high similarity of non coding sequences between duplicated segments, indicating genomic duplication was a major force for expansion of conserved miRNAs. Duplicated miRNA pairs in date palm showed divergence in pre-miRNA sequence and in number of promoters, implying that these duplicated pairs may have undergone divergent evolution. Comparisons between date palm and the seven other plant species for the gain/loss of miR167 loci in an ancient segment shared between monocots and dicots suggested that these conserved miRNAs were highly influenced by and diverged as a result of genomic duplication events.
The Dlx gene family encodes transcription factors involved in the development of a wide variety of morphological innovations that first evolved at the origins of vertebrates or of the jawed vertebrates. This gene family expanded with the two rounds of genome duplications that occurred before jawed vertebrates diversified. It includes at least three bigene pairs sharing conserved regulatory sequences in tetrapods and teleost fish, but has been only partially characterized in chondrichthyans, the third major group of jawed vertebrates. Here we take advantage of developmental and molecular tools applied to the shark Scyliorhinus canicula to fill in the gap and provide an overview of the evolution of the Dlx family in the jawed vertebrates. These results are analyzed in the theoretical framework of the DDC (Duplication-Degeneration-Complementation) model.
The genomic organisation of the catshark Dlx genes is similar to that previously described for tetrapods. Conserved non-coding elements identified in bony fish were also identified in catshark Dlx clusters and showed regulatory activity in transgenic zebrafish. Gene expression patterns in the catshark showed that there are some expression sites with high conservation of the expressed paralog(s) and other expression sites with events of paralog sub-functionalization during jawed vertebrate diversification, resulting in a wide variety of evolutionary scenarios within this gene family.
Dlx gene expression patterns in the catshark show that there has been little neo-functionalization in Dlx genes over gnathostome evolution. In most cases, one tandem duplication and two rounds of vertebrate genome duplication have led to at least six Dlx coding sequences with redundant expression patterns followed by some instances of paralog sub-functionalization. Regulatory constraints such as shared enhancers, and functional constraints including gene pleiotropy, may have contributed to the evolutionary inertia leading to high redundancy between gene expression patterns.
As the major enamel matrix protein contributing to tooth development, amelogenin has been demonstrated to play a crucial role in tooth enamel formation. Previous studies have revealed amelogenin alternative splicing as a mechanism for amelogenin heterogeneous expression in mammals. While amelogenin and its splicing forms in mammalian vertebrates have been characterized, splicing variants of amelogenin gene still remains largely unknown in non-mammalian species. Here, using PCR and sequence analysis we discovered two novel amelogenin transcript variants in tooth organ extracts from a caudate amphibian, the salamander Plethodoncinereus. The one was shorter -S- (416 nucleotides including untranslated regions, 5 exons) and the other larger -L- (851 nt, 7 exons) than the previously published “normal” gene in this species -M- (812 nucleotides, 6 exons). This is the first report demonstrating the amelogenin alternative splicing in amphibian, revealing a unique exon 2b and two novel amelogenin gene transcripts in Plethodoncinereus.
Systematic determination of gene function is an essential step in fully understanding the precise contribution of each gene for the proper execution of molecular functions in the cell. Gene functional linkage is defined as to describe the relationship of a group of genes with similar functions. With thousands of genomes sequenced, there arises a great opportunity to utilize gene evolutionary information to identify gene functional linkages. To this end, we established a computational method (called TRACE) to trace gene footprints through a gene functional network constructed from 341 prokaryotic genomes. TRACE performance was validated and successfully tested to predict enzyme functions as well as components of pathway. A so far undescribed chromosome partitioning-like protein ro03654 of an oleaginous bacteria Rhodococcus sp. RHA1 (RHA1) was predicted and verified experimentally with its deletion mutant showing growth inhibition compared to RHA1 wild type. In addition, four proteins were predicted to act as prokaryotic SNARE-like proteins, and two of them were shown to be localized at the plasma membrane. Thus, we believe that TRACE is an effective new method to infer prokaryotic gene functional linkages by tracing evolutionary events.
Monocots are one of the most diverse, successful and economically important clades of angiosperms. We attempt to analyse the complete plastid genome sequences of two lilies and their lengths were 152,793bp in Liliumlongiflorum (Liliaceae) and 155,510bp in Alstroemeriaaurea (Alstroemeriaceae). Phylogenetic analyses were performed for 28 taxa including major lineages of monocots using the sequences of 79 plastid genes for clarifying the phylogenetic relationship of the order Liliales. The sister relationship of Liliales and Asparagales-commelinids was improved with high resolution. Comparative analyses of inter-familial and inter-specific sequence variation were also carried out among three families of Liliaceae, Smilacaceae, and Alstroemeriaceae, and between two Lilium species of L. longflorum and L. superbum. Gene content and order were conserved in the order Liliales except infA loss in Smilax and Alstroemeria. IR boundaries were similar in IRa, however, IRb showed different extension patterns as JLB of Smilax and JSB in Alstroemeria. Ka/Ks ratio was high in matK among the pair-wise comparison of three families and the most variable genes were psaJ, ycf1, rpl32, rpl22, matK, and ccsA among the three families and rps15, rpoA, matK, and ndhF between Lilium.
Plasmids have long been recognized as an important driver of DNA exchange and genetic innovation in prokaryotes. The success of plasmids has been attributed to their independent replication from the host's chromosome and their frequent self-transfer. It is thought that plasmids accumulate, rearrange and distribute nonessential genes, which may provide an advantage for host proliferation under selective conditions. In order to test this hypothesis independently of biases from culture selection, we study the plasmid metagenome from microbial communities in two activated sludge systems, one of which receives mostly household and the other chemical industry wastewater. We find that plasmids from activated sludge microbial communities carry among the largest proportion of unknown gene pools so far detected in metagenomic DNA, confirming their presumed role of DNA innovators. At a system level both plasmid metagenomes were dominated by functions associated with replication and transposition, and contained a wide variety of antibiotic and heavy metal resistances. Plasmid families were very different in the two metagenomes and grouped in deep-branching new families compared with known plasmid replicons. A number of abundant plasmid replicons could be completely assembled directly from the metagenome, providing insight in plasmid composition without culturing bias. Functionally, the two metagenomes strongly differed in several ways, including a greater abundance of genes for carbohydrate metabolism in the industrial and of general defense factors in the household activated sludge plasmid metagenome. This suggests that plasmids not only contribute to the adaptation of single individual prokaryotic species, but of the prokaryotic community as a whole under local selective conditions.
metagenomic studies; mobilome
Although a large set of full-length transcripts was recently assembled in catfish, annotation of large gene families, especially those with duplications, is still a great challenge. Most often, complexities in annotation cause mis-identification and thereby much confusion in the scientific literature. As such, detailed phylogenetic analysis and/or orthology analysis are required for annotation of genes involved in gene families. The ATP-binding cassette (ABC) transporter gene superfamily is a large gene family that encodes membrane proteins that transport a diverse set of substrates across membranes, playing important roles in protecting organisms from diverse environment.
In this work, we identified a set of 50 ABC transporters in catfish genome. Phylogenetic analysis allowed their identification and annotation into seven subfamilies, including 9 ABCA genes, 12 ABCB genes, 12 ABCC genes, 5 ABCD genes, 2 ABCE genes, 4 ABCF genes and 6 ABCG genes. Most ABC transporters are conserved among vertebrates, though cases of recent gene duplications and gene losses do exist. Gene duplications in catfish were found for ABCA1, ABCB3, ABCB6, ABCC5, ABCD3, ABCE1, ABCF2 and ABCG2.
The whole set of catfish ABC transporters provide the essential genomic resources for future biochemical, toxicological and physiological studies of ABC drug efflux transporters. The establishment of orthologies should allow functional inferences with the information from model species, though the function of lineage-specific genes can be distinct because of specific living environment with different selection pressure.
An extant genome can be the descendant of an ancient polyploid genome. The genome aliquoting problem is to reconstruct the latter from the former such that the rearrangement distance (i.e., the number of genome rearrangements necessary to transform the former into the latter) is minimal. Though several heuristic algorithms have been published, here, we sought improved algorithms for the problem with respect to the double cut and join (DCJ) distance. The new algorithm makes use of partial and contracted partial graphs, and locally minimizes the distance. Our test results with simulation data indicate that it reliably recovers gene order of the ancestral polyploid genome even when the ancestor is ancient. We also compared the performance of our method with an earlier method using simulation data sets and found that our algorithm has higher accuracy. It is known that vertebrates had undergone two rounds of whole-genome duplication (2R-WGD) during early vertebrate evolution. We used the new algorithm to calculate the DCJ distance between three modern vertebrate genomes and their 2R-WGD ancestor and found that the rearrangement rate might have slowed down significantly since the 2R-WGD. The software AliquotG implementing the algorithm is available as an open-source package from our website (http://mosas.sysu.edu.cn/genome/download_softwares.php).
Developmental constraints have been postulated to limit the space of feasible phenotypes and thus shape animal evolution. These constraints have been suggested to be the strongest during either early or mid-embryogenesis, which corresponds to the early conservation model or the hourglass model, respectively. Conflicting results have been reported, but in recent studies of animal transcriptomes the hourglass model has been favored. Studies usually report descriptive statistics calculated for all genes over all developmental time points. This introduces dependencies between the sets of compared genes and may lead to biased results. Here we overcome this problem using an alternative modular analysis. We used the Iterative Signature Algorithm to identify distinct modules of genes co-expressed specifically in consecutive stages of zebrafish development. We then performed a detailed comparison of several gene properties between modules, allowing for a less biased and more powerful analysis. Notably, our analysis corroborated the hourglass pattern at the regulatory level, with sequences of regulatory regions being most conserved for genes expressed in mid-development but not at the level of gene sequence, age, or expression, in contrast to some previous studies. The early conservation model was supported with gene duplication and birth that were the most rare for genes expressed in early development. Finally, for all gene properties, we observed the least conservation for genes expressed in late development or adult, consistent with both models. Overall, with the modular approach, we showed that different levels of molecular evolution follow different patterns of developmental constraints. Thus both models are valid, but with respect to different genomic features.
During development, vertebrate embryos pass through a “phylotypic” stage, during which their morphology is most similar between different species. This gave rise to the hourglass model, which predicts the highest developmental constraints during mid-embryogenesis. In the last decade, a large effort has been made to uncover the relation between developmental constraints and the evolution of genome. Several studies reported gene characteristics that change according to the hourglass model, e.g. sequence conservation, age, or expression. Here, we first show that some of the previous conclusions do not hold out under detailed analysis of the data. Then, we discuss the disadvantages of the standard evo-devo approach, i.e. comparing descriptive statistics of all genes across development. Results of such analysis are biased by genes expressed constantly during development (housekeeping genes). To overcome this limitation, we use a modularization approach, which reduces the complexity of the data and assures independency between the sets of genes which are compared. We identified distinct sets of genes (modules) with time-specific expression in zebrafish development and analyzed their conservation of sequence, gene expression, and regulatory elements, as well as their age and orthology relationships. Interestingly, we found different patterns of developmental constraints for different gene properties. Only conserved regulatory regions follow an hourglass pattern.