The methylotrophic yeast Pichia pastoris is widely used as a bioengineering platform for producing industrial and biopharmaceutical proteins, studying protein expression and secretion mechanisms, and analyzing metabolite synthesis and peroxisome biogenesis. With the development of DNA microarray and mRNA sequence technology, the P. pastoris transcriptome has become a research hotspot due to its powerful capability to identify the transcript structures and gain insights into the transcriptional regulation model of cells under protein production conditions. The study of the P. pastoris transcriptome helps to annotate the P. pastoris transcript structures and provide useful information for further improvement of the production of recombinant proteins.
We used a massively parallel mRNA sequencing platform (RNA-Seq), based on next-generation sequencing technology, to map and quantify the dynamic transcriptome of P. pastoris at the genome scale under growth conditions with glycerol and methanol as substrates. The results describe the transcription landscape at the whole-genome level and provide annotated transcript structures, including untranslated regions (UTRs), alternative splicing (AS) events, novel transcripts, new exons, alternative upstream initiation codons (uATGs), and upstream open reading frames (uORFs). Internal ribosome entry sites (IRESes) were first identified within the UTRs of genes from P. pastoris, encoding kinases and the proteins involved in the control of growth. We also provide a transcriptional regulation model for P. pastoris grown on different carbon sources.
We suggest that the IRES-dependent translation initiation mechanism also exists in P. pastoris. Retained introns (RIs) are determined as the main AS event and are produced predominantly by an intron definition (ID) mechanism. Our results describe the metabolic characteristics of P. pastoris with heterologous protein production under methanol induction and provide rich information for further in-depth studies of P. pastoris protein expression and secretion mechanisms.
RNA-Seq; Transcriptome; Pichia pastoris; Methanol induction; Internal ribosome entry site (IRES); Translation initiation mechanism
High-throughput sequencing of cDNA libraries (RNA-Seq) has proven to be a highly effective approach for studying bacterial transcriptomes. A central challenge in designing RNA-Seq-based experiments is estimating a priori the number of reads per sample needed to detect and quantify thousands of individual transcripts with a large dynamic range of abundance.
We have conducted a systematic examination of how changes in the number of RNA-Seq reads per sample influences both profiling of a single bacterial transcriptome and the comparison of gene expression among samples. Our findings suggest that the number of reads typically produced in a single lane of the Illumina HiSeq sequencer far exceeds the number needed to saturate the annotated transcriptomes of diverse bacteria growing in monoculture. Moreover, as sequencing depth increases, so too does the detection of cDNAs that likely correspond to spurious transcripts or genomic DNA contamination. Finally, even when dozens of barcoded individual cDNA libraries are sequenced in a single lane, the vast majority of transcripts in each sample can be detected and numerous genes differentially expressed between samples can be identified.
Our analysis provides a guide for the many researchers seeking to determine the appropriate sequencing depth for RNA-Seq-based studies of diverse bacterial species.
Copy number variation (CNV) is a major source of structural variants and has been commonly identified in mammalian genome. It is associated with gene expression and may present a major genetic component of phenotypic diversity. Unlike many other mammalian genomes where CNVs have been well annotated, studies of porcine CNV in diverse breeds are still limited.
Here we used Porcine SNP60 BeadChip and PennCNV algorithm to identify 1,315 putative CNVs belonging to 565 CNV regions (CNVRs) in 1,693 pigs from 18 diverse populations. Total 538 out of 683 CNVs identified in a White Duroc × Erhualian F2 population fit Mendelian transmission and 6 out of 7 randomly selected CNVRs were confirmed by quantitative real time PCR. CNVRs were non-randomly distributed in the pig genome. Several CNV hotspots were found on pig chromosomes 6, 11, 13, 14 and 17. CNV numbers differ greatly among different pig populations. The Duroc pigs were identified to have the most number of CNVs per individual. Among 1,765 transcripts located within the CNVRs, 634 genes have been reported to be copy number variable genes in the human genome. By integrating analysis of QTL mapping, CNVRs and the description of phenotypes in knockout mice, we identified 7 copy number variable genes as candidate genes for phenotypes related to carcass length, backfat thickness, abdominal fat weight, length of scapular, intermuscle fat content of logissimus muscle, body weight at 240 day, glycolytic potential of logissimus muscle, mean corpuscular hemoglobin, mean corpuscular volume and humerus diameter.
We revealed the distribution of the unprecedented number of 565 CNVRs in pig genome and investigated copy number variable genes as the possible candidate genes for phenotypic traits. These findings give novel insights into porcine CNVs and provide resources to facilitate the identification of trait-related CNVs.
Copy number variation; Copy number variable gene; Complex trait; QTL; Pig
Trypanosoma cruzi, the causal agent of Chagas Disease, affects more than 16 million people in Latin America. The clinical outcome of the disease results from a complex interplay between environmental factors and the genetic background of both the human host and the parasite. However, knowledge of the genetic diversity of the parasite, is currently limited to a number of highly studied loci. The availability of a number of genomes from different evolutionary lineages of T. cruzi provides an unprecedented opportunity to look at the genetic diversity of the parasite at a genomic scale.
Using a bioinformatic strategy, we have clustered T. cruzi sequence data available in the public domain and obtained multiple sequence alignments in which one or two alleles from the reference CL-Brener were included. These data covers 4 major evolutionary lineages (DTUs): TcI, TcII, TcIII, and the hybrid TcVI. Using these set of alignments we have identified 288,957 high quality single nucleotide polymorphisms and 1,480 indels. In a reduced re-sequencing study we were able to validate ~ 97% of high-quality SNPs identified in 47 loci. Analysis of how these changes affect encoded protein products showed a 0.77 ratio of synonymous to non-synonymous changes in the T. cruzi genome. We observed 113 changes that introduce or remove a stop codon, some causing significant functional changes, and a number of tri-allelic and tetra-allelic SNPs that could be exploited in strain typing assays. Based on an analysis of the observed nucleotide diversity we show that the T. cruzi genome contains a core set of genes that are under apparent purifying selection. Interestingly, orthologs of known druggable targets show statistically significant lower nucleotide diversity values.
This study provides the first look at the genetic diversity of T. cruzi at a genomic scale. The analysis covers an estimated ~ 60% of the genetic diversity present in the population, providing an essential resource for future studies on the development of new drugs and diagnostics, for Chagas Disease. These data is available through the TcSNP database (http://snps.tcruzi.org).
MicroRNAs (miRNAs) are a class of small non-coding RNAs that regulate gene expression by targeting mRNAs for translation repression or mRNA degradation. Although many miRNAs have been discovered and studied in human and mouse, few studies focused on porcine miRNAs, especially in genome wide.
Here, we adopted computational approaches including support vector machine (SVM) and homology searching to make a global scanning on the pre-miRNAs of pigs. In our study, we built the SVM-based porcine pre-miRNAs classifier with a sensitivity of 100%, a specificity of 91.2% and a total prediction accuracy of 95.6%, respectively. Moreover, 2204 novel porcine pre-miRNA candidates were found by using SVM-based pre-miRNAs classifier. Besides, 116 porcine pre-miRNA candidates were detected by homology searching.
We identified the porcine pre-miRNA in genome-wide through computational approaches by utilizing the data sets of pigs and set up the porcine pre-miRNAs library which may provide us a global scanning on the pre-miRNAs of pigs in genome level and would benefit subsequent experimental research on porcine miRNA functional and expression analysis.
Porcine; Pre-miRNA; SVM; Homology searching
Sequence signatures, as defined by the frequencies of k-tuples (or k-mers, k-grams), have been used extensively to compare genomic sequences of individual organisms, to identify cis-regulatory modules, and to study the evolution of regulatory sequences. Recently many next-generation sequencing (NGS) read data sets of metagenomic samples from a variety of different environments have been generated. The assembly of these reads can be difficult and analysis methods based on mapping reads to genes or pathways are also restricted by the availability and completeness of existing databases. Sequence-signature-based methods, however, do not need the complete genomes or existing databases and thus, can potentially be very useful for the comparison of metagenomic samples using NGS read data. Still, the applications of sequence signature methods for the comparison of metagenomic samples have not been well studied.
We studied several dissimilarity measures, including d2, d2* and d2S recently developed from our group, a measure (hereinafter noted as Hao) used in CVTree developed from Hao’s group (Qi et al., 2004), measures based on relative di-, tri-, and tetra-nucleotide frequencies as in Willner et al. (2009), as well as standard lp measures between the frequency vectors, for the comparison of metagenomic samples using sequence signatures. We compared their performance using a series of extensive simulations and three real next-generation sequencing (NGS) metagenomic datasets: 39 fecal samples from 33 mammalian host species, 56 marine samples across the world, and 13 fecal samples from human individuals. Results showed that the dissimilarity measure d2S can achieve superior performance when comparing metagenomic samples by clustering them into different groups as well as recovering environmental gradients affecting microbial samples. New insights into the environmental factors affecting microbial compositions in metagenomic samples are obtained through the analyses. Our results show that sequence signatures of the mammalian gut are closely associated with diet and gut physiology of the mammals, and that sequence signatures of marine communities are closely related to location and temperature.
Sequence signatures can successfully reveal major group and gradient relationships among metagenomic samples from NGS reads without alignment to reference databases. The d2S dissimilarity measure is a good choice in all application scenarios. The optimal choice of tuple size depends on sequencing depth, but it is quite robust within a range of choices for moderate sequencing depths.
MicroRNAs (miRNAs) have been implicated in the regulation of milk protein synthesis and development of the mammary gland (MG). However, the specific functions of miRNAs in these regulations are not clear. Therefore, the elucidation of miRNA expression profiles in the MG is an important step towards understanding the mechanisms of lactogenesis.
Two miRNA libraries were constructed from MG tissues taken from a lactating and a non-lactating Holstein dairy cow, respectively, and the short RNA sequences (18–30 nt) in these libraries were sequenced by Solexa sequencing method. The libraries included 885 pre-miRNAs encoding for 921 miRNAs, of which 884 miRNAs were unique sequences and 544 (61.5%) were expressed in both periods. A custom-designed microarray assay was then performed to compare miRNA expression patterns in the MG of lactating and non-lactating dairy cows. A total of 56 miRNAs in the lactating MG showed significant differences in expression compared to non-lactating MG (P<0.05). Integrative miRNA target prediction and network analysis approaches were employed to construct an interaction network of lactation-related miRNAs and their putative targets. Using a cell-based model, six miRNAs (miR-125b, miR-141, miR-181a, miR-199b, miR-484 and miR-500) were studied to reveal their possible biological significance.
Our study provides a broad view of the bovine MG miRNA expression profile characteristics. Eight hundred and eighty-four miRNAs were identified in bovine MG. Differences in types and expression levels of miRNAs were observed between lactating and non-lactating bovine MG. Systematic predictions aided in the identification of lactation-related miRNAs, providing insight into the types of miRNAs and their possible mechanisms in regulating lactation.
Rhizobium tropici CIAT 899 and Rhizobium sp. PRF 81 are α-Proteobacteria that establish nitrogen-fixing symbioses with a range of legume hosts. These strains are broadly used in commercial inoculants for application to common bean (Phaseolus vulgaris) in South America and Africa. Both strains display intrinsic resistance to several abiotic stressful conditions such as low soil pH and high temperatures, which are common in tropical environments, and to several antimicrobials, including pesticides. The genetic determinants of these interesting characteristics remain largely unknown.
Genome sequencing revealed that CIAT 899 and PRF 81 share a highly-conserved symbiotic plasmid (pSym) that is present also in Rhizobium leucaenae CFN 299, a rhizobium displaying a similar host range. This pSym seems to have arisen by a co-integration event between two replicons. Remarkably, three distinct nodA genes were found in the pSym, a characteristic that may contribute to the broad host range of these rhizobia. Genes for biosynthesis and modulation of plant-hormone levels were also identified in the pSym. Analysis of genes involved in stress response showed that CIAT 899 and PRF 81 are well equipped to cope with low pH, high temperatures and also with oxidative and osmotic stresses. Interestingly, the genomes of CIAT 899 and PRF 81 had large numbers of genes encoding drug-efflux systems, which may explain their high resistance to antimicrobials. Genome analysis also revealed a wide array of traits that may allow these strains to be successful rhizosphere colonizers, including surface polysaccharides, uptake transporters and catabolic enzymes for nutrients, diverse iron-acquisition systems, cell wall-degrading enzymes, type I and IV pili, and novel T1SS and T5SS secreted adhesins.
Availability of the complete genome sequences of CIAT 899 and PRF 81 may be exploited in further efforts to understand the interaction of tropical rhizobia with common bean and other legume hosts.
Nodulation; Nitrogen fixation; Plant-microbe interactions; Antimicrobial resistance
DNA microarrays are used both for research and for diagnostics. In research, Affymetrix arrays are commonly used for genome wide association studies, resequencing, and for gene expression analysis. These arrays provide large amounts of data. This data is analyzed using statistical methods that quite often discard a large portion of the information. Most of the information that is lost comes from probes that systematically fail across chips and from batch effects. The aim of this study was to develop a comprehensive model for hybridization that predicts probe intensities for Affymetrix arrays and that could provide a basis for improved microarray analysis and probe development. The first part of the model calculates probe binding affinities to all the possible targets in the hybridization solution using the Langmuir isotherm. In the second part of the model we integrate details that are specific to each experiment and contribute to the differences between hybridization in solution and on the microarray. These details include fragmentation, wash stringency, temperature, salt concentration, and scanner settings. Furthermore, the model fits probe synthesis efficiency and target concentration parameters directly to the data. All the parameters used in the model have a well-established physical origin.
For the 302 chips that were analyzed the mean correlation between expected and observed probe intensities was 0.701 with a range of 0.88 to 0.55. All available chips were included in the analysis regardless of the data quality. Our results show that batch effects arise from differences in probe synthesis, scanner settings, wash strength, and target fragmentation. We also show that probe synthesis efficiencies for different nucleotides are not uniform.
To date this is the most complete model for binding on microarrays. This is the first model that includes both probe synthesis efficiency and hybridization kinetics/cross-hybridization. These two factors are sequence dependent and have a large impact on probe intensity. The results presented here provide novel insight into the effect of probe synthesis errors on Affymetrix microarrays; furthermore, the algorithms developed in this work provide useful tools for the analysis of cross-hybridization, probe synthesis efficiency, fragmentation, wash stringency, temperature, and salt concentration on microarray intensities.
A major goal of the field of systems biology is to translate genome-wide profiling data (e.g., mRNAs, miRNAs) into interpretable functional networks. However, employing a systems biology approach to better understand the complexities underlying drug resistance phenotypes in cancer continues to represent a significant challenge to the field. Previously, we derived two drug-resistant breast cancer sublines (tamoxifen- and fulvestrant-resistant cell lines) from the MCF7 breast cancer cell line and performed genome-wide mRNA and microRNA profiling to identify differential molecular pathways underlying acquired resistance to these important antiestrogens. In the current study, to further define molecular characteristics of acquired antiestrogen resistance we constructed an “integrative network”. We combined joint miRNA-mRNA expression profiles, cancer contexts, miRNA-target mRNA relationships, and miRNA upstream regulators. In particular, to reduce the probability of false positive connections in the network, experimentally validated, rather than prediction-oriented, databases were utilized to obtain connectivity. Also, to improve biological interpretation, cancer contexts were incorporated into the network connectivity.
Based on the integrative network, we extracted “substructures” (network clusters) representing the drug resistant states (tamoxifen- or fulvestrant-resistance cells) compared to drug sensitive state (parental MCF7 cells). We identified un-described network clusters that contribute to antiestrogen resistance consisting of miR-146a, -27a, -145, -21, -155, -15a, -125b, and let-7s, in addition to the previously described miR-221/222.
By integrating miRNA-related network, gene/miRNA expression and text-mining, the current study provides a computational-based systems biology approach for further investigating the molecular mechanism underlying antiestrogen resistance in breast cancer cells. In addition, new miRNA clusters that contribute to antiestrogen resistance were identified, and they warrant further investigation.
Bioinformatics; miRNA; Network; Breast cancer; Antiestrogen resistance
Protein arginine methylation is a post-translational modification involved in important biological processes such as transcription and RNA processing. This modification is catalyzed by both type I and II protein arginine methyltransferases (PRMTs). One of the most conserved type I PRMTs is PRMT1, the homolog of which is Hmt1 in Saccharomyces cerevisiae. Hmt1 has been shown to play a role in various gene expression steps, such as promoting the dynamics of messenger ribonucleoprotein particle (mRNP) biogenesis, pre-mRNA splicing, and silencing of chromatin. To determine the full extent of Hmt1’s involvement during gene expression, we carried out a genome-wide location analysis for Hmt1.
A comprehensive genome-wide binding profile for Hmt1 was obtained by ChIP-chip using NimbleGen high-resolution tiling microarrays. Of the approximately 1000 Hmt1-binding sites found, the majority fall within or proximal to an ORF. Different occupancy patterns of Hmt1 across genes with different transcriptional rates were found. Interestingly, Hmt1 occupancy is found at a number of other genomic features such as tRNA and snoRNA genes, thereby implicating a regulatory role in the biogenesis of these non-coding RNAs. RNA hybridization analysis shows that Hmt1 loss-of-function mutants display higher steady-state tRNA abundance relative to the wild-type. Co-immunoprecipitation studies demonstrate that Hmt1 interacts with the TFIIIB component Bdp1, suggesting a mechanism for Hmt1 in modulating RNA Pol III transcription to regulate tRNA production.
The genome-wide binding profile of Hmt1 reveals multiple potential new roles for Hmt1 in the control of eukaryotic gene expression, especially in the realm of non-coding RNAs. The data obtained here will provide an important blueprint for future mechanistic studies on the described occupancy relationship for genomic features bound by Hmt1.
Protein arginine methylation; Hmt1; RNA Pol III transcription; tRNA biogenesis; ChIP-chip
MicroRNAs (miRNAs) are small noncoding RNAs that regulate gene expression post-transcriptionally in a wide range of biological processes. The zebra finch (Taeniopygia guttata), an oscine songbird with characteristic learned vocal behavior, provides biologists a unique model system for studying vocal behavior, sexually dimorphic brain development and functions, and comparative genomics.
We deep sequenced small RNA libraries made from the brain, heart, liver, and muscle tissues of adult male and female zebra finches. By mapping the sequence reads to the zebra finch genome and to known miRNAs in miRBase, we annotated a total of 193 miRNAs. Among them, 29 (15%) are avian specific, including three novel zebra finch specific miRNAs. Many of the miRNAs exhibit sequence heterogeneity including length variations, untemplated terminal nucleotide additions, and internal substitution events occurring at the uridine nucleotide within a GGU motif. We also identified seven Z chromosome-encoded miRNAs. Among them, miR-2954, an avian specific miRNA, is expressed at significantly higher levels in males than in females in all tissues examined. Target prediction analysis reveals that miR-2954, but not other Z-linked miRNAs, preferentially targets Z chromosome-encoded genes, including several genes known to be expressed in a sexually dimorphic manner in the zebra finch brain.
Our genome-wide systematic analysis of mature sequences, genomic locations, evolutionary sequence conservation, and tissue expression profiles of the zebra finch miRNA repertoire provides a valuable resource to the research community. Our analysis also reveals a miRNA-mediated mechanism that potentially regulates sex-biased gene expression in avian species.
Zebra finch; miRNAs; Sequence variations; Tissue-enriched miRNA expression; Z chromosome; Sex-biased miRNA expression
Next Generation Sequencing has provided comprehensive, affordable and high-throughput DNA sequences for Single Nucleotide Polymorphism (SNP) discovery in Acacia auriculiformis and Acacia mangium. Like other non-model species, SNP detection and genotyping in Acacia are challenging due to lack of genome sequences. The main objective of this study is to develop the first high-throughput SNP genotyping assay for linkage map construction of A. auriculiformis x A. mangium hybrids.
We identified a total of 37,786 putative SNPs by aligning short read transcriptome data from four parents of two Acacia hybrid mapping populations using Bowtie against 7,839 de novo transcriptome contigs. Given a set of 10 validated SNPs from two lignin genes, our in silico SNP detection approach is highly accurate (100%) compared to the traditional in vitro approach (44%). Further validation of 96 SNPs using Illumina GoldenGate Assay gave an overall assay success rate of 89.6% and conversion rate of 37.5%. We explored possible factors lowering assay success rate by predicting exon-intron boundaries and paralogous genes of Acacia contigs using Medicago truncatula genome as reference. This assessment revealed that presence of exon-intron boundary is the main cause (50%) of assay failure. Subsequent SNPs filtering and improved assay design resulted in assay success and conversion rate of 92.4% and 57.4%, respectively based on 768 SNPs genotyping. Analysis of clustering patterns revealed that 27.6% of the assays were not reproducible and flanking sequence might play a role in determining cluster compression. In addition, we identified a total of 258 and 319 polymorphic SNPs in A. auriculiformis and A. mangium natural germplasms, respectively.
We have successfully discovered a large number of SNP markers in A. auriculiformis x A. mangium hybrids using next generation transcriptome sequencing. By using a reference genome from the most closely related species, we converted most SNPs to successful assays. We also demonstrated that Illumina GoldenGate genotyping together with manual clustering can provide high quality genotypes for a non-model species like Acacia. These SNPs markers are not only important for linkage map construction, but will be very useful for hybrid discrimination and genetic diversity assessment of natural germplasms in the future.
Recent studies have shown that copy number variation (CNV) in mammalian genomes contributes to phenotypic diversity, including health and disease status. In domestic pigs, CNV has been catalogued by several reports, but the extent of CNV and the phenotypic effects are far from clear. The goal of this study was to identify CNV regions (CNVRs) in pigs based on array comparative genome hybridization (aCGH).
Here a custom-made tiling oligo-nucleotide array was used with a median probe spacing of 2506 bp for screening 12 pigs including 3 Chinese native pigs (one Chinese Erhualian, one Tongcheng and one Yangxin pig), 5 European pigs (one Large White, one Pietrain, one White Duroc and two Landrace pigs), 2 synthetic pigs (Chinese new line DIV pigs) and 2 crossbred pigs (Landrace × DIV pigs) with a Duroc pig as the reference. Two hundred and fifty-nine CNVRs across chromosomes 1–18 and X were identified, with an average size of 65.07 kb and a median size of 98.74 kb, covering 16.85 Mb or 0.74% of the whole genome. Concerning copy number status, 93 (35.91%) CNVRs were called as gains, 140 (54.05%) were called as losses and the remaining 26 (10.04%) were called as both gains and losses. Of all detected CNVRs, 171 (66.02%) and 34 (13.13%) CNVRs directly overlapped with Sus scrofa duplicated sequences and pig QTLs, respectively. The CNVRs encompassed 372 full length Ensembl transcripts. Two CNVRs identified by aCGH were validated using real-time quantitative PCR (qPCR).
Using 720 K array CGH (aCGH) we described a map of porcine CNVs which facilitated the identification of structural variations for important phenotypes and the assessment of the genetic diversity of pigs.
Salmonids are popular sport fishes, and as such have been subjected to widespread stocking throughout western North America. Historically, stocking was done with little regard for genetic variation among populations and has resulted in genetic mixing among species and subspecies in many areas, thus putting the genetic integrity of native salmonid populations at risk and creating a need to assess the genetic constitution of native salmonid populations. Cutthroat trout is a salmonid species with pronounced geographic structure (there are 10 extant subspecies) and a recent history of hybridization with introduced rainbow trout in many populations. Genetic admixture has also occurred among cutthroat trout subspecies in areas where introductions have brought two or more subspecies into contact. Consequently, management agencies have increased their efforts to evaluate the genetic composition of cutthroat trout populations to identify populations that remain uncompromised and manage them accordingly, but additional genetic markers are needed to do so effectively. Here we used genome reduction, MID-barcoding, and 454-pyrosequencing to discover single nucleotide polymorphisms that differentiate cutthroat trout subspecies and can be used as a rapid, cost-effective method to characterize the genetic composition of cutthroat trout populations.
Thirty cutthroat and six rainbow trout individuals were subjected to genome reduction and next-generation sequencing. A total of 1,499,670 reads averaging 379 base pairs in length were generated by 454-pyrosequencing, resulting in 569,060,077 total base pairs sequenced. A total of 43,558 putative SNPs were identified, and of those, 125 SNP primers were developed that successfully amplified 96 cutthroat trout and rainbow trout individuals. These SNP loci were able to differentiate most cutthroat trout subspecies using distance methods and Structure analyses.
Genomic and bioinformatic protocols were successfully implemented to identify 125 nuclear SNPs that are capable of differentiating most subspecies of cutthroat trout from one another. The ability to use this suite of SNPs to identify individuals of unknown genetic background to subspecies can be a valuable tool for management agencies in their efforts to evaluate the genetic structure of cutthroat trout populations prior to constructing and implementing conservation plans.
Conservation genetics; Genetic admixture; Hybridization; KASPar; Oncorhynchus clarkii; Oncorhynchus mykiss; Population genomics; SNP; Rainbow trout
Thermacetogenium phaeum is a thermophilic strictly anaerobic bacterium oxidizing acetate to CO2 in syntrophic association with a methanogenic partner. It can also grow in pure culture, e.g., by fermentation of methanol to acetate. The key enzymes of homoacetate fermentation (Wood-Ljungdahl pathway) are used both in acetate oxidation and acetate formation. The obvious reversibility of this pathway in this organism is of specific interest since syntrophic acetate oxidation operates close to the energetic limitations of microbial life.
The genome of Th. phaeum is organized on a single circular chromosome and has a total size of 2,939,057 bp. It comprises 3.215 open reading frames of which 75% could be assigned to a gene function. The G+C content is 53.88 mol%. Many CRISPR sequences were found, indicating heavy phage attack in the past. A complete gene set for a phage was found in the genome, and indications of phage action could also be observed in culture. The genome contained all genes required for CO2 reduction through the Wood-Ljungdahl pathway, including two formyl tetrahydrofolate ligases, three carbon monoxide dehydrogenases, one formate hydrogenlyase complex, three further formate dehydrogenases, and three further hydrogenases. The bacterium contains a menaquinone MQ-7. No indications of cytochromes or Rnf complexes could be found in the genome.
The information obtained from the genome sequence indicates that Th. phaeum differs basically from the three homoacetogenic bacteria sequenced so far, i.e., the sodium ion-dependent Acetobacterium woodii, the ethanol-producing Clostridium ljungdahlii, and the cytochrome-containing Moorella thermoacetica. The specific enzyme outfit of Th. phaeum obviously allows ATP formation both in acetate formation and acetate oxidation.
Tandemly arranged nuclear ribosomal DNA (rDNA), encoding 18S, 5.8S and 26S ribosomal RNA (rRNA), exhibit concerted evolution, a pattern thought to result from the homogenisation of rDNA arrays. However rDNA homogeneity at the single nucleotide polymorphism (SNP) level has not been detailed in organisms with more than a few hundred copies of the rDNA unit. Here we study rDNA complexity in species with arrays consisting of thousands of units.
We examined homogeneity of genic (18S) and non-coding internally transcribed spacer (ITS1) regions of rDNA using Roche 454 and/or Illumina platforms in four angiosperm species, Nicotiana sylvestris, N. tomentosiformis, N. otophora and N. kawakamii. We compared the data with Southern blot hybridisation revealing the structure of intergenic spacer (IGS) sequences and with the number and distribution of rDNA loci.
Results and Conclusions
In all four species the intragenomic homogeneity of the 18S gene was high; a single ribotype makes up over 90% of the genes. However greater variation was observed in the ITS1 region, particularly in species with two or more rDNA loci, where >55% of rDNA units were a single ribotype, with the second most abundant variant accounted for >18% of units. IGS heterogeneity was high in all species. The increased number of ribotypes in ITS1 compared with 18S sequences may reflect rounds of incomplete homogenisation with strong selection for functional genic regions and relaxed selection on ITS1 variants. The relationship between the number of ITS1 ribotypes and the number of rDNA loci leads us to propose that rDNA evolution and complexity is influenced by locus number and/or amplification of orphaned rDNA units at new chromosomal locations.
rRNA genes; Concerted evolution; Plants; Chromosomes; DNA repeats homogenisation
The emergence of vertebrates is characterized by a strong increase in miRNA families. MicroRNAs interact broadly with many transcripts, and the evolution of such a system is intriguing. However, evolutionary questions concerning the origin of miRNA genes and their subsequent evolution remain unexplained.
In order to systematically understand the evolutionary relationship between miRNAs gene and their function, we classified human known miRNAs into eight groups based on their evolutionary ages estimated by maximum parsimony method. New miRNA genes with new functional sequences accumulated more dynamically in vertebrates than that observed in Drosophila. Different levels of evolutionary selection were observed over miRNA gene sequences with different time of origin. Most genic miRNAs differ from their host genes in time of origin, there is no particular relationship between the age of a miRNA and the age of its host genes, genic miRNAs are mostly younger than the corresponding host genes. MicroRNAs originated over different time-scales are often predicted/verified to target the same or overlapping sets of genes, opening the possibility of substantial functional redundancy among miRNAs of different ages. Higher degree of tissue specificity and lower expression level was found in young miRNAs.
Our data showed that compared with protein coding genes, miRNA genes are more dynamic in terms of emergence and decay. Evolution patterns are quite different between miRNAs of different ages. MicroRNAs activity is under tight control with well-regulated expression increased and targeting decreased over time. Our work calls attention to the study of miRNA activity with a consideration of their origin time.
Cis-natural antisense transcripts (cis-NATs) are RNAs transcribed from the antisense strand of a gene locus, and are complementary to the RNA transcribed from the sense strand. Common techniques including microarray approach and analysis of transcriptome databases are the major ways to globally identify cis-NATs in various eukaryotic organisms. Genome-wide in silico analysis has identified a large number of cis-NATs that may generate endogenous short interfering RNAs (nat-siRNAs), which participate in important biogenesis mechanisms for transcriptional and post-transcriptional regulation in rice. However, the transcriptomes are yet to be deeply sequenced to comprehensively investigate cis-NATs.
We applied high-throughput strand-specific complementary DNA sequencing technology (ssRNA-seq) to deeply sequence mRNA for assessing sense and antisense transcripts that were derived under salt, drought and cold stresses, and normal conditions, in the model plant rice (Oryza sativa). Combined with RAP-DB genome annotation (the Rice Annotation Project Database build-5 data set), 76,013 transcripts corresponding to 45,844 unique gene loci were assembled, in which 4873 gene loci were newly identified. Of 3819 putative rice cis-NATs, 2292 were detected as expressed and giving rise to small RNAs from their overlapping regions through integrated analysis of ssRNA-seq data and small RNA data. Among them, 503 cis-NATs seemed to be associated with specific conditions. The deep sequence data from isolated epidermal cells of rice seedlings further showed that 54.0% of cis-NATs were expressed simultaneously in a population of homogenous cells. Nearly 9.7% of rice transcripts were involved in one-to-one or many-to-many cis-NATs formation. Furthermore, only 17.4-34.7% of 223 many-to-many cis-NAT groups were all expressed and generated nat-siRNAs, indicating that only some cis-NAT groups may be involved in complex regulatory networks.
Our study profiles an abundance of cis-NATs and nat-siRNAs in rice. These data are valuable for gaining insight into the complex function of the rice transcriptome.
Oryza sativa; Cis-NATs; Nat-siRNAs; SsRNA-seq; Transcriptome
It has recently emerged that common epithelial cancers such as breast cancers have fusion genes like those in leukaemias. In a representative breast cancer cell line, ZR-75-30, we searched for fusion genes, by analysing genome rearrangements.
We first analysed rearrangements of the ZR-75-30 genome, to around 10kb resolution, by molecular cytogenetic approaches, combining array painting and array CGH. We then compared this map with genomic junctions determined by paired-end sequencing. Most of the breakpoints found by array painting and array CGH were identified in the paired end sequencing—55% of the unamplified breakpoints and 97% of the amplified breakpoints (as these are represented by more sequence reads). From this analysis we identified 9 expressed fusion genes: APPBP2-PHF20L1, BCAS3-HOXB9, COL14A1-SKAP1, TAOK1-PCGF2, TIAM1-NRIP1, TIMM23-ARHGAP32, TRPS1-LASP1, USP32-CCDC49 and ZMYM4-OPRD1. We also determined the genomic junctions of a further three expressed fusion genes that had been described by others, BCAS3-ERBB2, DDX5-DEPDC6/DEPTOR and PLEC1-ENPP2. Of this total of 12 expressed fusion genes, 9 were in the coamplification. Due to the sensitivity of the technologies used, we estimate these 12 fusion genes to be around two-thirds of the true total. Many of the fusions seem likely to be driver mutations. For example, PHF20L1, BCAS3, TAOK1, PCGF2, and TRPS1 are fused in other breast cancers. HOXB9 and PHF20L1 are members of gene families that are fused in other neoplasms. Several of the other genes are relevant to cancer—in addition to ERBB2, SKAP1 is an adaptor for Src, DEPTOR regulates the mTOR pathway and NRIP1 is an estrogen-receptor coregulator.
This is the first structural analysis of a breast cancer genome that combines classical molecular cytogenetic approaches with sequencing. Paired-end sequencing was able to detect almost all breakpoints, where there was adequate read depth. It supports the view that gene breakage and gene fusion are important classes of mutation in breast cancer, with a typical breast cancer expressing many fusion genes.
Breast cancer; Chromosome aberrations; Genomics; Fusion genes
Mycosphaerella fijiensis is a ascomycete that causes Black Sigatoka in bananas. Recently, the M. fijiensis genome was sequenced. Repetitive sequences are ubiquitous components of fungal genomes. In most genomic analyses, repetitive sequences are associated with transposable elements (TEs). TEs are dispersed repetitive DNA sequences found in a host genome. These elements have the ability to move from one location to another within the genome, and their insertion can cause a wide spectrum of mutations in their hosts. Some of the deleterious effects of TEs may be due to ectopic recombination among TEs of the same family. In addition, some transposons are physically linked to genes and can control their expression. To prevent possible damage caused by the presence of TEs in the genome, some fungi possess TE-silencing mechanisms, such as RIP (Repeat Induced Point mutation). In this study, the abundance, distribution and potential impact of TEs in the genome of M. fijiensis were investigated.
A total of 613 LTR-Gypsy and 27 LTR-Copia complete elements of the class I were detected. Among the class II elements, a total of 28 Mariner, five Mutator and one Harbinger complete elements were identified. The results of this study indicate that transposons were and are important ectopic recombination sites. A distribution analysis of a transposable element from each class of the M. fijiensis isolates revealed variable hybridization profiles, indicating the activity of these elements. Several genes encoding proteins involved in important metabolic pathways and with potential correlation to pathogenicity systems were identified upstream and downstream of transposable elements. A comparison of the sequences from different transposon groups suggested the action of the RIP silencing mechanism in the genome of this microorganism.
The analysis of TEs in M. fijiensis suggests that TEs play an important role in the evolution of this organism because the activity of these elements, as well as the rearrangements caused by ectopic recombination, can result in deletion, duplication, inversion and translocation. Some of these changes can potentially modify gene structure or expression and, thus, facilitate the emergence of new strains of this pathogen.
Mycosphaerella fijiensis; Transposable elements; RIP; Genome
New genes that originate from non-coding DNA rather than being duplicated from parent genes are called de novo genes. Their short evolution time and lack of parent genes provide a chance to study the evolution of cis-regulatory elements in the initial stage of gene emergence. Although a few reports have discussed cis-regulatory elements in new genes, knowledge of the characteristics of these elements in de novo genes is lacking. Here, we conducted a comprehensive investigation to depict the emergence and establishment of cis-regulatory elements in de novo yeast genes.
In a genome-wide investigation, we found that the number of transcription factor binding sites (TFBSs) in de novo genes of S. cerevisiae increased rapidly and quickly became comparable to the number of TFBSs in established genes. This phenomenon might have resulted from certain characteristics of de novo genes; namely, a relatively frequent gain of TFBSs, an unexpectedly high number of preexisting TFBSs, or lower selection pressure in the promoter regions of the de novo genes. Furthermore, we identified differences in the promoter architecture between de novo genes and duplicated new genes, suggesting that distinct regulatory strategies might be employed by genes of different origin. Finally, our functional analyses of the yeast de novo genes revealed that they might be related to reproduction.
Our observations showed that de novo genes and duplicated new genes possess mutually distinct regulatory characteristics, implying that these two types of genes might have different roles in evolution.
De novo gene; Regulatory evolution; TFBS turnover; Promoter architecture
The complete sequences of chloroplast genomes provide wealthy information regarding the evolutionary history of species. With the advance of next-generation sequencing technology, the number of completely sequenced chloroplast genomes is expected to increase exponentially, powerful computational tools annotating the genome sequences are in urgent need.
We have developed a web server CPGAVAS. The server accepts a complete chloroplast genome sequence as input. First, it predicts protein-coding and rRNA genes based on the identification and mapping of the most similar, full-length protein, cDNA and rRNA sequences by integrating results from Blastx, Blastn, protein2genome and est2genome programs. Second, tRNA genes and inverted repeats (IR) are identified using tRNAscan, ARAGORN and vmatch respectively. Third, it calculates the summary statistics for the annotated genome. Fourth, it generates a circular map ready for publication. Fifth, it can create a Sequin file for GenBank submission. Last, it allows the extractions of protein and mRNA sequences for given list of genes and species. The annotation results in GFF3 format can be edited using any compatible annotation editing tools. The edited annotations can then be uploaded to CPGAVAS for update and re-analyses repeatedly. Using known chloroplast genome sequences as test set, we show that CPGAVAS performs comparably to another application DOGMA, while having several superior functionalities.
CPGAVAS allows the semi-automatic and complete annotation of a chloroplast genome sequence, and the visualization, editing and analysis of the annotation results. It will become an indispensible tool for researchers studying chloroplast genomes. The software is freely accessible from
Chloroplast genome; Annotation; Web server; CPGAVAS
MicroRNAs play a vital role in the regulation of gene expression and have been identified in every animal with a sequenced genome examined thus far, except for the placozoan Trichoplax. The genomic repertoires of metazoan microRNAs have become increasingly endorsed as phylogenetic characters and drivers of biological complexity.
In this study, we report the first investigation of microRNAs in a species from the phylum Ctenophora. We use short RNA sequencing and the assembled genome of the lobate ctenophore Mnemiopsis leidyi to show that this species appears to lack any recognizable microRNAs, as well as the nuclear proteins Drosha and Pasha, which are critical to canonical microRNA biogenesis. This finding represents the first reported case of a metazoan lacking a Drosha protein.
Recent phylogenomic analyses suggest that Mnemiopsis may be the earliest branching metazoan lineage. If this is true, then the origins of canonical microRNA biogenesis and microRNA-mediated gene regulation may postdate the last common metazoan ancestor. Alternatively, canonical microRNA functionality may have been lost independently in the lineages leading to both Mnemiopsis and the placozoan Trichoplax, suggesting that microRNA functionality was not critical until much later in metazoan evolution.
Mnemiopsis leidyi; Ctenophore; Metazoa; microRNA; miRNA; Drosha; Pasha; Microprocessor complex; Ribonuclease III; RNase III
Sacha Inchi (Plukenetia volubilis L., Euphorbiaceae) is a potential oilseed crop because the seeds of this plant are rich in unsaturated fatty acids (FAs). In particular, the fatty acid composition of its seed oil differs markedly in containing large quantities of α-linolenic acid (18C:3, a kind of ω-3 FAs). However, little is known about the molecular mechanisms responsible for biosynthesis of unsaturated fatty acids in the developing seeds of this species. Transcriptome data are needed to better understand these mechanisms.
In this study, de novo transcriptome assembly and gene expression analysis were performed using Illumina sequencing technology. A total of 52.6 million 90-bp paired-end reads were generated from two libraries constructed at the initial stage and fast oil accumulation stage of seed development. These reads were assembled into 70,392 unigenes; 22,179 unigenes showed a 2-fold or greater expression difference between the two libraries. Using this data we identified unigenes that may be involved in de novo FA and triacylglycerol biosynthesis. In particular, a number of unigenes encoding desaturase for formation of unsaturated fatty acids with high expression levels in the fast oil accumulation stage compared with the initial stage of seed development were identified.
This study provides the first comprehensive dataset characterizing Sacha Inchi gene expression at the transcriptional level. These data provide the foundation for further studies on molecular mechanisms underlying oil accumulation and PUFA biosynthesis in Sacha Inchi seeds. Our analyses facilitate understanding of the molecular mechanisms responsible for the high unsaturated fatty acids (especially α-linolenic acid) accumulation in Sacha Inchi seeds.
Transcriptome; Unsaturated fatty acids; Omega-3 fatty acids; Triacylglycerols; Gene expression