Biological data acquisition is raising new challenges, both in data analysis and handling. Not only is it proving hard to analyze the data at the rate it is generated today, but simply reading and transferring data files can be prohibitively slow due to their size. This primarily concerns logistics within and between data centers, but is also important for workstation users in the analysis phase. Common usage patterns, such as comparing and transferring files, are proving computationally expensive and are tying down shared resources.
We present an efficient method for calculating file uniqueness for large scientific data files, that takes less computational effort than existing techniques. This method, called Probabilistic Fast File Fingerprinting (PFFF), exploits the variation present in biological data and computes file fingerprints by sampling randomly from the file instead of reading it in full. Consequently, it has a flat performance characteristic, correlated with data variation rather than file size. We demonstrate that probabilistic fingerprinting can be as reliable as existing hashing techniques, with provably negligible risk of collisions. We measure the performance of the algorithm on a number of data storage and access technologies, identifying its strengths as well as limitations.
Probabilistic fingerprinting may significantly reduce the use of computational resources when comparing very large files. Utilisation of probabilistic fingerprinting techniques can increase the speed of common file-related workflows, both in the data center and for workbench analysis. The implementation of the algorithm is available as an open-source tool named pfff, as a command-line tool as well as a C library. The tool can be downloaded from http://biit.cs.ut.ee/pfff.
Haplotype information is useful for various genetic analyses, including
genome-wide association studies. Determining haplotypes experimentally is
difficult and there are several computational approaches that infer haplotypes
from genomic data. Among such approaches, single individual haplotyping or
haplotype assembly, which infers two haplotypes of an individual from aligned
sequence fragments, has been attracting considerable attention. To avoid incorrect
results in downstream analyses, it is important not only to assemble haplotypes as
long as possible but also to provide means to extract highly reliable haplotype
regions. Although there are several efficient algorithms for solving haplotype
assembly, there are no efficient method that allow for extracting the regions
assembled with high confidence.
We develop a probabilistic model, called MixSIH, for solving the haplotype
assembly problem. The model has two mixture components representing two
haplotypes. Based on the optimized model, a quality score is defined, which we
call the 'minimum connectivity' (MC) score, for each segment in the haplotype
assembly. Because existing accuracy measures for haplotype assembly are designed
to compare the efficiency between the algorithms and are not suitable for
evaluating the quality of the set of partially assembled haplotype segments, we
develop an accuracy measure based on the pairwise consistency and evaluate the
accuracy on the simulation and real data. By using the MC scores, our algorithm
can extract highly accurate haplotype segments. We also show evidence that an
existing experimental dataset contains chimeric read fragments derived from
different haplotypes, which significantly degrade the quality of assembled
We develop a novel method for solving the haplotype assembly problem. We also
define the quality score which is based on our model and indicates the accuracy of
the haplotypes segments. In our evaluation, MixSIH has successfully extracted
reliable haplotype segments. The C++ source code of MixSIH is available at
Investigation of conformational changes in a protein is a prerequisite to understand its biological function. To explore these conformational changes in proteins we developed a strategy with the combination of molecular dynamics (MD) simulations and electron paramagnetic resonance (EPR) spectroscopy. The major goal of this work is to investigate how far computer simulations can meet the experiments.
Vinculin tail protein is chosen as a model system as conformational changes within the vinculin protein are believed to be important for its biological function at the sites of cell adhesion. MD simulations were performed on vinculin tail protein both in water and in vacuo environments. EPR experimental data is compared with those of the simulated data for corresponding spin label positions.
The calculated EPR spectra from MD simulations trajectories of selected spin labelled positions are comparable to experimental EPR spectra. The results show that the information contained in the spin label mobility provides a powerful means of mapping protein folds and their conformational changes.
The results suggest the localization of dynamic and flexible regions of the vinculin tail protein. This study shows MD simulations can be used as a complementary tool to interpret experimental EPR data.
Long intergenic non-coding RNAs (lincRNAs) are emerging as a novel class of non-coding RNAs and potent gene regulators. High-throughput RNA-sequencing combined with de novo assembly promises quantity discovery of novel transcripts. However, the identification of lincRNAs from thousands of assembled transcripts is still challenging due to the difficulties of separating them from protein coding transcripts (PCTs).
We have implemented iSeeRNA, a support vector machine (SVM)-based classifier for the identification of lincRNAs. iSeeRNA shows better performance compared to other software. A public available webserver for iSeeRNA is also provided for small size dataset.
iSeeRNA demonstrates high prediction accuracy and runs several magnitudes faster than other similar programs. It can be integrated into the transcriptome data analysis pipelines or run as a web server, thus offering a valuable tool for lincRNA study.
Detection of significant differentially expressed genes (DEGs) from DNA microarray datasets is a common routine task conducted in biomedical research. For the detection of DEGs, numerous methods are proposed. By such conventional methods, generally, DEGs are detected from one dataset consisting of group of control and treatment. However, some DEGs are easily to be detected in any experimental condition. For the detection of much experiment condition specific DEGs, each measurement value of gene expression levels should be compared in two dimensional ways, or both with other genes and other datasets simultaneously. For this purpose, we retrieve the gene expression data from public database as possible and construct "meta-dataset" which summarize expression change of all genes in various experimental condition. Herein, we propose "two-way AIC" (Akaike Information Criteria), method for simultaneous detection of significance genes and experiments on meta-dataset.
As a case study of the Pseudomonas aeruginosa, we evaluate whether two-way AIC method can detect test data which is the experiment condition specific DEGs. Operon genes are used as test data. Compared with other commonly used statistical methods (t-rank/F-test, RankProducts and SAM), two-way AIC shows the highest specificity of detection of operon genes.
The two-way AIC performs high specificity for operon gene detection on the microarray meta-dataset. This method can also be applied to estimation of mutual gene interactions.
Single nucleotide polymorphisms (SNPs) are the most common form of genetic variation in human DNA. The sequence of SNPs in each of the two copies of a given chromosome in a diploid organism is referred to as a haplotype. Haplotype information has many applications such as gene disease diagnoses, drug design, etc. The haplotype assembly problem is defined as follows: Given a set of fragments sequenced from the two copies of a chromosome of a single individual, and their locations in the chromosome, which can be pre-determined by aligning the fragments to a reference DNA sequence, the goal here is to reconstruct two haplotypes (h1, h2) from the input fragments. Existing algorithms do not work well when the error rate of fragments is high. Here we design an algorithm that can give accurate solutions, even if the error rate of fragments is high.
We first give a dynamic programming algorithm that can give exact solutions to the haplotype assembly problem. The time complexity of the algorithm is O(n × 2t × t), where n is the number of SNPs, and t is the maximum coverage of a SNP site. The algorithm is slow when t is large. To solve the problem when t is large, we further propose a heuristic algorithm on the basis of the dynamic programming algorithm. Experiments show that our heuristic algorithm can give very accurate solutions.
We have tested our algorithm on a set of benchmark datasets. Experiments show that our algorithm can give very accurate solutions. It outperforms most of the existing programs when the error rate of the input fragments is high.
Protein structure comparison and classification is an effective method for exploring protein structure-function relations. This problem is computationally challenging. Many different computational approaches for protein structure comparison apply the secondary structure elements (SSEs) representation of protein structures.
We study the complexity of the protein structure comparison problem based on a mixed-graph model with respect to different computational frameworks. We develop an effective approach for protein structure comparison based on a novel independent set enumeration algorithm. Our approach (named: ePC, efficient enumeration-based Protein structure Comparison) is tested for general purpose protein structure comparison as well as for specific protein examples. Compared with other graph-based approaches for protein structure comparison, the theoretical running-time O(1.47rnn2) of our approach ePC is significantly better, where n is the smaller number of SSEs of the two proteins, r is a parameter of small value.
Through the enumeration algorithm, our approach can identify different substructures from a list of high-scoring solutions of biological interest. Our approach is flexible to conduct protein structure comparison with the SSEs in sequential and non-sequential order as well. Supplementary data of additional testing and the source of ePC will be available at http://bioinformatics.astate.edu/.
microRNAs (miRNAs) are tiny endogenous RNAs that have been discovered in animals and plants, and direct the post-transcriptional regulation of target mRNAs for degradation or translational repression via binding to the 3'UTRs and the coding exons. To gain insight into the biological role of miRNAs, it is essential to identify the full repertoire of mRNA targets (target genes). A number of computer programs have been developed for miRNA-target prediction. These programs essentially focus on potential binding sites in 3'UTRs, which are recognized by miRNAs according to specific base-pairing rules.
Here, we introduce a novel method for miRNA-target prediction that is entirely independent of existing approaches. The method is based on the hypothesis that transcription of a miRNA and its target genes tend to be co-regulated by common transcription factors. This hypothesis predicts the frequent occurrence of common cis-elements between promoters of a miRNA and its target genes. That is, our proposed method first identifies putative cis-elements in a promoter of a given miRNA, and then identifies genes that contain common putative cis-elements in their promoters. In this paper, we show that a significant number of common cis-elements occur in ~28% of experimentally supported human miRNA-target data. Moreover, we show that the prediction of human miRNA-targets based on our method is statistically significant. Further, we discuss the random incidence of common cis-elements, their consensus sequences, and the advantages and disadvantages of our method.
This is the first report indicating prevalence of transcriptional regulation of a miRNA and its target genes by common transcription factors and the predictive ability of miRNA-targets based on this property.
Tandem repeats (TRs) in the mitochondrial (mt) genome control region have been documented in a wide variety of vertebrate species. The mechanism by which repeated tracts originate and undergo duplication and deletion, however, remains unclear.
We analyzed DNA sequences of mt genome TRs (mtTRs) in the ridged-eye flounder (Pleuronichthys cornutus), and characterized DNA sequences of mtTRs from other vertebrates using the data available in GenBank. Tandem repeats are concentrated in the control regions; however, we found approximately 16.6% of the TRs elsewhere in the mt genome. The flounder mtTRs possess three motif types with hypervariable characteristics at the 3′ end of the control region (CR).
Based on our analysis of this larger dataset of mtTR sequences, we propose a novel model of Pause Melting Misalignment (PMM) to describe the birth and motif indel of tandem repeats. PMM is activated during a pause event in mitochondrial replication in which a dynamic competition between the nascent (N) heavy strand and the displaced (D) heavy strand may lead to the melting of the N-strand from the template (T) light strand. When mispairing occurs during rebinding of the N-strand, one or several motifs can be inserted or deleted in both strands during the next round of mt-replication or repair. This model can explain the characteristics of TRs in available vertebrate mt genomes.
Due to the rapid progress of next-generation sequencing (NGS) facilities, an explosion of human whole genome data will become available in the coming years. These data can be used to optimize and to increase the resolution of the phylogenetic Y chromosomal tree. Moreover, the exponential growth of known Y chromosomal lineages will require an automatic determination of the phylogenetic position of an individual based on whole genome SNP calling data and an up to date Y chromosomal tree.
We present an automated approach, ‘AMY-tree’, which is able to determine the phylogenetic position of a Y chromosome using a whole genome SNP profile, independently from the NGS platform and SNP calling program, whereby mistakes in the SNP calling or phylogenetic Y chromosomal tree are taken into account. Moreover, AMY-tree indicates ambiguities within the present phylogenetic tree and points out new Y-SNPs which may be phylogenetically relevant. The AMY-tree software package was validated successfully on 118 whole genome SNP profiles of 109 males with different origins. Moreover, support was found for an unknown recurrent mutation, wrong reported mutation conversions and a large amount of new interesting Y-SNPs.
Therefore, AMY-tree is a useful tool to determine the Y lineage of a sample based on SNP calling, to identify Y-SNPs with yet unknown phylogenetic position and to optimize the Y chromosomal phylogenetic tree in the future. AMY-tree will not add lineages to the existing phylogenetic tree of the Y-chromosome but it is the first step to analyse whole genome SNP profiles in a phylogenetic framework.
Haploid marker; Phylogeny; Next-generation sequencing; SNP calling; Y-SNP mutations; Y chromosome haplogroups
Many studies have revealed correlations between breast tumour phenotypes, variations in gene expression, and patient survival outcomes. The molecular heterogeneity between breast tumours revealed by these studies has allowed prediction of prognosis and has underpinned stratified therapy, where groups of patients with particular tumour types receive specific treatments. The molecular tests used to predict prognosis and stratify treatment usually utilise fixed sets of genomic biomarkers, with the same biomarker sets being used to test all patients. In this paper we suggest that instead of fixed sets of genomic biomarkers, it may be more effective to use a stratified biomarker approach, where optimal biomarker sets are automatically chosen for particular patient groups, analogous to the choice of optimal treatments for groups of similar patients in stratified therapy. We illustrate the effectiveness of a biclustering approach to select optimal gene sets for determining the prognosis of specific strata of patients, based on potentially overlapping, non-discrete molecular characteristics of tumours.
Biclustering identified tightly co-expressed gene sets in the tumours of restricted subgroups of breast cancer patients. The co-expressed genes in these biclusters were significantly enriched for particular biological annotations and gene regulatory modules associated with breast cancer biology. Tumours identified within the same bicluster were more likely to present with similar clinical features. Bicluster membership combined with clinical information could predict patient prognosis in conditional inference tree and ridge regression class prediction models.
The increasing clinical use of genomic profiling demands identification of more effective methods to segregate patients into prognostic and treatment groups. We have shown that biclustering can be used to select optimal gene sets for determining the prognosis of specific strata of patients.
Biclustering; Gene expression profiles; Tumour classification; Survival prediction; Breast cancer
A classical example of repeated speciation coupled with ecological diversification is the evolution of 14 closely related species of Darwin’s (Galápagos) finches (Thraupidae, Passeriformes). Their adaptive radiation in the Galápagos archipelago took place in the last 2–3 million years and some of the molecular mechanisms that led to their diversification are now being elucidated. Here we report evolutionary analyses of genome of the large ground finch, Geospiza magnirostris.
13,291 protein-coding genes were predicted from a 991.0 Mb G. magnirostris genome assembly. We then defined gene orthology relationships and constructed whole genome alignments between the G. magnirostris and other vertebrate genomes. We estimate that 15% of genomic sequence is functionally constrained between G. magnirostris and zebra finch. Genic evolutionary rate comparisons indicate that similar selective pressures acted along the G. magnirostris and zebra finch lineages suggesting that historical effective population size values have been similar in both lineages. 21 otherwise highly conserved genes were identified that each show evidence for positive selection on amino acid changes in the Darwin's finch lineage. Two of these genes (Igf2r and Pou1f1) have been implicated in beak morphology changes in Darwin’s finches. Five of 47 genes showing evidence of positive selection in early passerine evolution have cilia related functions, and may be examples of adaptively evolving reproductive proteins.
These results provide insights into past evolutionary processes that have shaped G. magnirostris genes and its genome, and provide the necessary foundation upon which to build population genomics resources that will shed light on more contemporaneous adaptive and non-adaptive processes that have contributed to the evolution of the Darwin’s finches.
Genomics; Evolution; Darwin’s finches; Large ground finch; Geospiza magnirostris
Cotton (Gossypium hirsutum) anther development involves a diverse range of gene interactions between sporophytic and gametophytic tissues. However, only a small number of genes are known to be specifically involved in this developmental process and the molecular mechanism of the genetic male sterility (GMS) is still poorly understand. To fully explore the global gene expression during cotton anther development and identify genes related to male sterility, a digital gene expression (DGE) analysis was adopted.
Six DGE libraries were constructed from the cotton anthers of the wild type (WT) and GMS mutant (in the WT background) in three stages of anther development, resulting in 21,503 to 37,352 genes detected in WT and GMS mutant anthers. Compared with the fertile isogenic WT, 9,595 (30% of the expressed genes), 10,407 (25%), and 3,139 (10%) genes were differentially expressed at the meiosis, tetrad, and uninucleate microspore stages of GMS mutant anthers, respectively. Using both DGE experiments and real-time quantitative RT-PCR, the expression of many key genes required for anther development were suppressed in the meiosis stage and the uninucleate microspore stage in anthers of the mutant, but these genes were activated in the tetrad stage of anthers in the mutant. These genes were associated predominantly with hormone synthesis, sucrose and starch metabolism, the pentose phosphate pathway, glycolysis, flavonoid metabolism, and histone protein synthesis. In addition, several genes that participate in DNA methylation, cell wall loosening, programmed cell death, and reactive oxygen species generation/scavenging were activated during the three anther developmental stages in the mutant.
Compared to the same anther developmental stage of the WT, many key genes involved in various aspects of anther development show a reverse gene expression pattern in the GMS mutant, which indicates that diverse gene regulation pathways are involved in the GMS mutant anther development. These findings provide the first insights into the mechanism that leads to genetic male sterility in cotton and contributes to a better understanding of the regulatory network involved in anther development in cotton.
Most studies on the origin and evolution of microRNA in the human genome have been focused on its relationship with repetitive elements and segmental duplications. However, duplication events at a smaller scale (<1 kb) could also contribute to microRNA expansion, as demonstrated in this study.
Using comparative genome analysis and bioinformatics methods, we found nine novel expanded microRNA families enriched in short duplicated sequences in the human genome. Furthermore, novel genomic regions were found to contain microRNA paralogs for microRNA families previously analyzed to be related to segmental duplications. We found that for microRNA families expanded in the human genome, 14 families are specific to the primate lineage, and nine are non-specific, respectively. Two microRNA families (hsa-mir-1233 and hsa-mir-622) appear to be further expanded in the human genome, and were confirmed by fluorescence in situ hybridization. These novel microRNA families expanded in the human genome were mostly embedded in or close to proteins with conserved functions. Furthermore, besides the Alu element, L1 elements could also contribute to the origination of microRNA paralog families.
Together, we found that small duplication events could also contribute to microRNA expansion, which could provide us novel insights on the evolution of human genome structure and function.
Human; Genome; microRNA; Duplication; Evolution
The throughput of next-generation sequencing machines has increased dramatically over the last few years; yet the cost and time for library preparation have not changed proportionally, thus representing the main bottleneck for sequencing large numbers of samples. Here we present an economical, high-throughput library preparation method for the Illumina platform, comprising a 96-well based method for DNA isolation for yeast cells, a low-cost DNA shearing alternative, and adapter ligation using heat inactivation of enzymes instead of bead cleanups.
Up to 384 whole-genome libraries can be prepared from yeast cells in one week using this method, for less than 15 euros per sample. We demonstrate the robustness of this protocol by sequencing over 1000 yeast genomes at ~30x coverage. The sequence information from 768 yeast segregants derived from two divergent S. cerevisiae strains was used to generate a meiotic recombination map at unprecedented resolution. Comparisons to other datasets indicate a high conservation of recombination at a chromosome-wide scale, but differences at the local scale. Additionally, we detected a high degree of aneuploidy (3.6%) by examining the sequencing coverage in these segregants. Differences in allele frequency allowed us to attribute instances of aneuploidy to gains of chromosomes during meiosis or mitosis, both of which showed a strong tendency to missegregate specific chromosomes.
Here we present a high throughput workflow to sequence genomes of large number of yeast strains at a low price. We have used this workflow to obtain recombination and aneuploidy data from hundreds of segregants, which can serve as a foundation for future studies of linkage, recombination, and chromosomal aberrations in yeast and higher eukaryotes.
Next-generation sequencing; High throughput; DNA isolation; Yeast; DNA fragmentation; Heat inactivation; Recombination; Aneuploidy
Tissues and their component cells have unique DNA methylation profiles comprising DNA methylation patterns of tissue-dependent and differentially methylated regions (T-DMRs). Previous studies reported that DNA methylation plays crucial roles in cell differentiation and development. Here, we investigated the genome-wide DNA methylation profiles of mouse neural progenitors derived from different developmental stages using HpyCH4IV, a methylation-sensitive restriction enzyme that recognizes ACGT residues, which are uniformly distributed across the genome.
Using a microarray-based genome-wide DNA methylation analysis system focusing on 8.5-kb regions around transcription start sites (TSSs), we analyzed the DNA methylation profiles of mouse neurospheres derived from telencephalons at embryonic days 11.5 (E11.5NSph) and 14.5 (E14.5NSph) and the adult brain (AdBr). We identified T-DMRs with different DNA methylation statuses between E11.5NSph and E14.5NSph at genes involved in neural development and/or associated with neurological disorders in humans, such as Dclk1, Nrcam, Nfia, and Ntng1. These T-DMRs were located not only within 2 kb but also distal (several kbs) from the TSSs, and those hypomethylated in E11.5NSph tended to be in CpG island (CGI-) associated genes. Most T-DMRs that were hypomethylated in neurospheres were also hypomethylated in the AdBr. Interestingly, among the T-DMRs hypomethylated in the progenitors, there were T-DMRs that were hypermethylated in the AdBr. Although certain genes, including Ntng1, had hypermethylated T-DMRs 5′ upstream, we identified hypomethylated T-DMRs in the AdBr, 3′ downstream from their TSSs. This observation could explain why Ntng1 was highly expressed in the AdBr despite upstream hypermethylation.
Mouse adult brain DNA methylation and gene expression profiles could be attributed to developmental dynamics of T-DMRs in neural-related genes.
DNA methylation; Tissue-dependent and differentially methylated region; Neural progenitor cells
Phosphorus (P) is an essential macronutrient for plant growth and development. To modulate their P homeostasis, plants must balance P uptake, mobilisation, and partitioning to various organs. Despite the worldwide importance of wheat as a cultivated food crop, molecular mechanisms associated with phosphate (Pi) starvation in wheat remain unclear. To elucidate these mechanisms, we used RNA-Seq methods to generate transcriptome profiles of the wheat variety ‘Chinese Spring’ responding to 10 days of Pi starvation.
We carried out de novo assembly on 73.8 million high-quality reads generated from RNA-Seq libraries. We then constructed a transcript dataset containing 29,617 non-redundant wheat transcripts, comprising 15,047 contigs and 14,570 non-redundant full-length cDNAs from the TriFLDB database. When compared with barley full-length cDNAs, 10,656 of the 15,047 contigs were unalignable, suggesting that many might be distinct from barley transcripts. The average expression level of the contigs was lower than that of the known cDNAs, implying that these contigs included transcripts that were rarely represented in the full-length cDNA library. Within the non-redundant transcript set, we identified 892–2,833 responsive transcripts in roots and shoots, corresponding on average to 23.4% of the contigs not covered by cDNAs in TriFLDB under Pi starvation. The relative expression level of the wheat IPS1 (Induced by Phosphate Starvation 1) homologue, TaIPS1, was 341-fold higher in roots and 13-fold higher in shoots; this finding was further confirmed by qRT-PCR analysis. A comparative analysis of the wheat- and rice-responsive transcripts for orthologous genes under Pi-starvation revealed commonly upregulated transcripts, most of which appeared to be involved in a general response to Pi starvation, namely, an IPS1-mediated signalling cascade and its downstream functions such as Pi remobilisation, Pi uptake, and changes in Pi metabolism.
Our transcriptome profiles demonstrated the impact of Pi starvation on global gene expression in wheat. This study revealed that enhancement of the Pi-mediated signalling cascade using IPS1 is a potent adaptation mechanism to Pi starvation that is conserved in both wheat and rice and validated the effectiveness of using short-read next-generation sequencing data for wheat transcriptome analysis in the absence of reference genome information.
De novo assembly; RNA-Seq; Transcriptome; Wheat; Phosphorus; Phosphate starvation
Hevea brasiliensis, a member of the Euphorbiaceae family, is the major commercial source of natural rubber (NR). NR is a latex polymer with high elasticity, flexibility, and resilience that has played a critical role in the world economy since 1876.
Here, we report the draft genome sequence of H. brasiliensis. The assembly spans ~1.1 Gb of the estimated 2.15 Gb haploid genome. Overall, ~78% of the genome was identified as repetitive DNA. Gene prediction shows 68,955 gene models, of which 12.7% are unique to Hevea. Most of the key genes associated with rubber biosynthesis, rubberwood formation, disease resistance, and allergenicity have been identified.
The knowledge gained from this genome sequence will aid in the future development of high-yielding clones to keep up with the ever increasing need for natural rubber.
Hevea brasiliensis; Euphorbiaceae; Natural rubber; Genome
Buchnera aphidicola is an obligate symbiotic bacterium, associated with most of the aphididae, whose genome has drastically shrunk during intracellular evolution. Gene regulation in Buchnera has been a matter of controversy in recent years as the combination of genomic information with the experimental results has been contradictory, refuting or arguing in favour of a functional and responsive transcription regulation in Buchnera.
The goal of this study was to describe the gene transcription regulation capabilities of Buchnera based on the inventory of cis- and trans-regulators encoded in the genomes of five strains from different aphids (Acyrthosiphon pisum, Schizaphis graminum, Baizongia pistacea, Cinara cedri and Cinara tujafilina), as well as on the characterisation of some intrinsic structural properties of the DNA molecule in these bacteria.
Interaction graph analysis shows that gene neighbourhoods are conserved between E. coli and Buchnera in structures called transcriptons, interactons and metabolons, indicating that selective pressures have acted on the evolution of transcriptional, protein-protein interaction and metabolic networks in Buchnera. The transcriptional regulatory network in Buchnera is composed of a few general DNA-topological regulators (Nucleoid Associated Proteins and topoisomerases), with the quasi-absence of any specific ones (except for multifunctional enzymes with a known gene expression regulatory role in Escherichia coli, such as AlaS, PepA and BolA, and the uncharacterized hypothetical regulators YchA and YrbA). The relative positioning of regulatory genes along the chromosome of Buchnera seems to have conserved its ancestral state, despite the genome erosion. Sigma-70 promoters with canonical thermodynamic sequence profiles were detected upstream of about 94% of the CDS of Buchnera in the different aphids. Based on Stress-Induced Duplex Destabilization (SIDD) measurements, unstable σ70 promoters were found specifically associated with the regulator and transporter genes.
This genomic analysis provides supporting evidence of a selection of functional regulatory structures and it has enabled us to propose hypotheses concerning possible links between these regulatory elements and the DNA-topology (i.e., supercoiling, curvature, flexibility and base-pair stability) in the regulation of gene expression in the shrunken genome of Buchnera.
Buchnera aphidicola; Genome reduction; Transcription regulation; DNA-topology; Nucleoid associated proteins (NAPs)
Tuberculosis (TB) continues to cause a high toll of disease and death among children worldwide. The diagnosis of childhood TB is challenged by the paucibacillary nature of the disease and the difficulties in obtaining specimens. Whereas scientific and clinical research efforts to develop novel diagnostic tools have focused on TB in adults, childhood TB has been relatively neglected. Blood transcriptional profiling has improved our understanding of disease pathogenesis of adult TB and may offer future leads for diagnosis and treatment. No studies applying gene expression profiling of children with TB have been published so far.
We identified a 116-gene signature set that showed an average prediction error of 11% for TB vs. latent TB infection (LTBI) and for TB vs. LTBI vs. healthy controls (HC) in our dataset. A minimal gene set of only 9 genes showed the same prediction error of 11% for TB vs. LTBI in our dataset. Furthermore, this minimal set showed a significant discriminatory value for TB vs. LTBI for all previously published adult studies using whole blood gene expression, with average prediction errors between 17% and 23%. In order to identify a robust representative gene set that would perform well in populations of different genetic backgrounds, we selected ten genes that were highly discriminative between TB, LTBI and HC in all literature datasets as well as in our dataset. Functional annotation of these genes highlights a possible role for genes involved in calcium signaling and calcium metabolism as biomarkers for active TB. These ten genes were validated by quantitative real-time polymerase chain reaction in an additional cohort of 54 Warao Amerindian children with LTBI, HC and non-TB pneumonia. Decision tree analysis indicated that five of the ten genes were sufficient to classify 78% of the TB cases correctly with no LTBI subjects wrongly classified as TB (100% specificity).
Our data justify the further exploration of our signature set as biomarkers for potential childhood TB diagnosis. We show that, as the identification of different biomarkers in ethnically distinct cohorts is apparent, it is important to cross-validate newly identified markers in all available cohorts.
Biomarker; Children; Mycobacterium tuberculosis; Transcriptomics
Class 2 transposable elements (TEs) are the predominant elements in and around plant genes where they generate significant allelic diversity. Using the complete sequences of four grasses, we have performed a novel comparative analysis of class 2 TEs. To ensure consistent comparative analyses, we re-annotated class 2 TEs in Brachypodium distachyon, Oryza sativa (rice), Sorghum bicolor and Zea mays and assigned them to one of the five cut-and-paste superfamilies found in plant genomes (Tc1/mariner, PIF/Harbinger, hAT, Mutator, CACTA). We have focused on noncoding elements because of their abundance, and compared superfamily copy number, size and genomic distribution as well as correlation with the level of nearby gene expression.
Our comparison revealed both unique and conserved features. First, the average length or size distribution of elements in each superfamily is largely conserved, with the shortest always being Tc1/mariner elements, followed by PIF/Harbinger, hAT, Mutator and CACTA. This order also holds for the ratio of the copy numbers of noncoding to coding elements. Second, with the exception of CACTAs, noncoding TEs are enriched within and flanking genes, where they display conserved distribution patterns, having the highest peak in the promoter region. Finally, our analysis of microarray data revealed that genes associated with Tc1/mariner and PIF/Harbinger noncoding elements have significantly higher expression levels than genes without class 2 TEs. In contrast, genes with CACTA elements have significantly lower expression than genes without class 2 TEs.
We have achieved the most comprehensive annotation of class 2 TEs to date in these four grass genomes. Comparative analysis of this robust dataset led to the identification of several previously unknown features of each superfamily related to copy number, element size, genomic distribution and correlation with the expression levels of nearby genes. These results highlight the importance of distinguishing TE superfamilies when assessing their impact on gene and genome evolution.
Genome comparison; Plant genomes; Genome evolution; Class2 transposable elements; Features; Grass genomes
Alzheimer’s disease (AD) is intimately tied to amyloid-β (Aβ) peptide. Extraneuronal brain plaques consisting primarily of Aβ aggregates are a hallmark of AD. Intraneuronal Aβ subunits are strongly implicated in disease progression. Protein sequence mutations of the Aβ precursor protein (APP) account for a small proportion of AD cases, suggesting that regulation of the associated gene (APP) may play a more important role in AD etiology. The APP promoter possesses a novel 30 nucleotide sequence, or “proximal regulatory element” (PRE), at −76/−47, from the +1 transcription start site that confers cell type specificity. This PRE contains sequences that make it vulnerable to epigenetic modification and may present a viable target for drug studies. We examined PRE-nuclear protein interaction by gel electrophoretic mobility shift assay (EMSA) and PRE mutant EMSA. This was followed by functional studies of PRE mutant/reporter gene fusion clones.
EMSA probed with the PRE showed DNA-protein interaction in multiple nuclear extracts and in human brain tissue nuclear extract in a tissue-type specific manner. We identified transcription factors that are likely to bind the PRE, using competition gel shift and gel supershift: Activator protein 2 (AP2), nm23 nucleoside diphosphate kinase/metastatic inhibitory protein (PuF), and specificity protein 1 (SP1). These sites crossed a known single nucleotide polymorphism (SNP). EMSA with PRE mutants and promoter/reporter clone transfection analysis further implicated PuF in cells and extracts. Functional assays of mutant/reporter clone transfections were evaluated by ELISA of reporter protein levels. EMSA and ELISA results correlated by meta-analysis.
We propose that PuF may regulate the APP gene promoter and that AD risk may be increased by interference with PuF regulation at the PRE. PuF is targeted by calcium/calmodulin-dependent protein kinase II inhibitor 1, which also interacts with the integrins. These proteins are connected to vital cellular and neurological functions. In addition, the transcription factor PuF is a known inhibitor of metastasis and regulates cell growth during development. Given that APP is a known cell adhesion protein and ferroxidase, this suggests biochemical links among cell signaling, the cell cycle, iron metabolism in cancer, and AD in the context of overall aging.
Amyloid precursor protein; Alzheimer’s disease; Cancer; Gene regulation; Gene transcription; Iron; Latency; nm23 nucleoside diphosphate kinase; Oncogenesis; PuF; SP1; Specificity protein 1; Transcription factor
Comparative genomics is a formidable tool to identify functional elements throughout a genome. In the past ten years, studies in the budding yeast Saccharomyces cerevisiae and a set of closely related species have been instrumental in showing the benefit of analyzing patterns of sequence conservation. Increasing the number of closely related genome sequences makes the comparative genomics approach more powerful and accurate.
Here, we report the genome sequence and analysis of Saccharomyces arboricolus, a yeast species recently isolated in China, that is closely related to S. cerevisiae. We obtained high quality de novo sequence and assemblies using a combination of next generation sequencing technologies, established the phylogenetic position of this species and considered its phenotypic profile under multiple environmental conditions in the light of its gene content and phylogeny.
We suggest that the genome of S. arboricolus will be useful in future comparative genomics analysis of the Saccharomyces sensu stricto yeasts.
The genomic basis of teleost phenotypic complexity remains obscure, despite increasing availability of genome and transcriptome sequence data. Fish-specific genome duplication cannot provide sufficient explanation for the morphological complexity of teleosts, considering the relatively large number of extinct basal ray-finned fishes.
In this study, we performed comparative genomic analysis to discover the Conserved Teleost-Specific Genes (CTSGs) and orphan genes within zebrafish and found that these two sets of lineage-specific genes may have played important roles during zebrafish embryogenesis. Lineage-specific genes within zebrafish share many of the characteristics of their counterparts in other species: shorter length, fewer exon numbers, higher GC content, and fewer of them have transcript support. Chromosomal location analysis indicated that neither the CTSGs nor the orphan genes were distributed evenly in the chromosomes of zebrafish. The significant enrichment of immunity proteins in CTSGs annotated by gene ontology (GO) or predicted ab initio may imply that defense against pathogens may be an important reason for the diversification of teleosts. The evolutionary origin of the lineage-specific genes was determined and a very high percentage of lineage-specific genes were generated via gene duplications. The temporal and spatial expression profile of lineage-specific genes obtained by expressed sequence tags (EST) and RNA-seq data revealed two novel properties: in addition to being highly tissue-preferred expression, lineage-specific genes are also highly temporally restricted, namely they are expressed in narrower time windows than evolutionarily conserved genes and are specifically enriched in later-stage embryos and early larval stages.
Our study provides the first systematic identification of two different sets of lineage-specific genes within zebrafish and provides valuable information leading towards a better understanding of the molecular mechanisms of the genomic basis of teleost phenotypic complexity for future studies.
Teleost; Lineage-specific gene; Transcriptome; Zebrafish embryogenesis
An emerging Hi-C protocol has the ability to probe three-dimensional (3D) architecture and capture chromatin interactions in a genome-wide scale. It provides informative results to address how chromatin organization changes contribute to disease/tumor occurrence and progression in response to stimulation of environmental chemicals or hormones.
In this study, using MCF7 cells as a model system, we found estrogen stimulation significantly impact chromatin interactions, leading to alteration of gene regulation and the associated histone modification states. Many chromosomal interaction regions at different levels of interaction frequency were identified. In particular, the top 10 hot regions with the highest interaction frequency are enriched with breast cancer specific genes. Furthermore, four types of E2-mediated strong differential (gain- or loss-) chromosomal (intra- or inter-) interactions were classified, in which the number of gain-chromosomal interactions is less than the number of loss-chromosomal interactions upon E2 stimulation. Finally, by integrating with eight histone modification marks, DNA methylation, regulatory elements regions, ERα and Pol-II binding activities, associations between epigenetic patterns and high chromosomal interaction frequency were revealed in E2-mediated gene regulation.
The work provides insight into the effect of chromatin interaction on E2/ERα regulated downstream genes in breast cancer cells.