Gene regulation through cis-regulatory elements plays a crucial role in development and disease. A major aim of the post-genomic era is to be able to read the function of cis-regulatory elements through scrutiny of their DNA sequence. Whilst comparative genomics approaches have identified thousands of putative regulatory elements, our knowledge of their mechanism of action is poor and very little progress has been made in systematically de-coding them.
Here, we identify ancient functional signatures within vertebrate conserved non-coding elements (CNEs) through a combination of phylogenetic footprinting and functional assay, using genomic sequence from the sea lamprey as a reference. We uncover a striking enrichment within vertebrate CNEs for conserved binding-site motifs of the Pbx-Hox hetero-dimer. We further show that these predict reporter gene expression in a segment specific manner in the hindbrain and pharyngeal arches during zebrafish development.
These findings evoke an evolutionary scenario in which many CNEs evolved early in the vertebrate lineage to co-ordinate Hox-dependent gene-regulatory interactions that pattern the vertebrate head. In a broader context, our evolutionary analyses reveal that CNEs are composed of tightly linked transcription-factor binding-sites (TFBSs), which can be systematically identified through phylogenetic footprinting approaches. By placing a large number of ancient vertebrate CNEs into a developmental context, our findings promise to have a significant impact on efforts toward de-coding gene-regulatory elements that underlie vertebrate development, and will facilitate building general models of regulatory element evolution.
Gene regulation; enhancer code; sea lamprey; Hox genes; embryogenesis
The sequencing of the cow genome was recently published (Btau_4.0 assembly). A second, alternate cow genome assembly (UMD2), based on the same raw sequence data, was also published. The two assemblies have been subsequently updated to Btau_4.2 and UMD3.1, respectively.
We compared the Btau_4.2 and UMD3.1 alternate assemblies. Inconsistencies were grouped into three main categories: (i) DNA segments showing almost coincidental chromosomal mapping but discordant orientation (inversions); (ii) DNA segments showing a discordant map position along the same chromosome; and (iii) sequences present in one chromosomal assembly but absent in the corresponding chromosome of the other assembly. The latter category mainly consisted of large amounts of scaffolds that were unassigned in Btau_4.2 but successfully mapped in UMD3.1. We sampled 70 inconsistencies and identified appropriate cow BACs for each of them. These clones were then utilized in FISH experiments on cow metaphase or interphase nuclei in order to disambiguate the discrepancies. In almost all instances the FISH results agreed with the UMD3.1 assembly. Occasionally, however, the mapping data of both assemblies were discordant with the FISH results.
Our work demonstrates how FISH, which is assembly independent, can be efficiently used to solve assembly problems frequently encountered using the shotgun approach.
Cow genome; alternate assemblies of cow genomes; genomic comparison; unassigned scaffolds; BAC-FISH mapping
GC-skews have previously been linked to transcription in some eukaryotes. They have been associated with transcription start sites, with the coding strand G-biased in mammals and C-biased in fungi and invertebrates.
We show a consistent and highly significant pattern of GC-skew within genes of almost all unicellular fungi. The pattern of GC-skew is asymmetrical: the coding strand of genes is typically C-biased at the 5' ends but G-biased at the 3' ends, with intermediate skews at the middle of genes. Thus, the initiation, elongation, and termination phases of transcription are associated with different skews. This pattern influences the encoded proteins by generating differential usage of amino acids at the 5' and 3' ends of genes. These biases also affect fourfold-degenerate positions and extend into promoters and 3' UTRs, indicating that skews cannot be accounted by selection for protein function or translation.
We propose two explanations, the mutational pressure hypothesis, and the adaptive hypothesis. The mutational pressure hypothesis is that different co-factors bind to RNA pol II at different phases of transcription, producing different mutational regimes. The adaptive hypothesis is that cytidine triphosphate deficiency may lead to C-avoidance at the 3' ends of transcripts to control the flow of RNA pol II molecules and reduce their frequency of collisions.
miRNAs are ~21 nucleotide long small noncoding RNA molecules, formed endogenously in most of the eukaryotes, which mainly control their target genes post transcriptionally by interacting and silencing them. While a lot of tools has been developed for animal miRNA target system, plant miRNA target identification system has witnessed limited development. Most of them have been centered around exact complementarity match. Very few of them considered other factors like multiple target sites and role of flanking regions.
In the present work, a Support Vector Regression (SVR) approach has been implemented for plant miRNA target identification, utilizing position specific dinucleotide density variation information around the target sites, to yield highly reliable result. It has been named as p-TAREF (plant-Target Refiner). Performance comparison for p-TAREF was done with other prediction tools for plants with utmost rigor and where p-TAREF was found better performing in several aspects. Further, p-TAREF was run over the experimentally validated miRNA targets from species like Arabidopsis, Medicago, Rice and Tomato, and detected them accurately, suggesting gross usability of p-TAREF for plant species. Using p-TAREF, target identification was done for the complete Rice transcriptome, supported by expression and degradome based data. miR156 was found as an important component of the Rice regulatory system, where control of genes associated with growth and transcription looked predominant. The entire methodology has been implemented in a multi-threaded parallel architecture in Java, to enable fast processing for web-server version as well as standalone version. This also makes it to run even on a simple desktop computer in concurrent mode. It also provides a facility to gather experimental support for predictions made, through on the spot expression data analysis, in its web-server version.
A machine learning multivariate feature tool has been implemented in parallel and locally installable form, for plant miRNA target identification. The performance was assessed and compared through comprehensive testing and benchmarking, suggesting a reliable performance and gross usability for transcriptome wide plant miRNA target identification.
The evolution of gene expression is a challenging problem in evolutionary biology, for which accurate, well-calibrated measurements and methods are crucial.
We quantified gene expression with whole-transcriptome sequencing in four diploid, prototrophic strains of Saccharomyces species grown under the same condition to investigate the evolution of gene expression. We found that variation in expression is gene-dependent with large variations in each gene's expression between replicates of the same species. This confounds the identification of genes differentially expressed across species. To address this, we developed a statistical approach to establish significance bounds for inter-species differential expression in RNA-Seq data based on the variance measured across biological replicates. This metric estimates the combined effects of technical and environmental variance, as well as Poisson sampling noise by isolating each component. Despite a paucity of large expression changes, we found a strong correlation between the variance of gene expression change and species divergence (R2 = 0.90).
We provide an improved methodology for measuring gene expression changes in evolutionary diverged species using RNA Seq, where experimental artifacts can mimic evolutionary effects.
GEO Accession Number: GSE32679
RNA-Seq; Comparative transcriptomics; S. cerevisiae; S. paradoxus; S. mikatae; S. bayanus
Disruption of thyroid hormone signalling can alter growth, development and energy metabolism. Thyroid hormones exert their effects through interactions with thyroid receptors that directly bind thyroid response elements and can alter transcriptional activity of target genes. The effects of short-term thyroid hormone perturbation on hepatic mRNA transcription in juvenile mice were evaluated, with the goal of identifying genes containing active thyroid response elements. Thyroid hormone disruption was induced from postnatal day 12 to 15 by adding goitrogens to dams' drinking water (hypothyroid). A subgroup of thyroid hormone-disrupted pups received intraperitoneal injections of replacement thyroid hormones four hours prior to sacrifice (replacement). An additional group received only thyroid hormones four hours prior to sacrifice (hyperthyroid). Hepatic mRNA was extracted and hybridized to Agilent mouse microarrays.
Transcriptional profiling enabled the identification of 28 genes that appeared to be under direct thyroid hormone-regulation. The regulatory regions of the genome adjacent to these genes were examined for half-site sequences that resemble known thyroid response elements. A bioinformatics search identified 33 thyroid response elements in the promoter regions of 13 different genes thought to be directly regulated by thyroid hormones. Thyroid response elements found in the promoter regions of Tor1a, 2310003H01Rik, Hect3d and Slc25a45 were further validated by confirming that the thyroid receptor is associated with these sequences in vivo and that it can bind directly to these sequences in vitro. Three different arrangements of thyroid response elements were identified. Some of these thyroid response elements were located far up-stream (> 7 kb) of the transcription start site of the regulated gene.
Transcriptional profiling of thyroid hormone disrupted animals coupled with a novel bioinformatics search revealed new thyroid response elements associated with genes previously unknown to be responsive to thyroid hormone. The work provides insight into thyroid response element sequence motif characteristics.
The characterization of DNA replication origins in yeast has shed much light on the mechanisms of initiation of DNA replication. However, very little is known about the evolution of origins or the evolution of mechanisms through which origins are recognized by the initiation machinery. This lack of understanding is largely due to the vast evolutionary distances between model organisms in which origins have been examined.
In this study we have isolated and characterized autonomously replicating sequences (ARSs) in Lachancea kluyveri - a pre-whole genome duplication (WGD) budding yeast. Through a combination of experimental work and rigorous computational analysis, we show that L. kluyveri ARSs require a sequence that is similar but much longer than the ARS Consensus Sequence well defined in Saccharomyces cerevisiae. Moreover, compared with S. cerevisiae and K. lactis, the replication licensing machinery in L. kluyveri seems more tolerant to variations in the ARS sequence composition. It is able to initiate replication from almost all S. cerevisiae ARSs tested and most Kluyveromyces lactis ARSs. In contrast, only about half of the L. kluyveri ARSs function in S. cerevisiae and less than 10% function in K. lactis.
Our findings demonstrate a replication initiation system with novel features and underscore the functional diversity within the budding yeasts. Furthermore, we have developed new approaches for analyzing biologically functional DNA sequences with ill-defined motifs.
Cluster thinning is an agronomic practice in which a proportion of berry clusters are removed from the vine to increase the source/sink ratio and improve the quality of the remaining berries. Until now no transcriptomic data have been reported describing the mechanisms that underlie the agronomic and biochemical effects of thinning.
We profiled the transcriptome of Vitis vinifera cv. Sangiovese berries before and after thinning at veraison using a genome-wide microarray representing all grapevine genes listed in the latest V1 gene prediction. Thinning increased the source/sink ratio from 0.6 to 1.2 m2 leaf area per kg of berries and boosted the sugar and anthocyanin content at harvest. Extensive transcriptome remodeling was observed in thinned vines 2 weeks after thinning and at ripening. This included the enhanced modulation of genes that are normally regulated during berry development and the induction of a large set of genes that are not usually expressed.
Cluster thinning has a profound effect on several important cellular processes and metabolic pathways including carbohydrate metabolism and the synthesis and transport of secondary products. The integrated agronomic, biochemical and transcriptomic data revealed that the positive impact of cluster thinning on final berry composition reflects a much more complex outcome than simply enhancing the normal ripening process.
The presence of tandem amino acid repeats (AARs) is one of the signatures of eukaryotic proteins. AARs were thought to be frequently involved in bio-molecular interactions. Comprehensive studies that primarily focused on metazoan AARs have suggested that AARs are evolving rapidly and are highly variable among species. However, there is still controversy over causal factors of this inter-species variation. In this work, we attempted to investigate this topic mainly by comparing AARs in orthologous proteins from ten angiosperm genomes.
Angiosperm AAR content is positively correlated with the GC content of the protein coding sequence. However, based on observations from fungal AARs and insect AARs, we argue that the applicability of this kind of correlation is limited by AAR residue composition and species' life history traits. Angiosperm AARs also tend to be fast evolving and structurally disordered, supporting the results of comprehensive analyses of metazoans. The functions of conserved long AARs are summarized. Finally, we propose that the rapid mRNA decay rate, alternative splicing and tissue specificity are regulatory processes that are associated with angiosperm proteins harboring AARs.
Our investigation suggests that GC content is a predictor of AAR content in the protein coding sequence under certain conditions. Although angiosperm AARs lack conservation and 3D structure, a fraction of the proteins that contain AARs may be functionally important and are under extensive regulation in plant cells.
This is an editorial report of the supplement to BMC Genomics that includes 15 papers selected from the BIOCOMP'10 - The 2010 International Conference on Bioinformatics & Computational Biology as well as other sources with a focus on genomics studies.
BIOCOMP'10 was held on July 12-15 in Las Vegas, Nevada. The congress covered a large variety of research areas, and genomics was one of the major focuses because of the fast development in this field. We set out to launch a supplement to BMC Genomics with manuscripts selected from this congress and invited submissions. With a rigorous peer review process, we selected 15 manuscripts that showed work in cutting-edge genomics fields and proposed innovative methodology. We hope this supplement presents the current computational and statistical challenges faced in genomics studies, and shows the enormous promises and opportunities in the genomic future.
Microarray data have a high dimension of variables and a small sample size. In microarray data analyses, two important issues are how to choose genes, which provide reliable and good prediction for disease status, and how to determine the final gene set that is best for classification. Associations among genetic markers mean one can exploit information redundancy to potentially reduce classification cost in terms of time and money.
To deal with redundant information and improve classification, we propose a gene selection method, Recursive Feature Addition, which combines supervised learning and statistical similarity measures. To determine the final optimal gene set for prediction and classification, we propose an algorithm, Lagging Prediction Peephole Optimization. By using six benchmark microarray gene expression data sets, we compared Recursive Feature Addition with recently developed gene selection methods: Support Vector Machine Recursive Feature Elimination, Leave-One-Out Calculation Sequential Forward Selection and several others.
On average, with the use of popular learning machines including Nearest Mean Scaled Classifier, Support Vector Machine, Naive Bayes Classifier and Random Forest, Recursive Feature Addition outperformed other methods. Our studies also showed that Lagging Prediction Peephole Optimization is superior to random strategy; Recursive Feature Addition with Lagging Prediction Peephole Optimization obtained better testing accuracies than the gene selection method varSelRF.
gene selection; microarray; classification; supervised-learning; similarity
The recent advancement in array CGH (aCGH) research has significantly improved tumor identification using DNA copy number data. A number of unsupervised learning methods have been proposed for clustering aCGH samples. Two of the major challenges for developing aCGH sample clustering are the high spatial correlation between aCGH markers and the low computing efficiency. A mixture hidden Markov model based algorithm was developed to address these two challenges.
The hidden Markov model (HMM) was used to model the spatial correlation between aCGH markers. A fast clustering algorithm was implemented and real data analysis on glioma aCGH data has shown that it converges to the optimal cluster rapidly and the computation time is proportional to the sample size. Simulation results showed that this HMM based clustering (HMMC) method has a substantially lower error rate than NMF clustering. The HMMC results for glioma data were significantly associated with clinical outcomes.
We have developed a fast clustering algorithm to identify tumor subtypes based on DNA copy number aberrations. The performance of the proposed HMMC method has been evaluated using both simulated and real aCGH data. The software for HMMC in both R and C++ is available in ND INBRE website http://ndinbre.org/programs/bioinformatics.php.
Studies of toxicity and unintended side effects can lead to improved drug safety and efficacy. One promising form of study comes from molecular systems biology in the form of "systems pharmacology". Systems pharmacology combines data from clinical observation and molecular biology. This approach is new, however, and there are few examples of how it can practically predict adverse reactions (ADRs) from an experimental drug with acceptable accuracy.
We have developed a new and practical computational framework to accurately predict ADRs of trial drugs. We combine clinical observation data with drug target data, protein-protein interaction (PPI) networks, and gene ontology (GO) annotations. We use cardiotoxicity, one of the major causes for drug withdrawals, as a case study to demonstrate the power of the framework. Our results show that an in silico model built on this framework can achieve a satisfactory cardiotoxicity ADR prediction performance (median AUC = 0.771, Accuracy = 0.675, Sensitivity = 0.632, and Specificity = 0.789). Our results also demonstrate the significance of incorporating prior knowledge, including gene networks and gene annotations, to improve future ADR assessments.
Biomolecular network and gene annotation information can significantly improve the predictive accuracy of ADR of drugs under development. The use of PPI networks can increase prediction specificity and the use of GO annotations can increase prediction sensitivity. Using cardiotoxicity as an example, we are able to further identify cardiotoxicity-related proteins among drug target expanding PPI networks. The systems pharmacology approach that we developed in this study can be generally applicable to all future developmental drug ADR assessments and predictions.
Along with obesity, physical inactivity, and family history of metabolic disorders, African American ethnicity is a risk factor for type 2 diabetes (T2D) in the United States. However, little is known about the differences in gene expression and transcriptomic profiles of blood in T2D between African Americans (AA) and Caucasians (CAU), and microarray analysis of peripheral white blood cells (WBCs) from these two ethnic groups will facilitate our understanding of the underlying molecular mechanism in T2D and identify genetic biomarkers responsible for the disparities.
A whole human genome oligomicroarray of peripheral WBCs was performed on 144 samples obtained from 84 patients with T2D (44 AA and 40 CAU) and 60 healthy controls (28 AA and 32 CAU). The results showed that 30 genes had significant difference in expression between patients and controls (a fold change of <-1.4 or >1.4 with a P value <0.05). These known genes were mainly clustered in three functional categories: immune responses, lipid metabolism, and organismal injury/abnormaly. Transcriptomic analysis also showed that 574 genes were differentially expressed in AA diseased versus AA control, compared to 200 genes in CAU subjects. Pathway study revealed that "Communication between innate and adaptive immune cells"/"Primary immunodeficiency signaling" are significantly down-regulated in AA patients and "Interferon signaling"/"Complement System" are significantly down-regulated in CAU patients.
These newly identified genetic markers in WBCs provide valuable information about the pathophysiology of T2D and can be used for diagnosis and pharmaceutical drug design. Our results also found that AA and CAU patients with T2D express genes and pathways differently.
Dynamic Bayesian Network (DBN) is an approach widely used for reconstruction of gene regulatory networks from time-series microarray data. Its performance in network reconstruction depends on a structure learning algorithm. REVEAL (REVerse Engineering ALgorithm) is one of the algorithms implemented for learning DBN structure and used to reconstruct gene regulatory networks (GRN). However, the two-stage temporal Bayes network (2TBN) structure of DBN that specifies correlation between time slices cannot be obtained by score metrics used in REVEAL.
In this paper, we study a more sophisticated score function for DBN first proposed by Nir Friedman for stationary DBNs structure learning of both initial and transition networks but has not yet been used for reconstruction of GRNs. We implemented Friedman's Bayesian Information Criterion (BIC) score function, modified K2 algorithm to learn Dynamic Bayesian Network structure with the score function and tested the performance of the algorithm for GRN reconstruction with synthetic time series gene expression data generated by GeneNetWeaver and real yeast benchmark experiment data.
We implemented an algorithm for DBN structure learning with Friedman's score function, tested it on reconstruction of both synthetic networks and real yeast networks and compared it with REVEAL in the absence or presence of preprocessed network generated by Zou&Conzen's algorithm. By introducing a stationary correlation between two consecutive time slices, Friedman's score function showed a higher precision and recall than the naive REVEAL algorithm.
Friedman's score metrics for DBN can be used to reconstruct transition networks and has a great potential to improve the accuracy of gene regulatory network structure prediction with time series gene expression datasets.
Speckles in ultrasound imaging affect image quality and can make the post-processing difficult. Speckle reduction technologies have been employed for removing speckles for some time. One of the effective speckle reduction technologies is anisotropic diffusion. Anisotropic diffusion technology can remove the speckles effectively while preserving the edges of the image and thus has drawn great attention from image processing scientists. However, the proposed methods in the past have different disadvantages, such as being sensitive to the number of iterations or low capability of preserving the details of the ultrasound images. Thus a detail preserved anisotropic diffusion speckle reduction with less sensitive to the number of iterations is needed. This paper aims to develop this kind of technologies.
In this paper, we propose a robust detail preserving anisotropic diffusion filter (RDPAD) for speckle reduction. In order to get robust diffusion, the proposed method integrates Tukey error norm function into the detail preserving anisotropic diffusion filter (DPAD) developed recently. The proposed method could prohibit over-diffusion and thus is less sensitive to the number of iterations
The proposed anisotropic diffusion can preserve the important structure information of the original image while reducing speckles. It is also less sensitive to the number of iterations. Experimental results on real ultrasound images show the effectiveness of the proposed anisotropic diffusion filter.
DNA methylation in the 5' promoter regions of genes and microRNA (miRNA) regulation at the 3' untranslated regions (UTRs) are two major epigenetic regulation mechanisms in most eukaryotes. Both DNA methylation and miRNA regulation can suppress gene expression and their corresponding protein product; thus, they play critical roles in cellular processes. Although there have been numerous investigations of gene regulation by methylation changes and miRNAs, there is no systematic genome-wide examination of their coordinated effects in any organism.
In this study, we investigated the relationship between promoter methylation at the transcription level and miRNA regulation at the post-transcription level by taking advantage of recently released human methylome data and high quality miRNA and other gene annotation data. We found methylation level in the promoter regions and expression level was negatively correlated. Then, we showed that miRNAs tended to target the genes with a low DNA methylation level in their promoter regions. We further demonstrated that this observed pattern was not attributed to the gene expression level, expression broadness, or the number of transcription factor binding sites. Interestingly, we found miRNA target sites were significantly enriched in the genes located in differentially methylated regions or partially methylated domains. Finally, we explored the features of DNA methylation and miRNA regulation in cancer genes and found cancer genes tended to have low methylation level and more miRNA target sites.
This is the first genome-wide investigation of the combined regulation of gene expression. Our results supported a complementary regulation between DNA methylation (transcriptional level) and miRNA function (post-transcriptional level) in the human genome. The results were helpful for our understanding of the evolutionary forces towards organisms' complexity beyond traditional sequence level investigation.
Recent studies suggest that many proteins or regions of proteins lack 3D structure. Defined as intrinsically disordered proteins, these proteins/peptides are functionally important. Recent advances in next generation sequencing technologies enable genome-wide identification of novel nucleotide variations in a specific population or cohort.
Using the exonic single nucleotide variations (SNVs) identified in the 1,000 Genomes Project and distributed by the Genetic Analysis Workshop 17, we systematically analysed the genetic and predicted disorder potential features of the non-synonymous variations. The result of experiments suggests that a significant change in the tendency of a protein region to be structured or disordered caused by SNVs may lead to malfunction of such a protein and contribute to disease risk.
After validation with functional SNVs on the traits distributed by GAW17, we conclude that it is valuable to consider structure/disorder tendencies while prioritizing and predicting mechanistic effects arising from novel genetic variations.
Microarray data have been used for gene signature selection to predict clinical outcomes. Many studies have attempted to identify factors that affect models' performance with only little success. Fine-tuning of model parameters and optimizing each step of the modeling process often results in over-fitting problems without improving performance.
We propose a quantitative measurement, termed consistency degree, to detect the correlation between disease endpoint and gene expression profile. Different endpoints were shown to have different consistency degrees to gene expression profiles. The validity of this measurement to estimate the consistency was tested with significance at a p-value less than 2.2e-16 for all of the studied endpoints. According to the consistency degree score, overall survival milestone outcome of multiple myeloma was proposed to extend from 730 days to 1561 days, which is more consistent with gene expression profile.
For various clinical endpoints, the maximum predictive powers of different microarray-based models are limited by the correlation between endpoint and gene expression profile of disease samples as indicated by the consistency degree score. In addition, previous defined clinical outcomes can also be reassessed and refined more coherent according to related disease gene expression profile. Our findings point to an entirely new direction for assessing the microarray-based predictive models and provide important information to gene signature based clinical applications.
One of the most fundamental and challenging tasks in bio-informatics is to identify related sequences and their hidden biological significance. The most popular and proven best practice method to accomplish this task is aligning multiple sequences together. However, multiple sequence alignment is a computing extensive task. In addition, the advancement in DNA/RNA and Protein sequencing techniques has created a vast amount of sequences to be analyzed that exceeding the capability of traditional computing models. Therefore, an effective parallel multiple sequence alignment model capable of resolving these issues is in a great demand.
We design O(1) run-time solutions for both local and global dynamic programming pair-wise alignment algorithms on reconfigurable mesh computing model. To align m sequences with max length n, we combining the parallel pair-wise dynamic programming solutions with newly designed parallel components. We successfully reduce the progressive multiple sequence alignment algorithm's run-time complexity from O(m × n4) to O(m) using O(m × n3) processing units for scoring schemes that use three distinct values for match/mismatch/gap-extension. The general solution to multiple sequence alignment algorithm takes O(m × n4) processing units and completes in O(m) time.
To our knowledge, this is the first time the progressive multiple sequence alignment algorithm is completely parallelized with O(m) run-time. We also provide a new parallel algorithm for the Longest Common Subsequence (LCS) with O(1) run-time using O(n3) processing units. This is a big improvement over the current best constant-time algorithm that uses O(n4) processing units.
Panax notoginseng (Burk) F.H. Chen is important medicinal plant of the Araliacease family. Triterpene saponins are the bioactive constituents in P. notoginseng. However, available genomic information regarding this plant is limited. Moreover, details of triterpene saponin biosynthesis in the Panax species are largely unknown.
Using the 454 pyrosequencing technology, a one-quarter GS FLX titanium run resulted in 188,185 reads with an average length of 410 bases for P. notoginseng root. These reads were processed and assembled by 454 GS De Novo Assembler software into 30,852 unique sequences. A total of 70.2% of unique sequences were annotated by Basic Local Alignment Search Tool (BLAST) similarity searches against public sequence databases. The Kyoto Encyclopedia of Genes and Genomes (KEGG) assignment discovered 41 unique sequences representing 11 genes involved in triterpene saponin backbone biosynthesis in the 454-EST dataset. In particular, the transcript encoding dammarenediol synthase (DS), which is the first committed enzyme in the biosynthetic pathway of major triterpene saponins, is highly expressed in the root of four-year-old P. notoginseng. It is worth emphasizing that the candidate cytochrome P450 (Pn02132 and Pn00158) and UDP-glycosyltransferase (Pn00082) gene most likely to be involved in hydroxylation or glycosylation of aglycones for triterpene saponin biosynthesis were discovered from 174 cytochrome P450s and 242 glycosyltransferases by phylogenetic analysis, respectively. Putative transcription factors were detected in 906 unique sequences, including Myb, homeobox, WRKY, basic helix-loop-helix (bHLH), and other family proteins. Additionally, a total of 2,772 simple sequence repeat (SSR) were identified from 2,361 unique sequences, of which, di-nucleotide motifs were the most abundant motif.
This study is the first to present a large-scale EST dataset for P. notoginseng root acquired by next-generation sequencing (NGS) technology. The candidate genes involved in triterpene saponin biosynthesis, including the putative CYP450s and UGTs, were obtained in this study. Additionally, the identification of SSRs provided plenty of genetic makers for molecular breeding and genetics applications in this species. These data will provide information on gene discovery, transcriptional regulation and marker-assisted selection for P. notoginseng. The dataset establishes an important foundation for the study with the purpose of ensuring adequate drug resources for this species.
The use of gene signatures can potentially be of considerable value in the field of clinical diagnosis. However, gene signatures defined with different methods can be quite various even when applied the same disease and the same endpoint. Previous studies have shown that the correct selection of subsets of genes from microarray data is key for the accurate classification of disease phenotypes, and a number of methods have been proposed for the purpose. However, these methods refine the subsets by only considering each single feature, and they do not confirm the association between the genes identified in each gene signature and the phenotype of the disease. We proposed an innovative new method termed Minimize Feature's Size (MFS) based on multiple level similarity analyses and association between the genes and disease for breast cancer endpoints by comparing classifier models generated from the second phase of MicroArray Quality Control (MAQC-II), trying to develop effective meta-analysis strategies to transform the MAQC-II signatures into a robust and reliable set of biomarker for clinical applications.
We analyzed the similarity of the multiple gene signatures in an endpoint and between the two endpoints of breast cancer at probe and gene levels, the results indicate that disease-related genes can be preferably selected as the components of gene signature, and that the gene signatures for the two endpoints could be interchangeable. The minimized signatures were built at probe level by using MFS for each endpoint. By applying the approach, we generated a much smaller set of gene signature with the similar predictive power compared with those gene signatures from MAQC-II.
Our results indicate that gene signatures of both large and small sizes could perform equally well in clinical applications. Besides, consistency and biological significances can be detected among different gene signatures, reflecting the studying endpoints. New classifiers built with MFS exhibit improved performance with both internal and external validation, suggesting that MFS method generally reduces redundancies for features within gene signatures and improves the performance of the model. Consequently, our strategy will be beneficial for the microarray-based clinical applications.
In microarray experiments with small sample sizes, it is a challenge to estimate p-values accurately and decide cutoff p-values for gene selection appropriately. Although permutation-based methods have proved to have greater sensitivity and specificity than the regular t-test, their p-values are highly discrete due to the limited number of permutations available in very small sample sizes. Furthermore, estimated permutation-based p-values for true nulls are highly correlated and not uniformly distributed between zero and one, making it difficult to use current false discovery rate (FDR)-controlling methods.
We propose a model-based information sharing method (MBIS) that, after an appropriate data transformation, utilizes information shared among genes. We use a normal distribution to model the mean differences of true nulls across two experimental conditions. The parameters of the model are then estimated using all data in hand. Based on this model, p-values, which are uniformly distributed from true nulls, are calculated. Then, since FDR-controlling methods are generally not well suited to microarray data with very small sample sizes, we select genes for a given cutoff p-value and then estimate the false discovery rate.
Simulation studies and analysis using real microarray data show that the proposed method, MBIS, is more powerful and reliable than current methods. It has wide application to a variety of situations.
RNA-binding proteins (RBPs) play diverse roles in eukaryotic RNA processing. Despite their pervasive functions in coding and noncoding RNA biogenesis and regulation, elucidating the sequence specificities that define protein-RNA interactions remains a major challenge. Recently, CLIP-seq (Cross-linking immunoprecipitation followed by high-throughput sequencing) has been successfully implemented to study the transcriptome-wide binding patterns of SRSF1, PTBP1, NOVA and fox2 proteins. These studies either adopted traditional methods like Multiple EM for Motif Elicitation (MEME) to discover the sequence consensus of RBP's binding sites or used Z-score statistics to search for the overrepresented nucleotides of a certain size. We argue that most of these methods are not well-suited for RNA motif identification, as they are unable to incorporate the RNA structural context of protein-RNA interactions, which may affect to binding specificity. Here, we describe a novel model-based approach--RNAMotifModeler to identify the consensus of protein-RNA binding regions by integrating sequence features and RNA secondary structures.
As an example, we implemented RNAMotifModeler on SRSF1 (SF2/ASF) CLIP-seq data. The sequence-structural consensus we identified is a purine-rich octamer 'AGAAGAAG' in a highly single-stranded RNA context. The unpaired probabilities, the probabilities of not forming pairs, are significantly higher than negative controls and the flanking sequence surrounding the binding site, indicating that SRSF1 proteins tend to bind on single-stranded RNA. Further statistical evaluations revealed that the second and fifth bases of SRSF1octamer motif have much stronger sequence specificities, but weaker single-strandedness, while the third, fourth, sixth and seventh bases are far more likely to be single-stranded, but have more degenerate sequence specificities. Therefore, we hypothesize that nucleotide specificity and secondary structure play complementary roles during binding site recognition by SRSF1.
In this study, we presented a computational model to predict the sequence consensus and optimal RNA secondary structure for protein-RNA binding regions. The successful implementation on SRSF1 CLIP-seq data demonstrates great potential to improve our understanding on the binding specificity of RNA binding proteins.
Malaria continues to be one of the most severe global infectious diseases, responsible for 1-2 million deaths yearly. The rapid evolution and spread of drug resistance in parasites has led to an urgent need for the development of novel antimalarial targets. Proteases are a group of enzymes that play essential roles in parasite growth and invasion. The possibility of designing specific inhibitors for proteases makes them promising drug targets. Previously, combining a comparative genomics approach and a machine learning approach, we identified the complement of proteases (degradome) in the malaria parasite Plasmodium falciparum and its sibling species [1-3], providing a catalog of targets for functional characterization and rational inhibitor design. Network analysis represents another route to revealing the role of proteins in the biology of parasites and we use this approach here to expand our understanding of the systems involving the proteases of P. falciparum.
We investigated the roles of proteases in the parasite life cycle by constructing a network using protein-protein association data from the STRING database , and analyzing these data, in conjunction with the data from protein-protein interaction assays using the yeast 2-hybrid (Y2H) system , blood stage microarray experiments [6-8], proteomics [9-12], literature text mining, and sequence homology analysis. Seventy-seven (77) out of 124 predicted proteases were associated with at least one other protein, constituting 2,431 protein-protein interactions (PPIs). These proteases appear to play diverse roles in metabolism, cell cycle regulation, invasion and infection. Their degrees of connectivity (i.e., connections to other proteins), range from one to 143. The largest protease-associated sub-network is the ubiquitin-proteasome system which is crucial for protein recycling and stress response. Proteases are also implicated in heat shock response, signal peptide processing, cell cycle progression, transcriptional regulation, and signal transduction networks.
Our network analysis of proteases from P. falciparum uses a so-called guilt-by-association approach to extract sets of proteins from the proteome that are candidates for further study. Novel protease targets and previously unrecognized members of the protease-associated sub-systems provide new insights into the mechanisms underlying parasitism, pathogenesis and virulence.