Sufficient codominant genetic markers are needed for various genetic investigations in alfalfa since the species is an outcrossing autotetraploid. With the newly developed next generation sequencing technology, a large amount of transcribed sequences of alfalfa have been generated and are available for identifying SSR markers by data mining. A total of 54,278 alfalfa non-redundant unigenes were assembled through the Illumina HiSeqTM 2000 sequencing technology. Based on 3,903 unigene sequences, 4,493 SSRs were identified. Tri-nucleotide repeats (56.71%) were the most abundant motif class while AG/CT (21.7%), AGG/CCT (19.8%), AAC/GTT (10.3%), ATC/ATG (8.8%), and ACC/GGT (6.3%) were the subsequent top five nucleotide repeat motifs. Eight hundred and thirty- seven EST-SSR primer pairs were successfully designed. Of these, 527 (63%) primer pairs yielded clear and scored PCR products and 372 (70.6%) exhibited polymorphisms. High transferability was observed for ssp falcata at 99.2% (523) and 71.7% (378) in M. truncatula. In addition, 313 of 527 SSR marker sequences were in silico mapped onto the eight M. truncatula chromosomes. Thirty-six polymorphic SSR primer pairs were used in the genetic relatedness analysis of 30 Chinese alfalfa cultivated accessions generating a total of 199 scored alleles. The mean observed heterozygosity and polymorphic information content were 0.767 and 0.635, respectively. The codominant markers not only enriched the current resources of molecular markers in alfalfa, but also would facilitate targeted investigations in marker-trait association, QTL mapping, and genetic diversity analysis in alfalfa.
Zantedeschia aethiopica is an evergreen perennial plant cultivated worldwide and commonly used for ornamental and medicinal purposes including the treatment of bacterial infections. However, the current understanding of molecular and physiological mechanisms in this plant is limited, in comparison to other non-model plants. In order to improve understanding of the biology of this botanical species, RNA-Seq technology was used for transcriptome assembly and characterization. Following Z. aethiopica spathe tissue RNA extraction, high-throughput RNA sequencing was performed with the aim of obtaining both abundant and rare transcript data. Functional profiling based on KEGG Orthology (KO) analysis highlighted contigs that were involved predominantly in genetic information (37%) and metabolism (34%) processes. Predicted proteins involved in the plant circadian system, hormone signal transduction, secondary metabolism and basal immunity are described here. In silico screening of the transcriptome data set for antimicrobial peptide (AMP) –encoding sequences was also carried out and three lipid transfer proteins (LTP) were identified as potential AMPs involved in plant defense. Spathe predicted protein maps were drawn, and suggested that major plant efforts are expended in guaranteeing the maintenance of cell homeostasis, characterized by high investment in carbohydrate, amino acid and energy metabolism as well as in genetic information.
Different subtypes of Mycobacterium tuberculosis (MTB) may induce diverse severe human infections, and some of their symptoms are similar to other pathogenes, e.g. Nontuberculosis mycobacteria (NTM). So determination of mycobacterium subtypes facilitates the effective control of MTB infection and proliferation. This study exploits a novel DNA barcoding visualization method for molecular typing of 17 mycobacteria genomes published in the NCBI prokaryotic genome database. Three mycobacterium genes (Rv0279c, Rv3508 and Rv3514) from the PE/PPE family of MT Band were detected to best represent the inter-strain pathogenetic variations. An accurate and fast MTB substrain typing method was proposed based on the combination of the aforementioned three biomarker genes and the 16S rRNA gene. The protocol of establishing a bacterial substrain typing system used in this study may also be applied to the other pathogenes.
Mycobacterium; Molecular typing; Typing biomarker; Bioinformatics; Differential diagnosis of mycobacteria
Genomic copy number alteration and allelic imbalance are distinct features of cancer cells, and recent advances in the genotyping technology have greatly boosted the research in the cancer genome. However, the complicated nature of tumor usually hampers the dissection of the SNP arrays. In this study, we describe a bioinformatic tool, named GIANT, for genome-wide identification of somatic aberrations from paired normal-tumor samples measured with SNP arrays. By efficiently incorporating genotype information of matched normal sample, it accurately detects different types of aberrations in cancer genome, even for aneuploid tumor samples with severe normal cell contamination. Furthermore, it allows for discovery of recurrent aberrations with critical biological properties in tumorigenesis by using statistical significance test. We demonstrate the superior performance of the proposed method on various datasets including tumor replicate pairs, simulated SNP arrays and dilution series of normal-cancer cell lines. Results show that GIANT has the potential to detect the genomic aberration even when the cancer cell proportion is as low as 5∼10%. Application on a large number of paired tumor samples delivers a genome-wide profile of the statistical significance of the various aberrations, including amplification, deletion and LOH. We believe that GIANT represents a powerful bioinformatic tool for interpreting the complex genomic aberration, and thus assisting both academic study and the clinical treatment of cancer.
Pregnant women carry a mixture of cell-free DNA fragments from self and fetus (non-self) in their circulation. In recent years multiple independent studies have demonstrated the ability to detect fetal trisomies such as trisomy 21, the cause of Down syndrome, by Next-Generation Sequencing of maternal plasma. The current clinical tests based on this approach show very high sensitivity and specificity, although as yet they have not become the standard diagnostic test. Here we describe improvements to the analysis of the sequencing data by reducing GC bias and better handling of the genomic repeats. We show substantial improvements in the sensitivity of the standard trisomy 21 statistical tests, which we measure by artificially reducing read coverage. We also explore the bias stemming from the natural cleavage of plasma DNA by examining DNA motifs and position specific base distributions. We propose a model to correct this fragmentation bias and observe that incorporating this bias does not lead to any further improvements in the detection of fetal trisomy. The improved bias corrections that we demonstrate in this work can be readily adopted into existing fetal trisomy detection protocols and should also lead to improvements in sub-chromosomal copy number variation detection.
We applied a metagenomics approach to screen for transcriptional regulators that sense aromatic compounds. The library was constructed by cloning environmental DNA fragments into a promoter-less vector containing green fluorescence protein. Fluorescence-based screening was then performed in the presence of various aromatic compounds. A total of 12 clones were isolated that fluoresced in response to salicylate, 3-methyl catechol, 4-chlorocatechol and chlorohydroquinone. Sequence analysis revealed at least 1 putative transcriptional regulator, excluding 1 clone (CHLO8F). Deletion analysis identified compound-specific transcriptional regulators; namely, 8 LysR-types, 2 two-component-types and 1 AraC-type. Of these, 9 representative clones were selected and their reaction specificities to 18 aromatic compounds were investigated. Overall, our transcriptional regulators were functionally diverse in terms of both specificity and induction rates. LysR- and AraC- type regulators had relatively narrow specificities with high induction rates (5-50 fold), whereas two-component-types had wide specificities with low induction rates (3 fold). Numerous transcriptional regulators have been deposited in sequence databases, but their functions remain largely unknown. Thus, our results add valuable information regarding the sequence–function relationship of transcriptional regulators.
Although electrocardiogram (ECG) fluctuates over time and physical activity, some of its intrinsic measurements serve well as biometric features. Considering its constant availability and difficulty in being faked, the ECG signal is becoming a promising factor for biometric authentication. The majority of the currently available algorithms only work well on healthy participants. A novel normalization and interpolation algorithm is proposed to convert an ECG signal into multiple template cycles, which are comparable between any two ECGs, no matter the sampling rates or health status. The overall accuracies reach 100% and 90.11% for healthy participants and cardiovascular disease (CVD) patients, respectively.
In the last 100 years, intensive studies have been done on the identification of the systematic approaches to find the cure for the chronic heart failure, however the mystery remains unresolved due to its complicated pathogenesis and ineffective early diagnosis. The present investigation was aimed to evaluate the potential effects of the traditional chinese medicine, Xinmailong, on the chronic heart failure (CHF) patients as compared to the standard western medical treatment available so far. In our study, we selected two groups of voluntary CHF patients at the Xiangya Hospital, which were allowed to administrate Xinmailong or standard treatments, respectively. Another group of voluntary healthy individuals were recruited as the control group. The treatment effectiveness was measured by five symptomatic factors, i.e. angiotensin II (Ang_II), high sensitivity C-reactive protein (hs_CRP), Left Ventricular End Systolic Volume Index (LVESVI), left ventricular ejection fraction (LVEF) and pro-B-type natriuretic peptide (NT_proBNP), between the control group and the CHF patients at different stages of drug administration and in different treatment groups. The timeline for the full dose administration was set to 15 days and five measurements as indicated above were taken on every 0, 7th and 15th day of the drug administration respectively. In the conducted study, similar symptomatic measurements were observed on day 0 in both treatment groups, and slight improvements were observed on 7th day. It was observed that after a full course of drug administration for 15 days, both of the treatment groups achieved statistically significant improvements in all the five measures, but Xinmailong was found to be more (almost double) statistically significant as compared with the available drug treatments for chronic heart failure.
Chronic heart failure; Traditional Chinese Medicine; Xinmailong.
Haplotype phasing represents an essential step in studying the association of genomic polymorphisms with complex genetic diseases, and in determining targets for drug designing. In recent years, huge amounts of genotype data are produced from the rapidly evolving high-throughput sequencing technologies, and the data volume challenges the community with more efficient haplotype phasing algorithms, in the senses of both running time and overall accuracy. 2SNP is one of the fastest haplotype phasing algorithms with comparable low error rates with the other algorithms. The most time-consuming step of 2SNP is the construction of a maximum spanning tree (MST) among all the heterozygous SNP pairs. We simplified this step by replacing the MST with the initial haplotypes of adjacent heterozygous SNP pairs. The multi-SNP haplotypes were estimated within a sliding window along the chromosomes. The comparative studies on four different-scale genotype datasets suggest that our algorithm WinHAP outperforms 2SNP and most of the other haplotype phasing algorithms in terms of both running speeds and overall accuracies. To facilitate the WinHAP’s application in more practical biological datasets, we released the software for free at: http://staff.ustc.edu.cn/~xuyun/winhap/index.htm.
Biclustering is a powerful technique for identification of co-expressed gene groups under any (unspecified) substantial subset of given experimental conditions, which can be used for elucidation of transcriptionally co-regulated genes.
We have previously developed a biclustering algorithm, QUBIC, which can solve more general biclustering problems than previous biclustering algorithms. To fully utilize the analysis power the algorithm provides, we have developed a web server, QServer, for prediction, computational validation and analyses of co-expressed gene clusters. Specifically, the QServer has the following capabilities in addition to biclustering by QUBIC: (i) prediction and assessment of conserved cis regulatory motifs in promoter sequences of the predicted co-expressed genes; (ii) functional enrichment analyses of the predicted co-expressed gene clusters using Gene Ontology (GO) terms, and (iii) visualization capabilities in support of interactive biclustering analyses. QServer supports the biclustering and functional analysis for a wide range of organisms, including human, mouse, Arabidopsis, bacteria and archaea, whose underlying genome database will be continuously updated.
We believe that QServer provides an easy-to-use and highly effective platform useful for hypothesis formulation and testing related to transcription co-regulation.
Caldicellulosiruptor bescii DSM 6725 utilizes various polysaccharides and grows efficiently on untreated high-lignin grasses and hardwood at an optimum temperature of ∼80°C. It is a promising anaerobic bacterium for studying high-temperature biomass conversion. Its genome contains 2666 protein-coding sequences organized into 1209 operons. Expression of 2196 genes (83%) was confirmed experimentally. At least 322 genes appear to have been obtained by lateral gene transfer (LGT). Putative functions were assigned to 364 conserved/hypothetical protein (C/HP) genes. The genome contains 171 and 88 genes related to carbohydrate transport and utilization, respectively. Growth on cellulose led to the up-regulation of 32 carbohydrate-active (CAZy), 61 sugar transport, 25 transcription factor and 234 C/HP genes. Some C/HPs were overproduced on cellulose or xylan, suggesting their involvement in polysaccharide conversion. A unique feature of the genome is enrichment with genes encoding multi-modular, multi-functional CAZy proteins organized into one large cluster, the products of which are proposed to act synergistically on different components of plant cell walls and to aid the ability of C. bescii to convert plant biomass. The high duplication of CAZy domains coupled with the ability to acquire foreign genes by LGT may have allowed the bacterium to rapidly adapt to changing plant biomass-rich environments.
Summary: Huge amount of metagenomic sequence data have been produced as a result of the rapidly increasing efforts worldwide in studying microbial communities as a whole. Most, if not all, sequenced metagenomes are complex mixtures of chromosomal and plasmid sequence fragments from multiple organisms, possibly from different kingdoms. Computational methods for prediction of genomic elements such as genes are significantly different for chromosomes and plasmids, hence raising the need for separation of chromosomal from plasmid sequences in a metagenome. We present a program for classification of a metagenome set into chromosomal and plasmid sequences, based on their distinguishing pentamer frequencies. On a large training set consisting of all the sequenced prokaryotic chromosomes and plasmids, the program achieves ∼92% in classification accuracy. On a large set of simulated metagenomes with sequence lengths ranging from 300 bp to 100 kbp, the program has classification accuracy from 64.45% to 88.75%. On a large independent test set, the program achieves 88.29% classification accuracy.
Availability: The program has been implemented as a standalone prediction program, cBar, which is available at http://csbl.bmb.uga.edu/∼ffzhou/cBar
Supplementary information:Supplementary data are available at Bioinformatics online.
The genomes of numerous cellulolytic organisms have been recently sequenced or in the pipeline of being sequenced. Analyses of these genomes as well as the recently sequenced metagenomes in a systematic manner could possibly lead to discoveries of novel biomass-degradation systems in nature.
We have identified 4,679 and 49,099 free acting glycosyl hydrolases with or without carbohydrate binding domains, respectively, by scanning through all the proteins in the UniProt Knowledgebase and the JGI Metagenome database. Cellulosome components were observed only in bacterial genomes, and 166 cellulosome-dependent glycosyl hydrolases were identified. We observed, from our analysis data, unexpected wide distributions of two less well-studied bacterial glycosyl hydrolysis systems in which glycosyl hydrolases may bind to the cell surface directly rather than through linking to surface anchoring proteins, or cellulosome complexes may bind to the cell surface by novel mechanisms other than the other used SLH domains. In addition, we found that animal-gut metagenomes are substantially enriched with novel glycosyl hydrolases.
The identified biomass degradation systems through our large-scale search are organized into an easy-to-use database GASdb at http://csbl.bmb.uga.edu/~ffzhou/GASdb/, which should be useful to both experimental and computational biofuel researchers.
“Anaerocellum thermophilum” DSM 6725 is a strictly anaerobic bacterium that grows optimally at 75°C. It uses a variety of polysaccharides, including crystalline cellulose and untreated plant biomass, and has potential utility in biomass conversion. Here we report its complete genome sequence of 2.97 Mb, which is contained within one chromosome and two plasmids (of 8.3 and 3.6 kb). The genome encodes a broad set of cellulolytic enzymes, transporters, and pathways for sugar utilization and compared to those of other saccharolytic, anaerobic thermophiles is most similar to that of Caldicellulosiruptor saccharolyticus DSM 8903.
Motivation: The computational identification of non-coding RNA (ncRNA) genes represents one of the most important and challenging problems in computational biology. Existing methods for ncRNA gene prediction rely mostly on homology information, thus limiting their applications to ncRNA genes with known homologues.
Results: We present a novel de novo prediction algorithm for ncRNA genes using features derived from the sequences and structures of known ncRNA genes in comparison to decoys. Using these features, we have trained a neural network-based classifier and have applied it to Escherichia coli and Sulfolobus solfataricus for genome-wide prediction of ncRNAs. Our method has an average prediction sensitivity and specificity of 68% and 70%, respectively, for identifying windows with potential for ncRNA genes in E.coli. By combining windows of different sizes and using positional filtering strategies, we predicted 601 candidate ncRNAs and recovered 41% of known ncRNAs in E.coli. We experimentally investigated six novel candidates using Northern blot analysis and found expression of three candidates: one represents a potential new ncRNA, one is associated with stable mRNA decay intermediates and one is a case of either a potential riboswitch or transcription attenuator involved in the regulation of cell division. In general, our approach enables the identification of both cis- and trans-acting ncRNAs in partially or completely sequenced microbial genomes without requiring homology or structural conservation.
Availability: The source code and results are available at http://csbl.bmb.uga.edu/publications/materials/tran/.
Supplementary information: Supplementary data are available at Bioinformatics online.
Populus trichocarpa is the first tree genome to be completed, and its whole genome is currently being assembled. No functional annotation about the repetitive elements in the Populus trichocarpa genome is currently available.
We predicted 9,623 repetitive elements in the Populus trichocarpa genome, and assigned functions to 3,075 of them (31.95%). The 9,623 repetitive elements cover ~40% of the current (partially) assembled genome. Among the 9,623 repetitive elements, 668 have copies only in the contigs that have not been assigned to one of the 19 chromosome while the rest all have copies in the partially assembled chromosomes.
All the predicted data are organized into an easy-to-use web-browsable database, RepPop. Various search capabilities are provided against the RepPop database. A Wiki system has been set up to facilitate functional annotation and curation of the repetitive elements by a community rather than just the database developer. The database RepPop will facilitate the assembling and functional characterization of the Populus trichocarpa genome.
Each genome has a stable distribution of the combined frequency for each k-mer and its reverse complement measured in sequence fragments as short as 1000 bps across the whole genome, for 1
We found that for each genome, the majority of its short sequence fragments have highly similar barcodes while sequence fragments with different barcodes typically correspond to genes that are horizontally transferred or highly expressed. This observation has led to new and more effective ways for addressing two challenging problems: metagenome binning problem and identification of horizontally transferred genes. Our barcode-based metagenome binning algorithm substantially improves the state of the art in terms of both binning accuracies and the scope of applicability. Other attractive properties of genomes barcodes include (a) the barcodes have different and identifiable characteristics for different classes of genomes like prokaryotes, eukaryotes, mitochondria and plastids, and (b) barcodes similarities are generally proportional to the genomes' phylogenetic closeness.
These and other properties of genomes barcodes make them a new and effective tool for studying numerous genome and metagenome analysis problems.
Mobile genetic elements (MGEs) play an essential role in genome rearrangement and evolution, and are widely used as an important genetic tool.
In this article, we present genetic maps of recently active Insertion Sequence (IS) elements, the simplest form of MGEs, for all sequenced cyanobacteria and archaea, predicted based on the previously identified ~1,500 IS elements. Our predicted IS maps are consistent with the NCBI annotations of the IS elements. By linking the predicted IS elements to various characteristics of the organisms under study and the organism's living conditions, we found that (a) the activities of IS elements heavily depend on the environments where the host organisms live; (b) the number of recently active IS elements in a genome tends to increase with the genome size; (c) the flanking regions of the recently active IS elements are significantly enriched with genes encoding DNA binding factors, transporters and enzymes; and (d) IS movements show no tendency to disrupt operonic structures.
This is the first genome-scale maps of IS elements with detailed structural information on the sequence level. These genetic maps of recently active IS elements and the several interesting observations would help to improve our understanding of how IS elements proliferate and how they are involved in the evolution of the host genomes.
Systematic dissection of the sumoylation proteome is emerging as an appealing but challenging research topic because of the significant roles sumoylation plays in cellular dynamics and plasticity. Although several proteome-scale analyzes have been performed to delineate potential sumoylatable proteins, the bona fide sumoylation sites still remain to be identified. Previously, we carried out a genome-wide analysis of the SUMO substrates in human nucleus using the putative motif ψ-K-X-E and evolutionary conservation. However, a highly specific predictor for in silico prediction of sumoylation sites in any individual organism is still urgently needed to guide experimental design. In this work, we present a computational system SUMOsp—SUMOylation Sites Prediction, based on a manually curated dataset, integrating the results of two methods, GPS and MotifX, which were originally designed for phosphorylation site prediction. SUMOsp offers at least as good prediction performance as the only available method, SUMOplot, on a very large test set. We expect that the prediction results of SUMOsp combined with experimental verifications will propel our understanding of sumoylation mechanisms to a new level. SUMOsp has been implemented on a freely accessible web server at: .
Protein phosphorylation plays a fundamental role in most of the cellular regulatory pathways. Experimental identification of protein kinases' (PKs) substrates with their phosphorylation sites is labor-intensive and often limited by the availability and optimization of enzymatic reactions. Recently, large-scale analysis of the phosphoproteome by the mass spectrometry (MS) has become a popular approach. But experimentally, it is still difficult to distinguish the kinase-specific sites on the substrates. In this regard, the in silico prediction of phosphorylation sites with their specific kinases using protein's primary sequences may provide guidelines for further experimental consideration and interpretation of MS phosphoproteomic data. A variety of such tools exists over the Internet and provides the predictions for at most 30 PK subfamilies. We downloaded the verified phosphorylation sites from the public databases and curated the literature extensively for recently found phosphorylation sites. With the hypothesis that PKs in the same subfamily share similar consensus sequences/motifs/functional patterns on substrates, we clustered the 216 unique PKs in 71 PK groups, according to the BLAST results and protein annotations. Then, we applied the group-based phosphorylation scoring (GPS) method on the data set; here, we present a comprehensive PK-specific prediction server GPS, which could predict kinase-specific phosphorylation sites from protein primary sequences for 71 different PK groups. GPS has been implemented in PHP and is available on a www server at .
Results 1-20 (20)
This will clear all selections from your clipboard. Do you wish proceed?
Clipboard is full! Please remove an item and try again.