Twelve cDNA libraries from two species of catfish have been sequenced, resulting in the generation of nearly 500,000 ESTs.
Through the Community Sequencing Program, a catfish EST sequencing project was carried out through a collaboration between the catfish research community and the Department of Energy's Joint Genome Institute. Prior to this project, only a limited EST resource from catfish was available for the purpose of SNP identification.
A total of 438,321 quality ESTs were generated from 8 channel catfish (Ictalurus punctatus) and 4 blue catfish (Ictalurus furcatus) libraries, bringing the number of catfish ESTs to nearly 500,000. Assembly of all catfish ESTs resulted in 45,306 contigs and 66,272 singletons. Over 35% of the unique sequences had significant similarities to known genes, allowing the identification of 14,776 unique genes in catfish. Over 300,000 putative SNPs have been identified, of which approximately 48,000 are high-quality SNPs identified from contigs with at least four sequences and the minor allele presence of at least two sequences in the contig. The EST resource should be valuable for identification of microsatellites, genome annotation, large-scale expression analysis, and comparative genome analysis.
This project generated a large EST resource for catfish that captured the majority of the catfish transcriptome. The parallel analysis of ESTs from two closely related Ictalurid catfishes should also provide powerful means for the evaluation of ancient and recent gene duplications, and for the development of high-density microarrays in catfish. The inter- and intra-specific SNPs identified from all catfish EST dataset assembly will greatly benefit the catfish introgression breeding program and whole genome association studies.
The ChIP-chip and ChIP-seq techniques enable genome-wide mapping of in vivo protein-DNA interactions and chromatin states. The cross-platform and between-laboratory variation poses a challenge to the comparison and integration of results from different ChIP experiments. We describe a novel method, MM-ChIP, which integrates information from cross-platform and between-laboratory ChIP-chip or ChIP-seq datasets. It improves both the sensitivity and the specificity of detecting ChIP-enriched regions, and is a useful meta-analysis tool for driving discoveries from multiple data sources.
Sorghum (Sorghum bicolor) is globally produced as a source of food, feed, fiber and fuel. Grain and sweet sorghums differ in a number of important traits, including stem sugar and juice accumulation, plant height as well as grain and biomass production. The first whole genome sequence of a grain sorghum is available, but additional genome sequences are required to study genome-wide and intraspecific variation for dissecting the genetic basis of these important traits and for tailor-designed breeding of this important C4 crop.
We resequenced two sweet and one grain sorghum inbred lines, and identified a set of nearly 1,500 genes differentiating sweet and grain sorghum. These genes fall into ten major metabolic pathways involved in sugar and starch metabolisms, lignin and coumarin biosynthesis, nucleic acid metabolism, stress responses and DNA damage repair. In addition, we uncovered 1,057,018 SNPs, 99,948 indels of 1 to 10 bp in length and 16,487 presence/absence variations as well as 17,111 copy number variations. The majority of the large-effect SNPs, indels and presence/absence variations resided in the genes containing leucine rich repeats, PPR repeats and disease resistance R genes possessing diverse biological functions or under diversifying selection, but were absent in genes that are essential for life.
This is a first report of the identification of genome-wide patterns of genetic variation in sorghum. High-density SNP and indel markers reported here will be a valuable resource for future gene-phenotype studies and the molecular breeding of this important crop and related species.
The increasing volume of ChIP-chip and ChIP-seq data being generated creates a challenge for standard, integrative and reproducible bioinformatics data analysis platforms. We developed a web-based application called Cistrome, based on the Galaxy open source framework. In addition to the standard Galaxy functions, Cistrome has 29 ChIP-chip- and ChIP-seq-specific tools in three major categories, from preliminary peak calling and correlation analyses to downstream genome feature association, gene expression analyses, and motif discovery. Cistrome is available at http://cistrome.org/ap/.
MACS performs model-based analysis of ChIP-Seq data generated by short read sequencers.
We present Model-based Analysis of ChIP-Seq data, MACS, which analyzes data generated by short read sequencers such as Solexa's Genome Analyzer. MACS empirically models the shift size of ChIP-Seq tags, and uses it to improve the spatial resolution of predicted binding sites. MACS also uses a dynamic Poisson distribution to effectively capture local biases in the genome, allowing for more robust predictions. MACS compares favorably to existing ChIP-Seq peak-finding algorithms, and is freely available.
The goal of the gene normalization task is to link genes or gene products mentioned in the literature to biological databases. This is a key step in an accurate search of the biological literature. It is a challenging task, even for the human expert; genes are often described rather than referred to by gene symbol and, confusingly, one gene name may refer to different genes (often from different organisms). For BioCreative II, the task was to list the Entrez Gene identifiers for human genes or gene products mentioned in PubMed/MEDLINE abstracts. We selected abstracts associated with articles previously curated for human genes. We provided 281 expert-annotated abstracts containing 684 gene identifiers for training, and a blind test set of 262 documents containing 785 identifiers, with a gold standard created by expert annotators. Inter-annotator agreement was measured at over 90%.
Twenty groups submitted one to three runs each, for a total of 54 runs. Three systems achieved F-measures (balanced precision and recall) between 0.80 and 0.81. Combining the system outputs using simple voting schemes and classifiers obtained improved results; the best composite system achieved an F-measure of 0.92 with 10-fold cross-validation. A 'maximum recall' system based on the pooled responses of all participants gave a recall of 0.97 (with precision 0.23), identifying 763 out of 785 identifiers.
Major advances for the BioCreative II gene normalization task include broader participation (20 versus 8 teams) and a pooled system performance comparable to human experts, at over 90% agreement. These results show promise as tools to link the literature with biological databases.
Nineteen teams presented results for the Gene Mention Task at the BioCreative II Workshop. In this task participants designed systems to identify substrings in sentences corresponding to gene name mentions. A variety of different methods were used and the results varied with a highest achieved F1 score of 0.8721. Here we present brief descriptions of all the methods used and a statistical analysis of the results. We also demonstrate that, by combining the results from all submissions, an F score of 0.9066 is feasible, and furthermore that the best result makes use of the lowest scoring submissions.
A comprehensive regulatory module network of 15 bHLH transcription factors over 150 target genes in mouse brain has been constructed.
The basic/helix-loop-helix (bHLH) proteins are important components of the transcriptional regulatory network, controlling a variety of biological processes, especially the development of the central nervous system. Until now, reports describing the regulatory network of the bHLH transcription factor (TF) family have been scarce. In order to understand the regulatory mechanisms of bHLH TFs in mouse brain, we inferred their regulatory network from genome-wide gene expression profiles with the module networks method.
A regulatory network comprising 15 important bHLH TFs and 153 target genes was constructed. The network was divided into 28 modules based on expression profiles. A regulatory-motif search shows the complexity and diversity of the network. In addition, 26 cooperative bHLH TF pairs were also detected in the network. This cooperation suggests possible physical interactions or genetic regulation between TFs. Interestingly, some TFs in the network regulate more than one module. A novel cross-repression between Neurod6 and Hey2 was identified, which may control various functions in different brain regions. The presence of TF binding sites (TFBSs) in the promoter regions of their target genes validates more than 70% of TF-target gene pairs of the network. Literature mining provides additional support for five modules. More importantly, the regulatory relationships among selected key components are all validated in mutant mice.
Our network is reliable and very informative for understanding the role of bHLH TFs in mouse brain development and function. It provides a framework for future experimental analyses.
A normalization method based on probe GC content for two-color tiling arrays and an algorithm for detecting peak regions are presented. They are available in a stand-alone Java program.
A novel normalization method based on the GC content of probes is developed for two-color tiling arrays. The proposed method, together with robust estimates of the model parameters, is shown to perform superbly on published data sets. A robust algorithm for detecting peak regions is also formulated and shown to perform well compared to other approaches. The tools have been implemented as a stand-alone Java program called MA2C, which can display various plots of statistical analysis for quality control.
As we come to the end of 2011, Genome Biology has asked some members of our Editorial Board for their views on the state of play in genomics. What was their favorite paper of 2011? What are the challenges in their particular research area? Who has had the biggest influence on their careers? What advice would they give to young researchers embarking on a career in research?
Bisulfite treatment of DNA followed by high-throughput sequencing (Bisulfite-seq) is an important method for studying DNA methylation and epigenetic gene regulation, yet current software tools do not adequately address single nucleotide polymorphisms (SNPs). Identifying SNPs is important for accurate quantification of methylation levels and for identification of allele-specific epigenetic events such as imprinting. We have developed a model-based bisulfite SNP caller, Bis-SNP, that results in substantially better SNP calls than existing methods, thereby improving methylation estimates. At an average 30× genomic coverage, Bis-SNP correctly identified 96% of SNPs using the default high-stringency settings. The open-source package is available at http://epigenome.usc.edu/publicationdata/bissnp2011.
MicroRNAs (miRNAs) and their regulatory functions have been extensively characterized in model species but whether apple has evolved similar or unique regulatory features remains unknown.
We performed deep small RNA-seq and identified 23 conserved, 10 less-conserved and 42 apple-specific miRNAs or families with distinct expression patterns. The identified miRNAs target 118 genes representing a wide range of enzymatic and regulatory activities. Apple also conserves two TAS gene families with similar but unique trans-acting small interfering RNA (tasiRNA) biogenesis profiles and target specificities. Importantly, we found that miR159, miR828 and miR858 can collectively target up to 81 MYB genes potentially involved in diverse aspects of plant growth and development. These miRNA target sites are differentially conserved among MYBs, which is largely influenced by the location and conservation of the encoded amino acid residues in MYB factors. Finally, we found that 10 of the 19 miR828-targeted MYBs undergo small interfering RNA (siRNA) biogenesis at the 3' cleaved, highly divergent transcript regions, generating over 100 sequence-distinct siRNAs that potentially target over 70 diverse genes as confirmed by degradome analysis.
Our work identified and characterized apple miRNAs, their expression patterns, targets and regulatory functions. We also discovered that three miRNAs and the ensuing siRNAs exploit both conserved and divergent sequence features of MYB genes to initiate distinct regulatory networks targeting a multitude of genes inside and outside the MYB family.
Evolutionary divergence is common within bacterial species and populations, even during a single bacterial infection. We use large-scale genomic and phenotypic analysis to identify the extent of diversification in controlled experimental populations and apply these data to differentiate between several potential mechanisms of evolutionary divergence.
We defined testable differences between five proposed mechanisms and used experimental evolution studies to follow eight glucose-limited Escherichia coli chemostat populations at two growth rates. Simple phenotypic tests identified 11 phenotype combinations evolving under glucose limitation. Each evolved population exhibited 3 to 5 different combinations of the 11 phenotypic clusters. Genome sequencing of a representative of each phenotypic cluster from each population identified 193 mutations in 48 isolates. Only two of the 48 strains had evolved identically. Convergent paths to the same phenotype occurred, but two pleiotropic mutations were unique to slow-growing bacteria, permitting them greater phenotypic variance. Indeed, greater diversity arose in slower-growing, more stressed cultures. Mutation accumulation, hypermutator presence and fitness mechanisms varied between and within populations, with the evolved fitness considerably more uniform with fast growth cultures. Negative frequency-dependent fitness was shown by a subset of isolates.
Evolutionary diversity is unlikely to be explained by any one of the available mechanisms. For a large population as used in this study, our results suggest that multiple mechanisms contribute to the mix of phenotypes and evolved fitness types in a diversifying population. Another major conclusion is that the capacity of a population to diversify is a function of growth rate.
A response to Dynamic cumulative activity of transcription factors as a mechanism of quantitative gene regulation by F He, J Buer, AP Zeng and R Balling. Genome Biol 2007, 8:R181.
Comment on He et al.: http://genomebiology.com/2007/8/9/R181
Many eukaryotic genomes encode cis-natural antisense transcripts (cis-NATs). Sense and antisense transcripts may form double-stranded RNAs that are processed by the RNA interference machinery into small interfering RNAs (siRNAs). A few so-called nat-siRNAs have been reported in plants, mammals, Drosophila, and yeasts. However, many questions remain regarding the features and biogenesis of nat-siRNAs.
Through deep sequencing, we identified more than 17,000 unique siRNAs corresponding to cis-NATs from biotic and abiotic stress-challenged Arabidopsis thaliana and 56,000 from abiotic stress-treated rice. These siRNAs were enriched in the overlapping regions of NATs and exhibited either site-specific or distributed patterns, often with strand bias. Out of 1,439 and 767 cis-NAT pairs identified in Arabidopsis and rice, respectively, 84 and 119 could generate at least 10 siRNAs per million reads from the overlapping regions. Among them, 16 cis-NAT pairs from Arabidopsis and 34 from rice gave rise to nat-siRNAs exclusively in the overlap regions. Genetic analysis showed that the overlapping double-stranded RNAs could be processed by Dicer-like 1 (DCL1) and/or DCL3. The DCL3-dependent nat-siRNAs were also dependent on RNA-dependent RNA polymerase 2 (RDR2) and plant-specific RNA polymerase IV (PolIV), whereas only a fraction of DCL1-dependent nat-siRNAs was RDR- and PolIV-dependent. Furthermore, the levels of some nat-siRNAs were regulated by specific biotic or abiotic stress conditions in Arabidopsis and rice.
Our results suggest that nat-siRNAs display distinct distribution patterns and are generated by DCL1 and/or DCL3. Our analysis further supported the existence of nat-siRNAs in plants and advanced our understanding of their characteristics.
Species in the ascomycete fungal genus Cordyceps have been proposed to be the teleomorphs of Metarhizium species. The latter have been widely used as insect biocontrol agents. Cordyceps species are highly prized for use in traditional Chinese medicines, but the genes responsible for biosynthesis of bioactive components, insect pathogenicity and the control of sexuality and fruiting have not been determined.
Here, we report the genome sequence of the type species Cordyceps militaris. Phylogenomic analysis suggests that different species in the Cordyceps/Metarhizium genera have evolved into insect pathogens independently of each other, and that their similar large secretomes and gene family expansions are due to convergent evolution. However, relative to other fungi, including Metarhizium spp., many protein families are reduced in C. militaris, which suggests a more restricted ecology. Consistent with its long track record of safe usage as a medicine, the Cordyceps genome does not contain genes for known human mycotoxins. We establish that C. militaris is sexually heterothallic but, very unusually, fruiting can occur without an opposite mating-type partner. Transcriptional profiling indicates that fruiting involves induction of the Zn2Cys6-type transcription factors and MAPK pathway; unlike other fungi, however, the PKA pathway is not activated.
The data offer a better understanding of Cordyceps biology and will facilitate the exploitation of medicinal compounds produced by the fungus.
Clonorchis sinensis is a carcinogenic human liver fluke that is widespread in Asian countries. Increasing infection rates of this neglected tropical disease are leading to negative economic and public health consequences in affected regions. Experimental and epidemiological studies have shown a strong association between the incidence of cholangiocarcinoma and the infection rate of C. sinensis. To aid research into this organism, we have sequenced its genome.
We combined de novo sequencing with computational techniques to provide new information about the biology of this liver fluke. The assembled genome has a total size of 516 Mb with a scaffold N50 length of 42 kb. Approximately 16,000 reliable protein-coding gene models were predicted. Genes for the complete pathways for glycolysis, the Krebs cycle and fatty acid metabolism were found, but key genes involved in fatty acid biosynthesis are missing from the genome, reflecting the parasitic lifestyle of a liver fluke that receives lipids from the bile of its host. We also identified pathogenic molecules that may contribute to liver fluke-induced hepatobiliary diseases. Large proteins such as multifunctional secreted proteases and tegumental proteins were identified as potential targets for the development of drugs and vaccines.
This study provides valuable genomic information about the human liver fluke C. sinensis and adds to our knowledge on the biology of the parasite. The draft genome will serve as a platform to develop new strategies for parasite control.
Exome sequencing, which allows the global analysis of protein coding sequences in the human genome, has become an effective and affordable approach to detecting causative genetic mutations in diseases. Currently, there are several commercial human exome capture platforms; however, the relative performances of these have not been characterized sufficiently to know which is best for a particular study.
We comprehensively compared three platforms: NimbleGen's Sequence Capture Array and SeqCap EZ, and Agilent's SureSelect. We assessed their performance in a variety of ways, including number of genes covered and capture efficacy. Differences that may impact on the choice of platform were that Agilent SureSelect covered approximately 1,100 more genes, while NimbleGen provided better flanking sequence capture. Although all three platforms achieved similar capture specificity of targeted regions, the NimbleGen platforms showed better uniformity of coverage and greater genotype sensitivity at 30- to 100-fold sequencing depth. All three platforms showed similar power in exome SNP calling, including medically relevant SNPs. Compared with genotyping and whole-genome sequencing data, the three platforms achieved a similar accuracy of genotype assignment and SNP detection. Importantly, all three platforms showed similar levels of reproducibility, GC bias and reference allele bias.
We demonstrate key differences between the three platforms, particularly advantages of solutions over array capture and the importance of a large gene target set.