The revolution in DNA sequencing technology continues unabated, and is affecting all aspects of the biological and medical sciences. The training and recruitment of the next generation of researchers who are able to use and exploit the new technology is severely lacking and potentially negatively influencing research and development efforts to advance genome biology. Here we present a cross-disciplinary course that provides undergraduate students with practical experience in running a next generation sequencing instrument through to the analysis and annotation of the generated DNA sequences.
Many labs across world are installing next generation sequencing technology and we show that the undergraduate students produce quality sequence data and were excited to participate in cutting edge research. The students conducted the work flow from DNA extraction, library preparation, running the sequencing instrument, to the extraction and analysis of the data. They sequenced microbes, metagenomes, and a marine mammal, the Californian sea lion, Zalophus californianus. The students met sequencing quality controls, had no detectable contamination in the targeted DNA sequences, provided publication quality data, and became part of an international collaboration to investigate carcinomas in carnivores.
Students learned important skills for their future education and career opportunities, and a perceived increase in students’ ability to conduct independent scientific research was measured. DNA sequencing is rapidly expanding in the life sciences. Teaching undergraduates to use the latest technology to sequence genomic DNA ensures they are ready to meet the challenges of the genomic era and allows them to participate in annotating the tree of life.
Undergraduate education; DNA sequencing; Sea lion; Metagenome
ChIP-chip and ChIP-seq are widely used methods to map protein-DNA interactions on a genomic scale in vivo. Waldminghaus and Skarstad recently reported, in this journal, a modified method for ChIP-chip. Based on a comparison of our previously-published ChIP-chip data for Escherichia coli σ32 with their own data, Waldminghaus and Skarstad concluded that many of the σ32 targets identified in our earlier work are false positives. In particular, we identified many non-canonical σ32 targets that are located inside genes or are associated with genes that show no detectable regulation by σ32. Waldminghaus and Skarstad propose that such non-canonical sites are artifacts, identified due to flaws in the standard ChIP methodology. Waldminghaus and Skarstad suggest specific changes to the standard ChIP procedure that reportedly eliminate the claimed artifacts.
We reanalyzed our published ChIP-chip datasets for σ32 and the datasets generated by Waldminghaus and Skarstad to assess data quality and reproducibility. We also performed targeted ChIP/qPCR for σ32 and an unrelated transcription factor, AraC, using the standard ChIP method and the modified ChIP method proposed by Waldminghaus and Skarstad. Furthermore, we determined the association of core RNA polymerase with disputed σ32 promoters, with and without overexpression of σ32. We show that (i) our published σ32 ChIP-chip datasets have a consistently higher dynamic range than those of Waldminghaus and Skarstad, (ii) our published σ32 ChIP-chip datasets are highly reproducible, whereas those of Waldminghaus and Skarstad are not, (iii) non-canonical σ32 target regions are enriched in a σ32 ChIP in a heat shock-dependent manner, regardless of the ChIP method used, (iv) association of core RNA polymerase with some disputed σ32 target genes is induced by overexpression of σ32, (v) σ32 targets disputed by Waldminghaus and Skarstad are predominantly those that are most weakly bound, and (vi) the modifications to the ChIP method proposed by Waldminghaus and Skarstad reduce enrichment of all protein-bound genomic regions.
The modifications to the ChIP-chip method suggested by Waldminghaus and Skarstad reduce rather than increase the quality of ChIP data. Hence, the non-canonical σ32 targets identified in our previous study are likely to be genuine. We propose that the failure of Waldminghaus and Skarstad to identify many of these σ32 targets is due predominantly to the lower data quality in their study. We conclude that surprising ChIP-chip results are not artifacts to be ignored, but rather indications that our understanding of DNA-binding proteins is incomplete.
ChIP-chip; ChIP-seq; σ32
High throughput gene expression technologies are a popular choice for researchers seeking molecular or systems-level explanations of biological phenomena. Nevertheless, there has been a groundswell of opinion that these approaches have not lived up to the hype because the interpretation of the data has lagged behind its generation. In our view a major problem has been an over-reliance on isolated lists of differentially expressed (DE) genes which – by simply comparing genes to themselves – have the pitfall of taking molecular information out of context. Numerous scientists have emphasised the need for better context. This can be achieved through holistic measurements of differential connectivity in addition to, or in replacement, of DE. However, many scientists continue to use isolated lists of DE genes as the major source of input data for common readily available analytical tools. Focussing this opinion article on our own research in skeletal muscle, we outline our resolutions to these problems – particularly a universally powerful way of quantifying differential connectivity. With a well designed experiment, it is now possible to use gene expression to identify causal mutations and the other major effector molecules with whom they cooperate, irrespective of whether they themselves are DE. We explain why, for various reasons, no other currently available experimental techniques or quantitative analyses are capable of reaching these conclusions.
Differential connectivity; Differential networking; Gene expression; Causal mutation algorithm
Finished genome sequences and assemblies are available for only a few vertebrates. Thus, investigators studying many species must rely on draft genomes. Using the rhesus macaque as an example, we document the effects of sequencing errors, gaps in sequence and misassemblies on one automated gene model pipeline, Gnomon. The combination of draft genome with automated gene finding software can result in spurious sequences. We estimate that approximately 50% of the rhesus gene models are missing, incomplete or incorrect. The problems identified in this work likely apply to all draft vertebrate genomes annotated with any automated gene model pipeline and thus represent a pervasive challenge to the analysis of draft genomes.
Comparative studies of amniotes have been hindered by a dearth of reptilian molecular sequences. With the genomic assembly of the green anole, Anolis carolinensis available, non-avian reptilian genes can now be compared to mammalian, avian, and amphibian homologs. Furthermore, with more than 350 extant species in the genus Anolis, anoles are an unparalleled example of tetrapod genetic diversity and divergence. As an important ecological, genetic and now genomic reference, it is imperative to develop a standardized Anolis gene nomenclature alongside associated vocabularies and other useful metrics.
Here we report the formation of the Anolis Gene Nomenclature Committee (AGNC) and propose a standardized evolutionary characterization code that will help researchers to define gene orthology and paralogy with tetrapod homologs, provide a system for naming novel genes in Anolis and other reptiles, furnish abbreviations to facilitate comparative studies among the Anolis species and related iguanid squamates, and classify the geographical origins of Anolis subpopulations.
This report has been generated in close consultation with members of the Anolis and genomic research communities, and using public database resources including NCBI and Ensembl. Updates will continue to be regularly posted to new research community websites such as lizardbase. We anticipate that this standardized gene nomenclature will facilitate the accessibility of reptilian sequences for comparative studies among tetrapods and will further serve as a template for other communities in their sequencing and annotation initiatives.
Small nucleolar RNAs (snoRNAs) are a large group of non-coding RNAs (ncRNAs) that mainly guide 2'-O-methylation (C/D RNAs) and pseudouridylation (H/ACA RNAs) of ribosomal RNAs. The pattern of rRNA modifications and the set of snoRNAs that guide these modifications are conserved in vertebrates. Nearly all snoRNA genes in vertebrates are localized in introns of other genes and are processed from pre-mRNAs. Thus, the same promoter is used for the transcription of snoRNAs and host genes.
The series of studies by Dahai Zhu and coworkers on snoRNAs and their genes were critically considered. We present evidence that dozens of species-specific snoRNAs that they described in vertebrates are experimental artifacts resulting from the improper use of Northern hybridization. The snoRNA genes with putative intrinsic promoters that were supposed to be transcribed independently proved to contain numerous substitutions and are, most likely, pseudogenes. In some cases, they are localized within introns of overlooked host genes. Finally, an increased number of snoRNA genes in mammalian genomes described by Zhu and coworkers is also an artifact resulting from two mistakes. First, numerous mammalian snoRNA pseudogenes were considered as genes, whereas most of them are localized outside of host genes and contain substitutions that question their functionality. Second, Zhu and coworkers failed to identify many snoRNA genes in non-mammalian species. As an illustration, we present 1352 C/D snoRNA genes that we have identified and annotated in vertebrates.
Our results demonstrate that conclusions based only on databases with automatically annotated ncRNAs can be erroneous. Special investigations aimed to distinguish true RNA genes from their pseudogenes should be done. Zhu and coworkers, as well as most other groups studying vertebrate snoRNAs, give new names to newly described homologs of human snoRNAs, which significantly complicates comparison between different species. It seems necessary to develop a uniform nomenclature for homologs of human snoRNAs in other vertebrates, e.g., human gene names prefixed with several-letter code denoting the vertebrate species.
Advances in DNA sequencing technologies have made it possible to generate large amounts of sequence data very rapidly and at substantially lower cost than capillary sequencing. These new technologies have specific characteristics and limitations that require either consideration during project design, or which must be addressed during data analysis. Specialist skills, both at the laboratory and the computational stages of project design and analysis, are crucial to the generation of high quality data from these new platforms. The Illumina sequencers (including the Genome Analyzers I/II/IIe/IIx and the new HiScan and HiSeq) represent a widely used platform providing parallel readout of several hundred million immobilized sequences using fluorescent-dye reversible-terminator chemistry. Sequencing library quality, sample handling, instrument settings and sequencing chemistry have a strong impact on sequencing run quality. The presence of adapter chimeras and adapter sequences at the end of short-insert molecules, as well as increased error rates and short read lengths complicate many computational analyses. We discuss here some of the factors that influence the frequency and severity of these problems and provide solutions for circumventing these. Further, we present a set of general principles for good analysis practice that enable problems with sequencing runs to be identified and dealt with.
In recent years numerous studies have undertaken to measure the impact of patents, material transfer agreements, data-withholding and commercialization pressures on biomedical researchers. Of particular concern is the theory that such pressures may have negative effects on academic and other upstream researchers. In response to these concerns, commentators in some research communities have called for an increased level of access to, and sharing of, data and research materials. We have been studying how data and materials are shared in the community of researchers who use the nematode Caenorhabditis elegans (C. elegans) as a model organism for biological research. Specifically, we conducted a textual analysis of academic articles referencing C. elegans, reviewed C. elegans repository request lists, scanned patents that reference C. elegans and conducted a broad survey of C. elegans researchers. Of particular importance in our research was the role of the C. elegans Gene Knockout Consortium in the facilitation of sharing in this community.
Our research suggests that a culture of sharing exists within the C. elegans research community. Furthermore, our research provides insight into how this sharing operates and the role of the culture that underpins it.
The greater scientific community is likely to benefit from understanding the factors that motivate C. elegans researchers to share. In this sense, our research is a 'response' to calls for a greater amount of sharing in other research communities, such as the mouse community, specifically, the call for increased investment and support of centralized resource sharing infrastructure, grant-based funding of data-sharing, clarity of third party recommendations regarding sharing, third party insistence of post-publication data sharing, a decrease in patenting and restrictive material transfer agreements, and increased attribution and reward.
The pig genome is being sequenced and characterised under the auspices of the Swine Genome Sequencing Consortium. The sequencing strategy followed a hybrid approach combining hierarchical shotgun sequencing of BAC clones and whole genome shotgun sequencing.
Assemblies of the BAC clone derived genome sequence have been annotated using the Pre-Ensembl and Ensembl automated pipelines and made accessible through the Pre-Ensembl/Ensembl browsers. The current annotated genome assembly (Sscrofa9) was released with Ensembl 56 in September 2009. A revised assembly (Sscrofa10) is under construction and will incorporate whole genome shotgun sequence (WGS) data providing > 30× genome coverage. The WGS sequence, most of which comprise short Illumina/Solexa reads, were generated from DNA from the same single Duroc sow as the source of the BAC library from which clones were preferentially selected for sequencing. In accordance with the Bermuda and Fort Lauderdale agreements and the more recent Toronto Statement the data have been released into public sequence repositories (Genbank/EMBL, NCBI/Ensembl trace repositories) in a timely manner and in advance of publication.
In this marker paper, the Swine Genome Sequencing Consortium (SGSC) sets outs its plans for analysis of the pig genome sequence, for the application and publication of the results.
A response to Toplak et al: Does replication groups scoring reduce false positive rate in SNP interaction discovery? BMC Genomics 2010, 11:58.
The genomewide evaluation of genetic epistasis is a computationally demanding task, and a current challenge in Genetics. HFCC (Hypothesis-Free Clinical Cloning) is one of the methods that have been suggested for genomewide epistasis analysis. In order to perform an exhaustive search of epistasis, HFCC has implemented several tools and data filters, such as the use of multiple replication groups, and direction of effect and control filters. A recent article has claimed that the use of multiple replication groups (as implemented in HFCC) does not reduce the false positive rate, and we hereby try to clarify these issues.
HFCC uses, as an analysis strategy, the possibility of replicating findings in multiple replication groups, in order to select a liberal subset of preliminary results that are above a statistical criterion and consistent in direction of effect. We show that the use of replication groups and the direction filter reduces the false positive rate of a study, although at the expense of lowering the overall power of the study. A post-hoc analysis of these selected signals in the combined sample could then be performed to select the most promising results.
Replication of results in independent samples is generally used in scientific studies to establish credibility in a finding. Nonetheless, the combined analysis of several datasets is known to be a preferable and more powerful strategy for the selection of top signals. HFCC is a flexible and complete analysis tool, and one of its analysis options combines these two strategies: A preliminary multiple replication group analysis to eliminate inconsistent false positive results, and a post-hoc combined-group analysis to select the top signals.
The number of databases in molecular biological fields has rapidly increased to provide a large-scale resource. Though valuable information is available, data can be difficult to access, compare and integrate due to different formats and presentations of web interfaces. This paper offers a practical guide to the integration of gene, comparative genomic, and functional genomics data using the Ensembl website at http://www.ensembl.org.
The Ensembl genome browser and underlying databases focus on chordate organisms. More species such as plants and microorganisms can be investigated using our sister browser at http://www.ensemblgenomes.org.
In this study, four examples are used that sample many pages and features of the Ensembl browser. We focus on comparative studies across over 50 mostly chordate organisms, variations linked to disease, functional genomics, and access of external information housed in databases outside the Ensembl project. Researchers will learn how to go beyond simply exporting one gene sequence, and explore how a genome browser can integrate data from various sources and databases to build a full and comprehensive biological picture.
While new sequencing technologies have ushered in an era where microbial genomes can be easily sequenced, the goal of routinely producing high-quality draft and finished genomes in a cost-effective fashion has still remained elusive. Due to shorter read lengths and limitations in library construction protocols, shotgun sequencing and assembly based on these technologies often results in fragmented assemblies. Correspondingly, while draft assemblies can be obtained in days, finishing can take many months and hence the time and effort can only be justified for high-priority genomes and in large sequencing centers. In this work, we revisit this issue in light of our own experience in producing finished and nearly-finished genomes for a range of microbial species in a small-lab setting. These genomes were finished with surprisingly little investments in terms of time, computational effort and lab work, suggesting that the increased access to sequencing might also eventually lead to a greater proportion of finished genomes from small labs and genomics cores.
Anamika et al recently published in this journal a sequence alignment analysis of protein kinases encoded by the chimpanzee genome in comparison to those in the human genome. From this analysis they concluded that several chimpanzee kinases have unusual domain arrangements.
Re-examination of these kinases reveals claimed novel arrangements cannot withstand scrutiny; each is either not novel or represents over-analysis of weakly confident computer generated gene models. Additional sequence evidence available at the time of the paper's submission either directly contradict the gene models or suggest alternate gene models. These alternate models would minimize or eliminate the observed differences between human and chimp kinases.
None of the proposed novel chimpanzee kinase architectures are supported by experiment evidence. Guidelines to prevent such erroneous conclusions in similar papers are proposed.
Duplications and rearrangements of coding genes are major themes in the evolution of mitochondrial genomes, bearing important consequences in the function of mitochondria and the fitness of organisms. Yu et al. (BMC Genomics 2008, 9:477) reported the complete mt genome sequence of the oyster Crassostrea hongkongensis (16,475 bp) and found that a DNA segment containing four tRNA genes (trnK1, trnC, trnQ1 and trnN), a duplicated (rrnS) and a split rRNA gene (rrnL5') was absent compared with that of two other Crassostrea species. It was suggested that the absence was a novel case of "tandem duplication-random loss" with evolutionary significance. We independently sequenced the complete mt genome of three C. hongkongensis individuals, all of which were 18,622 bp and contained the segment that was missing in Yu et al.'s sequence. Further, we designed primers, verified sequences and demonstrated that the sequence loss in Yu et al.'s study was an artifact caused by placing primers in a duplicated region. The duplication and split of ribosomal RNA genes are unique for Crassostrea oysters and not lost in C. hongkongensis. Our study highlights the need for caution when amplifying and sequencing through duplicated regions of the genome.
In the work of Chari et al. entitled "Effect of active smoking on the human bronchial epithelium transcriptome" the authors use SAGE to identify candidate gene expression changes in bronchial brushings from never, former, and current smokers. These gene expression changes are categorized into those that are reversible or irreversible upon smoking cessation. A subset of these identified genes is validated on an independent cohort using RT-PCR. The authors conclude that their results support the notion of gene expression changes in the lungs of smokers which persist even after an individual has quit.
This correspondence raises questions about the validity of the approach used by the authors to analyze their data. The majority of the reported results suffer deficiencies due to the methods used. The most fundamental of these are explained in detail: biases introduced during data processing, lack of correction for multiple testing, and an incorrect use of clustering for gene discovery. A randomly generated "null" dataset is used to show the consequences of these shortcomings.
Most of Chari et al.'s findings are consistent with what would be expected by chance alone. Although there is clear evidence of reversible changes in gene expression, the majority of those identified appear to be false positives. However, contrary to the authors' claims, no irreversible changes were identified. There is a broad consensus that genetic change due to smoking persists once an individual has quit smoking; unfortunately, this study lacks sufficient scientific rigour to support or refute this hypothesis or identify any specific candidate genes. The pitfalls of large-scale analysis, as exemplified here, may not be unique to Chari et al.
Over the past two decades, genomics has evolved as a scientific research discipline. Genomics research was fueled initially by government and nonprofit funding sources, later augmented by private research and development (R&D) funding. Citizens and taxpayers of many countries have funded much of the research, and have expectations about access to the resulting information and knowledge. While access to knowledge gained from all publicly funded research is desired, access is especially important for fields that have broad social impact and stimulate public dialogue. Genomics is one such field, where public concerns are raised for reasons such as health care and insurance implications, as well as personal and ancestral identification. Thus, genomics has grown rapidly as a field, and attracts considerable interest.
One way to study the growth of a field of research is to examine its funding. This study focuses on public funding of genomics research, identifying and collecting data from major government and nonprofit organizations around the world, and updating previous estimates of world genomics research funding, including information about geographical origins. We initially identified 89 publicly funded organizations; we requested information about each organization's funding of genomics research. Of these organizations, 48 responded and 34 reported genomics research expenditures (of those that responded but did not supply information, some did not fund such research, others could not quantify it). The figures reported here include all the largest funders and we estimate that we have accounted for most of the genomics research funding from government and nonprofit sources.
Aggregate spending on genomics research from 34 funding sources averaged around $2.9 billion in 2003 – 2006. The United States spent more than any other country on genomics research, corresponding to 35% of the overall worldwide public funding (compared to 49% US share of public health research funding for all purposes). When adjusted to genomics funding intensity, however, the United States dropped below Ireland, the United Kingdom, and Canada, as measured both by genomics research expenditure per capita and per Gross Domestic Product.
A reanalysis of the sequences reported by Hoegg et al has highlighted the presence of a putative HoxC1a gene in Astatotilapia burtoni. We discuss the evolutionary history of the HoxC1a gene in the teleost fish lineages and suggest that HoxC1a gene was lost twice independently in the Neoteleosts. This comment points out that combining several gene-finding methods and a Hox-dedicated program can improve the identification of Hox genes.
A response to Snyder LA, Saunders NJ: The majority of genes in the pathogenic Neisseria species are present in non-pathogenic Neisseria lactamica, including those designated as virulence genes. BMC Genomics 2006, 7:128.