The sciences have seen a large increase in demand for students in bioinformatics
and multidisciplinary fields in general. Many new educational programs have been
created to satisfy this demand, but navigating these programs requires a
non-traditional outlook and emphasizes working in teams of individuals with
distinct yet complementary skill sets. Written from the perspective of a current
bioinformatics student, this article seeks to offer advice to prospective and
current students in bioinformatics regarding what to expect in their educational
program, how multidisciplinary fields differ from more traditional paths, and
decisions that they will face on the road to becoming successful, productive
bioinformatics; education; multidisciplinary education; multidisciplinary research; bioinformatics education; computational biology
We present TaqMan-minor groove binding (MGB) assays for an SNP that
separates the Yersinia pestis strain CO92 from all other strains and for another
SNP that separates North American strains from all other global strains.
Motivation: Biological analysis has shifted from identifying genes and transcripts to mapping these genes and transcripts to biological functions. The ENCODE Project has generated hundreds of ChIP-Seq experiments spanning multiple transcription factors and cell lines for public use, but tools for a biomedical scientist to analyze these data are either non-existent or tailored to narrow biological questions. We present the ENCODE ChIP-Seq Significance Tool, a flexible web application leveraging public ENCODE data to identify enriched transcription factors in a gene or transcript list for comparative analyses.
Supplementary material is available at Bioinformatics online.
Higher-order chromosomal organization for transcription regulation is poorly understood in eukaryotes. Using genome-wide Chromatin Interaction Analysis with Paired-End-Tag sequencing (ChIA-PET), we mapped long-range chromatin interactions associated with RNA polymerase II in human cells and uncovered widespread promoter-centered intra-genic, extra-genic and inter-genic interactions. These interactions further aggregated into higher-order clusters, wherein proximal and distal genes were engaged through promoter-promoter interactions. Most genes with promoter-promoter interactions were active and transcribed cooperatively, and some interacting promoters could influence each other implying combinatorial complexity of transcriptional controls. Comparative analyses of different cell lines showed that cell-specific chromatin interactions could provide structural frameworks for cell-specific transcription, and suggested significant enrichment of enhancer-promoter interactions for cell-specific functions. Furthermore, genetically-identified disease-associated non-coding elements were found to be spatially engaged with corresponding genes through long-range interactions. Overall, our study provides insights into the transcription regulation by three-dimensional chromatin interactions for both housekeeping and cell-specific genes in human cells.
Accurate chromosome segregation requires centromeres (CENs), the DNA sequences where kinetochores form, to attach chromosomes to microtubules. In contrast to most eukaryotes, which have broad centromeres, Saccharomyces cerevisiae possesses sequence-defined point CENs. Chromatin immunoprecipitation followed by sequencing (ChIP–Seq) reveals colocalization of four kinetochore proteins at novel, discrete, non-centromeric regions, especially when levels of the centromeric histone H3 variant, Cse4 (a.k.a. CENP-A or CenH3), are elevated. These regions of overlapping protein binding enhance the segregation of plasmids and chromosomes and have thus been termed Centromere-Like Regions (CLRs). CLRs form in close proximity to S. cerevisiae CENs and share characteristics typical of both point and regional CENs. CLR sequences are conserved among related budding yeasts. Many genomic features characteristic of CLRs are also associated with these conserved homologous sequences from closely related budding yeasts. These studies provide general and important insights into the origin and evolution of centromeres.
Centromeres (CENs) are chromosomal regions essential for proper chromosome segregation through their ability to establish evolutionarily conserved protein complexes called kinetochores. During mitosis, kinetochores attach to microtubules emanating from spindle poles, thus providing the mechanism for chromosome segregation. Eukaryotes have different types of CENs. Most eukaryotes have large multimeric centromeres lacking DNA sequence specificity. In contrast, the budding yeast, S. cerevisiae, has short punctate centromeres, comprised of specific DNA sequences. Combining chromatin immunoprecipitation and deep sequencing, we identified regions of the yeast genome that are bound by key kinetochore components; we refer to these regions as Centromere-Like Regions (CLRs). We found that CLRs can promote segregation on episomal plasmids and native chromosomes. Most CLRs are found in intergenic regions, close to native CENs. CLRs resemble point CENs by their short size and regional centromeres by their lack of determining DNA sequences. CLR sequences are conserved among related budding yeasts. Our findings indicate that, similar to other fungi and eukaryotes, S. cerevisiae possesses the ability to form sequence-independent centromeric structures. Establishment of centromeric elements outside regular CENs, or neocentromerization, can lead to chromosome missegregation and is a hallmark of cancer cells. CLR formation in budding yeast provides a simple model of neocentromerization.
Chromatin-remodeling enzymes play essential roles in many biological processes, including gene expression, DNA replication and repair, and cell division. Although one such complex, SWI/SNF, has been extensively studied, new discoveries are still being made. Here, we review SWI/SNF biochemistry; highlight recent genomic and proteomic advances; and address the role of SWI/SNF in human diseases, including cancer and viral infections. These studies have greatly increased our understanding of complex nuclear processes.
Cancer; Chromatin; Chromatin Immunoprecipitation (ChIP); Chromatin Remodeling; DNA Sequencing; HIV-1; Mass Spectrometry (MS); Transcriptional Regulation; Viral Transcription; SWI/SNF
Advances in sequencing technology have led to a sharp decrease in the cost of 'data generation'. But is this sufficient to ensure cost-effective and efficient 'knowledge generation'?
Bioinformatics; costs of sequencing; data analysis; experimental design; next-generation sequencing; sample collection
We systematically generated large-scale data sets to improve genome annotation for the nematode Caenorhabditis elegans, a key model organism. These data sets include transcriptome profiling across a developmental time course, genome-wide identification of transcription factor–binding sites, and maps of chromatin organization. From this, we created more complete and accurate gene models, including alternative splice forms and candidate noncoding RNAs. We constructed hierarchical networks of transcription factor–binding and microRNA interactions and discovered chromosomal locations bound by an unusually large number of transcription factors. Different patterns of chromatin composition and histone modification were revealed between chromosome arms and centers, with similarly prominent differences between autosomes and the X chromosome. Integrating data types, we built statistical models relating chromatin, transcription factor binding, and gene expression. Overall, our analyses ascribed putative functions to most of the conserved genome.
A systems understanding of nuclear organization and events is critical for determining how cells divide, differentiate, and respond to stimuli and for identifying the causes of diseases. Chromatin remodeling complexes such as SWI/SNF have been implicated in a wide variety of cellular processes including gene expression, nuclear organization, centromere function, and chromosomal stability, and mutations in SWI/SNF components have been linked to several types of cancer. To better understand the biological processes in which chromatin remodeling proteins participate, we globally mapped binding regions for several components of the SWI/SNF complex throughout the human genome using ChIP-Seq. SWI/SNF components were found to lie near regulatory elements integral to transcription (e.g. 5′ ends, RNA Polymerases II and III, and enhancers) as well as regions critical for chromosome organization (e.g. CTCF, lamins, and DNA replication origins). Interestingly we also find that certain configurations of SWI/SNF subunits are associated with transcripts that have higher levels of expression, whereas other configurations of SWI/SNF factors are associated with transcripts that have lower levels of expression. To further elucidate the association of SWI/SNF subunits with each other as well as with other nuclear proteins, we also analyzed SWI/SNF immunoprecipitated complexes by mass spectrometry. Individual SWI/SNF factors are associated with their own family members, as well as with cellular constituents such as nuclear matrix proteins, key transcription factors, and centromere components, implying a ubiquitous role in gene regulation and nuclear function. We find an overrepresentation of both SWI/SNF-associated regions and proteins in cell cycle and chromosome organization. Taken together the results from our ChIP and immunoprecipitation experiments suggest that SWI/SNF facilitates gene regulation and genome function more broadly and through a greater diversity of interactions than previously appreciated.
Genetic information and programming are not entirely contained in DNA sequence but are also governed by chromatin structure. Gaining a greater understanding of chromatin remodeling complexes can bridge gaps between processes in the genome and the epigenome and can offer insights into diseases such as cancer. We identified targets of the chromatin remodeling complex, SWI/SNF, on a genome-wide scale using ChIP-Seq. We also identify proteins that co-purify with its various components via immunoprecipitation combined with mass spectrometry. By integrating these newly-identified regions with a combination of novel and published data sources, we identify pathways and cellular compartments in which SWI/SNF plays a major role as well as discern general characteristics of SWI/SNF target sites. Our parallel evaluations of multiple SWI/SNF factors indicate that these subunits are found in highly dynamic and combinatorial assemblies. Our study presents the first genome-wide and unified view of multiple SWI/SNF components and also provides a valuable resource to the scientific community as an important data source to be integrated with future genomic and epigenomic studies.
In parallel to the growth in bioscience databases, biomedical publications have increased exponentially in the past decade. However, the extraction of high-quality information from the corpus of scientific literature has been hampered by the lack of machine-interpretable content, despite text-mining advances. To address this, we propose creating a structured digital table as part of an overall effort in developing machine-readable, structured digital literature. In particular, we envision transforming publication tables into standardized triples using Semantic Web approaches. We identify three canonical types of tables (conveying information about properties, networks, and concept hierarchies) and show how more complex tables can be built from these basic types. We envision that authors would create tables initially using the structured triples for canonical types and then have them visually rendered for publication, and we present examples for converting representative tables into triples. Finally, we discuss how ‘stub' versions of structured digital tables could be a useful bridge for connecting together the literature with databases, allowing the former to more precisely document the later.
bioinformatics; data integration; semantic publishing; Semantic Web; triplification
Chromatin immunoprecipitation followed by tag sequencing (ChIP-Seq) using high-throughput next-generation instrumentation is replacing ChIP-chip for mapping of sites of transcription-factor binding and chromatin modification. To develop a scoring approach for this new technique, we produce two deeply sequenced datasets for human RNA polymerase II and STAT1 with matching input-DNA controls. In these, we observe that signal peaks corresponding to sites of potential binding are strongly correlated with peaks in the control, likely revealing features of open chromatin. Based on these observations, we develop a two-pass approach for scoring ChIP-Seq relative to controls. The first pass identifies putative binding sites and compensates for genomic variation in the mappability of sequences. The second pass filters sites not significantly enriched compared to the normalized control, computing precise enrichments and significances. Using our scoring we investigate optimal experimental design – i.e. depth of sequencing and value of replicas (showing marginal information gain beyond two).
Francisella tularensis is the etiologic agent of tularemia and is classified as a select agent by the Centers for Disease Control and Prevention. Currently four known subspecies of F. tularensis that differ in virulence and geographical distribution are recognized:tularensis (type A), holarctica (type B), mediasiatica, and novicida. Because of the Select Agent status and differences in virulence and geographical location, the molecular analysis of any clinical case of tularemia is of particular interest. We analyzed an unusual Francisella clinical isolate from a human infection in Arizona using multiple DNA-based approaches.
We report that the isolate is F. tularensis subsp. novicida, a subspecies that is rarely isolated.
The rarity of this novicida subspecies in clinical settings makes each case study important for our understanding of its role in disease and its genetic relationship with other F. tularensis subspecies.
Francisella tularensis contains several highly pathogenic subspecies, including Francisella tularensis subsp. holarctica, whose distribution is circumpolar in the northern hemisphere. The phylogeography of these subspecies and their subclades was examined using whole-genome single nucleotide polymorphism (SNP) analysis, high-density microarray SNP genotyping, and real-time-PCR-based canonical SNP (canSNP) assays. Almost 30,000 SNPs were identified among 13 whole genomes for phylogenetic analysis. We selected 1,655 SNPs to genotype 95 isolates on a high-density microarray platform. Finally, 23 clade- and subclade-specific canSNPs were identified and used to genotype 496 isolates to establish global geographic genetic patterns. We confirm previous findings concerning the four subspecies and two Francisella tularensis subsp. tularensis subpopulations and identify additional structure within these groups. We identify 11 subclades within F. tularensis subsp. holarctica, including a new, genetically distinct subclade that appears intermediate between Japanese F. tularensis subsp. holarctica isolates and the common F. tularensis subsp. holarctica isolates associated with the radiation event (the B radiation) wherein this subspecies spread throughout the northern hemisphere. Phylogenetic analyses suggest a North American origin for this B-radiation clade and multiple dispersal events between North America and Eurasia. These findings indicate a complex transmission history for F. tularensis subsp. holarctica.
Short-read high-throughput DNA sequencing technologies provide new tools to answer biological questions. However, high cost and low throughput limit their widespread use, particularly in organisms with smaller genomes such as S. cerevisiae. Although ChIP-Seq in mammalian cell lines is replacing array-based ChIP-chip as the standard for transcription factor binding studies, ChIP-Seq in yeast is still underutilized compared to ChIP-chip. We developed a multiplex barcoding system that allows simultaneous sequencing and analysis of multiple samples using Illumina's platform. We applied this method to analyze the chromosomal distributions of three yeast DNA binding proteins (Ste12, Cse4 and RNA PolII) and a reference sample (input DNA) in a single experiment and demonstrate its utility for rapid and accurate results at reduced costs.
We developed a barcoding ChIP-Seq method for the concurrent analysis of transcription factor binding sites in yeast. Our multiplex strategy generated high quality data that was indistinguishable from data obtained with non-barcoded libraries. None of the barcoded adapters induced differences relative to a non-barcoded adapter when applied to the same DNA sample. We used this method to map the binding sites for Cse4, Ste12 and Pol II throughout the yeast genome and we found 148 binding targets for Cse4, 823 targets for Ste12 and 2508 targets for PolII. Cse4 was strongly bound to all yeast centromeres as expected and the remaining non-centromeric targets correspond to highly expressed genes in rich media. The presence of Cse4 non-centromeric binding sites was not reported previously.
We designed a multiplex short-read DNA sequencing method to perform efficient ChIP-Seq in yeast and other small genome model organisms. This method produces accurate results with higher throughput and reduced cost. Given constant improvements in high-throughput sequencing technologies, increasing multiplexing will be possible to further decrease costs per sample and to accelerate the completion of large consortium projects such as modENCODE.
Burkholderia pseudomallei is the etiologic agent of melioidosis, a significant cause of morbidity and mortality where this infection is endemic. Genomic differences among strains of B. pseudomallei are predicted to be one of the major causes of the diverse clinical manifestations observed among patients with melioidosis. The purpose of this study was to examine the role of genomic islands (GIs) as sources of genomic diversity in this species.
We found that genomic islands (GIs) vary greatly among B. pseudomallei strains. We identified 71 distinct GIs from the genome sequences of five reference strains of B. pseudomallei: K96243, 1710b, 1106a, MSHR668, and MSHR305. The genomic positions of these GIs are not random, as many of them are associated with tRNA gene loci. In particular, the 3' end sequences of tRNA genes are predicted to be involved in the integration of GIs. We propose the term "tRNA-mediated site-specific recombination" (tRNA-SSR) for this mechanism. In addition, we provide a GI nomenclature that is based upon integration hotspots identified here or previously described.
Our data suggest that acquisition of GIs is one of the major sources of genomic diversity within B. pseudomallei and the molecular mechanisms that facilitate horizontally-acquired GIs are common across multiple strains of B. pseudomallei. The differential presence of the 71 GIs across multiple strains demonstrates the importance of these mobile elements for shaping the genetic composition of individual strains and populations within this bacterial species.
Burkholderia pseudomallei is the etiologic agent of melioidosis. Many disease manifestations are associated with melioidosis, and the mechanisms causing this variation are unknown; genomic differences among strains offer one explanation. We compared the genome sequences of two strains of B. pseudomallei: the original reference strain K96243 from Thailand and strain MSHR305 from Australia. We identified a variable homologous region between the two strains. This region was previously identified in comparisons of the genome of B. pseudomallei strain K96243 with the genome of strain E264 from the closely related B. thailandensis. In that comparison, K96243 was shown to possess a horizontally acquired Yersinia-like fimbrial (YLF) gene cluster. Here, we show that the homologous genomic region in B. pseudomallei strain 305 is similar to that previously identified in B. thailandensis strain E264. We have named this region in B. pseudomallei strain 305 the B. thailandensis-like flagellum and chemotaxis (BTFC) gene cluster. We screened for these different genomic components across additional genome sequences and 571 B. pseudomallei DNA extracts obtained from regions of endemicity. These alternate genomic states define two distinct groups within B. pseudomallei: all strains contained either the BTFC gene cluster (group BTFC) or the YLF gene cluster (group YLF). These two groups have distinct geographic distributions: group BTFC is dominant in Australia, and group YLF is dominant in Thailand and elsewhere. In addition, clinical isolates are more likely to belong to group YLF, whereas environmental isolates are more likely to belong to group BTFC. These groups should be further characterized in an animal model.
Francisella tularensis is the causative agent of tularemia, which is a highly lethal disease from nature and potentially from a biological weapon. This species contains four recognized subspecies including the North American endemic F. tularensis subsp. tularensis (type A), whose genetic diversity is correlated with its geographic distribution including a major population subdivision referred to as A.I and A.II. The biological significance of the A.I – A.II genetic differentiation is unknown, though there are suggestive ecological and epidemiological correlations. In order to understand the differentiation at the genomic level, we have determined the complete sequence of an A.II strain (WY96-3418) and compared it to the genome of Schu S4 from the A.I population. We find that this A.II genome is 1,898,476 bp in size with 1,820 genes, 1,303 of which code for proteins. While extensive genomic variation exists between “WY96” and Schu S4, there is only one whole gene difference. This one gene difference is a hypothetical protein of unknown function. In contrast, there are numerous SNPs (3,367), small indels (1,015), IS element differences (7) and large chromosomal rearrangements (31), including both inversions and translocations. The rearrangement borders are frequently associated with IS elements, which would facilitate intragenomic recombination events. The pathogenicity island duplicated regions (DR1 and DR2) are essentially identical in WY96 but vary relative to Schu S4 at 60 nucleotide positions. Other potential virulence-associated genes (231) varied at 559 nucleotide positions, including 357 non-synonymous changes. Molecular clock estimates for the divergence time between A.I and A.II genomes for different chromosomal regions ranged from 866 to 2131 years before present. This paper is the first complete genomic characterization of a member of the A.II clade of Francisella tularensis subsp. tularensis.
Yersinia pestis, the etiologic agent of plague, was responsible for several devastating epidemics throughout history and is currently of global importance to current public heath and biodefense efforts. Y. pestis is widespread in the Western United States. Because Y. pestis was first introduced to this region just over 100 years ago, there has been little time for genetic diversity to accumulate. Recent studies based upon single nucleotide polymorphisms have begun to quantify the genetic diversity of Y. pestis in North America.
To examine the evolution of Y. pestis in North America, a gapped genome sequence of CA88-4125 was generated. Sequence comparison with another North American Y. pestis strain, CO92, identified seven regions of difference (six inversions, one rearrangement), differing IS element copy numbers, and several SNPs.
The relatively large number of inverted/rearranged segments suggests that North American Y. pestis strains may be undergoing inversion fixation at high rates over a short time span, contributing to higher-than-expected diversity in this region. These findings will hopefully encourage the scientific community to sequence additional Y. pestis strains from North America and abroad, leading to a greater understanding of the evolutionary history of this pathogen.
Yersinia pestis, the causative agent of plague, is responsible for some of the greatest epidemic scourges of mankind. It is widespread in the western United States, although it has only been present there for just over 100 years. As a result, there has been very little time for diversity to accumulate in this region. Much of the diversity that has been detected among North American isolates is at loci that mutate too quickly to accurately reconstruct large-scale phylogenetic patterns. Slowly-evolving but stable markers such as SNPs could be useful for this purpose, but are difficult to identify due to the monomorphic nature of North American isolates.
To identify SNPs that are polymorphic among North American populations of Y. pestis, a gapped genome sequence of Y. pestis strain FV-1 was generated. Sequence comparison of FV-1 with another North American strain, CO92, identified 19 new SNP loci that differ among North American isolates.
The 19 SNP loci identified in this study should facilitate additional studies of the genetic population structure of Y. pestis across North America.