|Home | About | Journals | Submit | Contact Us | Français|
Although the capability of DNA to form a variety of non-canonical (non-B) structures has long been recognized, the overall significance of these alternate conformations in biology has only recently become accepted en masse. In order to provide access to genome-wide locations of these classes of predicted structures, we have developed non-B DB, a database integrating annotations and analysis of non-B DNA-forming sequence motifs. The database provides the most complete list of alternative DNA structure predictions available, including Z-DNA motifs, quadruplex-forming motifs, inverted repeats, mirror repeats and direct repeats and their associated subsets of cruciforms, triplex and slipped structures, respectively. The database also contains motifs predicted to form static DNA bends, short tandem repeats and homo(purine•pyrimidine) tracts that have been associated with disease. The database has been built using the latest releases of the human, chimp, dog, macaque and mouse genomes, so that the results can be compared directly with other data sources. In order to make the data interpretable in a genomic context, features such as genes, single-nucleotide polymorphisms and repetitive elements (SINE, LINE, etc.) have also been incorporated. The database is accessed through query pages that produce results with links to the UCSC browser and a GBrowse-based genomic viewer. It is freely accessible at http://nonb.abcc.ncifcrf.gov.
The ability of certain DNA sequences to adopt alternative conformations, in addition to the canonical Watson–Crick right-handed double helix, has long been recognized (1). Indeed, a large number of studies have documented the formation of alternative (non-B) DNA structures by biophysical methods, including X-ray crystallography (2–4), nuclear magnetic resonance (NMR) spectroscopy (5) and circular dichroism (6). Other methods, such as the detection of single-stranded bases upon non-B DNA structure formation by chemical and enzymatic probes and the relaxation of negative supercoiling by two-dimensional gel electrophoresis have played a major role in revealing the formation of non-B DNA conformations in biological systems (7–9).
Repetitive DNA motifs may fold into non-B DNA structures. Specifically, inverted repeats can adopt cruciform structures, runs of alternating purine–pyrimidine bases are able to switch from the right-handed B- to the left-handed Z-DNA helix, homo(purine•pyrimidine) tracts with mirror repeat symmetry may fold into several types of intramolecular triplexes, four sets of three, four or five guanines, each interrupted by ~1–7 bases, can form highly stable, polymorphic, quadruplex structures and direct repeats can give rise to loops or hairpins through the misalignment of complementary strands, also known as slipped structures (10).
A number of bioinformatic searches have been conducted with the aim of identifying the biological relevance of putative non-B DNA structures in mammalian and other genomes (1). These studies support the notion that the secondary structure conformational domain, rather than the underlying sequence symmetry, often contributes to the control of diverse biological functions, including replication, transcription, immune response (11), recombination and antigenic variation in human pathogens (1,12). Concomitant to this notion, a number of studies have provided circumstantial evidence for the involvement of DNA secondary structures in inducing genetic instability, both in model systems (13–15) and in association with human genetic disease (16–20), including genomic regions that do not contain known genes, suggesting that deeper functional annotation across these regions is warranted. Therefore, the need has arisen to provide the scientific community with a tool that offers a systematic cataloguing of all predicted sequences currently known to potentially form alternative DNA conformations. The non-B DB database bridges this gap by providing a resource for searching, mapping and comparing non-B DNA-forming motifs among various mammalian species.
To date, several reports have detailed methods aimed at enumerating and evaluating predicted non-B DNA-forming elements from genomic sequences, including QuadBase (21), TTS (22), TRF (23) and others (documented at http://nonb.abcc.ncifcrf.gov/Resources/). These reports use various consensus-based scanning methods for identifying one specific class of predicted non-B DNA structure. In some cases, the identified motifs are screened for the presence of other overlapping functional motifs, such as Sp1 binding sites and CpG islands (24). In other cases, the resulting motifs can be searched by genomic position and scanned for the presence of other nearby non-B DNA predicted features [e.g. triplex sequences near quadruplexes (22)]. More recently, analyses that incorporate thermodynamic values into the overall scoring method (25–27) have been reported. Together, these resources provide an important, yet partial, view into the complexities of locating and characterizing the many different sequence motifs that have the potential of forming non-B DNA structures. Our database expands on these functionalities by including all classes of predicted non-B DNA-forming sequences and by using the latest genome assemblies of human, mouse and other mammalian species. The non-B DNA data are available with current genomic annotation data and polymorphism information. Importantly, non-B DB provides the capacity to visualize the data in a genomic context that is fully integrated with other genomic features, such as genes and single-nucleotide polymorphisms (SNPs). The same interface allows for the users to upload their own annotation data, which are displayed alongside the in-house data through the PolyBrowse and UCSC interfaces.
One of the main difficulties in developing and evaluating algorithms that predict the likely candidates for each class of non-B structures is the lack of large collections of experimental data that have validated their formation in vivo. Although most non-B DNA structures can be formed under in vitro conditions, the identification of such conformations in vivo and the elucidation of parameters that govern their B to non-B equilibria have presented formidable challenges. In addition, these equilibria are influenced by local superhelical density, the presence of nearby DNA unwinding element complexes (DUEs) (28), the transcriptional status, nucleosome assembly and other tissue/temporally regulated biological processes. In light of these considerations, we have taken the approach of using rather broad and general identification methods based exclusively on sequence features; thus, although subsequent filtering of the sampled data is straightforward because of the flexibility provided by the database, our current criteria are expected to include a subset of both false positive and negative hits.
We have previously reported the construction of a database containing information on mouse indel polymorphisms (30). Herein, we have extended that system to include motifs with the potential to form non-B DNA structures. A number of studies in vitro (31–34) and in vivo (29,35–38) have indicated that the structural transition from B to non-B DNA is assisted by unrestrained negative supercoiling. In mammalian cells, the global steady-state levels of negative supercoiling vary depending on chromosomal location (39), but are expected to increase transiently by processes, such as transcription, replication and repair, that entail separation of the complementary strands and thus affect nucleosome occupancy (29,38,40–42). However, because the kinetics of these processes may vary among cell types and various developmental stages, an assessment of the probability that a defined chromosomal sequence might exist in the non-B form is currently not available. Indeed, only limited overlap has been reported between the predicted Z-DNA formation based on in silico thermodynamic predictions and genomic loci bound to the Zα domain of ADAR1, which displays high specificity for Z-DNA (43). Thus, a combination of factors, including nucleosome occupancy, negative supercoiling, matrix attachment sites, replication, transcription and repair may underlie B to non-B equilibria in vivo. In the absence of such information, our search algorithms were based solely on sequence relationships derived from in vitro data.
The general approach involves running a scanning application for each specific predicted non-B DNA class against each chromosome (Table 1), including G-quadruplex motifs, alternating purine–pyrimidine sequences, mirror repeats, inverted repeats and direct repeats. Although the ‘Mirror Repeat’ class as a whole has not been reported to form specific non-B DNA structures, it is included in the database as it is used as a first step in the identification of triplex-forming motifs, i.e. the subset of mirror repeats with purine/pyrimidine content.
The output file in GFF format (http://nonb.abcc.ncifcrf.gov/FAQs/) is then loaded into a MySQL database. The data from all such scans are merged and can be queried and displayed using our local instance of GBrowse called PolyBrowse (44) at http://pbrowse3.abcc.ncifcrf.gov/cgi-bin/gb2/gbrowse/human_37 and several GFF-based query tools at http://nonb.abcc.ncifcrf.gov (Figure 1). Importantly, the result pages produced from the queries contain links that allow the user to switch to the genome browser view of that feature, as well as a view that provides the sequence and other annotations for each feature.
These data represent the basis for the non-B DNA annotation information for each species. The scanning criteria do not allow for mismatches within the repeat segments; however, this feature may be added as information becomes available as to the acceptable structural tolerances for each mismatch case. Also, currently not included are very large palindromes (>100kb), such as those that characterize the Y chromosome and whose recombination is known to lead to spermatogenic failure (45,46). Nevertheless, some aspects related to the presence of mismatches are presented in the polymorphism analysis described below.
After scanning across different mammalian genomes, the numbers of each of the predicted classes of non-B DNA structure-forming motifs appear to be quite variable (Table 2).
As the overall base composition between different mammalian genomes is rather similar (data not shown), the observed differences in the numbers of predicted non-B DNA motifs could simply result from the altered arrangement of bases from one species to another. Alternatively, variations in the population of classes of repetitive elements (SINE, LINE, etc.) among species, or other unknown features, might also contribute to the observed differences. This interspecies variability appears to be uniformly distributed along the entire chromosomes, rather than concentrated in large repetitive clusters (data not shown). Whether these differences play any role or contribute to conferring species-specific differences remains to be investigated.
A caveat concerning the simple assessment and comparison of the number of non-B DNA-forming repeats among species relates to the criteria used and the counting method. For example, in the G-quadruplex forming sequences, the pattern of a run of 3Gs followed by 1–7 bases repeated four times can be extended, as long as more runs of Gs are encountered, resulting in a single cluster that has the potential to form many substructures. This circumstance needs to be considered when comparing between different reports or methods. Although our approach identifies this finding as a single cluster in the database, separate database tables are provided, in which all possible permutations of the sequence that satisfies the consensus sequence are reported.
In addition to the non-B DNA predicted motifs, the database contains other features of the DNA, such as phased A-tracts that impart static bends to the double-helix and may be involved in nucleosome assembly (47), simple tandem repeats (STR) including triplet repeats whose expansions cause a number of neuromuscular disorders (20) and poly(purine•pyrimidine) tracts, which are characterized by high stacking interactions (48). In addition, NCBI-derived features, such as genes, SNPs and RepeatMasker (http://www.repeatmasker.org/) elements are also included. This integrated information is critical not only for guiding the user visually, but also for enabling queries that combine ‘classes’, such as ‘exons’ containing predicted ‘Z-DNA’ forming sequences, etc.
One of the main features of the non-B DB is the ability to compare different mammalian genomes for the presence of non-B DNA-forming motifs. This allows for conservation of the predicted elements to be evaluated visually. Figure 2 illustrates this capability by comparing the presence of G-quadruplex forming motifs in the region upstream of the MYC locus across the human, chimp, macaque, dog and mouse reference genomes. In order to view syntenic regions in other genomes, the liftOver application from the UCSC website was used to map 1-kb fragments along each chromosome to the corresponding other genomes. These mapped features are called liftOver1k. Areas where a syntenic match failed to be identified (i.e. that region was absent in the other genome, or mapped redundantly) do not show a link to that species. Other non-B DNA tracks available in PolyBrowse will be described in more detail elsewhere (Cer et. al., manuscript in preparation).
The computed non-B DNA forming elements are likely to be under-represented in our reference genome as their underlying repeats may be polymorphic among individuals. Because this type of information may be critical in the context of gene regulation or predisposition to disease (48), we used a specific parser to scan both the reference human genome as well as additional sequence sources, such as trace reads from the trace archive (http://www.ncbi.nlm.nih.gov/Traces/trace.cgi) and contigs (http://www.ncbi.nlm.nih.gov/projects/WGS/WGSprojectlist.cgi) from personal genome projects (49), for matches to the non-B DNA motifs. Each match found in either the reference or alternate source is then scored for being polymorphic or not. Of the sites identified as polymorphic, a second evaluation is made to determine whether the polymorphism would affect the motif underlying the putative non-B DNA structure. The results of this scan are incorporated into the database as a series of separate tracks (Figure 3B, trace GPlex tracks). Additional information can be gathered by extending this type of analysis to sequence alignments using closely related species. Currently, only the G-quadruplex forming motif supports this type of query.
In order to provide access to the back-end database, we have leveraged two existing tools from the bioinformatics community. The first is a BioPerl (50) set of methods, which is used to query genome databases in various ways, such as by position, by class, or by attribute. This same set of utilities is used within the context of PolyBrowse (44), so that visualization of the genomic features is made available. In addition to linking the outputs from the query tools to the browser for visualization, we also provide links allowing the returned data to be displayed in the familiar UCSC (http://genome.ucsc.edu/) genome browser (Figure 3C), as well as links to our bioDBnet database warehouse (51), which contains gene-centric information derived from several sources, and additional links.
Herein, we present a database containing the locations of motifs predicted to adopt the most common non-B DNA structures. The database can be used to browse specific genomic regions for the possible contribution of non-B DNA-forming elements to inherent biological observations derived from the region. In addition to the locations of predicted motifs, the database also contains polymorphism information about each of the test sequences, as well as additional candidate sequences not present within the reference genomes. The database is accessible using both query pages and PolyBrowse. Additional genomes are in the process of being added to the system and will continue to be updated and added as they become available. Input from the community regarding the addition of other tracks, enhanced algorithms for the detection or scoring of the identified motifs or additional query tools are welcome and will be incorporated into the system as appropriate. Further additions, such as a community-based curation capability and the addition of other validation information through literature mining approaches are also under consideration.
We anticipate that significant improvements to our methods will be made in the future by incorporating energetic, and other secondary metrics, to the current predictive algorithms. Although significant biological knowledge would be required, such as localized superhelical density, nucleosome positioning, etc. (see above), the overall goal is to associate a likelihood index with each of the predicted locations for each of the non-B DNA-forming classes. Finally, as reliable methods are expected to be developed that identify genome-wide data on non-B DNA structures in vivo and some of the biological parameters involved, the resulting data sets can be used to train the prediction tools, resulting in improved predictive capabilities for each type of non-B-forming classes.
Center for Biomedical Informatics and Information Technology (CBIIT)/Cancer Biomedical Informatics Grid (caBIG) ISRCE yellow task #09-260 to NCI-Frederick and National Cancer Institute/National Institutes of Health contract HHSN261200800001E (to A.B.). Funding for open access charge: National Cancer Institute/National Institutes of Health contract HHSN261200800001E.
Conflict of interest statement. None declared.
We thank Dr. Robert Wells for many useful suggestions and Dr. Karen Vasquez for assistance and sharing unpublished data. We also acknowledge the many valuable contributions from the participants at the FASEB Summer Conference on ‘Biological Impact of Alternative DNA Structures’ held at Steamboat Springs, CO in July 2010. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government.