For many studies, the information in HbVar
needs to be combined with the wealth of information about features of the genomic DNA, such as gene structures, conservation among species, repetitive elements, recombination frequencies and many others. The latter information is stored in genome browsers such as those at UCSC (6
), Ensembl (7
) and MapViewer at NCBI (8
). Genomic DNA features (annotations) and much data about interspecies conservation recently were organized into a relational database, called GALA
). This database allows a user to query across fields for different types of information from multiple locations. We have linked HbVar
so that users can access information in both databases. The output from GALA
can be examined in a variety of formats, including UCSC Genome Browser views, which facilitates some analyses.
One example of the new capacities is to generate a mutation spectrum from the linked databases. A query to HbVar finds the 197 β-thalassemia mutations currently recorded (from the field ‘Type of Thalassemia’ as in Fig. a). After selecting a ‘GALA query’ as the output, the system automatically brings the user to an interface with GALA to select the output format desired. If the bar graph option under ‘Graphical displays’ is selected (Fig. a), the system generates a graph indicating how many times the query results (β-thalassemia mutations in this case) fall within a bin along a designated region [both the bin size (the number of nucleotides included in each vertical bar) and the region are specified by the user]. Using a bin size of one nucleotide for optimal resolution, we see that most β-thalassemia mutations fall within the promoter, exons 1 and 2 and intron 1 (Fig. b). Most of the gene regions and surrounding DNA have mutation frequencies of 1 to 4; these result from the large deletions that can cause β-thalassemia.
Figure 2 Linking HbVar and GALA databases and the UCSC Genome Browser to examine the spectrum of mutations that cause β-thalassemia. (a) The set of all β-thalassemia mutations collected from HbVar can be exported to GALA, which is used to generate (more ...)
A more detailed view can be obtained by directing the output to the UCSC Genome Browser as a custom track, which GALA does automatically upon a user’s request. Figure c shows the point mutations in the HBB promoter that cause β-thalassemia, with additional tracks selected to show the human DNA sequence aligned with orthologous segments of mouse and rat. Nucleotides in the TATA and CAC (EKLF binding site) boxes have been mutated multiple times in different β-thalassemia mutations. The template strand is shown, i.e. the one that is complementary to the mRNA within the exons. These mutationally sensitive regions are highly conserved, especially the CAC box. Interestingly, other highly conserved regions, such as the CCAAT box, are not mutated in the known β-thalassemias. This intriguing observation is difficult to explain. It is unlikely that CCAAT box mutations are too severe since deletions of the entire gene have been found; these loss-of-function mutations are recessive as expected. Finding multiple mutations of the same nucleotide in other parts of the promoter is consistent with the current collection of β-thalassemia mutations being quite comprehensive. An alternative hypothesis is that the CCAAT box mutations have a dominant negative phenotype, thus removing them from the population soon after they occur.
Examining a mutation spectrum illustrates the power of combining HbVar with the analysis and display capacity of other databases. Additional examples illustrate combinations of data from different databases. Starting with the 96 β-thalassemia substitution mutations found by HbVar and collected as a simple query in GALA, we can use GALA to find those that are found in exons. We find that 51 of the β-thalassemia mutations caused by nucleotide substitutions intersect with the set of all exons (not shown). One may want to find the nucleotide substitutions that occur in highly conserved regions. Again, using GALA to combine information from HbVar with alignment data reveals that of the 96 nucleotide substitutions that cause β-thalassemia, 39 occur in highly conserved regions (defined as at least 70% identity in at least 100 bp ungapped alignment between human and mouse sequences). Users can easily access information about the mutations that fall in these categories.