The purpose and design of the database
The HFV database aims to be a resource for scientists working on HFV genetics, evolution, variability and vaccine and drug design. The database is managed by biologists with extensive experience in sequence analysis, assisted by bioinformaticians and computer scientists. The backbone of the database is formed by the HFV sequences deposited in GenBank. New sequences are downloaded weekly, and the available ancillary information is extracted from the GenBank records. This information is gleaned not only from designated fields but also from text mining, and may include country, sampling year, isolate names, host species, etc.
Many of the site's capabilities and tools are supported by reference sequences (one per species). These are mostly obtained from NCBI's RefSeq database (4
), but in cases where RefSeq does not contain a good reference sequence (e.g. a new or unclassified virus), an automated Blast search selects an optimal reference sequence. Reference sequences and their annotation are used to direct the location of genes and coding regions, to standardize numbering of regions and epitopes and form the basis of all alignment models and frame corrections. The database also contains (synthetic) reference sequences for each genus; these are used for on-the-fly alignments, which are done on a per-genus (and segment) basis.
Searching the database
The information in the database can be accessed via two search interfaces. One is a versatile but user-friendly search interface that allows searches on some 30 different fields, and lets the user automatically exclude bad-quality sequences. The search results can be sorted and selected in various ways, and include an icon for each sequence that shows at a glance how long each sequence is and where in the genome it is located (). This interface also offers access to a number of analysis and visualization methods.
Figure 1. Tabular results page from the regular search interface, including functions available for the search results (phylogenetic tree, geographical information, etc.), ancillary information about sample background, genome coverage and taxonomic relations of (more ...)
A graphical overview showing which regions and which species are included in the entire set of retrieved sequences can be generated (). An important feature is the ability to search by genomic region. It allows the user to locate all sequences in the database that span a region, and opt to include or exclude sequences that are located in that region but do not cover it completely. Retrieved sequences and the associated annotation can be downloaded as an alignment which will usually be codon-aligned, so that it can be translated immediately; or alternatively, search results can also be downloaded as translated amino acids in any reading frame. The retrieved background information can also be downloaded as a tab-delimited file.
Figure 2. Distribution of genomic sequence information over the flavivirus model genome, by viral species. For dengue, approximately 3500 complete genomes are available; the most densely sequenced region is nucleotides 800–2000, roughly corresponding to (more ...)
The ‘advanced search interface’ dynamically reads the schema of the database and generates a graphical overview of the tables and fields. This overview can be used to generate a custom-made search interface containing user-selected fields. This interface offers read access to the entire database, but some of the ‘overhead’ that the regular interface performs automatically, such as including the proper foreign key fields, must be done by hand in the advanced interface.
All sequences are downloaded from GenBank as XML files, which are parsed and searched for additional information. The basis for further processing is currently the NCBI taxonomical classification. For each genus or sub-genus and each segment, an initial alignment was created based on all sequences in that category that were available in the NCBI RefSeq database. These alignments were used to create a model using HMMer2.0 (5
). The resulting model was then used to align all sequences in that category.
The resulting aligned sequences are stored in the database, using a storage algorithm described previously (6
) that keeps track of both the gaps inserted into the sequence relative to the model, and into the model relative to the sequence. By combining these two sets of gaps, the original aligned sequence can be reconstructed. This procedure is repeated for all sequences that the user wants to download. Finally, columns containing only gaps are removed, and the resulting alignment can be used for further analysis. A similar process is used to align a user sequence set to the provided reference alignments, for example to use in making phylogenetic trees and graphical SNP displays.
The database still contains some 5000 sequences (10% of the total) that are not classified and do not have a reference sequences. These sequences are stored and annotated as far as possible, and they can still be retrieved, but much of the additional functionality is not available.
Sequence orientation and alignment
The method used to internally align the sequences to a genus-level central alignment profile is based on each sequence's taxonomical classification. However, in the use of the database we found that many sequences are not classified, or classified only to the family level. Those sequences are now provisionally assigned a reference sequence based on a simple Blast search that identifies the closest profile. A consequence of the method used is that only sequences in, or resembling, the same genus can be aligned. However, aligning viral sequences at the genus level without a good model is not trivial, and this alignment is a big improvement.
Quite a few viral sequences in GenBank are in reverse-complement orientation, and in some cases the ‘correct’ orientation is not easy to determine from the sequence itself. They are stored as-is, but the correct orientation of all sequences is automatically determined relative to the alignment model, and re-reversed upon download when needed, so they will be in the same orientation. The search output displays the location as well as the orientation of each sequence relative to the species reference sequence. However, some adjustments were to be made for the reference sequences, which can also be in the reverse orientation. The database deals with these sequences separately, and the orientation of both the sequence relative to the reference sequence, and of the reference sequence relative to the alignment profile is displayed.
Quality control and annotation
Sequence annotation plays an increasingly important role in the analysis of the data. The HFV database has designed and implemented several methods to better handle the available annotation and allow it to be re-used for annotation of new sequences.
To harvest the annotation from existing reference sequences, a script has been developed that retrieves the features and values of each reference sequence for each ontological category, such as gene, coding sequence (CDS), mature peptide, etc. This annotation includes the start and stop location for each element. This information is used for many different tools, including Genome Mapper and HFVAlign. It will also be applied in a new tool that was recently made available, that allows users to submit sequences to GenBank and annotate them using similarity to existing sequences and their annotation. In most cases, sequences and annotation are matched based on their taxonomical relationships, but in cases where taxonomical links are not available, a simple Blast search is used to find a matching reference sequence, if one exists.
Several forms of quality control and error checking are included in the database setup. The genomic location of each sequence is recorded upon storage, and sequences in reverse-complement orientation are flagged; they are re-reversed for use when appropriate. Sequences that are not assigned to a species and are too divergent to be aligned will be excluded from downloaded alignments. Patent and synthetic sequences, as well as those with >10% N’s, are also excluded by default, although users can choose to include them.