HOMD contains various types of information on the human oral microorganisms including taxonomy, genomics and bibliography. The purpose of the HOMD web site is to provide interfaces to search, retrieve and navigate among these different types of information. HOMD also provides web-based software tools for data-mining and analysis. Different types of data for the same organism are linked together by unique HOT IDs, which appear in all the web interfaces and results of the analytic tools. The interfaces and tools are listed in a left-side menu on the HOMD home page as well as in a drop-down menu on top of every HOMD web page. Detailed descriptions for these interfaces and tools are provided in this section.
The Taxon Table provides the tabularized view of all the human oral microbial taxa defined and curated by HOMD. The table consists of five columns of information: Oral Taxon ID (HOT), Genus, Species, Taxon Link and Genome Link. The list can be sorted by the first three columns. The taxon and genome links lead to the detailed taxon description and genome description if available. The table page contains a search box for all the taxa. An alphabetic index is located on top of the table for quick access to a specific taxon. The table displays the total number of taxa in the database or the number of search result. Users can choose to display all result in one page or 100, 50 or 20 taxa per page. Options are also available to showing only named species, unnamed cultivated species or uncultured phylotypes.
The human oral microbial taxa are also arranged in the taxonomic hierarchy, i.e. from domain, phylum, class, order, family, genus, to species levels. The hierarchical tree is fully expanded by default and can be dynamically collapsed at any given level. The link, at the species level, brings users to the detail Taxon Description page. The designation of each level is followed by two numbers enclosed in the square brackets indicating the number of taxa and taxa genome sequences. For example, ‘Phylum Proteobacteria [107, 144]’ indicates that in the phylum Proteobacteria, 107 taxa were identified in the oral cavity and 144 strains have genome sequences available at HOMD. If a species has been sequenced by multiple groups, we provide each sequence when available for that particular species.
The Taxonomic level page provides a list of taxa and the number of taxa at the next lower level for each of the 7 taxonomic levels: Domain, Phylum, Class, Order, Family, Genus and Species. For example, at the Genus level, it lists the 169 genera in HOMD and for each lists the number of species as well as an up pointer to the family for each genus.
The HOMD Taxon Description Page provides comprehensive information for a specific human oral microbial taxon. Information provided can be summarized in four categories: taxonomic hierarchy, biological characteristics, references and community comments. Throughout the page, clickable dynamic cross-links are provided for additional information. The taxon page can be edited and curated by designated curators upon logged-in. The page also allows input and comments provided by the users in the research community. The information provided for each taxon are as follows.
The HOT ID is a unique numeric ID representing this particular taxon. The taxon can be unambiguously referred to from other source of scientific literature. The taxon can be accessed on the web with an easy universal resource locator (URL) format: http://www.homd.org/taxon=NNN
, where NNN is the HOT ID.
A taxon can be either a validly named cultivated species, an unnamed cultivated species or an unnamed uncultured phylotype. This status is shown in this field and will be updated upon the change of actual status of the taxon.
Type strain/reference strain
If the taxon’s status is validly named cultivated species, the type strain is listed here; if the taxon is an unnamed isolate the strain information will be listed as reference strain. If no cultivated strain is available yet, the Reference Strain field will be listed as ‘None, not yet cultivated’.
The Taxon Description page lists the nomenclatures of each taxonomic level from Domain to Species. The classification defined by HOMD may be different from the NCBI taxonomy. The NCBI taxonomy can be accessed using a dynamic link. The HOMD taxonomy is based on the analysis of where each taxon falls in phylogenetic trees generated using several treeing methods and including over 100 non-oral reference taxa identified by searching greengenes. For example, in HOMD, an organism such as Eubacterium saburreum, is placed in the family Lachnospiraceae (because that is where it falls phylogenetically), rather than in the family Eubacteriaceae (because its incorrect genus name ‘Eubacterium’ has not yet been revised). Synonyms of the taxon that are currently in use or were used before in the literature or publications are also provided.
16S rRNA gene sequence
Accession number and links to one or more 16S rRNA gene sequences for that taxon.
16S rRNA gene sequence alignment
HOMD provides the clone sequences preliminarily aligned to the reference sequence to which the clones belong. These alignments were generated automatically by computer search of GenBank and were not manually examined. The clone alignments are provided in Clustal format with the reference sequence(s) on top that are used as the template for alignment. To view the alignment in color format and for further adjustment, third-party alignment viewing software may be used, such as SeqView (http://pbil.univ-lyon1.fr/software/seaview.html
) and BioEdit (http://www.mbio.ncsu.edu/BioEdit/bioedit.html
). Because some pairs of clone sequences may be nonoverlapping (i.e. 500-base sequences at opposite end of the molecule) this file must be used with caution for tree construction.
A phylogenic tree showing the position of this taxon among related HOMD taxa can be viewed or downloaded.
Prevalence by molecular cloning
The number of clones found for this taxon in an analysis of approximately 35
000 clones (F. E. Dewhirst et al.
, submitted for publication). Based on the number of clones found, the rank abundance of the taxon (out of 619) is given.
Lists previous names for the organism if validly named. Isolates or clones designations are given as synonyms when they have appeared in the literature as ‘names’ for the taxon, such as ‘BU063’ (21
For validly named species there is a link to the NCBI taxonomy. NCBI has no taxonomy for unnamed species, hence the reason HOMD was created.
The number of hits when the name (genus plus species) of this taxon is used in the PubMed search. HOMD automatically and periodically updates this hit number every 2 weeks. To get a most up-to-date search, simply click the ‘PubMed Link’ to pull up the search result live from NCBI PubMed site. In general there are no results for unnamed taxa, hence the need for HOMD. When articles referencing these taxa (often through clone numbers) are found by HOMD curators or community members, they are manually added to the taxon description.
Similar search as above using NCBI Entrez ‘nucleotide’ as reference database.
Similar search as above using NCBI Entrez ‘protein’ as reference database.
Number of genomes that have been sequenced is indicated here with a link to a detail list of these genomes.
An expandable/collapsible view of a dynamically displayed taxonomy tree indicating the position of the taxon on the page.
Generic information regarding this taxon.
Conditions and media for growing strains of this taxon, if available.
Generic phenotypic description of the taxon if the taxon has cultivated member(s).
Prevalence and source
Describes the frequency and source of clones and isolates from different oral sites and states of health or disease when known.
Literature and publications referencing this taxon. These references are manually curated with up to 10
key references that may also include older references not indexed in PubMed.
Registered and logged-in users can provide their feedbacks related to this taxon. The comment requires the approval of the HOMD curators before it is shown to the public.
Curation and editorial information
When and by whom this page was created or last modified.
Identification of 16S rRNA gene sequence by BLAST search
One of the key features of HOMD is the comprehensive collection of the 16S rRNA reference gene sequences for each of the 619 taxa defined based on version 10 of the sequences. Since a phylotype can include members with up to 1.5% sequence divergence (23 bases for a full 1500 base sequence), multiple reference sequences have been selected where we have sequences diverging by more than 10 bases within a taxon. Thus, at the time of writing, there are 747 reference sequences for the 619 taxa. This set of sequence (the HOMD 16S rRNA RefSeq version 10) is available for download and searching against using the BLAST search tool provided. HOMD provides a customized BLAST search tool for identifying unknown 16S rRNA gene sequences for the closest match(s). The HOMD 16S rRNA BLAST search tool allows submission of multiple sequences in a single upload. The search is piped to the back-end computer cluster servers (described above) and the results are presented back to submitter in a tabularized format. Results containing up to 20 top matches for each query sequence can be downloaded in text or Excel file formats. Original full BLAST results including the alignments can also be accessed from the result page. The match identity is presented as straight BLAST results and as an adjusted percent identity (API) calculated as:
where M is the matched (identical) and MM the mismatch sequence length between the query and the reference sequence, respectively. This calculation excludes any gaps introduced during the alignment process of the BLAST search. We have found that this correction gives much better values for single primer sequence reads where the sequence adjacent the primer often includes indels. The top hits are ordered by their API and sequences with alignment smaller than 95% of query sequence are excluded from ranking. The top four matched reference sequences are listed by this methods and the table shown on the web page contain links to the original BLAST results as well as to the taxon description pages for reference sequences. The results for the 20 top matches can be downloaded as plain text or Microsoft Excel format.
Dissemination of taxonomic naming scheme
The HMP Data Analysis and Coordination Center (DACC; accessible at http://www.hmpdacc.org/
) is using HOT numbers to designate taxonomic identity isolates of the oral cavity with URLs cross-referenced to HOMD. These URLs will be embedded in the data provided by DACC so that user can track down to the more comprehensive information for individual genome. In our recent submission of approximately 35
000 16S rRNA clone sequences to GenBank, a hyperlink was provided in each sequence for cross-referencing back to the HOMD database.
Genomics tools overview
Complimentary to the taxonomy information, the HOMD also provides comprehensive information and tools for studying genomes of the human oral microbes. HOMD genomics database serves as the curated repository for the molecular sequences of human oral microbiome, including complete and partial genomics sequences, as well as 16S rRNA mentioned in the previous section. Genomic sequences available at HOMD can be either fully assembled genomes, high coverage genomes or genome surveys. HOMD also keeps tracks of the status of ongoing genome sequencing projects for human oral microorganisms. A sequence meta information page is created to hold relevant genomics and sequence meta information if a sequencing project for a human oral microbe is announced and available in the NCBI Genome Project Database. The genome project status is updated frequently based on information collected from the NCBI Genome Projects Database with an automatic query script. Once genomic sequences are publicly released, they are dynamically annotated by HOMD (dynamic annotation). Annotation done by other data center, if available, is termed ‘static annotation’ and is viewable in a separate panel in the Genome Viewer (described below). Relevant tools are provided for viewing and searching the annotation. These tools were first developed as part of the Bioinformatics Resource for Oral Pathogens (BROP: http://www.brop.org
). The programs and the data-mining schemes used in HOMD are designed for both finished and unfinished (collections of multiple contigs) genome sequences. The tools are integrated with the HOMD web site and are conveniently accessible by users. Icons or links to available tools pertaining to a specific genome are automatically presented on relevant page to users. Important genomic data and bioinformatics tools provided by HOMD are described below. Additional information is also available in the previous publication (20
HOMD organizes genomes in three viewing options: Taxa with Annotated Genomes, Taxa with Genomes in Progress and View All Genomes. The first option lists the oral taxa with annotated (static or dynamic) genomic information and provides links to all the genomes available for each taxon. The View Genome button links to the Genome Table showing all the available genomes of a specific taxon. The Genome Table shows the Oral Taxon ID (HOT), the Genus and Species names, Strain Culture Collection, HOMD Sequence ID (SEQ ID), Number of contigs and singlet, combined sequence length and links to available tools and information. The second option (Taxa with Genomes in Progress) lists those oral taxa with genomic sequencing project still in progress but no sequences are yet available. The third option shows all the genomes in the alphabetical order and provides searching and sorting function for easier navigation. Each genome listed will have a link to the Sequence Meta Information page described next.
Sequence Meta Information
The Sequence Meta Information page provides detailed biological, molecular biological, genetic, genomic and taxonomic as well as annotation information for a particular strain that has been, is being or will be sequenced. Information on these pages is semi-automatically updated. Updated information from both Genomes Online and NCBI Genomic Project database are retrieved frequently and compared with the existing database automatically. New or modified Genomic Project Information are then added to the Sequence Meta Information pages with confirmation by curators. The Sequence Meta Information pages contains the following human curated information related to the target organism: Oral Taxon ID, HOMD Sequence ID (SEQ ID), Organism Name (Genus, Species), Culture Collection Entry Number, Isolate Origin, Sequencing Status, NCBI Genome Project ID, NCBI Taxonomy ID, Genomes Online Goldstamp ID, NCBI Genome Survey Sequence Accession ID, JCVI (previously TIGR) CMR ID, Sequencing Center, Number of Contigs and Singlets, Combined Length (kbp), GC Percentage, DNA Molecular Summary, ORF Annotation Summary and 16S rRNA Gene Sequence. In addition, original external information such as NCBI Genome Project Database, NCBI Taxonomy Database, Genome Online Database and rRNA in NCBI Nucleotide Database, if available, are parsed into separate tables below the Sequence Meta Information for convenient referencing.
Full and high coverage genomes
Full genomes are the oral microbial genomes that have been fully assembled, while the high coverage genomes are not fully assembled but represent most of the genome coverage. Both types of genomes are annotated and deposited in the public database such as GenBank. HOMD aims to provide more frequently updated genomic annotation for bacterial oral isolates (see below). In addition, HOMD provides graphical genomic viewing for static annotations done by other public data centers such as NCBI or JCVI.
One of the original major goals of the NIH funded project ‘A Foundation for the Oral Microbiome and Metagenome’ was to partially sequence up to 100 representative human oral microbial species. A total of 12 low-coverage partial genomic sequences were sequenced and deposited in NCBI and active annotation is being maintained by HOMD (8
). Since the launch of the HMP, the HOMD team has been providing the genomic DNA of human oral microbes to the four HMP sequencing centers for high coverage rather than survey sequencing (8
Dynamic annotation of genomic sequences
One of the major features of the HOMD Genomic Database is the automatic and frequent updating genomic annotation pipeline for genomes of oral isolates. Although the amount of sequence data is still growing rapidly, the computational power needed for bioinformatic analysis is catching up and the cost and energy consumption per CPU decreasing due to the availability of multi-core CPU formats. The lower cost of computational power has made it feasible for us to setup a computational cluster dedicated to the annotation of human oral microbial genomes. As described in the Hardware section, HOMD recruited a cluster of multi-core multi-node computer servers to frequently update the annotation. Current HOMD genome annotation algorithms include: (i) BLASTP (http://www.ncbi.nih.gov/BLAST/
) search against weekly updated NCBI non-redundant protein data (ftp://ftp.ncbi.nih.gov/blast/db
); (ii) BLASTP search against Swiss-Prot protein data (http://us.expasy.org/sprot/
) and (iii) InterProScan search (http://www.ebi.ac.uk/Tools/InterProScan/
) against ScanRegExp, BlastProDom, ProfileScan, HMMPfam, superfamily, HMMTigr, Seg, Coil, HMMPIR, FPrintScan and HMMSmart databases (http://www.ebi.ac.uk/interpro/databases.html
). To provide data on functional potential of genomes, BLASTP search result against Swiss-Prot are further processed for the construction of KEGG metabolic pathways and Gene Ontology trees, because the well-annotated Swiss-Prot protein sequence descriptions contains interlinks to the ENZYME (25
) and Gene Ontology (26
). The dynamic genome annotation is being repeated continuously based on NCBI's weekly update of non-redundant protein database. Additional genomes are being added to the annotation pipeline as more sequences are made available by other public sequencing projects such as the HMP (http://www.hmpdacc.org
). A live update status of the genome annotation is provided on the HOMD home page indicating the latest genome annotated or updated. HOMD aims to maintain frequent and dynamic computer annotation for genomic sequence of at least one isolate from each oral taxon whenever sequences are made publicly available, as well as static annotation of all annotated releases.
Genome Explorer is the centralized web interface that inter-connects all the genomics resources in HOMD. The front end of Genome Explorer is a user-friendly interface that allows investigators to navigate among all the genomics information provided at HOMD. HOMD Genomics Tools can be accessed either by selecting the tool or the genome first. If the user chooses the desired tool first, the user is then directed to the Genome Explorer interface for selecting genomes. Once a target genome is chosen, the interface dynamically presents all the tools, including linked external databases, available for the selected genome. Currently available tools include Genome Viewer, Dynamic Annotation, BLAST, Annotator, EMBOSS, KEGG pathways (27
), Gene Ontology Tree (28
), Genomewide ORF Alignment and Sequence Download. The back-end of Genome Explorer is a searchable annotation database that integrates all the results generated from the dynamic annotation pipeline mentioned. The search result is presented in a paginated and sortable table that also provides web links to (i) a summary page for individual ORF, (ii) Genome Viewer to show the exact location of the target ORF in the genome and (iii) to the original BLAST or InteProScan results. The summary page provides all the information and tools available for a specific ORF, including all the data-mining results mentioned above, as well as convenient links to other web tools for performing new search and analysis. In short, Genome Explorer is a one-stop interface for all the genomic information available for each target genome or gene.
Genome Viewer is a unique graphical genomic sequence viewer developed originally for the BROP project (20
). The Genome Viewer was designed to alleviate the inconvenience encountered when comparing two different sets of annotations for the same genome. Genome Viewer provides a graphical, six-frame translational view of the same region of the genome with individual panels showing different sets of annotations. It has easy navigating features including zooming, centering and searching by gene ID. For example, the genome Porphyromonas gingivalis
W83 has been annotated by JCVI (TIGR), Los Alamos National Laboratory and NCBI separately. These different annotations can be viewed and compared side-by-side in the Genome Viewer (http://www.homd.org/index.php?name=
Meta-database search engine searches across four major databases—Taxonomy, Genomes, User documents and Dynamic Genome Annotation. The Hit box will show the number of matches found in each databases and provides links to the results. For the annotated genomes, counts of search hits are shown for each individual genome, which are linked to the Genome Explorer showing corresponding matches.
Batch database content download
HOMD provides batch database content download in both tab-delimited text (viewable in a web browser or downloadable as file to user’s computer) and Excel format for all the data in HOMD. These downloads can be accessed from the ‘Tools and Download’ top menu, leading to a page listing all the downloadable contents of HOMD. Currently the downloads include the following six primary categories: (i) Taxon tables; (ii) 16S rRNA gene sequences; (iii) Genomic sequences and dynamic annotation results; (iv) Genome meta informaton; (v) Database schema and (vi) Database table structures. The download page can be accessed by navigating from the top or side-panel menu, or through a direct URL access to the download page- http://www.homd.org/download
The HOMD user’s guide (i.e. the help documentation) was designed to help users to use the tools, navigate the information and interpret the results provided by HOMD. The user’s guide is accessible through the top navigation menu on every tool page and is dynamically linked to the relevant guide for each different tool. For example, when users are viewing the Taxon Table page, the ‘How to use this page’ menu item shown in the top navigation menu will lead directly to the page that explains the use of the Taxon Table. Alternatively users can also browse the entire user documentation (along with the ‘general documentation’) by the ‘Table of content’ tab shown op top of each documentation page. Every document of HOMD can be searched either through the search box located at the bottom of the table of content of the documentation page or through the Meta Database Search box located at the top-right part of the home page.
HOMD uses the AWStats system (http://awstats.sourceforge.net
) to track the usage of the web site. AWStats provides comprehensive web usage statistics graphically. It summarizes hourly, daily, weekly and monthly usages and aggregates the statistics by geographic locations such as countries or individual hosts. Major search engines such as Google, Yahoo and MSN are filtered for better tracking of true user visits (i.e. non-machine visits).