Although the eight BRCs have common contract terms and goals, they have taken varied approaches to reaching these goals—with different data, analysis tools, and focuses. For example, BioHealthBase has a high level of diversity in the type of organisms under study, whereas ERIC is focused on closely related enteropathogens. Pathema and NMPDR support specific data types for numerous microbial genomes. Below, we present a brief summary of each BRC, in alphabetical order. The list of coinvestigators on each BRC is also provided to acknowledge the valuable contributions of key individuals to the project.
) is an integrated BRC for protozoan pathogens (Table ) in the phylum Apicomplexa
, including the category B agents Cryptosporidium parvum
and Toxoplasma gondii
, and multiple (re)emerging infectious disease species within the phylum Plasmodium
(the causative agents of malaria). A federation of three databases (PlasmoDB, CryptoDB, and ToxoDB) in ApiDB permits users to simultaneously query across all three.
The Genomics Unified Schema relational database system (http://www.gusdb.org
) has enabled the integration of many differing data sets, including genomic sequences for multiple species (and SNPs identified through resequencing), synteny maps, curated annotations and comments from the user community, automated analysis of gene/protein attributes, crystal structures/models, interactome data, ortholog/paralog information, and functional genomic data sets (from both expression profiling and proteomic studies, on several platforms). Tools available through ApiDB include an integrated genome browser (GBrowse), BLAST queries, motif identification, ortholog identification, and metabolic pathway maps. Users may also exploit powerful tools to combine results by using Boolean operators to identify gene sets that exhibit very precise characteristics, and these results may be downloaded for further analysis and archiving.
The focus of the component databases is driven by community needs. Areas of particular interest currently include metabolic pathway modeling and reconstruction (CryptoDB), genetic diversity and identification of genes under positive or negative selection (PlasmoDB), and a new generation of photolithographic arrays suitable for both expression profiling and genotyping (ToxoDB).
The core client communities for these resources are relatively small and well integrated, and the active involvement of ApiDB staff with these communities (as independent researchers and through publications, meeting presentations, workshops, etc.) has resulted in daily use of ApiDB and its component databases by a majority of the ~1,000 research labs studying these important organisms worldwide.
David Roos of the University of Pennsylvania is the principal investigator (PI) for ApiDB; the coinvestigators are Christian J. Stoeckert of the University of Pennsylvania and Jessica C. Kissinger of the University of Georgia.
The Biodefense/Public Health Database (BioHealthBase; http://www.biohealthbase.org
) is an integrated data repository and analysis portal supporting data on a broad range of pathogenic organisms, including Francisella tularensis
, influenza virus, Mycobacterium tuberculosis
, Giardia lamblia
, and Ricinus communis
BioHealthBase contains extensive genome sequence and protein function data for its designated pathogens. BioHealthBase aims to go beyond the traditional genomic and proteomic levels of experimental data to integrate data from a variety of different experimental technologies and to capture the interplay between pathogenic organisms and the human host. Building upon previous work of the biological and bioinformatic community has allowed BioHealthBase to quickly create a single cohesive resource for this wide variety of data to elucidate host-pathogen interactions.
One area of integration has been to combine information regarding the localization of immunological epitopes with sequence polymorphism analyses for influenza virus. Predicted major histocompatibility complex class I cytotoxic T-cell epitopes and validated immunological epitopes from the NIAID Immune Epitope Database for influenza virus are localized together with sequence conservation scores to indicate genomic regions under host selective pressure and protein motifs that can serve as prospective vaccine candidates. Future work will layer these data onto three-dimensional protein structure data to support an understanding of the relationships between protein structure, function, and host interactions.
Other collaborative efforts have been to initiate and expand annotations of metabolic and cellular signal transduction pathways for these pathogens and their human hosts. For example, the entire influenza virus life cycle framework and the host response pathways of Toll-like receptor 3 and RIG-I have been modeled and added to the Reactome database (6
). Additionally, the pathways for transport reactions related to Mycobacterium tuberculosis
H37Rv physiology have been added to the BioCyc pathway database (7
Richard Scheuermann of the University of Texas Southwestern Medical Center is the PI of BioHealthBase; the coinvestigators include Adolfo Garcia-Sastre of Mount Sinai School of Medicine, Stephen Johnston of Arizona State University, Barbara Mann of the University of Virginia, Hilary Morrison of Marine Biological Laboratory, Louis Weiss of Albert Einstein College of Medicine, and Ellen Vitetta of the University of Texas Southwestern Medical Center.
ERIC focuses on four enteropathogens and the closely related organism Yersinia pestis
(Table ), with a meticulous, disciplined focus on the annotation and curation of genes and gene families (http://www.ericbrc.org
ERIC's Web portal currently centers around its pathogen annotation system, named ASAP (3
), originally developed in the laboratory of coinvestigator Nicole Perna at the University of Wisconsin—Madison. This system's functions allow versioning of genomes, careful annotation of genes and other genome features (e.g., insertion elements) with evidence codes, using six annotators/curators, and the ability to allow direct community annotation.
In addition, the system makes integrated use of the MAUVE application for comparative genomics (2
), which unlike most such applications, is not a pairwise genome aligner but is capable of aligning multiple genomes simultaneously and displaying chromosomal rearrangement, deletions, and insertions. One can zoom into the actual sequence, allowing MAUVE to also be of value in SNP location and prediction.
Other components to add value include GBrowse, for genome viewing with a direct link to ERIC's gene annotations, and a microarray database analysis system, originated for the National Cancer Institute's intramural program, to handle sophisticated analysis of microarray data. Work under development includes the integration of these resources to ask queries across these data types as well as advanced text mining of the literature, using SRA International, Inc.'s powerful NetOwl suite of tools, which promise to be able to extract heretofore difficult-to-find semantic relationships from the literature for the BRC Program.
ERIC is now providing training seminars on the use of the annotation system, which will be expanded over time to include the use of other tools, such as the mAdb microarray application. John Greene of SRA International, Inc., is the PI for ERIC; the coinvestigators are Nicole Perna and Frederick Blattner, both of the University of Wisconsin—Madison.
) contains the complete genomes of nearly 50 strains of pathogenic bacteria that are the focus of its annotators, as well as more than 400 (as of February 2007) other genomes that provide a broad context for comparative analysis. The NMPDR focus organisms are food-borne or nosocomial bacterial pathogens in NIAID category B (Table ). NMPDR is both a central repository for a wide variety of scientific data on its core pathogens and a platform for software tools that support investigator-driven data analysis. Data resources include essential genes and candidate drug targets, which are provided for exploration and comparative analysis with tools such as Compare Regions and Functional Clusters.
The Signature Genes tool compares user-selected organisms to find the proteins in common to one group or those proteins that distinguish one group of organisms from another, e.g., virulent and avirulent strains. NMPDR integrates complete, public genomes with expertly curated biological subsystems to provide consistent functional annotations. Subsystems are defined here as sets of functional roles related by any biologically meaningful organizing principle which are vertically curated across microbial genomes. Investigators can browse subsystems and reactions to develop accurate, detailed reconstructions of networks involved in metabolism or pathogenesis.
The Drug Targets project represents the first step toward providing the user community with a comprehensive selection of potential targets for therapeutic intervention, including vaccines and antitoxins. The candidate targets have been determined experimentally to be either essential factors in virulence or involved in antibiotic resistance or sensitivity. Candidates have a close bacterial ortholog with an experimentally determined structure and no close ortholog in humans. Selected candidates will be used for in silico screening against libraries of small molecules, using computational docking methods. NMPDR will provide the results of in silico screening against both broad- and narrow-spectrum targets on its website as screening progresses. All candidates, not only those selected for in silico analysis, are presented at the NMPDR website, with links to physical and kinetic data useful for designing in vitro screening protocols.
NMPDR's training and outreach effort includes presentations at scientific meetings and training sessions both at the University of Illinois and at other institutions, upon request. NMPDR trainers also present lessons or lectures in undergraduate courses in microbiology at the University of Illinois at Urbana, and teaching materials are available online.
Rick Stevens of the University of Chicago is the PI of NMPDR; the coinvestigators include Ross Overbeek of FIG and Leslie McNeil of the National Center for Supercomputing Applications at the University of Illinois, Urbana-Champaign.
Pathema, developed at TIGR, is a website (http://pathema.tigr.org
) composed of multiple databases and other computer resources that are meant to serve as an online focal point for the biodefense community (Table ).
As mentioned earlier, Pathema has developed and released resource guide pages for Pathema's category A organisms, namely, the botulinum neurotoxin and Bacillus anthracis (anthrax) resource guides. Each guide contains 10 pages containing curated links to resource providers and primary data organized into topic-oriented Web pages, which include sequence, strain, structure, antibody, therapeutic, vaccine, inhibitor, protocol, reference, and related links. The C. botulinum guide has been reviewed by a number of botulinum researchers and will continue to be developed in close collaboration with the scientific community. In addition to updating existing pages and tables, Pathema investigators will continue to curate pathogenicity and virulence genes and proteins, identify toxigenic/nontoxigenic and capsulated/unencapsulated strains from the literature, provide reviews of diagnostic methods, and offer links to detection systems. Future work will include the development of similar resource guides for Pathema's category B organisms.
The Genome Properties system (4
) is a comparative genomic system which incorporates both computer-generated and human-curated assertions of biological processes and properties of sequenced genomes. These genome properties are defined such that assertions or calculations made across many genomes are as standardized as possible, using controlled vocabularies or numerical values with controlled units. Many genome properties represent metabolic pathways and other biological systems; others define genome metadata, including the presence and type of flagella, pili, or capsule as well as the cell shape of the organism. The Pathema interface displays each property and enables one to find correlations between genomes based on manually annotated properties (e.g., optimal growth temperature, oxygen requirement, or human pathogen), taxonomic classifications, calculated values (e.g., GC content or average protein length), and the presence or absence of complete or partial metabolic pathways.
Pathema offers a service which will assemble the genomes of any category A to C organism and deposit a computer representation of that assembly into the GenBank Assembly Archive (10
). Pathema provides this service for any genome for which sequence trace files and quality values are available; 24 genomes have been submitted to date. In all cases where SNPs or insertions-deletions were produced as a result of reassembly, these changes are faithfully reflected in the sequence found in the annotation division of GenBank.
Several SNP data sets were generated using a custom SNP clustering pipeline for the following groups of organisms: Burkholderia mallei, Burkholderia pseudomallei, Listeria monocytogenes, and B. anthracis. Users can view overall summaries, SNP positions for an individual organism, comparisons between strains, and phylogenetic trees based on SNP information.
TIGR offers a 3-day course in prokaryotic annotation and analysis to train investigators on how to evaluate the myriad microbial database resources and software tools that are available. The course familiarizes users with TIGR's prokaryotic annotation tools and the analysis of the prokaryotic data in Pathema. The Pathema website also displays extensive documentation describing manual annotation processes.
Owen White of TIGR is the PI for Pathema; the coinvestigator is Steven Salzberg of the University of Maryland, College Park.
The PathoSystems Resource Integration Center (PATRIC; http://patric.vbi.vt.edu
) integrates genomic and associated data types for three genera of proteobacteria and five single-stranded RNA viruses (Table ).
PATRIC aims to provide a standardized curation for each pathosystem's genomes. Curation is based on existing annotations and the output of PATRIC's Genome Sequence Annotation Pipeline and Protein Annotation Pipeline (5
). Importantly, PATRIC works with organism experts to ensure that curation efforts meet the expectations of the user base. These experts guide the selection of the reference genomes that receive thorough curation. Gene annotations are then propagated from reference genomes to orthologs in associated genomes to facilitate consistency of annotation between genomes.
Integration of gene expression and proteomic data types is another of PATRIC's fundamental goals that is currently in development. The data model being developed to link postsequence data generated by individual laboratories and NIAID programs, such as the PRCs, will be important for the BRC mission.
Close collaboration with organism communities helps the PATRIC project to understand how best to utilize PATRIC data. Bioinformatic research projects have been undertaken with members of the research community to identify gene sets of high importance for countermeasure development that may contain useful vaccine targets, to identify genes missed by previous annotation efforts, and to design pan-lyssavirus primer pools for virus identification and genotyping. Such gene sets are the focus of targeted curation efforts. Through this collaborative model, tools and workflows are being developed to help investigators put data to use.
Bruno Sobral of the Virginia Bioinformatics Institute (VBI) is the PI of PATRIC; the coinvestigators for PATRIC include Joao Setubal at VBI, Abdu Azad at University of Maryland, Baltimore County, and Susan Baker at the Loyola University Medical Center.
The Viral Bioinformatics Resource Center (VBRC) is focused on providing genomic data, analytical tools, and basic bioinformatic research focusing on pathogens that are members of viral taxonomic families (listed in Table ). The VBRC (http://www.vbrc.org
) is an extension of previous work to develop the Poxvirus Bioinformatics Resource Center (8
The VBRC consists of a relational database, analytical tools, and Web interfaces that support the data storage, annotation, analysis, and information exchange goals of the BRC program. The database contains the complete genome sequences, along with comprehensive annotations of genes, for all of the human viral pathogens that are members of the VBRC taxonomic families. In addition, for comparative purposes, genomic information and annotations are included for all animal pathogens, as well as nonpathogenic viruses, that are also members of these viral families.
In addition to sequence data and computationally derived gene annotations, the VBRC provides literature-based manual curation for each of its viral genomes and gene records, resulting in a searchable, comprehensive minireview of gene function relating genotypes to biological phenotypes, with special emphasis on pathogenesis. The VBRC also includes a variety of analytical and visualization tools on its website to aid in the understanding of the available data, including tools for genome annotation, comparative analysis, whole-genome alignments, and phylogenetic analysis. Finally, an important aspect of the ongoing work is to solicit feedback from the scientific community, with the goals of enhancing and extending the VBRC, thereby making it both used and useful in support of basic and applied research on these viral pathogens.
Elliott Lefkowitz of the University of Alabama—Birmingham is the PI of VBRC; the coinvestigator for the project is Chris Upton at the University of Victoria, Canada.
) is responsible for annotating the genomes of a number of arthropod vectors of human pathogens. The provision of a predicted gene set with appropriate tools for interrogation and dissemination of these gene data are the primary goals. In terms of genome annotation, the VectorBase BRC has undertaken the reannotation of existing genomes for which it has assumed responsibility, as well as the annotation of new genomes in collaboration with the sequencing centers. Reannotation of Anopheles gambiae
has improved the gene set by two approaches, namely, the manual annotation of chromosome 2L (one of the three chromosomes [about 40% of the genome]) and higher-quality automated prediction, with better discrimination of transposable elements, improved ab initio predictions, and noncoding RNA predictions.
The yellow fever vector mosquito Aedes aegypti genome annotation was recently incorporated into VectorBase. Aedes is estimated to have diverged from Anopheles approximately 150 million years ago, and comparative analysis between these genomes will improve predictions of further dipteran genomes, e.g., Culex pipiens quinquefasciatus.
The primary VectorBase Web presence consists of a genome browser powered by the Ensembl code base. This gives a rich, interlinked set of pages for genes and transcripts from which many annotations can be accessed. VectorBase provides orthologue assignments, Web links to the public sequence databases, microarray experiments, gene ontology terms, and protein domain features. The database can be interrogated by using a custom query tool or the BioMart system. Published microarray experiments are being incorporated into the database via the mapping of array probe sets onto the genomes of both mosquitoes. The array data are stored in a separate database (http://base.vectorbase.org
) with Web links to the VectorBase gene pages and to features within the VectorBase genome browser.
Controlled vocabularies for mosquito anatomy have been developed (available through the Open Biomedical Ontologies website [http://obo.sourceforge.net
]) and are used for annotating microarray experiments and for expanding the available gene ontology terms.
VectorBase has provided some training in the use of the Apollo annotation system to users who are actively involved in community-contributed manual annotation, and plans are under way to offer an expanded opportunity for both this training and more general training in the use of VectorBase.
Frank Collins of the University of Notre Dame is the PI of VectorBase; coinvestigators include William Gelbart of Harvard University; Ewan Birney of the European Bioinformatics Institute, United Kingdom; Kitsos Louis of the Institute of Molecular Biology and Biotechnology, Crete, Greece; and Fotis Kafatos of the Imperial College, United Kingdom.
The NIAID-funded BRC program is tasked with supporting genomic and related data for a variety of different pathogenic organisms. The BRCs import data from existing repositories, generate related data types, store the data in databases, analyze them, and provide for investigator access via user interfaces.
As these BRCs mature, a significant challenge remaining will be to develop user interfaces not only for experts but also for relative novices. This will also impact training on these systems, as in addition to in-person workshops, manuals and online tutorials will need to be developed to reach the widest possible audience. For most of the BRCs, the basic infrastructure and analysis tools are in place, but there remains significant work to be done to enhance scientific usability and workflows through the systems.
Another challenge will be to develop easy-to-use, query-by-example methods to ask complex questions across genomic data types, integrating all this information. For example, one may ask the database to show all genes upregulated during infection as well as the annotations indicating which of these genes were already known to be involved in pathogenesis.
Perhaps most important for the success of the BRC program is the active involvement by the various components of the research community interested in these organisms, including pathogen specialists, biodefense researchers, evolutionary biologists, experts in one model gene family or organism, and genomic researchers. Each of these groups has different interests and needs, which the BRCs will endeavor to meet.
However, for the BRC program to succeed there is an absolute need for scientists from all of these communities to actively provide feedback, request refinements and enhancements, contribute data and annotations, and most importantly, use the valuable BRC resources relevant to their research. Such input and active use are the only way for these new, powerful bioinformatic resources for pathogens to develop to better serve the research communities, to ensure improved data curation in the future, and to fulfill their mission of facilitating the identification and refinement of molecular targets to develop vaccines, therapeutics, diagnostics, and countermeasures.