The National Institute of Allergy and Infectious Diseases (NIAID) established the Bioinformatics Resource Centers (BRCs) to provide scientists with genomics-centric resources for NIAID category A, B, and C priority microbial pathogens (a complete list of these priority pathogens is provided at the NIAID Biodefense and Related Programs website: http://www.niaid.nih.gov/topics/biodefenserelated/biodefense/research/pages/cata.aspx
). Originally, NIAID funded eight BRCs to provide annotated genomic and related data on microbes causing emerging and re-emerging infectious diseases, including bacterial, viral, and eukaryotic pathogens, as well as invertebrate vectors of infectious-disease agents. The Pat
enter (PATRIC), one of the original eight BRCs, stored and integrated data on six different bacterial and viral pathogens (40
). In 2009, NIAID reorganized the BRC program through a competitive renewal for four BRCs, each one with a discrete yet all-encompassing organismal focus: bacteria, viruses, eukaryotic pathogens, and invertebrate vectors (with one exception: the Influenza Resource Database [IRD] specifically focuses on the influenza virus). PATRIC was awarded the bacterial BRC (http://www.patricbrc.org
All bacteria with a focus on the NIAID priority watch list.
PATRIC integrates and annotates all genomic and associated data available from most of the major bacterial lineages, allowing comparative analysis of the NIAID priority infectious agents with closely related free-living, symbiotic, and commensal species (see “Annotation FAQs” at http://enews.patricbrc.org/faqs/
, which links to all FAQs subjects). With an emphasis on consistency in comparative genomic analysis, PATRIC has standardized annotation of all available bacterial genomes using the RAST (rapid annotation using subsystems technology) system (5
), a product of the Fellowship of Interpretation of Genomes (FIG) SEED team, which is a component of the PATRIC team. RAST, which predicts genes, assigns gene functions, and reconstructs metabolic pathways, is powered by a robust assembly of subsystems that have been curated based on evaluation of hundreds of prokaryotic genomes and the clustering of common protein families encoded within these genomes (FIGfams). As of 1 July 2010, PATRIC had annotated 2,865 bacterial genomes using RAST (Note: the “All Bacteria” homepage at http://wwww.patricbrc.org/portal/portal/patric/Taxon?cType=taxon&cId=2
lists the current annotation statistics, including eight genome and protein sequence statistics and 43 genomic features). As it is anticipated that the growing number of sequenced prokaryotic genomes will continue to improve the quality of SEED subsystems, PATRIC will continue to update RAST-based gene, protein, and protein family annotations, as well as providing historical information to track future amendments.
In addition to the RAST-based annotations, PATRIC preserves and provides the historical annotations present at GenBank (RefSeq), as well as the annotations created by the specialists at the previous BRCs, referred to as “Legacy BRC” on the PATRIC site. Importantly, the different annotation methods allow comparison of many genomes using all three approaches. However, given the breadth of coverage of bacterial genomes using RAST, the Legacy BRC annotations are generally the least complete source at PATRIC, because annotation efforts by the previous BRCs ended in 2009. As such, PATRIC houses 355 genomes with annotations from the former BRCs. From GenBank, annotations from 3,230 genomes are currently included, allowing comparison of different annotation schemes across most bacterial genomes. Finally, our cyberinfrastructure technology enables PATRIC to support additional annotations that specific communities implement for their focal organisms, such as curated MetaCyc data (12
). Thus, it is anticipated that comparative approaches to genome annotation will continue in the near future.
Organisms, genomes, and comparative genomics.
The PATRIC website is primarily organism-centric, with various levels of genomic data and associated information related to each included organism. While the PATRIC homepage lists the 22 watch list genera for easy access to data associated with many pathogenic species, compilation and organization of all relevant data for “All Bacteria” are standardized according to bacterial (NCBI) taxonomy, with options for viewing sets of genomes within the hierarchical bacterial tree. Thus, specific “Overview” pages can be accessed for selected taxa within the bacterial tree (e.g., genus, family, order, class, etc.). The “Overview” page contains genome (and associated data) information for all available genomic sequences (closed and incomplete) within a selected taxon and also lists the most recent PubMed articles pertinent to the study of the focal taxon. Each “Overview” page also contains six search tools (Genome Finder, Feature Finder, Comparative Pathway Tool, Protein Family Sorter (PFS), Gene Ontology (GO) Search, and Enzyme Commission (EC) Search) that allow quick directed searches without navigating further into the more detailed pages that house specific data for each organism. The “Genome List” page (, box 1) provides the compiled genomes (closed and incomplete, chromosomal and plasmid) for a given taxon, with statistics for all three different annotation methods and direct links to an interactive genome browser based on JBrowse (37
). The “Taxonomy” page (, box 2) provides classification schemes that are listed at NCBI, with assigned NCBI taxonomic identifiers used to relate associate data for each organism across the website. The “Phylogeny” page (, box 3) illustrates precomputed trees generated for higher-level groups (typically at the order level), which are based on concatenated alignments of multiple conserved protein families (50
). The methods used to estimate organism phylogenies are more detailed than the trees generated from individual gene and protein alignments within other pages of the website (see “Phylogeny FAQs”).
Fig. 1. Schema depicting major genomic and comparative genomic tools available from an organism “Overview” homepage. This example illustrates the Rickettsia genomes compiled at PATRIC. The “Genome List” (box 1) provides statistics (more ...)
Several pages encompass the majority of genomic data and present convenient platforms for comparative genomic analysis. The “Feature Table” (, box 4) provides the tabulation of information for each protein-encoding sequence (CDS), as well as noncoding RNAs, within a selected genome and can be visualized for each of the three different annotation methods. All columns contain user-defined sorting options, and selection of “Locus Tag” leads to specific pages for each CDS that list additional information, including links to NCBI (corresponding RefSeq locus tags), FASTA-formatted protein and nucleotide files, Uniprot mapping data for proteins, and direct interaction with the genome browser tool. Recent implementation of a “Compare Region Viewer” allows synteny analysis across all genomes encoding a selected CDS (see Fig. S1 in the supplemental material). A video tutorial for navigating a typical “Feature Table” illustrates its functionality (see “Feature Table FAQs”). The “Protein Families” page (, box 5) lists the orthologous groups of proteins generated across a selected number of input genomes, with SEED-derived FIGfams used for clustering conserved families (31
). A genome filter tool allows user-defined inclusion/exclusion of genomes, and the annotated FIGfams are provided with the number of included genomes (and sequences) and length range for sequences within the protein clusters. An interactive two-dimensional (2-D) heat map visualization tool is also provided to give a bird's-eye (pan-proteome) view of both protein distribution across multiple genomes and relative conservation of synteny. A demonstration of the full range of the PFS, as applied to a typical genomics-driven experimental design, is illustrated in the following section. Finally, the “Pathways” page (, box 6) lists the cellular function and metabolic pathways that are encoded within a selected taxon, integrating information from the Kyoto Encyclopedia of Genes and Genomes (KEGG) (33
). Pathways are classified according to major biological roles (e.g., carbohydrate metabolism, translation, biosynthesis of secondary metabolites, etc.) and are assigned identifications from a list of 137 unique cellular pathways. All pathways can be visualized for each of the three different annotation methods, and all annotation schemes can be simultaneously superimposed over pathway maps. For evaluation of pathway conservation across multiple genomes, components within KEGG maps (depicted by EC numbers) are color coded according to a spectrum depicting gene presence/absence across analyzed genomes.
Application to comparative genomics: erythritol utilization in Brucella.
In conjunction with the tools mentioned above, PATRIC's compilation of all public bacterial genomes provides a powerful platform for comparative genomic analysis. Such in silico
experiments often shed light on factors implicated in pathogenicity, including their evolutionary trajectories and functions across diverse bacterial lineages. We selected a previously identified virulence factor associated with brucellosis to illustrate this experimental design. Originally isolated from infected bovine fetal tissues (39
), the four-carbon sugar erythritol is the preferred carbon and energy source of Brucella
spp. Subsequent experiments showed that erythritol stimulated in vitro
growth of B. abortus
and enhanced infections caused by a second species, B. melitensis
). It is thought that erythritol uptake is linked to spontaneous abortion, a complication of Brucella
infection in some hosts. Animals with low placental concentrations of erythritol do not have the overwhelming infection that is seen in species with high concentrations (39
). Seminal studies on the biochemical pathway for erythritol catabolism in B. abortus
) led to a genetic characterization of the genes involved in this metabolism (36
Four genes in the Brucella ery
) encode enzymes that have been characterized in erythritol catabolism: erythritol kinase (EryA), erythritol phosphate dehydrogenase (EryB), d
-erythrulose 4-phosphate dehydrogenase (EryC), and erythritol transcriptional regulator (EryD) (36
). The ery
operon has also been found in closely related bacteria (including some nonpathogenic species), suggesting a broader biological utilization for this sugar source. For example, genes involved in erythritol transport were recently identified in the legume symbiont Rhizobium leguminosarum
), in which the ery
genes play a role in root nodule formation. Discovery of the transporter operon (eryEFG
), found adjacent to the catabolic operon in R. leguminosarum
, led to the identification and reannotation of genes adjacent to the ery
operon in Brucella
. A third adjacent operon (deoR-tpiA2-rpiB
) was also identified by Yost et al. (55
) as possibly being important in erythritol catabolism. As the experiments demonstrating importance of this operon in erythritol catabolism have not yet been published, this operon was excluded from the present analysis.
spp., Brucella ovis
and a vaccine strain, Brucella abortus
S19, are known for their inability to oxidize erythritol. Tsolis et al. (46
) identified four genes in B. ovis
, and eryG
) with mutations rendering them pseudogenes. Additionally, Crasta et al. (13
) identified a 703-bp deletion that interrupts the coding regions of eryC
in B. abortus
S19. With 41 Brucella
genomes now sequenced, we wanted to examine the genes considered important in erythritol catabolism and identify similar and perhaps additional problems that might exist in the newly available genomes. Given the presence of these genes in other bacteria, we extended our analysis to include all members of the order Rhizobiales
, which, aside from Brucella
, contains an interesting assortment of pathogens, symbionts, and free-living members (51
In the examination of the erythritol catabolism among Brucella
spp., eight proteins were analyzed in detail, including a protein whose annotation recently changed from “hypothetical protein” to “hypothetical lipoprotein component of the erythritol ABC transporter.” Using the PFS suite of tools available at PATRIC, as well as the multiple-sequence alignment viewer tool BLAST (1
) and the Genome Browser tool (), we were able to identify mutations in seven of these eight proteins (see Fig. S2A in the supplemental material). Although all mutations found are listed, we stress that some mutations found in single genomes (e.g., those of B. ovis
and B. abortus
S19) do not have supporting experimental evidence and could be sequencing or assembly errors. However, more weight should be given to mutations shared by phylogenetically related genomes, because sequencing and assembly errors are less likely to be conserved across various genomes. With this in mind, we were able to identify some mutations that are phylogenetically shared. Brucella ceti
strains M13/05/01 and M644/93/1, which are monophyletic within the B. ceti
clade, share two single-base-pair deletions, resulting in premature stop codons that affect eryA
and the hypothetical lipoprotein component of the erythritol ABC transporter. An additional shared single-base-pair deletion that affects all nine members of the B. ceti
clade is found in eryF. Brucella ovis
sp. strain NVSL 07-2006 are members of the same clade, yet they share only one of the mutations known to occur in B. ovis
, a single-base-pair deletion that results in an altered start site for eryG
. As B. ovis
has mutations that alter four proteins, it is difficult to say if this single shared deletion renders strain NVSL 07-2006 incapable of catabolizing erythritol.
Fig. 2. Experimental design for evaluating the conservation and distribution of erythritol catabolic and transport genes across 107 Rhizobiales genomes. Steps 1 to 4 illustrate the functionality of the PATRIC Protein Family Sorter (PFS) tool. (Step 1) From either (more ...)
One interesting finding involves B. abortus
strains S19 and NCTC 8038, for which phylogeny estimation suggests monophyly within the B. abortus
clade (see Fig. S2B in the supplemental material). While these may be the same strain, these genome sequences were generated by different teams: S19 by the Virginia Bioinformatics Institute (13
) and NCTC 8038 by the Broad Institute (http://www.broadinstitute.org/annotation/genome/brucella_group/MultiHome.html
). Curiously, the 703-bp deletion affecting both eryC
(see above) is not present in the NCTC 8038 genome, which has complete open reading frames for these genes. It is currently unknown if these sequences represent two different isolations within the B. abortus
S19 strain. If so, then there appears to be some variability in the presence of this deletion among isolates of this important vaccine strain. The only mutation that S19 and NCTC 8038 share is a single-base-pair deletion that results in a truncated eryG
Looking more broadly across the order Rhizobiales
, all proteins putatively involved in erythritol catabolism and transport were identified and compiled using several PATRIC tools. With the PFS, a visual representation of the presence or absence of these proteins in a heat map view was created, with the bacterial families within the order and the operons of interest annotated (A). Analysis of these proteins showed that the ery
catabolism operon is present across all members of the families Brucellaceae
, and Aurantimonidaceae
, but it is only sporadically found in Rhizobiaceae
genomes. This operon, and any associated transport genes, is completely missing from the families Bartonellaceae
, and Beijerinckiaceae
. Using the 2-D heat map view, it is evident that some genomes within the Rhizobiaceae
have all proteins within this operon annotated, while some are missing components. This genomic distribution has been described previously, as it has been suggested that the erythritol operon is used for root nodule formation by the non-Brucella
). Our bioinformatics analysis presents a platform for testing the hypothesis that a complete ery
operon and associated transporter genes are essential for root nodule formation.
Fig. 3. Phylogenomic analysis of erythritol catabolic and transport genes across 107 Rhizobiales genomes. These results summarize the comparative genomics experimental design, which primarily utilizes the PATRIC Protein Family Sorter (PFS) tool (). (A) Heat (more ...)
An unexpected result of our analysis was the identification of a second set of genes putatively involved in erythritol transport. While the Brucellaceae and some of the genomes in other families have a type 2 erythritol ABC transporter, a genetically distinct system is encoded within the genomes of other families (A). In order to examine the evolutionary origin of the genes encoding these two divergent transporters, protein sequences from similarly named components (e.g., the permease component of either transporter 1 or 2) were assembled using the above-mentioned tools (see Fig. S2C in the supplemental material). Trees for all three components of the similarly named transporter proteins were generated (B). From this analysis, it is clear that in all three cases the Brucella proteins appear to be part of a broadly conserved ancestral family (type 2) and that a less conserved erythritol transport system (type 1) evolved from within this group. Because the transporter gene trees do not corroborate the Rhizobiales species tree (see Fig. S2B in the supplemental material), it is likely that horizontal transfer events have facilitated the dissemination of the type 1 erythritol transport system genes throughout Rhizobiales evolution. The biological relevance of diverse transport systems for erythritol and their possible correlations with pathogenicity (type 2) and symbiosis (type 1) remain to be elucidated.
Recurrent integration of community-derived associated data.
In addition to the acquisition, annotation, integration, and bioinformatics processing of genome-scale data sets, PATRIC provides “awareness” of community-derived research and information associated with each bacterial organism. Principally, these genome-associated data are organized into three categories: disease, experimental data, and literature (). All of this information is made available to the researcher in a recurring and contextualized manner, such that it is continually updated (contingent on PATRIC and corresponding website updates) and provided at useful locations throughout the website. Thus, this feature provides the infectious-disease research community with an invaluable integration of research data and metadata from a multitude of sources, enabling sophisticated and comprehensive analyses across any bacterial taxon of interest at a single website with consistent tools and interfaces.
Fig. 4. Schema depicting the integrated community-derived associated data available from an organism “Overview” homepage. Navigation from the Helicobacter “Genome List” (outlined in black) is illustrated. Disease information (box (more ...)
For disease-related information (, box 1), a catalog of PubMed literature relevant to associated diseases is provided. Additionally, medical subject headings (MeSH) disease terms are listed, allowing direct access to the National Library of Medicine MeSH Descriptor Database (32
). Candidate virulence factors can be evaluated based on a strategy that integrates data from the Virulence Factor Database (VFDB) (54
). Briefly, virulence factors listed at the VFDB are compiled at PATRIC and used to identify all putative homologs present within other bacterial genomes. Information is also provided on human genes associated with each disease, including genetic and chemical evidence. Integrating data from the Genetic Association Database (8
), the “Genetic Association Source” table lists human genes that have been shown to have some genetic association with a bacterial disease. Similarly, data from the Comparative Toxicogenomics Database (14
) is integrated in the “Comparative Toxicogenomics Source,” which lists human genes associated with a bacterial disease that have been characterized via chemical treatment or exposure. Both the “Genetic Association Source” and the “Comparative Toxicogenomics Source” provide additional information about the human genes from NCBI as well as GeneCards, a comprehensive and authoritative compendium of annotative information pertaining to human genes (35
). Finally, two additional tools round out the integrated information pertinent to bacterial diseases. The “Disease-Pathogen Visualization” page provides an interactive, graphical image of the relationships between pathogens, diseases, virulence genes, and disease-associated host genes. The “Disease Map” page provides a real-time global view of recent reports and outbreaks of bacterial diseases, with geolocation superimposed on an interactive global health map (11
). An example of a PATRIC disease map shows the high activity index in Europe of reported Escherichia coli
infections during the recent outbreak of the German enterohemorrhagic/verocytotoxin-producing E. coli
(EHEC/VTEC) strain (see Fig. S3 in the supplemental material).
A major undertaking for PATRIC is to provide a summary of the wide range of experimental data found in a variety of databases for all bacteria (, box 2). This information, collectively referred to as postgenomic data, encompasses transcriptomic data primarily from microarrays (in addition to serial analysis of gene expression [SAGE] and RNA-Seq), proteomics data from mass spectrometry, protein-protein interaction data, and protein 3-D structure data (X ray and nuclear magnetic resonance [NMR]). At the species and strain levels, these data are sometimes difficult to find at the associated databases. PATRIC recurrently searches select external databases using several keywords (i.e., organism name, NCBI taxonomic identifier, etc.) specific to each source and provides links to data that are continually updated at these repositories. Thus, PATRIC provides a summary of the number and types of data available at NCBI's GEO (Gene Expression Omnibus) (6
), EBI's ArrayExpress, (26
), and the legacy NIAID-funded Proteomics Resource Centers (PRCs) (56
). Mass spectrometry data are accessed from Peptidome, (25
), PRIDE (48
), and the PRCs. Current knowledge on protein-protein interactions is also retrieved from the PRCs, as well as IntACt (4
). Finally, PATRIC links to protein 3-D structure data from the NCBI and the Protein Data Bank (PDB) (10
A continual challenge for PATRIC is to provide the user with a robust and real-time list of literature and web text resources pertaining to each organism (, box 3). Relevant articles (and abstracts when available) from PubMed are listed chronologically, with direct links to PubMed provided. Literature compilations may be filtered by date and keyword for winnowing down large lists. A more direct way to reduce irrelevant results while increasing the recall of relevant documents is to use the text-mining tool, which implements technology developed in conjunction with the UK National Text Mining Centre (NaCTeM), another component of the PATRIC team. This process displays search results based on indexes of UK Medline abstracts, identifying key entities from the search text (i.e., genes, proteins, metabolites, drugs, diseases, symptoms, etc.). Results are summarized by entity type and allow progressive filtering. Abstracts are provided with key entities highlighted in different colors and contain direct links to PubMed.
Application to annotation driven by data integration: drug and vaccine targeting.
The computed proteomes of all PATRIC genomes provide rich data sets for large-scale computa-tional analyses. One of PATRIC's major focal areas of research is the design and execution of experiments that integrate multiple levels of information from community databases for improving bacterial genome annotation (i.e., adding information beyond standard automated annotation). Importantly, while the data integrated from the community may pertain to selected high-profile pathogens, PATRIC's analysis pipelines work to propagate this information across all bacterial genomes when gene and protein homology supports such an approach. In theory, this strategy of refining functional gene and protein annotations will expand our knowledge of the factors directly involved at the interface between host and pathogen, e.g., virulence factor identification, antibiotic resistance and synthesis gene characterization, drug and vaccine targeting, etc. The following example illustrates this approach for the development of a drug targeting classification for all bacterial genomes.
With the list of drug and vaccine targets in the infectious-disease research community rapidly growing (53
), we hypothesize that this information, combined with the comprehensive proteome of PATRIC genomes, may be utilized to propose novel antibacterial drug targets. The logic in our approach presumes that previously determined drug targets in some bacterial species might provide reasonable candidate targets for other species if structural and functional data are similar across bona fide and candidate targets. Aside from sequence-based criteria, we elected to incorporate information from protein 3-D structure into our experimental design, as there is a tendency for approved and pending bacterial drug targets to have associated structural data (NMR, cryo-electron microscopy, X-ray crystallography, etc.). We also considered the human genome in our analysis, distinguishing between drug targets with high similarity to human proteins and those with no significant human-encoded counterparts. The latter distinction is important, as selection of drug targets with some degree of similarity to human proteins would require more careful design to avoid effective targeting of both host and pathogen proteins.
To illustrate the PATRIC's potential for large-scale drug target annotation, the workflow is divided into two processes. First, a data set was created containing significant similarity between a set of position-specific scoring matrices (PSSMs) (23
) from NCBI's Protein Clusters (28
) and (i) protein sequences encoded within the human genome (47
), (ii) proteins previously annotated as drug targets (29
), and (iii) proteins with associated 3-D structure information (47
) (see Fig. S4A in the supplemental material). A high PSSM score within a region of a sequence (query) is a good indication of a comparable biological role of this region to the domain, family, or motif characterized by the PSSM (9
). Sequence similarity across query proteins and the PSSMs was evaluated using reverse-position-specific BLAST (RPSBLAST) (30
) with an E-value cutoff of 0.001. This resulted in a diverse set of annotated proteins and, importantly, substantially limited the number of possible matches for transferring annotations to bacterial genes. In the second step (Fig. S4B), the set of protein sequences (total = 2,771,151) encoded within 800 bacterial genomes (794 species) was used in RPSBLAST searches against the data set constructed in the first step, with the identical search strategy and significance threshold. This resulted in the identification of bacterial genes encoding proteins with regions of significant similarity to at least human proteins, previously described drug targets, or proteins with associated structural data (n
= 454,842, or 16.4% of query proteins). Many of these bacterial proteins scored a match for two or all three of these specific groups identified using the PSSMs (see Fig. S4C).
The result of propagating information from host, prior drug targets, and structure to novel bacterial proteins is shown for 22 NIAID category A, B, and C priority microbial pathogens (). A modest number of proteins (n = 40,180) encoded within these 22 genomes scored significant matches to the PSSMs described above, with slightly more having significant similarity to domains within human proteins (55.2%). This attests to the nature of protein conservation, particularly domain architecture, even across diverse organisms such as bacteria and vertebrates. However, of the 18,013 proteins lacking significant similarity to human proteins, only 19.7% lacked PSSMs matching previously defined drug targets and/or proteins with associated structural data. Thus, our analysis winnowed down a robust list to strictly prokaryotic protein domains with existing drug target analogs (n = 12), relevant structural information (n = 7,290), or both (n = 7,155), all of which provide candidate drug targets that can be utilized with minimal regard for host proteins. Regarding the bacterial proteins having significant similarity to human protein domains, the majority (97.8%) also contained matches to PSSMs with existing drug target analogs (n = 352), relevant structural information (n = 3,791), or both (n = 17,546). Of the latter class, the majority of proteins (67.4%) have matches to approved (versus under development) drug targets, suggesting that many of the existing drug targets may be applicable to pathogens with similarly functioning proteins encoded in their genomes.
Drug-targeting attributes characterized within the genomes of 22 NIAID category A, B, and C priority microbial pathogensa
While currently under development, the novel set of bacterial genes annotated with drug-targeting attributes will become available to all PATRIC researchers in a future release. Similar “reverse annotation” strategies are also being employed for the curation of antibiotic synthesis and resistance genes, as well as a vast set of virulence factors defined by a novel controlled vocabulary. All of these data will be propagated across all genomes at PATRIC in a manner consistent with the provision of other associated data across the website. Improvements to genomic annotation generated from the strategy outlined above will drive the design and development of new resources at PATRIC, which will facilitate comprehensive comparative analyses for infectious-disease research.