|Home | About | Journals | Submit | Contact Us | Français|
Funded by the National Institute of Allergy and Infectious Diseases, the Pathosystems Resource Integration Center (PATRIC) is a genomics-centric relational database and bioinformatics resource designed to assist scientists in infectious-disease research. Specifically, PATRIC provides scientists with (i) a comprehensive bacterial genomics database, (ii) a plethora of associated data relevant to genomic analysis, and (iii) an extensive suite of computational tools and platforms for bioinformatics analysis. While the primary aim of PATRIC is to advance the knowledge underlying the biology of human pathogens, all publicly available genome-scale data for bacteria are compiled and continually updated, thereby enabling comparative analyses to reveal the basis for differences between infectious free-living and commensal species. Herein we summarize the major features available at PATRIC, dividing the resources into two major categories: (i) organisms, genomes, and comparative genomics and (ii) recurrent integration of community-derived associated data. Additionally, we present two experimental designs typical of bacterial genomics research and report on the execution of both projects using only PATRIC data and tools. These applications encompass a broad range of the data and analysis tools available, illustrating practical uses of PATRIC for the biologist. Finally, a summary of PATRIC's outreach activities, collaborative endeavors, and future research directions is provided.
The National Institute of Allergy and Infectious Diseases (NIAID) established the Bioinformatics Resource Centers (BRCs) to provide scientists with genomics-centric resources for NIAID category A, B, and C priority microbial pathogens (a complete list of these priority pathogens is provided at the NIAID Biodefense and Related Programs website: http://www.niaid.nih.gov/topics/biodefenserelated/biodefense/research/pages/cata.aspx) (22). Originally, NIAID funded eight BRCs to provide annotated genomic and related data on microbes causing emerging and re-emerging infectious diseases, including bacterial, viral, and eukaryotic pathogens, as well as invertebrate vectors of infectious-disease agents. The Pathosystems Resource Integration Center (PATRIC), one of the original eight BRCs, stored and integrated data on six different bacterial and viral pathogens (40). In 2009, NIAID reorganized the BRC program through a competitive renewal for four BRCs, each one with a discrete yet all-encompassing organismal focus: bacteria, viruses, eukaryotic pathogens, and invertebrate vectors (with one exception: the Influenza Resource Database [IRD] specifically focuses on the influenza virus). PATRIC was awarded the bacterial BRC (http://www.patricbrc.org).
PATRIC integrates and annotates all genomic and associated data available from most of the major bacterial lineages, allowing comparative analysis of the NIAID priority infectious agents with closely related free-living, symbiotic, and commensal species (see “Annotation FAQs” at http://enews.patricbrc.org/faqs/, which links to all FAQs subjects). With an emphasis on consistency in comparative genomic analysis, PATRIC has standardized annotation of all available bacterial genomes using the RAST (rapid annotation using subsystems technology) system (5), a product of the Fellowship of Interpretation of Genomes (FIG) SEED team, which is a component of the PATRIC team. RAST, which predicts genes, assigns gene functions, and reconstructs metabolic pathways, is powered by a robust assembly of subsystems that have been curated based on evaluation of hundreds of prokaryotic genomes and the clustering of common protein families encoded within these genomes (FIGfams). As of 1 July 2010, PATRIC had annotated 2,865 bacterial genomes using RAST (Note: the “All Bacteria” homepage at http://wwww.patricbrc.org/portal/portal/patric/Taxon?cType=taxon&cId=2 lists the current annotation statistics, including eight genome and protein sequence statistics and 43 genomic features). As it is anticipated that the growing number of sequenced prokaryotic genomes will continue to improve the quality of SEED subsystems, PATRIC will continue to update RAST-based gene, protein, and protein family annotations, as well as providing historical information to track future amendments.
In addition to the RAST-based annotations, PATRIC preserves and provides the historical annotations present at GenBank (RefSeq), as well as the annotations created by the specialists at the previous BRCs, referred to as “Legacy BRC” on the PATRIC site. Importantly, the different annotation methods allow comparison of many genomes using all three approaches. However, given the breadth of coverage of bacterial genomes using RAST, the Legacy BRC annotations are generally the least complete source at PATRIC, because annotation efforts by the previous BRCs ended in 2009. As such, PATRIC houses 355 genomes with annotations from the former BRCs. From GenBank, annotations from 3,230 genomes are currently included, allowing comparison of different annotation schemes across most bacterial genomes. Finally, our cyberinfrastructure technology enables PATRIC to support additional annotations that specific communities implement for their focal organisms, such as curated MetaCyc data (12). Thus, it is anticipated that comparative approaches to genome annotation will continue in the near future.
The PATRIC website is primarily organism-centric, with various levels of genomic data and associated information related to each included organism. While the PATRIC homepage lists the 22 watch list genera for easy access to data associated with many pathogenic species, compilation and organization of all relevant data for “All Bacteria” are standardized according to bacterial (NCBI) taxonomy, with options for viewing sets of genomes within the hierarchical bacterial tree. Thus, specific “Overview” pages can be accessed for selected taxa within the bacterial tree (e.g., genus, family, order, class, etc.). The “Overview” page contains genome (and associated data) information for all available genomic sequences (closed and incomplete) within a selected taxon and also lists the most recent PubMed articles pertinent to the study of the focal taxon. Each “Overview” page also contains six search tools (Genome Finder, Feature Finder, Comparative Pathway Tool, Protein Family Sorter (PFS), Gene Ontology (GO) Search, and Enzyme Commission (EC) Search) that allow quick directed searches without navigating further into the more detailed pages that house specific data for each organism. The “Genome List” page (Fig. 1, box 1) provides the compiled genomes (closed and incomplete, chromosomal and plasmid) for a given taxon, with statistics for all three different annotation methods and direct links to an interactive genome browser based on JBrowse (37, 38). The “Taxonomy” page (Fig. 1, box 2) provides classification schemes that are listed at NCBI, with assigned NCBI taxonomic identifiers used to relate associate data for each organism across the website. The “Phylogeny” page (Fig. 1, box 3) illustrates precomputed trees generated for higher-level groups (typically at the order level), which are based on concatenated alignments of multiple conserved protein families (50, 51). The methods used to estimate organism phylogenies are more detailed than the trees generated from individual gene and protein alignments within other pages of the website (see “Phylogeny FAQs”).
Several pages encompass the majority of genomic data and present convenient platforms for comparative genomic analysis. The “Feature Table” (Fig. 1, box 4) provides the tabulation of information for each protein-encoding sequence (CDS), as well as noncoding RNAs, within a selected genome and can be visualized for each of the three different annotation methods. All columns contain user-defined sorting options, and selection of “Locus Tag” leads to specific pages for each CDS that list additional information, including links to NCBI (corresponding RefSeq locus tags), FASTA-formatted protein and nucleotide files, Uniprot mapping data for proteins, and direct interaction with the genome browser tool. Recent implementation of a “Compare Region Viewer” allows synteny analysis across all genomes encoding a selected CDS (see Fig. S1 in the supplemental material). A video tutorial for navigating a typical “Feature Table” illustrates its functionality (see “Feature Table FAQs”). The “Protein Families” page (Fig. 1, box 5) lists the orthologous groups of proteins generated across a selected number of input genomes, with SEED-derived FIGfams used for clustering conserved families (31). A genome filter tool allows user-defined inclusion/exclusion of genomes, and the annotated FIGfams are provided with the number of included genomes (and sequences) and length range for sequences within the protein clusters. An interactive two-dimensional (2-D) heat map visualization tool is also provided to give a bird's-eye (pan-proteome) view of both protein distribution across multiple genomes and relative conservation of synteny. A demonstration of the full range of the PFS, as applied to a typical genomics-driven experimental design, is illustrated in the following section. Finally, the “Pathways” page (Fig. 1, box 6) lists the cellular function and metabolic pathways that are encoded within a selected taxon, integrating information from the Kyoto Encyclopedia of Genes and Genomes (KEGG) (33). Pathways are classified according to major biological roles (e.g., carbohydrate metabolism, translation, biosynthesis of secondary metabolites, etc.) and are assigned identifications from a list of 137 unique cellular pathways. All pathways can be visualized for each of the three different annotation methods, and all annotation schemes can be simultaneously superimposed over pathway maps. For evaluation of pathway conservation across multiple genomes, components within KEGG maps (depicted by EC numbers) are color coded according to a spectrum depicting gene presence/absence across analyzed genomes.
In conjunction with the tools mentioned above, PATRIC's compilation of all public bacterial genomes provides a powerful platform for comparative genomic analysis. Such in silico experiments often shed light on factors implicated in pathogenicity, including their evolutionary trajectories and functions across diverse bacterial lineages. We selected a previously identified virulence factor associated with brucellosis to illustrate this experimental design. Originally isolated from infected bovine fetal tissues (39), the four-carbon sugar erythritol is the preferred carbon and energy source of Brucella spp. Subsequent experiments showed that erythritol stimulated in vitro growth of B. abortus and enhanced infections caused by a second species, B. melitensis (27). It is thought that erythritol uptake is linked to spontaneous abortion, a complication of Brucella infection in some hosts. Animals with low placental concentrations of erythritol do not have the overwhelming infection that is seen in species with high concentrations (39). Seminal studies on the biochemical pathway for erythritol catabolism in B. abortus (42, 43) led to a genetic characterization of the genes involved in this metabolism (36).
Four genes in the Brucella ery operon (eryABCD) encode enzymes that have been characterized in erythritol catabolism: erythritol kinase (EryA), erythritol phosphate dehydrogenase (EryB), d-erythrulose 4-phosphate dehydrogenase (EryC), and erythritol transcriptional regulator (EryD) (36). The ery operon has also been found in closely related bacteria (including some nonpathogenic species), suggesting a broader biological utilization for this sugar source. For example, genes involved in erythritol transport were recently identified in the legume symbiont Rhizobium leguminosarum (55), in which the ery genes play a role in root nodule formation. Discovery of the transporter operon (eryEFG), found adjacent to the catabolic operon in R. leguminosarum, led to the identification and reannotation of genes adjacent to the ery operon in Brucella. A third adjacent operon (deoR-tpiA2-rpiB) was also identified by Yost et al. (55) as possibly being important in erythritol catabolism. As the experiments demonstrating importance of this operon in erythritol catabolism have not yet been published, this operon was excluded from the present analysis.
Regarding Brucella spp., Brucella ovis and a vaccine strain, Brucella abortus S19, are known for their inability to oxidize erythritol. Tsolis et al. (46) identified four genes in B. ovis (eryA, eryD, eryF, and eryG) with mutations rendering them pseudogenes. Additionally, Crasta et al. (13) identified a 703-bp deletion that interrupts the coding regions of eryC and eryD in B. abortus S19. With 41 Brucella genomes now sequenced, we wanted to examine the genes considered important in erythritol catabolism and identify similar and perhaps additional problems that might exist in the newly available genomes. Given the presence of these genes in other bacteria, we extended our analysis to include all members of the order Rhizobiales, which, aside from Brucella and Rhizobium, contains an interesting assortment of pathogens, symbionts, and free-living members (51).
In the examination of the erythritol catabolism among Brucella spp., eight proteins were analyzed in detail, including a protein whose annotation recently changed from “hypothetical protein” to “hypothetical lipoprotein component of the erythritol ABC transporter.” Using the PFS suite of tools available at PATRIC, as well as the multiple-sequence alignment viewer tool BLAST (1) and the Genome Browser tool (Fig. 2), we were able to identify mutations in seven of these eight proteins (see Fig. S2A in the supplemental material). Although all mutations found are listed, we stress that some mutations found in single genomes (e.g., those of B. ovis and B. abortus S19) do not have supporting experimental evidence and could be sequencing or assembly errors. However, more weight should be given to mutations shared by phylogenetically related genomes, because sequencing and assembly errors are less likely to be conserved across various genomes. With this in mind, we were able to identify some mutations that are phylogenetically shared. Brucella ceti strains M13/05/01 and M644/93/1, which are monophyletic within the B. ceti clade, share two single-base-pair deletions, resulting in premature stop codons that affect eryA and the hypothetical lipoprotein component of the erythritol ABC transporter. An additional shared single-base-pair deletion that affects all nine members of the B. ceti clade is found in eryF. Brucella ovis and Brucella sp. strain NVSL 07-2006 are members of the same clade, yet they share only one of the mutations known to occur in B. ovis, a single-base-pair deletion that results in an altered start site for eryG. As B. ovis has mutations that alter four proteins, it is difficult to say if this single shared deletion renders strain NVSL 07-2006 incapable of catabolizing erythritol.
One interesting finding involves B. abortus strains S19 and NCTC 8038, for which phylogeny estimation suggests monophyly within the B. abortus clade (see Fig. S2B in the supplemental material). While these may be the same strain, these genome sequences were generated by different teams: S19 by the Virginia Bioinformatics Institute (13) and NCTC 8038 by the Broad Institute (http://www.broadinstitute.org/annotation/genome/brucella_group/MultiHome.html). Curiously, the 703-bp deletion affecting both eryC and eryD (see above) is not present in the NCTC 8038 genome, which has complete open reading frames for these genes. It is currently unknown if these sequences represent two different isolations within the B. abortus S19 strain. If so, then there appears to be some variability in the presence of this deletion among isolates of this important vaccine strain. The only mutation that S19 and NCTC 8038 share is a single-base-pair deletion that results in a truncated eryG.
Looking more broadly across the order Rhizobiales, all proteins putatively involved in erythritol catabolism and transport were identified and compiled using several PATRIC tools. With the PFS, a visual representation of the presence or absence of these proteins in a heat map view was created, with the bacterial families within the order and the operons of interest annotated (Fig.3A). Analysis of these proteins showed that the ery catabolism operon is present across all members of the families Brucellaceae, Phyllobacteriaceae, and Aurantimonidaceae, but it is only sporadically found in Rhizobiaceae and Bradyrhizobiaceae genomes. This operon, and any associated transport genes, is completely missing from the families Bartonellaceae, Xanthobacteriaceae, Methylobacteriaceae, and Beijerinckiaceae. Using the 2-D heat map view, it is evident that some genomes within the Rhizobiaceae have all proteins within this operon annotated, while some are missing components. This genomic distribution has been described previously, as it has been suggested that the erythritol operon is used for root nodule formation by the non-Brucella organisms (55). Our bioinformatics analysis presents a platform for testing the hypothesis that a complete ery operon and associated transporter genes are essential for root nodule formation.
An unexpected result of our analysis was the identification of a second set of genes putatively involved in erythritol transport. While the Brucellaceae and some of the genomes in other families have a type 2 erythritol ABC transporter, a genetically distinct system is encoded within the genomes of other families (Fig. 3A). In order to examine the evolutionary origin of the genes encoding these two divergent transporters, protein sequences from similarly named components (e.g., the permease component of either transporter 1 or 2) were assembled using the above-mentioned tools (see Fig. S2C in the supplemental material). Trees for all three components of the similarly named transporter proteins were generated (Fig. 3B). From this analysis, it is clear that in all three cases the Brucella proteins appear to be part of a broadly conserved ancestral family (type 2) and that a less conserved erythritol transport system (type 1) evolved from within this group. Because the transporter gene trees do not corroborate the Rhizobiales species tree (see Fig. S2B in the supplemental material), it is likely that horizontal transfer events have facilitated the dissemination of the type 1 erythritol transport system genes throughout Rhizobiales evolution. The biological relevance of diverse transport systems for erythritol and their possible correlations with pathogenicity (type 2) and symbiosis (type 1) remain to be elucidated.
In addition to the acquisition, annotation, integration, and bioinformatics processing of genome-scale data sets, PATRIC provides “awareness” of community-derived research and information associated with each bacterial organism. Principally, these genome-associated data are organized into three categories: disease, experimental data, and literature (Fig. 4). All of this information is made available to the researcher in a recurring and contextualized manner, such that it is continually updated (contingent on PATRIC and corresponding website updates) and provided at useful locations throughout the website. Thus, this feature provides the infectious-disease research community with an invaluable integration of research data and metadata from a multitude of sources, enabling sophisticated and comprehensive analyses across any bacterial taxon of interest at a single website with consistent tools and interfaces.
For disease-related information (Fig.4, box 1), a catalog of PubMed literature relevant to associated diseases is provided. Additionally, medical subject headings (MeSH) disease terms are listed, allowing direct access to the National Library of Medicine MeSH Descriptor Database (32). Candidate virulence factors can be evaluated based on a strategy that integrates data from the Virulence Factor Database (VFDB) (54). Briefly, virulence factors listed at the VFDB are compiled at PATRIC and used to identify all putative homologs present within other bacterial genomes. Information is also provided on human genes associated with each disease, including genetic and chemical evidence. Integrating data from the Genetic Association Database (8, 57), the “Genetic Association Source” table lists human genes that have been shown to have some genetic association with a bacterial disease. Similarly, data from the Comparative Toxicogenomics Database (14) is integrated in the “Comparative Toxicogenomics Source,” which lists human genes associated with a bacterial disease that have been characterized via chemical treatment or exposure. Both the “Genetic Association Source” and the “Comparative Toxicogenomics Source” provide additional information about the human genes from NCBI as well as GeneCards, a comprehensive and authoritative compendium of annotative information pertaining to human genes (35). Finally, two additional tools round out the integrated information pertinent to bacterial diseases. The “Disease-Pathogen Visualization” page provides an interactive, graphical image of the relationships between pathogens, diseases, virulence genes, and disease-associated host genes. The “Disease Map” page provides a real-time global view of recent reports and outbreaks of bacterial diseases, with geolocation superimposed on an interactive global health map (11). An example of a PATRIC disease map shows the high activity index in Europe of reported Escherichia coli infections during the recent outbreak of the German enterohemorrhagic/verocytotoxin-producing E. coli (EHEC/VTEC) strain (see Fig. S3 in the supplemental material).
A major undertaking for PATRIC is to provide a summary of the wide range of experimental data found in a variety of databases for all bacteria (Fig.4, box 2). This information, collectively referred to as postgenomic data, encompasses transcriptomic data primarily from microarrays (in addition to serial analysis of gene expression [SAGE] and RNA-Seq), proteomics data from mass spectrometry, protein-protein interaction data, and protein 3-D structure data (X ray and nuclear magnetic resonance [NMR]). At the species and strain levels, these data are sometimes difficult to find at the associated databases. PATRIC recurrently searches select external databases using several keywords (i.e., organism name, NCBI taxonomic identifier, etc.) specific to each source and provides links to data that are continually updated at these repositories. Thus, PATRIC provides a summary of the number and types of data available at NCBI's GEO (Gene Expression Omnibus) (6, 7), EBI's ArrayExpress, (26), and the legacy NIAID-funded Proteomics Resource Centers (PRCs) (56). Mass spectrometry data are accessed from Peptidome, (25), PRIDE (48), and the PRCs. Current knowledge on protein-protein interactions is also retrieved from the PRCs, as well as IntACt (4). Finally, PATRIC links to protein 3-D structure data from the NCBI and the Protein Data Bank (PDB) (10).
A continual challenge for PATRIC is to provide the user with a robust and real-time list of literature and web text resources pertaining to each organism (Fig.4, box 3). Relevant articles (and abstracts when available) from PubMed are listed chronologically, with direct links to PubMed provided. Literature compilations may be filtered by date and keyword for winnowing down large lists. A more direct way to reduce irrelevant results while increasing the recall of relevant documents is to use the text-mining tool, which implements technology developed in conjunction with the UK National Text Mining Centre (NaCTeM), another component of the PATRIC team. This process displays search results based on indexes of UK Medline abstracts, identifying key entities from the search text (i.e., genes, proteins, metabolites, drugs, diseases, symptoms, etc.). Results are summarized by entity type and allow progressive filtering. Abstracts are provided with key entities highlighted in different colors and contain direct links to PubMed.
The computed proteomes of all PATRIC genomes provide rich data sets for large-scale computa-tional analyses. One of PATRIC's major focal areas of research is the design and execution of experiments that integrate multiple levels of information from community databases for improving bacterial genome annotation (i.e., adding information beyond standard automated annotation). Importantly, while the data integrated from the community may pertain to selected high-profile pathogens, PATRIC's analysis pipelines work to propagate this information across all bacterial genomes when gene and protein homology supports such an approach. In theory, this strategy of refining functional gene and protein annotations will expand our knowledge of the factors directly involved at the interface between host and pathogen, e.g., virulence factor identification, antibiotic resistance and synthesis gene characterization, drug and vaccine targeting, etc. The following example illustrates this approach for the development of a drug targeting classification for all bacterial genomes.
With the list of drug and vaccine targets in the infectious-disease research community rapidly growing (53), we hypothesize that this information, combined with the comprehensive proteome of PATRIC genomes, may be utilized to propose novel antibacterial drug targets. The logic in our approach presumes that previously determined drug targets in some bacterial species might provide reasonable candidate targets for other species if structural and functional data are similar across bona fide and candidate targets. Aside from sequence-based criteria, we elected to incorporate information from protein 3-D structure into our experimental design, as there is a tendency for approved and pending bacterial drug targets to have associated structural data (NMR, cryo-electron microscopy, X-ray crystallography, etc.). We also considered the human genome in our analysis, distinguishing between drug targets with high similarity to human proteins and those with no significant human-encoded counterparts. The latter distinction is important, as selection of drug targets with some degree of similarity to human proteins would require more careful design to avoid effective targeting of both host and pathogen proteins.
To illustrate the PATRIC's potential for large-scale drug target annotation, the workflow is divided into two processes. First, a data set was created containing significant similarity between a set of position-specific scoring matrices (PSSMs) (23) from NCBI's Protein Clusters (28) and (i) protein sequences encoded within the human genome (47), (ii) proteins previously annotated as drug targets (29, 52), and (iii) proteins with associated 3-D structure information (47) (see Fig. S4A in the supplemental material). A high PSSM score within a region of a sequence (query) is a good indication of a comparable biological role of this region to the domain, family, or motif characterized by the PSSM (9). Sequence similarity across query proteins and the PSSMs was evaluated using reverse-position-specific BLAST (RPSBLAST) (30) with an E-value cutoff of 0.001. This resulted in a diverse set of annotated proteins and, importantly, substantially limited the number of possible matches for transferring annotations to bacterial genes. In the second step (Fig. S4B), the set of protein sequences (total = 2,771,151) encoded within 800 bacterial genomes (794 species) was used in RPSBLAST searches against the data set constructed in the first step, with the identical search strategy and significance threshold. This resulted in the identification of bacterial genes encoding proteins with regions of significant similarity to at least human proteins, previously described drug targets, or proteins with associated structural data (n = 454,842, or 16.4% of query proteins). Many of these bacterial proteins scored a match for two or all three of these specific groups identified using the PSSMs (see Fig. S4C).
The result of propagating information from host, prior drug targets, and structure to novel bacterial proteins is shown for 22 NIAID category A, B, and C priority microbial pathogens (Table 1). A modest number of proteins (n = 40,180) encoded within these 22 genomes scored significant matches to the PSSMs described above, with slightly more having significant similarity to domains within human proteins (55.2%). This attests to the nature of protein conservation, particularly domain architecture, even across diverse organisms such as bacteria and vertebrates. However, of the 18,013 proteins lacking significant similarity to human proteins, only 19.7% lacked PSSMs matching previously defined drug targets and/or proteins with associated structural data. Thus, our analysis winnowed down a robust list to strictly prokaryotic protein domains with existing drug target analogs (n = 12), relevant structural information (n = 7,290), or both (n = 7,155), all of which provide candidate drug targets that can be utilized with minimal regard for host proteins. Regarding the bacterial proteins having significant similarity to human protein domains, the majority (97.8%) also contained matches to PSSMs with existing drug target analogs (n = 352), relevant structural information (n = 3,791), or both (n = 17,546). Of the latter class, the majority of proteins (67.4%) have matches to approved (versus under development) drug targets, suggesting that many of the existing drug targets may be applicable to pathogens with similarly functioning proteins encoded in their genomes.
While currently under development, the novel set of bacterial genes annotated with drug-targeting attributes will become available to all PATRIC researchers in a future release. Similar “reverse annotation” strategies are also being employed for the curation of antibiotic synthesis and resistance genes, as well as a vast set of virulence factors defined by a novel controlled vocabulary. All of these data will be propagated across all genomes at PATRIC in a manner consistent with the provision of other associated data across the website. Improvements to genomic annotation generated from the strategy outlined above will drive the design and development of new resources at PATRIC, which will facilitate comprehensive comparative analyses for infectious-disease research.
The community-derived information that is integrated into PATRIC is provided through a practical, rich interface that delivers access to all the relevant data from these key public external sources. Advancing the user's experience and research capability at PATRIC is a driving force; therefore, we formally apply the structured, user-centered process known as usability engineering (24) to improve users' experience with the site. Specifically, we actively involve representative researchers and other stakeholders in formulating user-centered requirements, design, and evaluation and continue their involvement through the PATRIC operational releases, thereby ensuring a highly usable site derived from real user experience (44). To create functional areas of the website, we iteratively cocreate conceptual design sketches with researchers that organize insights from domain analysis activities and user-centered requirements. We thoroughly analyze results from these early evaluations and use them to create detailed designs that use modern technologies to provide a user-centered experience.
Throughout the development and refinement of PATRIC, we have identified three keystone design principles from the field of human-computer interaction that are well suited to serve the infectious-disease researcher community. We employed each of these principles throughout the PATRIC website. The first pertains to information integration. This approach stresses seamless accession of all organisms, tasks/tools, and data throughout the website without forcing users to go repeatedly to different pages or website areas. Second, the progressive filtering method is implemented, supporting numerous levels of filtering and drill-down, e.g., over all PATRIC data, on a single organism, on a single genome, etc. Finally, a context sensitivity approach offers options (controls, filters, tools, etc.) that are appropriate to the user's current scope (e.g., as instantiated in filters, task areas, and tabs on PATRIC's data browser page). In sum, to meet the challenge of clearly and efficiently delivering a comprehensive collection of integrated data for infectious-disease research, PATRIC's user-centered design approach has produced a usable, friendly web interface.
Recently, PATRIC has utilized the above-mentioned tools, analysis platforms, and other resources in bioinformatics investigations pertaining to various aspects of infectious-disease research, including virulence factors (2, 19, 20), comparative genomics (21, 41, 49), large-scale phylogenetics (50, 51), human–bacterial-pathogen protein interaction networks (16), text mining (3; S. Pyysalo et al., presented at the 2010 Workshop on Biomedical Natural Language Processing, ACL 2010, Uppsala, Sweden, 15 July 2010), and data integration (44). Our efforts have also been utilized in various collaborations generating experimental research (15, 45). As such, with the recurrent expansion of the scope of information integration, PATRIC's infrastructure will continue to grow through developments driven by various collaborations with the infectious-disease research community, education and outreach activities, community engagement and feedback, and continuing PATRIC-driven research. Three aspects of PATRIC's future are described below.
PATRIC conducts several activities to engage the infectious-disease research community and to drive development of further infrastructure. One important example is the Driving Biological Projects (DBPs) program. Via DBPs, we collaborate with groups within the infectious-disease research community to produce large-scale data in order to define, cocreate, develop, and deploy the infrastructure needed to support further novel data types (such as RNA-Seq) and respective integrated analyses by the community. These are competitively awarded projects that are reviewed by PATRIC's scientific working group and awarded as PATRIC subcontracts. Through this process, PATRIC further evolves into a resource that can provide researchers with analysis capabilities and integrative access to new and evolving types of data.
In 2010, PATRIC awarded two subcontracts in the inaugural round of the DBPs program (see http://enews.patricbrc.org/feature/call-for-dbp-proposals/). The first project will focus on comparative transcriptome, proteome, and phenotype microarray analysis of five divergent Clostridium difficile strains to facilitate the understanding of mechanisms of C. difficile pathogenesis. The result of large-scale data analysis and comparisons will help verify and update C. difficile genome annotations and aid in obtaining a comprehensive overview of C. difficile core, divergent, and strain-specific genes and pathways involved in pathogenesis. In addition to its value for the C. difficile research community, this work will help expand the PATRIC data model (e.g., integration of Biolog data) by joint development, testing, and deployment of novel tools, such as RNA-Seq analysis pipeline and visualization. These tools will be directly applicable to other bacterial projects.
The second project will aim to provide PATRIC with essential information for displaying genes characteristic of nontyphoidal Salmonella enterica serovar Typhimurium, particularly those that contribute to survival in a variety of environments, including various host species. This will be accomplished primarily through a combination of high-throughput screening and sequencing approaches and unique resources developed to annotate the S. Typhimurium genome with fitness data. The generation of S. Typhimurium transcriptomes from bacteria growing in defined environments (including rich and minimal media, at stationary phase, and under conditions that induce virulence pathways) will yield basal reference profiles to help standardize, as well as streamline, the massive amount of high-throughput transcriptomics data from impending studies. Novel tools and infrastructure developed in concert with the DBPs will be incorporated into PATRIC in future releases. Future calls for DBPs will be posted at the PATRIC homepage.
We conduct additional outreach through delivery of workshops designed to educate researchers in how to maximally benefit from PATRIC's broad resources. Workshops include lectures on in silico experimental designs and bioinformatics tools and methods, as well as demonstrations of various analyses that can be performed using the PATRIC website. The scope of the workshops includes pathogens, as well as other bacterial species, and especially makes use of the comparative tools described in the examples outlined above and in recent publications (for example, see references 21, 41, 49, and 50). Workshops are conducted on a recurrent basis and will undergo changes in content as new developments are instituted at PATRIC. Our team also participates in various scientific meetings and conferences, and numerous presentations have been given. Web pages listing information on past and future presentations (see http://enews.patricbrc.org/category/presentations) as well as general PATRIC news feeds (see http://www.patricbrc.org) are updated on a regular basis.
Many new capabilities are already planned for PATRIC to improve the user experience and to provide the most comprehensive resource for computational analyses directed toward understanding bacterial pathogenesis and for development of antibacterial drugs, diagnostics, and vaccines. In the future, PATRIC researchers will be able to analyze and compare their own data against available data for all bacterial genomes. A complete list of future developments is beyond the scope of this introductory article but includes a more versatile multiple-sequence viewer, access to metagenomics data and annotation tools, and improved and more integrated text-mining capabilities. This growing suite of tools will enable complex analyses through workflows. Forthcoming developments at PATRIC will ensure that it meets the varied needs of the infectious-disease research community, especially teams working to develop antibacterial drugs and vaccines.
We are grateful for the constructive criticism provided by the two anonymous reviewers.
This project has been funded in whole or in part with federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under contract no. HHSN272200900040C awarded to B.W.S. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIAID or the National Institutes of Health.
‡Supplemental material for this article may be found at http://iai.asm.org/.
Published ahead of print on 6 September 2011.
#The authors have paid a fee to allow immediate free access to this paper.