The parasite Plasmodium falciparum is responsible for hundreds of millions of cases of malaria, and kills more than one million African children annually. Here we report an analysis of the genome sequence of P. falciparum clone 3D7. The 23-megabase nuclear genome consists of 14 chromosomes, encodes about 5,300 genes, and is the most (A + T)-rich genome sequenced to date. Genes involved in antigenic variation are concentrated in the subtelomeric regions of the chromosomes. Compared to the genomes of free-living eukaryotic microbes, the genome of this intracellular parasite encodes fewer enzymes and transporters, but a large proportion of genes are devoted to immune evasion and host–parasite interactions. Many nuclear-encoded proteins are targeted to the apicoplast, an organelle involved in fatty-acid and isoprenoid metabolism. The genome sequence provides the foundation for future studies of this organism, and is being exploited in the search for new drugs and vaccines to fight malaria.
TIGRFAMs, available online at http://www.jcvi.org/tigrfams is a database of protein family definitions. Each entry features a seed alignment of trusted representative sequences, a hidden Markov model (HMM) built from that alignment, cutoff scores that let automated annotation pipelines decide which proteins are members, and annotations for transfer onto member proteins. Most TIGRFAMs models are designated equivalog, meaning they assign a specific name to proteins conserved in function from a common ancestral sequence. Models describing more functionally heterogeneous families are designated subfamily or domain, and assign less specific but more widely applicable annotations. The Genome Properties database, available at http://www.jcvi.org/genome-properties, specifies how computed evidence, including TIGRFAMs HMM results, should be used to judge whether an enzymatic pathway, a protein complex or another type of molecular subsystem is encoded in a genome. TIGRFAMs and Genome Properties content are developed in concert because subsystems reconstruction for large numbers of genomes guides selection of seed alignment sequences and cutoff values during protein family construction. Both databases specialize heavily in bacterial and archaeal subsystems. At present, 4284 models appear in TIGRFAMs, while 628 systems are described by Genome Properties. Content derives both from subsystem discovery work and from biocuration of the scientific literature.
The whole genome analysis of two strains of the first intermediately pathogenic leptospiral species to be sequenced (Leptospira licerasiae strains VAR010 and MMD0835) provides insight into their pathogenic potential and deepens our understanding of leptospiral evolution. Comparative analysis of eight leptospiral genomes shows the existence of a core leptospiral genome comprising 1547 genes and 452 conserved genes restricted to infectious species (including L. licerasiae) that are likely to be pathogenicity-related. Comparisons of the functional content of the genomes suggests that L. licerasiae retains several proteins related to nitrogen, amino acid and carbohydrate metabolism which might help to explain why these Leptospira grow well in artificial media compared with pathogenic species. L. licerasiae strains VAR010T and MMD0835 possess two prophage elements. While one element is circular and shares homology with LE1 of L. biflexa, the second is cryptic and homologous to a previously identified but unnamed region in L. interrogans serovars Copenhageni and Lai. We also report a unique O-antigen locus in L. licerasiae comprised of a 6-gene cluster that is unexpectedly short compared with L. interrogans in which analogous regions may include >90 such genes. Sequence homology searches suggest that these genes were acquired by lateral gene transfer (LGT). Furthermore, seven putative genomic islands ranging in size from 5 to 36 kb are present also suggestive of antecedent LGT. How Leptospira become naturally competent remains to be determined, but considering the phylogenetic origins of the genes comprising the O-antigen cluster and other putative laterally transferred genes, L. licerasiae must be able to exchange genetic material with non-invasive environmental bacteria. The data presented here demonstrate that L. licerasiae is genetically more closely related to pathogenic than to saprophytic Leptospira and provide insight into the genomic bases for its infectiousness and its unique antigenic characteristics.
Leptospirosis is one of the most common diseases transmitted by animals worldwide and is important because it is a major cause of febrile illness in tropical areas and also occurs in epidemic form associated with natural disasters and flooding. The mechanisms through which Leptospira cause disease are not well understood. In this study we have sequenced the genomes of two strains of Leptospira licerasiae isolated from a person and a marsupial in the Peruvian Amazon. These strains were thought to be able to cause only mild disease in humans. We have compared these genomes with other leptospires that can cause severe illness and death and another leptospire that does not infect humans or animals. These comparisons have allowed us to demonstrate similarities among the disease-causing Leptospira. Studying genes that are common among infectious strains will allow us to identify genetic factors necessary for infecting, causing disease and determining the severity of disease. We have also found that L. licerasiae seems to be able to uptake and incorporate genetic information from other bacteria found in the environment. This information will allow us to begin to understand how Leptospira species have evolved.
Multiple new prokaryotic C-terminal protein-sorting signals were found that reprise the tripartite architecture shared by LPXTG and PEP-CTERM: motif, TM helix, basic cluster. Defining hidden Markov models were constructed for all. PGF-CTERM occurs in 29 archaeal species, some of which have more than 50 proteins that share the domain. PGF-CTERM proteins include the major cell surface protein in Halobacterium, a glycoprotein with a partially characterized diphytanylglyceryl phosphate linkage near its C terminus. Comparative genomics identifies a distant exosortase homolog, designated archaeosortase A (ArtA), as the likely protein-processing enzyme for PGF-CTERM. Proteomics suggests that the PGF-CTERM region is removed. Additional systems include VPXXXP-CTERM/archeaosortase B in two of the same archaea and PEF-CTERM/archaeosortase C in four others. Bacterial exosortases often fall into subfamilies that partner with very different cohorts of extracellular polymeric substance biosynthesis proteins; several species have multiple systems. Variant systems include the VPDSG-CTERM/exosortase C system unique to certain members of the phylum Verrucomicrobia, VPLPA-CTERM/exosortase D in several alpha- and deltaproteobacterial species, and a dedicated (single-target) VPEID-CTERM/exosortase E system in alphaproteobacteria. Exosortase-related families XrtF in the class Flavobacteria and XrtG in Gram-positive bacteria mark distinctive conserved gene neighborhoods. A picture emerges of an ancient and now well-differentiated superfamily of deeply membrane-embedded protein-processing enzymes. Their target proteins are destined to transit cellular membranes during their biosynthesis, during which most undergo additional posttranslational modifications such as glycosylation.
Bacteria in the 16S rRNA clade SAR86 are among the most abundant uncultivated constituents of microbial assemblages in the surface ocean for which little genomic information is currently available. Bioinformatic techniques were used to assemble two nearly complete genomes from marine metagenomes and single-cell sequencing provided two more partial genomes. Recruitment of metagenomic data shows that these SAR86 genomes substantially increase our knowledge of non-photosynthetic bacteria in the surface ocean. Phylogenomic analyses establish SAR86 as a basal and divergent lineage of γ-proteobacteria, and the individual genomes display a temperature-dependent distribution. Modestly sized at 1.25–1.7 Mbp, the SAR86 genomes lack several pathways for amino-acid and vitamin synthesis as well as sulfate reduction, trends commonly observed in other abundant marine microbes. SAR86 appears to be an aerobic chemoheterotroph with the potential for proteorhodopsin-based ATP generation, though the apparent lack of a retinal biosynthesis pathway may require it to scavenge exogenously-derived pigments to utilize proteorhodopsin. The genomes contain an expanded capacity for the degradation of lipids and carbohydrates acquired using a wealth of tonB-dependent outer membrane receptors. Like the abundant planktonic marine bacterial clade SAR11, SAR86 exhibits metabolic streamlining, but also a distinct carbon compound specialization, possibly avoiding competition.
SAR86; SAR11; metagenomic assembly; single cell genomics; proteorhodopsin; tonB receptors
InterPro (http://www.ebi.ac.uk/interpro/) is a database that integrates diverse information about protein families, domains and functional sites, and makes it freely available to the public via Web-based interfaces and services. Central to the database are diagnostic models, known as signatures, against which protein sequences can be searched to determine their potential function. InterPro has utility in the large-scale analysis of whole genomes and meta-genomes, as well as in characterizing individual protein sequences. Herein we give an overview of new developments in the database and its associated software since 2009, including updates to database content, curation processes and Web and programmatic interfaces.
Phylogenetic profiling is a technique of scoring co-occurrence between a protein family and some other trait, usually another protein family, across a set of taxonomic groups. In spite of several refinements in recent years, the technique still invites significant improvement. To be its most effective, a phylogenetic profiling algorithm must be able to examine co-occurrences among protein families whose boundaries are uncertain within large homologous protein superfamilies.
Partial Phylogenetic Profiling (PPP) is an iterative algorithm that scores a given taxonomic profile against the taxonomic distribution of families for all proteins in a genome. The method works through optimizing the boundary of each protein family, rather than by relying on prebuilt protein families or fixed sequence similarity thresholds. Double Partial Phylogenetic Profiling (DPPP) is a related procedure that begins with a single sequence and searches for optimal granularities for its surrounding protein family in order to generate the best query profiles for PPP. We present ProPhylo, a high-performance software package for phylogenetic profiling studies through creating individually optimized protein family boundaries. ProPhylo provides precomputed databases for immediate use and tools for manipulating the taxonomic profiles used as queries.
ProPhylo results show universal markers of methanogenesis, a new DNA phosphorothioation-dependent restriction enzyme, and efficacy in guiding protein family construction. The software and the associated databases are freely available under the open source Perl Artistic License from ftp://ftp.jcvi.org/pub/data/ppp/.
Regimens targeting Mycobacterium tuberculosis, the causative agent of tuberculosis (TB), require long courses of treatment and a combination of three or more drugs. An increase in drug-resistant strains of M. tuberculosis demonstrates the need for additional TB-specific drugs. A notable feature of M. tuberculosis is coenzyme F420, which is distributed sporadically and sparsely among prokaryotes. This distribution allows for comparative genomics-based investigations. Phylogenetic profiling (comparison of differential gene content) based on F420 biosynthesis nominated many actinobacterial proteins as candidate F420-dependent enzymes. Three such families dominated the results: the luciferase-like monooxygenase (LLM), pyridoxamine 5′-phosphate oxidase (PPOX), and deazaflavin-dependent nitroreductase (DDN) families. The DDN family was determined to be limited to F420-producing species. The LLM and PPOX families were observed in F420-producing species as well as species lacking F420 but were particularly numerous in many actinobacterial species, including M. tuberculosis. Partitioning the LLM and PPOX families based on an organism's ability to make F420 allowed the application of the SIMBAL (sites inferred by metabolic background assertion labeling) profiling method to identify F420-correlated subsequences. These regions were found to correspond to flavonoid cofactor binding sites. Significantly, these results showed that M. tuberculosis carries at least 28 separate F420-dependent enzymes, most of unknown function, and a paucity of flavin mononucleotide (FMN)-dependent proteins in these families. While prevalent in mycobacteria, markers of F420 biosynthesis appeared to be absent from the normal human gut flora. These findings suggest that M. tuberculosis relies heavily on coenzyme F420 for its redox reactions. This dependence and the cofactor's rarity may make F420-related proteins promising drug targets.
Comparative genomics methods such as phylogenetic profiling can mine powerful inferences from inherently noisy biological data sets. We introduce Sites Inferred by Metabolic Background Assertion Labeling (SIMBAL), a method that applies the Partial Phylogenetic Profiling (PPP) approach locally within a protein sequence to discover short sequence signatures associated with functional sites. The approach is based on the basic scoring mechanism employed by PPP, namely the use of binomial distribution statistics to optimize sequence similarity cutoffs during searches of partitioned training sets.
Here we illustrate and validate the ability of the SIMBAL method to find functionally relevant short sequence signatures by application to two well-characterized protein families. In the first example, we partitioned a family of ABC permeases using a metabolic background property (urea utilization). Thus, the TRUE set for this family comprised members whose genome of origin encoded a urea utilization system. By moving a sliding window across the sequence of a permease, and searching each subsequence in turn against the full set of partitioned proteins, the method found which local sequence signatures best correlated with the urea utilization trait. Mapping of SIMBAL "hot spots" onto crystal structures of homologous permeases reveals that the significant sites are gating determinants on the cytosolic face rather than, say, docking sites for the substrate-binding protein on the extracellular face. In the second example, we partitioned a protein methyltransferase family using gene proximity as a criterion. In this case, the TRUE set comprised those methyltransferases encoded near the gene for the substrate RF-1. SIMBAL identifies sequence regions that map onto the substrate-binding interface while ignoring regions involved in the methyltransferase reaction mechanism in general. Neither method for training set construction requires any prior experimental characterization.
SIMBAL shows that, in functionally divergent protein families, selected short sequences often significantly outperform their full-length parent sequence for making functional predictions by sequence similarity, suggesting avenues for improved functional classifiers. When combined with structural data, SIMBAL affords the ability to localize and model functional sites.
The complete genomes of three strains from the phylum Acidobacteria were compared. Phylogenetic analysis placed them as a unique phylum. They share genomic traits with members of the Proteobacteria, the Cyanobacteria, and the Fungi. The three strains appear to be versatile heterotrophs. Genomic and culture traits indicate the use of carbon sources that span simple sugars to more complex substrates such as hemicellulose, cellulose, and chitin. The genomes encode low-specificity major facilitator superfamily transporters and high-affinity ABC transporters for sugars, suggesting that they are best suited to low-nutrient conditions. They appear capable of nitrate and nitrite reduction but not N2 fixation or denitrification. The genomes contained numerous genes that encode siderophore receptors, but no evidence of siderophore production was found, suggesting that they may obtain iron via interaction with other microorganisms. The presence of cellulose synthesis genes and a large class of novel high-molecular-weight excreted proteins suggests potential traits for desiccation resistance, biofilm formation, and/or contribution to soil structure. Polyketide synthase and macrolide glycosylation genes suggest the production of novel antimicrobial compounds. Genes that encode a variety of novel proteins were also identified. The abundance of acidobacteria in soils worldwide and the breadth of potential carbon use by the sequenced strains suggest significant and previously unrecognized contributions to the terrestrial carbon cycle. Combining our genomic evidence with available culture traits, we postulate that cells of these isolates are long-lived, divide slowly, exhibit slow metabolic rates under low-nutrient conditions, and are well equipped to tolerate fluctuations in soil hydration.
The InterPro database (http://www.ebi.ac.uk/interpro/) integrates together predictive models or ‘signatures’ representing protein domains, families and functional sites from multiple, diverse source databases: Gene3D, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY and TIGRFAMs. Integration is performed manually and approximately half of the total ∼58 000 signatures available in the source databases belong to an InterPro entry. Recently, we have started to also display the remaining un-integrated signatures via our web interface. Other developments include the provision of non-signature data, such as structural data, in new XML files on our FTP site, as well as the inclusion of matchless UniProtKB proteins in the existing match XML files. The web interface has been extended and now links out to the ADAN predicted protein–protein interaction database and the SPICE and Dasty viewers. The latest public release (v18.0) covers 79.8% of UniProtKB (v14.1) and consists of 16 549 entries. InterPro data may be accessed either via the web address above, via web services, by downloading files by anonymous FTP or by using the InterProScan search software (http://www.ebi.ac.uk/Tools/InterProScan/).
With the quantity of genomic data increasing at an exponential rate, it is imperative that these data be captured electronically, in a standard format. Standardization activities must proceed within the auspices of open-access and international working bodies. To tackle the issues surrounding the development of better descriptions of genomic investigations, we have formed the Genomic Standards Consortium (GSC). Here, we introduce the minimum information about a genome sequence (MIGS) specification with the intent of promoting participation in its development and discussing the resources that will be required to develop improved mechanisms of metadata capture and exchange. As part of its wider goals, the GSC also supports improving the ‘transparency’ of the information contained in existing genomic databases.
The dimorphic prosthecate bacteria (DPB) are α-proteobacteria that reproduce in an asymmetric manner rather than by binary fission and are of interest as simple models of development. Prior to this work, the only member of this group for which genome sequence was available was the model freshwater organism Caulobacter crescentus. Here we describe the genome sequence of Hyphomonas neptunium, a marine member of the DPB that differs from C. crescentus in that H. neptunium uses its stalk as a reproductive structure. Genome analysis indicates that this organism shares more genes with C. crescentus than it does with Silicibacter pomeroyi (a closer relative according to 16S rRNA phylogeny), that it relies upon a heterotrophic strategy utilizing a wide range of substrates, that its cell cycle is likely to be regulated in a similar manner to that of C. crescentus, and that the outer membrane complements of H. neptunium and C. crescentus are remarkably similar. H. neptunium swarmer cells are highly motile via a single polar flagellum. With the exception of cheY and cheR, genes required for chemotaxis were absent in the H. neptunium genome. Consistent with this observation, H. neptunium swarmer cells did not respond to any chemotactic stimuli that were tested, which suggests that H. neptunium motility is a random dispersal mechanism for swarmer cells rather than a stimulus-controlled navigation system for locating specific environments. In addition to providing insights into bacterial development, the H. neptunium genome will provide an important resource for the study of other interesting biological processes including chromosome segregation, polar growth, and cell aging.
InterPro is an integrated resource for protein families, domains and functional sites, which integrates the following protein signature databases: PROSITE, PRINTS, ProDom, Pfam, SMART, TIGRFAMs, PIRSF, SUPERFAMILY, Gene3D and PANTHER. The latter two new member databases have been integrated since the last publication in this journal. There have been several new developments in InterPro, including an additional reading field, new database links, extensions to the web interface and additional match XML files. InterPro has always provided matches to UniProtKB proteins on the website and in the match XML file on the FTP site. Additional matches to proteins in UniParc (UniProt archive) are now available for download in the new match XML files only. The latest InterPro release (13.0) contains more than 13 000 entries, covering over 78% of all proteins in UniProtKB. The database is available for text- and sequence-based searches via a webserver (), and for download by anonymous FTP (). The InterProScan search tool is now also available via a web service at .
This meeting report summarizes the proceedings of the “eGenomics: Cataloguing our Complete Genome Collection III” workshop held September 11–13, 2006, at the National Institute for Environmental eScience (NIEeS), Cambridge, United Kingdom. This 3rd workshop of the Genomic Standards Consortium was divided into two parts. The first half of the three-day workshop was dedicated to reviewing the genomic diversity of our current and future genome and metagenome collection, and exploring linkages to a series of existing projects through formal presentations. The second half was dedicated to strategic discussions. Outcomes of the workshop include a revised “Minimum Information about a Genome Sequence” (MIGS) specification (v1.1), consensus on a variety of features to be added to the Genome Catalogue (GCat), agreement by several researchers to adopt MIGS for imminent genome publications, and an agreement by the EBI and NCBI to input their genome collections into GCat for the purpose of quantifying the amount of optional data already available (e.g., for geographic location coordinates) and working towards a single, global list of all public genomes and metagenomes.
TIGRFAMs is a collection of protein family definitions built to aid in high-throughput annotation of specific protein functions. Each family is based on a hidden Markov model (HMM), where both cutoff scores and membership in the seed alignment are chosen so that the HMMs can classify numerous proteins according to their specific molecular functions. Most TIGRFAMs models describe ‘equivalog’ families, where both orthology and lateral gene transfer may be part of the evolutionary history, but where a single molecular function has been conserved. The Genome Properties system contains a queriable set of metabolic reconstructions, genome metrics and extractions of information from the scientific literature. Its genome-by-genome assertions of whether or not specific structures, pathways or systems are present provide high-level conceptual descriptions of genomic content. These assertions enable comparative genomics, provide a meaningful biological context to aid in manual annotation, support assignments of Gene Ontology (GO) biological process terms and help validate HMM-based predictions of protein function. The Genome Properties system is particularly useful as a generator of phylogenetic profiles, through which new protein family functions may be discovered. The TIGRFAMs and Genome Properties systems can be accessed at and .
Protein translocation to the proper cellular destination may be guided by various classes of sorting signals recognizable in the primary sequence. Detection in some genomes, but not others, may reveal sorting system components by comparison of the phylogenetic profile of the class of sorting signal to that of various protein families.
We describe a short C-terminal homology domain, sporadically distributed in bacteria, with several key characteristics of protein sorting signals. The domain includes a near-invariant motif Pro-Glu-Pro (PEP). This possible recognition or processing site is followed by a predicted transmembrane helix and a cluster rich in basic amino acids. We designate this domain PEP-CTERM. It tends to occur multiple times in a genome if it occurs at all, with a median count of eight instances; Verrucomicrobium spinosum has sixty-five. PEP-CTERM-containing proteins generally contain an N-terminal signal peptide and exhibit high diversity and little homology to known proteins. All bacteria with PEP-CTERM have both an outer membrane and exopolysaccharide (EPS) production genes. By a simple heuristic for screening phylogenetic profiles in the absence of pre-formed protein families, we discovered that a homolog of the membrane protein EpsH (exopolysaccharide locus protein H) occurs in a species when PEP-CTERM domains are found. The EpsH family contains invariant residues consistent with a transpeptidase function. Most PEP-CTERM proteins are encoded by single-gene operons preceded by large intergenic regions. In the Proteobacteria, most of these upstream regions share a DNA sequence, a probable cis-regulatory site that contains a sigma-54 binding motif. The phylogenetic profile for this DNA sequence exactly matches that of three proteins: a sigma-54-interacting response regulator (PrsR), a transmembrane histidine kinase (PrsK), and a TPR protein (PrsT).
These findings are consistent with the hypothesis that PEP-CTERM and EpsH form a protein export sorting system, analogous to the LPXTG/sortase system of Gram-positive bacteria, and correlated to EPS expression. It occurs preferentially in bacteria from sediments, soils, and biofilms. The novel method that led to these findings, partial phylogenetic profiling, requires neither global sequence clustering nor arbitrary similarity cutoffs and appears to be a rapid, effective alternative to other profiling methods.
Anaplasma (formerly Ehrlichia) phagocytophilum, Ehrlichia chaffeensis, and Neorickettsia (formerly Ehrlichia) sennetsu are intracellular vector-borne pathogens that cause human ehrlichiosis, an emerging infectious disease. We present the complete genome sequences of these organisms along with comparisons to other organisms in the Rickettsiales order. Ehrlichia spp. and Anaplasma spp. display a unique large expansion of immunodominant outer membrane proteins facilitating antigenic variation. All Rickettsiales have a diminished ability to synthesize amino acids compared to their closest free-living relatives. Unlike members of the Rickettsiaceae family, these pathogenic Anaplasmataceae are capable of making all major vitamins, cofactors, and nucleotides, which could confer a beneficial role in the invertebrate vector or the vertebrate host. Further analysis identified proteins potentially involved in vacuole confinement of the Anaplasmataceae, a life cycle involving a hematophagous vector, vertebrate pathogenesis, human pathogenesis, and lack of transovarial transmission. These discoveries provide significant insights into the biology of these obligate intracellular pathogens.
Ehrlichiosis is an acute disease that triggers flu-like symptoms in both humans and animals. It is caused by a range of bacteria transmitted by ticks or flukes. Because these bacteria are difficult to culture, however, the organisms are poorly understood. The genomes of three emerging human pathogens causing ehrlichiosis were sequenced. A database was designed to allow the comparison of these three genomes to sixteen other bacteria with similar lifestyles. Analysis from this database reveals new species-specific and disease-specific genes indicating niche adaptations, pathogenic traits, and other features. In particular, one of the organisms contains more than 100 copies of a single gene involved in interactions with the host(s). These comparisons also enabled a reconstruction of the metabolic potential of five representative genomes from these bacteria and their close relatives. With this work, scientists can study these emerging pathogens in earnest.
Clustered regularly interspaced short palindromic repeats (CRISPRs) are a family of DNA direct repeats found in many prokaryotic genomes. Repeats of 21–37 bp typically show weak dyad symmetry and are separated by regularly sized, nonrepetitive spacer sequences. Four CRISPR-associated (Cas) protein families, designated Cas1 to Cas4, are strictly associated with CRISPR elements and always occur near a repeat cluster. Some spacers originate from mobile genetic elements and are thought to confer “immunity” against the elements that harbor these sequences. In the present study, we have systematically investigated uncharacterized proteins encoded in the vicinity of these CRISPRs and found many additional protein families that are strictly associated with CRISPR loci across multiple prokaryotic species. Multiple sequence alignments and hidden Markov models have been built for 45 Cas protein families. These models identify family members with high sensitivity and selectivity and classify key regulators of development, DevR and DevS, in Myxococcus xanthus as Cas proteins. These identifications show that CRISPR/cas gene regions can be quite large, with up to 20 different, tandem-arranged cas genes next to a repeat cluster or filling the region between two repeat clusters. Distinctive subsets of the collection of Cas proteins recur in phylogenetically distant species and correlate with characteristic repeat periodicity. The analyses presented here support initial proposals of mobility of these units, along with the likelihood that loci of different subtypes interact with one another as well as with host cell defensive, replicative, and regulatory systems. It is evident from this analysis that CRISPR/cas loci are larger, more complex, and more heterogeneous than previously appreciated.
The family of clustered regularly interspaced short palindromic repeats (CRISPRs) describes a class of DNA repeats found in nearly half of all bacterial and archaeal genomes. These DNA repeat regions have a remarkably regular structure: unique sequences of constant size, called spacers, sit between each pair of repeats. The DNA repeats do not encode proteins, but appear to be transcribed and processed into small RNAs that may have any number of functions, including resistance to any phage (i.e., virus of bacteria) whose sequence matches a spacer; spacers change rapidly as microbial strains evolve. This work describes 41 new CRISPR-associated (cas) gene families, which are always found near these repeats, in addition to the four previously known. It shows that CRISPR systems belong to different classes, with different repeat patterns, sets of genes, and species ranges. Most of these seem to come and go rather rapidly from their host genomes. These possibly beneficial mobile genetic elements may play an important role in driving prokaryotic evolution.
Pseudomonas syringae pv. phaseolicola, a gram-negative bacterial plant pathogen, is the causal agent of halo blight of bean. In this study, we report on the genome sequence of P. syringae pv. phaseolicola isolate 1448A, which encodes 5,353 open reading frames (ORFs) on one circular chromosome (5,928,787 bp) and two plasmids (131,950 bp and 51,711 bp). Comparative analyses with a phylogenetically divergent pathovar, P. syringae pv. tomato DC3000, revealed a strong degree of conservation at the gene and genome levels. In total, 4,133 ORFs were identified as putative orthologs in these two pathovars using a reciprocal best-hit method, with 3,941 ORFs present in conserved, syntenic blocks. Although these two pathovars are highly similar at the physiological level, they have distinct host ranges; 1448A causes disease in beans, and DC3000 is pathogenic on tomato and Arabidopsis. Examination of the complement of ORFs encoding virulence, fitness, and survival factors revealed a substantial, but not complete, overlap between these two pathovars. Another distinguishing feature between the two pathovars is their distinctive sets of transposable elements. With access to a fifth complete pseudomonad genome sequence, we were able to identify 3,567 ORFs that likely comprise the core Pseudomonas genome and 365 ORFs that are P. syringae specific.
InterPro, an integrated documentation resource of protein families, domains and functional sites, was created to integrate the major protein signature databases. Currently, it includes PROSITE, Pfam, PRINTS, ProDom, SMART, TIGRFAMs, PIRSF and SUPERFAMILY. Signatures are manually integrated into InterPro entries that are curated to provide biological and functional information. Annotation is provided in an abstract, Gene Ontology mapping and links to specialized databases. New features of InterPro include extended protein match views, taxonomic range information and protein 3D structure data. One of the new match views is the InterPro Domain Architecture view, which shows the domain composition of protein matches. Two new entry types were introduced to better describe InterPro entries: these are active site and binding site. PIRSF and the structure-based SUPERFAMILY are the latest member databases to join InterPro, and CATH and PANTHER are soon to be integrated. InterPro release 8.0 contains 11 007 entries, representing 2573 domains, 8166 families, 201 repeats, 26 active sites, 21 binding sites and 20 post-translational modification sites. InterPro covers over 78% of all proteins in the Swiss-Prot and TrEMBL components of UniProt. The database is available for text- and sequence-based searches via a webserver (http://www.ebi.ac.uk/interpro), and for download by anonymous FTP (ftp://ftp.ebi.ac.uk/pub/databases/interpro).
The genomes of three strains of Listeria monocytogenes that have been associated with food-borne illness in the USA were subjected to whole genome comparative analysis. A total of 51, 97 and 69 strain-specific genes were identified in L.monocytogenes strains F2365 (serotype 4b, cheese isolate), F6854 (serotype 1/2a, frankfurter isolate) and H7858 (serotype 4b, meat isolate), respectively. Eighty-three genes were restricted to serotype 1/2a and 51 to serotype 4b strains. These strain- and serotype-specific genes probably contribute to observed differences in pathogenicity, and the ability of the organisms to survive and grow in their respective environmental niches. The serotype 1/2a-specific genes include an operon that encodes the rhamnose biosynthetic pathway that is associated with teichoic acid biosynthesis, as well as operons for five glycosyl transferases and an adenine-specific DNA methyltransferase. A total of 8603 and 105 050 high quality single nucleotide polymorphisms (SNPs) were found on the draft genome sequences of strain H7858 and strain F6854, respectively, when compared with strain F2365. Whole genome comparative analyses revealed that the L.monocytogenes genomes are essentially syntenic, with the majority of genomic differences consisting of phage insertions, transposable elements and SNPs.