Multiple new prokaryotic C-terminal protein-sorting signals were found that reprise the tripartite architecture shared by LPXTG and PEP-CTERM: motif, TM helix, basic cluster. Defining hidden Markov models were constructed for all. PGF-CTERM occurs in 29 archaeal species, some of which have more than 50 proteins that share the domain. PGF-CTERM proteins include the major cell surface protein in Halobacterium, a glycoprotein with a partially characterized diphytanylglyceryl phosphate linkage near its C terminus. Comparative genomics identifies a distant exosortase homolog, designated archaeosortase A (ArtA), as the likely protein-processing enzyme for PGF-CTERM. Proteomics suggests that the PGF-CTERM region is removed. Additional systems include VPXXXP-CTERM/archeaosortase B in two of the same archaea and PEF-CTERM/archaeosortase C in four others. Bacterial exosortases often fall into subfamilies that partner with very different cohorts of extracellular polymeric substance biosynthesis proteins; several species have multiple systems. Variant systems include the VPDSG-CTERM/exosortase C system unique to certain members of the phylum Verrucomicrobia, VPLPA-CTERM/exosortase D in several alpha- and deltaproteobacterial species, and a dedicated (single-target) VPEID-CTERM/exosortase E system in alphaproteobacteria. Exosortase-related families XrtF in the class Flavobacteria and XrtG in Gram-positive bacteria mark distinctive conserved gene neighborhoods. A picture emerges of an ancient and now well-differentiated superfamily of deeply membrane-embedded protein-processing enzymes. Their target proteins are destined to transit cellular membranes during their biosynthesis, during which most undergo additional posttranslational modifications such as glycosylation.
TIGRFAMs, available online at http://www.jcvi.org/tigrfams is a database of protein family definitions. Each entry features a seed alignment of trusted representative sequences, a hidden Markov model (HMM) built from that alignment, cutoff scores that let automated annotation pipelines decide which proteins are members, and annotations for transfer onto member proteins. Most TIGRFAMs models are designated equivalog, meaning they assign a specific name to proteins conserved in function from a common ancestral sequence. Models describing more functionally heterogeneous families are designated subfamily or domain, and assign less specific but more widely applicable annotations. The Genome Properties database, available at http://www.jcvi.org/genome-properties, specifies how computed evidence, including TIGRFAMs HMM results, should be used to judge whether an enzymatic pathway, a protein complex or another type of molecular subsystem is encoded in a genome. TIGRFAMs and Genome Properties content are developed in concert because subsystems reconstruction for large numbers of genomes guides selection of seed alignment sequences and cutoff values during protein family construction. Both databases specialize heavily in bacterial and archaeal subsystems. At present, 4284 models appear in TIGRFAMs, while 628 systems are described by Genome Properties. Content derives both from subsystem discovery work and from biocuration of the scientific literature.
Biofilms are dense microbial communities. Although widely distributed and medically important, how biofilm cells interact with one another is poorly understood. Recently, we described a novel process whereby myxobacterial biofilm cells exchange their outer membrane (OM) lipoproteins. For the first time we report here the identification of two host proteins, TraAB, required for transfer. These proteins are predicted to localize in the cell envelope; and TraA encodes a distant PA14 lectin-like domain, a cysteine-rich tandem repeat region, and a putative C-terminal protein sorting tag named MYXO-CTERM, while TraB encodes an OmpA-like domain. Importantly, TraAB are required in donors and recipients, suggesting bidirectional transfer. By use of a lipophilic fluorescent dye, we also discovered that OM lipids are exchanged. Similar to lipoproteins, dye transfer requires TraAB function, gliding motility and a structured biofilm. Importantly, OM exchange was found to regulate swarming and development behaviors, suggesting a new role in cell–cell communication. A working model proposes TraA is a cell surface receptor that mediates cell–cell adhesion for OM fusion, in which lipoproteins/lipids are transferred by lateral diffusion. We further hypothesize that cell contact–dependent exchange helps myxobacteria to coordinate their social behaviors.
All cells interact with their environment, including other cells, to elicit cellular responses. Cell–cell interactions between eukaryotic cells are widely appreciated as large multicellular organisms coordinate cell behaviors for tissue and organ functions. In bacteria cell–cell interactions are not widely appreciated, as these organisms are relatively simple and are often depicted as single-cell entities. However, over the past decade, the concept of bacteria living in microbial communities or biofilms has received broad acceptance as a major lifestyle. As biofilm cells are packed in tight physical contact, there is an opportunity for cell–cell signaling to provide spatial and physiological clues of neighboring cells to elicit cellular responses. Although much has been learned about diffusible signals through quorum sensing, little is known about cell contact–dependent signaling in bacteria. In this report we describe a new mechanism where bacterial cells within structured biofilms form contacts that allow cellular material to be exchanged. This exchange elicits phenotypic changes, including in cell movements and development. We hypothesize that OM exchange involves kin recognition that bestows social benefits to myxobacterial populations.
As the deluge of genomic DNA sequence grows the fraction of protein sequences that have been manually curated falls. In turn, as the number of laboratories with the ability to sequence genomes in a high-throughput manner grows, the informatics capability of those labs to accurately identify and annotate all genes within a genome may often be lacking. These issues have led to fears about transitive annotation errors making sequence databases less reliable. During the lifetime of the Pfam protein families database a number of protein families have been built, which were later identified as composed solely of spurious open reading frames (ORFs) either on the opposite strand or in a different, overlapping reading frame with respect to the true protein-coding or non-coding RNA gene. These families were deleted and are no longer available in Pfam. However, we realized that these may perform a useful function to identify new spurious ORFs. We have collected these families together in AntiFam along with additional custom-made families of spurious ORFs. This resource currently contains 23 families that identified 1310 spurious proteins in UniProtKB and a further 4119 spurious proteins in a collection of metagenomic sequences. UniProt has adopted AntiFam as a part of the UniProtKB quality control process and will investigate these spurious proteins for exclusion.
The rhomboid family of serine proteases occurs in all domains of life. Its members contain at least six hydrophobic membrane-spanning helices, with an active site serine located deep within the hydrophobic interior of the plasma membrane. The model member GlpG from Escherichia coli is heavily studied through engineered mutant forms, varied model substrates, and multiple X-ray crystal studies, yet its relationship to endogenous substrates is not well understood. Here we describe an apparent membrane anchoring C-terminal homology domain that appears in numerous genera including Shewanella, Vibrio, Acinetobacter, and Ralstonia, but excluding Escherichia and Haemophilus. Individual genomes encode up to thirteen members, usually homologous to each other only in this C-terminal region. The domain's tripartite architecture consists of motif, transmembrane helix, and cluster of basic residues at the protein C-terminus, as also seen with the LPXTG recognition sequence for sortase A and the PEP-CTERM recognition sequence for exosortase. Partial Phylogenetic Profiling identifies a distinctive rhomboid-like protease subfamily almost perfectly co-distributed with this recognition sequence. This protease subfamily and its putative target domain are hereby renamed rhombosortase and GlyGly-CTERM, respectively. The protease and target are encoded by consecutive genes in most genomes with just a single target, but far apart otherwise. The signature motif of the Rhombo-CTERM domain, often SGGS, only partially resembles known cleavage sites of rhomboid protease family model substrates. Some protein families that have several members with C-terminal GlyGly-CTERM domains also have additional members with LPXTG or PEP-CTERM domains instead, suggesting there may be common themes to the post-translational processing of these proteins by three different membrane protein superfamilies.
Data mining methods in bioinformatics and comparative genomics commonly rely on working definitions of protein families from prior computation. Partial phylogenetic profiling (PPP), by contrast, optimizes family sizes during its searches for the cooccurring protein families that serve different roles in the same biological system. In a large-scale investigation of the incredibly diverse radical S-adenosylmethionine (SAM) enzyme superfamily, PPP aided in building a collection of 68 TIGRFAMs hidden Markov models (HMMs) that define nonoverlapping and functionally distinct subfamilies. Many identify radical SAM enzymes as molecular markers for multicomponent biological systems; HMMs defining their partner proteins also were constructed. Newly found systems include five groupings of protein families in which at least one marker is a radical SAM enzyme while another, encoded by an adjacent gene, is a short peptide predicted to be its substrate for posttranslational modification. The most prevalent, in over 125 genomes, featuring a peptide that we designate SCIFF (six cysteines in forty-five residues), is conserved throughout the class Clostridia, a distribution inconsistent with putative bacteriocin activity. A second novel system features a tandem pair of putative peptide-modifying radical SAM enzymes associated with a highly divergent family of peptides in which the only clearly conserved feature is a run of His-Xaa-Ser repeats. A third system pairs a radical SAM domain peptide maturase with selenocysteine-containing targets, suggesting a new biological role for selenium. These and several additional novel maturases that cooccur with predicted target peptides share a C-terminal additional 4Fe4S-binding domain with PqqE, the subtilosin A maturase AlbA, and the predicted mycofactocin and Nif11-class peptide maturases as well as with activators of anaerobic sulfatases and quinohemoprotein amine dehydrogenases. Radical SAM enzymes with this additional domain, as detected by TIGR04085, significantly outnumber lantibiotic synthases and cyclodehydratases combined in reference genomes while being highly enriched for members whose apparent targets are small peptides. Interpretation of comparative genomics evidence suggests unexpected (nonbacteriocin) roles for natural products from several of these systems.
Phylogenetic profiling is a technique of scoring co-occurrence between a protein family and some other trait, usually another protein family, across a set of taxonomic groups. In spite of several refinements in recent years, the technique still invites significant improvement. To be its most effective, a phylogenetic profiling algorithm must be able to examine co-occurrences among protein families whose boundaries are uncertain within large homologous protein superfamilies.
Partial Phylogenetic Profiling (PPP) is an iterative algorithm that scores a given taxonomic profile against the taxonomic distribution of families for all proteins in a genome. The method works through optimizing the boundary of each protein family, rather than by relying on prebuilt protein families or fixed sequence similarity thresholds. Double Partial Phylogenetic Profiling (DPPP) is a related procedure that begins with a single sequence and searches for optimal granularities for its surrounding protein family in order to generate the best query profiles for PPP. We present ProPhylo, a high-performance software package for phylogenetic profiling studies through creating individually optimized protein family boundaries. ProPhylo provides precomputed databases for immediate use and tools for manipulating the taxonomic profiles used as queries.
ProPhylo results show universal markers of methanogenesis, a new DNA phosphorothioation-dependent restriction enzyme, and efficacy in guiding protein family construction. The software and the associated databases are freely available under the open source Perl Artistic License from ftp://ftp.jcvi.org/pub/data/ppp/.
Regimens targeting Mycobacterium tuberculosis, the causative agent of tuberculosis (TB), require long courses of treatment and a combination of three or more drugs. An increase in drug-resistant strains of M. tuberculosis demonstrates the need for additional TB-specific drugs. A notable feature of M. tuberculosis is coenzyme F420, which is distributed sporadically and sparsely among prokaryotes. This distribution allows for comparative genomics-based investigations. Phylogenetic profiling (comparison of differential gene content) based on F420 biosynthesis nominated many actinobacterial proteins as candidate F420-dependent enzymes. Three such families dominated the results: the luciferase-like monooxygenase (LLM), pyridoxamine 5′-phosphate oxidase (PPOX), and deazaflavin-dependent nitroreductase (DDN) families. The DDN family was determined to be limited to F420-producing species. The LLM and PPOX families were observed in F420-producing species as well as species lacking F420 but were particularly numerous in many actinobacterial species, including M. tuberculosis. Partitioning the LLM and PPOX families based on an organism's ability to make F420 allowed the application of the SIMBAL (sites inferred by metabolic background assertion labeling) profiling method to identify F420-correlated subsequences. These regions were found to correspond to flavonoid cofactor binding sites. Significantly, these results showed that M. tuberculosis carries at least 28 separate F420-dependent enzymes, most of unknown function, and a paucity of flavin mononucleotide (FMN)-dependent proteins in these families. While prevalent in mycobacteria, markers of F420 biosynthesis appeared to be absent from the normal human gut flora. These findings suggest that M. tuberculosis relies heavily on coenzyme F420 for its redox reactions. This dependence and the cofactor's rarity may make F420-related proteins promising drug targets.
Enzymes in the radical SAM (rSAM) domain family serve in a wide variety of biological processes, including RNA modification, enzyme activation, bacteriocin core peptide maturation, and cofactor biosynthesis. Evolutionary pressures and relationships to other cellular constituents impose recognizable grammars on each class of rSAM-containing system, shaping patterns in results obtained through various comparative genomics analyses.
An uncharacterized gene cluster found in many Actinobacteria and sporadically in Firmicutes, Chloroflexi, Deltaproteobacteria, and one Archaeal plasmid contains a PqqE-like rSAM protein family that includes Rv0693 from Mycobacterium tuberculosis. Members occur clustered with a strikingly well-conserved small polypeptide we designate "mycofactocin," similar in size to bacteriocins and PqqA, precursor of pyrroloquinoline quinone (PQQ). Partial Phylogenetic Profiling (PPP) based on the distribution of these markers identifies the mycofactocin cluster, but also a second tier of high-scoring proteins. This tier, strikingly, is filled with up to thirty-one members per genome from three variant subfamilies that occur, one each, in three unrelated classes of nicotinoproteins. The pattern suggests these variant enzymes require not only NAD(P), but also the novel gene cluster. Further study was conducted using SIMBAL, a PPP-like tool, to search these nicotinoproteins for subsequences best correlated across multiple genomes to the presence of mycofactocin. For both the short chain dehydrogenase/reductase (SDR) and iron-containing dehydrogenase families, aligning SIMBAL's top-scoring sequences to homologous solved crystal structures shows signals centered over NAD(P)-binding sites rather than over substrate-binding or active site residues. Previous studies on some of these proteins have revealed a non-exchangeable NAD cofactor, such that enzymatic activity in vitro requires an artificial electron acceptor such as N,N-dimethyl-4-nitrosoaniline (NDMA) for the enzyme to cycle.
Taken together, these findings suggest that the mycofactocin precursor is modified by the Rv0693 family rSAM protein and other enzymes in its cluster. It becomes an electron carrier molecule that serves in vivo as NDMA and other artificial electron acceptors do in vitro. Subclasses from three different nicotinoprotein families show "only-if" relationships to mycofactocin because they require its presence. This framework suggests a segregated redox pool in which mycofactocin mediates communication among enzymes with non-exchangeable cofactors.
A new family of natural products has been described in which cysteine, serine and threonine from ribosomally-produced peptides are converted to thiazoles, oxazoles and methyloxazoles, respectively. These metabolites and their biosynthetic gene clusters are now referred to as thiazole/oxazole-modified microcins (TOMM). As exemplified by microcin B17 and streptolysin S, TOMM precursors contain an N-terminal leader sequence and C-terminal core peptide. The leader sequence contains binding sites for the posttranslational modifying enzymes which subsequently act upon the core peptide. TOMM peptides are small and highly variable, frequently missed by gene-finders and occasionally situated far from the thiazole/oxazole forming genes. Thus, locating a substrate for a particular TOMM pathway can be a challenging endeavor.
Examination of candidate TOMM precursors has revealed a subclass with an uncharacteristically long leader sequence closely related to the enzyme nitrile hydratase. Members of this nitrile hydratase leader peptide (NHLP) family lack the metal-binding residues required for catalysis. Instead, NHLP sequences display the classic Gly-Gly cleavage motif and have C-terminal regions rich in heterocyclizable residues. The NHLP family exhibits a correlated species distribution and local clustering with an ABC transport system. This study also provides evidence that a separate family, annotated as Nif11 nitrogen-fixing proteins, can serve as natural product precursors (N11P), but not always of the TOMM variety. Indeed, a number of cyanobacterial genomes show extensive N11P paralogous expansion, such as Nostoc, Prochlorococcus and Cyanothece, which replace the TOMM cluster with lanthionine biosynthetic machinery.
This study has united numerous TOMM gene clusters with their cognate substrates. These results suggest that two large protein families, the nitrile hydratases and Nif11, have been retailored for secondary metabolism. Precursors for TOMMs and lanthionine-containing peptides derived from larger proteins to which other functions are attributed, may be widespread. The functions of these natural products have yet to be elucidated, but it is probable that some will display valuable industrial or medical activities.
Comparative genomics methods such as phylogenetic profiling can mine powerful inferences from inherently noisy biological data sets. We introduce Sites Inferred by Metabolic Background Assertion Labeling (SIMBAL), a method that applies the Partial Phylogenetic Profiling (PPP) approach locally within a protein sequence to discover short sequence signatures associated with functional sites. The approach is based on the basic scoring mechanism employed by PPP, namely the use of binomial distribution statistics to optimize sequence similarity cutoffs during searches of partitioned training sets.
Here we illustrate and validate the ability of the SIMBAL method to find functionally relevant short sequence signatures by application to two well-characterized protein families. In the first example, we partitioned a family of ABC permeases using a metabolic background property (urea utilization). Thus, the TRUE set for this family comprised members whose genome of origin encoded a urea utilization system. By moving a sliding window across the sequence of a permease, and searching each subsequence in turn against the full set of partitioned proteins, the method found which local sequence signatures best correlated with the urea utilization trait. Mapping of SIMBAL "hot spots" onto crystal structures of homologous permeases reveals that the significant sites are gating determinants on the cytosolic face rather than, say, docking sites for the substrate-binding protein on the extracellular face. In the second example, we partitioned a protein methyltransferase family using gene proximity as a criterion. In this case, the TRUE set comprised those methyltransferases encoded near the gene for the substrate RF-1. SIMBAL identifies sequence regions that map onto the substrate-binding interface while ignoring regions involved in the methyltransferase reaction mechanism in general. Neither method for training set construction requires any prior experimental characterization.
SIMBAL shows that, in functionally divergent protein families, selected short sequences often significantly outperform their full-length parent sequence for making functional predictions by sequence similarity, suggesting avenues for improved functional classifiers. When combined with structural data, SIMBAL affords the ability to localize and model functional sites.
The complete genomes of three strains from the phylum Acidobacteria were compared. Phylogenetic analysis placed them as a unique phylum. They share genomic traits with members of the Proteobacteria, the Cyanobacteria, and the Fungi. The three strains appear to be versatile heterotrophs. Genomic and culture traits indicate the use of carbon sources that span simple sugars to more complex substrates such as hemicellulose, cellulose, and chitin. The genomes encode low-specificity major facilitator superfamily transporters and high-affinity ABC transporters for sugars, suggesting that they are best suited to low-nutrient conditions. They appear capable of nitrate and nitrite reduction but not N2 fixation or denitrification. The genomes contained numerous genes that encode siderophore receptors, but no evidence of siderophore production was found, suggesting that they may obtain iron via interaction with other microorganisms. The presence of cellulose synthesis genes and a large class of novel high-molecular-weight excreted proteins suggests potential traits for desiccation resistance, biofilm formation, and/or contribution to soil structure. Polyketide synthase and macrolide glycosylation genes suggest the production of novel antimicrobial compounds. Genes that encode a variety of novel proteins were also identified. The abundance of acidobacteria in soils worldwide and the breadth of potential carbon use by the sequenced strains suggest significant and previously unrecognized contributions to the terrestrial carbon cycle. Combining our genomic evidence with available culture traits, we postulate that cells of these isolates are long-lived, divide slowly, exhibit slow metabolic rates under low-nutrient conditions, and are well equipped to tolerate fluctuations in soil hydration.
Bacteriocins are peptide antibiotics from ribosomally translated precursors, produced by bacteria often through extensive post-translational modification. Minimal sequence conservation, short gene lengths, and low complexity sequence can hinder bacteriocin identification, even during gene calling, so they are often discovered by proximity to accessory genes encoding maturation, immunity, and export functions. This work reports a new subfamily of putative thiazole-containing heterocyclic bacteriocins. It appears universal in all strains of Bacillus anthracis and B. cereus, but has gone unrecognized because it is always encoded far from its maturation protein operon. Patterns of insertions and deletions among twenty-four variants suggest a repeating functional unit of Cys-Xaa-Xaa.
This article was reviewed by Andrei Osterman and Lakshminarayan Iyer.
Bacterial and Archaeal cells use selenium structurally in selenouridine-modified tRNAs, in proteins translated with selenocysteine, and in the selenium-dependent molybdenum hydroxylases (SDMH). The first two uses both require the selenophosphate synthetase gene, selD. Examining over 500 complete prokaryotic genomes finds selD in exactly two species lacking both the selenocysteine and selenouridine systems, Enterococcus faecalis and Haloarcula marismortui. Surrounding these orphan selD genes, forming bidirectional best hits between species, and detectable by Partial Phylogenetic Profiling vs. selD, are several candidate molybdenum hydroxylase subunits and accessory proteins. We propose that certain accessory proteins, and orphan selD itself, are markers through which new selenium-dependent molybdenum hydroxylases can be found.
This article was reviewed by Arcady Mushegian and Kira Makarova.
TIGRFAMs is a collection of protein family definitions built to aid in high-throughput annotation of specific protein functions. Each family is based on a hidden Markov model (HMM), where both cutoff scores and membership in the seed alignment are chosen so that the HMMs can classify numerous proteins according to their specific molecular functions. Most TIGRFAMs models describe ‘equivalog’ families, where both orthology and lateral gene transfer may be part of the evolutionary history, but where a single molecular function has been conserved. The Genome Properties system contains a queriable set of metabolic reconstructions, genome metrics and extractions of information from the scientific literature. Its genome-by-genome assertions of whether or not specific structures, pathways or systems are present provide high-level conceptual descriptions of genomic content. These assertions enable comparative genomics, provide a meaningful biological context to aid in manual annotation, support assignments of Gene Ontology (GO) biological process terms and help validate HMM-based predictions of protein function. The Genome Properties system is particularly useful as a generator of phylogenetic profiles, through which new protein family functions may be discovered. The TIGRFAMs and Genome Properties systems can be accessed at and .
Protein translocation to the proper cellular destination may be guided by various classes of sorting signals recognizable in the primary sequence. Detection in some genomes, but not others, may reveal sorting system components by comparison of the phylogenetic profile of the class of sorting signal to that of various protein families.
We describe a short C-terminal homology domain, sporadically distributed in bacteria, with several key characteristics of protein sorting signals. The domain includes a near-invariant motif Pro-Glu-Pro (PEP). This possible recognition or processing site is followed by a predicted transmembrane helix and a cluster rich in basic amino acids. We designate this domain PEP-CTERM. It tends to occur multiple times in a genome if it occurs at all, with a median count of eight instances; Verrucomicrobium spinosum has sixty-five. PEP-CTERM-containing proteins generally contain an N-terminal signal peptide and exhibit high diversity and little homology to known proteins. All bacteria with PEP-CTERM have both an outer membrane and exopolysaccharide (EPS) production genes. By a simple heuristic for screening phylogenetic profiles in the absence of pre-formed protein families, we discovered that a homolog of the membrane protein EpsH (exopolysaccharide locus protein H) occurs in a species when PEP-CTERM domains are found. The EpsH family contains invariant residues consistent with a transpeptidase function. Most PEP-CTERM proteins are encoded by single-gene operons preceded by large intergenic regions. In the Proteobacteria, most of these upstream regions share a DNA sequence, a probable cis-regulatory site that contains a sigma-54 binding motif. The phylogenetic profile for this DNA sequence exactly matches that of three proteins: a sigma-54-interacting response regulator (PrsR), a transmembrane histidine kinase (PrsK), and a TPR protein (PrsT).
These findings are consistent with the hypothesis that PEP-CTERM and EpsH form a protein export sorting system, analogous to the LPXTG/sortase system of Gram-positive bacteria, and correlated to EPS expression. It occurs preferentially in bacteria from sediments, soils, and biofilms. The novel method that led to these findings, partial phylogenetic profiling, requires neither global sequence clustering nor arbitrary similarity cutoffs and appears to be a rapid, effective alternative to other profiling methods.
Clustered regularly interspaced short palindromic repeats (CRISPRs) are a family of DNA direct repeats found in many prokaryotic genomes. Repeats of 21–37 bp typically show weak dyad symmetry and are separated by regularly sized, nonrepetitive spacer sequences. Four CRISPR-associated (Cas) protein families, designated Cas1 to Cas4, are strictly associated with CRISPR elements and always occur near a repeat cluster. Some spacers originate from mobile genetic elements and are thought to confer “immunity” against the elements that harbor these sequences. In the present study, we have systematically investigated uncharacterized proteins encoded in the vicinity of these CRISPRs and found many additional protein families that are strictly associated with CRISPR loci across multiple prokaryotic species. Multiple sequence alignments and hidden Markov models have been built for 45 Cas protein families. These models identify family members with high sensitivity and selectivity and classify key regulators of development, DevR and DevS, in Myxococcus xanthus as Cas proteins. These identifications show that CRISPR/cas gene regions can be quite large, with up to 20 different, tandem-arranged cas genes next to a repeat cluster or filling the region between two repeat clusters. Distinctive subsets of the collection of Cas proteins recur in phylogenetically distant species and correlate with characteristic repeat periodicity. The analyses presented here support initial proposals of mobility of these units, along with the likelihood that loci of different subtypes interact with one another as well as with host cell defensive, replicative, and regulatory systems. It is evident from this analysis that CRISPR/cas loci are larger, more complex, and more heterogeneous than previously appreciated.
The family of clustered regularly interspaced short palindromic repeats (CRISPRs) describes a class of DNA repeats found in nearly half of all bacterial and archaeal genomes. These DNA repeat regions have a remarkably regular structure: unique sequences of constant size, called spacers, sit between each pair of repeats. The DNA repeats do not encode proteins, but appear to be transcribed and processed into small RNAs that may have any number of functions, including resistance to any phage (i.e., virus of bacteria) whose sequence matches a spacer; spacers change rapidly as microbial strains evolve. This work describes 41 new CRISPR-associated (cas) gene families, which are always found near these repeats, in addition to the four previously known. It shows that CRISPR systems belong to different classes, with different repeat patterns, sets of genes, and species ranges. Most of these seem to come and go rather rapidly from their host genomes. These possibly beneficial mobile genetic elements may play an important role in driving prokaryotic evolution.
The complete 2,343,479-bp genome sequence of the gram-negative, pathogenic oral bacterium Porphyromonas gingivalis strain W83, a major contributor to periodontal disease, was determined. Whole-genome comparative analysis with other available complete genome sequences confirms the close relationship between the Cytophaga-Flavobacteria-Bacteroides (CFB) phylum and the green-sulfur bacteria. Within the CFB phyla, the genomes most similar to that of P. gingivalis are those of Bacteroides thetaiotaomicron and B. fragilis. Outside of the CFB phyla the most similar genome to P. gingivalis is that of Chlorobium tepidum, supporting the previous phylogenetic studies that indicated that the Chlorobia and CFB phyla are related, albeit distantly. Genome analysis of strain W83 reveals a range of pathways and virulence determinants that relate to the novel biology of this oral pathogen. Among these determinants are at least six putative hemagglutinin-like genes and 36 previously unidentified peptidases. Genome analysis also reveals that P. gingivalis can metabolize a range of amino acids and generate a number of metabolic end products that are toxic to the human host or human gingival tissue and contribute to the development of periodontal disease.
TIGRFAMs is a collection of manually curated protein families consisting of hidden Markov models (HMMs), multiple sequence alignments, commentary, Gene Ontology (GO) assignments, literature references and pointers to related TIGRFAMs, Pfam and InterPro models. These models are designed to support both automated and manually curated annotation of genomes. TIGRFAMs contains models of full-length proteins and shorter regions at the levels of superfamilies, subfamilies and equivalogs, where equivalogs are sets of homologous proteins conserved with respect to function since their last common ancestor. The scope of each model is set by raising or lowering cutoff scores and choosing members of the seed alignment to group proteins sharing specific function (equivalog) or more general properties. The overall goal is to provide information with maximum utility for the annotation process. TIGRFAMs is thus complementary to Pfam, whose models typically achieve broad coverage across distant homologs but end at the boundaries of conserved structural domains. The database currently contains over 1600 protein families. TIGRFAMs is available for searching or downloading at www.tigr.org/TIGRFAMs.
TIGRFAMs is a collection of protein families featuring curated
multiple sequence alignments, hidden Markov models and associated
information designed to support the automated functional identification
of proteins by sequence homology. We introduce the term ‘equivalog’ to
describe members of a set of homologous proteins that are conserved
with respect to function since their last common ancestor. Related proteins
are grouped into equivalog families where possible, and otherwise
into protein families with other hierarchically defined homology
types. TIGRFAMs currently contains over 800 protein families, available
for searching or downloading at www.tigr.org/TIGRFAMs. Classification
by equivalog family, where achievable, complements classification
by orthology, superfamily, domain or motif. It provides the information best
suited for automatic assignment of specific functions to proteins
from large-scale genome sequencing projects.
Sorting nexin 1 (SNX1) is a protein that binds to the epidermal growth factor (EGF) receptor and is proposed to play a role in directing EGF receptors to lysosomes for degradation (R. C. Kurten, D. L. Cadena, and G. N. Gill, Science 272:1008–1010, 1996). We have obtained full-length cDNAs and deduced the amino acid sequences of three novel homologous proteins, which were denoted human sorting nexins (SNX2, SNX3, and SNX4). In addition, we identified a presumed splice variant isoform of SNX1 (SNX1A). These molecules contain a conserved domain of ∼100 amino acids, which was termed the phox homology (PX) domain. Human SNX1 (522 amino acids), SNX1A (457 amino acids), SNX2 (519 amino acids), SNX3 (162 amino acids), and SNX4 (450 amino acids) are part of a larger family of hydrophilic molecules including proteins identified in Caenorhabditis elegans and Saccharomyces cerevisiae. Despite their hydrophilic nature, the sorting nexins are found partially associated with cellular membranes. They are widely expressed, although the tissue distribution of each sorting nexin mRNA varies. When expressed in COS7 cells, epitope-tagged sorting nexins SNX1, SNX1A, SNX2, and SNX4 coimmunoprecipitated with receptor tyrosine kinases for EGF, platelet-derived growth factor, and insulin. These sorting nexins also associated with the long isoform of the leptin receptor but not with the short and medium isoforms. Interestingly, endogenous COS7 transferrin receptors associated exclusively with SNX1 and SNX1A, while SNX3 was not found to associate with any of the receptors studied. Our demonstration of a large conserved family of sorting nexins that interact with a variety of receptor types suggests that these proteins may be involved in several stages of intracellular trafficking in mammalian cells.
The dimorphic prosthecate bacteria (DPB) are α-proteobacteria that reproduce in an asymmetric manner rather than by binary fission and are of interest as simple models of development. Prior to this work, the only member of this group for which genome sequence was available was the model freshwater organism Caulobacter crescentus. Here we describe the genome sequence of Hyphomonas neptunium, a marine member of the DPB that differs from C. crescentus in that H. neptunium uses its stalk as a reproductive structure. Genome analysis indicates that this organism shares more genes with C. crescentus than it does with Silicibacter pomeroyi (a closer relative according to 16S rRNA phylogeny), that it relies upon a heterotrophic strategy utilizing a wide range of substrates, that its cell cycle is likely to be regulated in a similar manner to that of C. crescentus, and that the outer membrane complements of H. neptunium and C. crescentus are remarkably similar. H. neptunium swarmer cells are highly motile via a single polar flagellum. With the exception of cheY and cheR, genes required for chemotaxis were absent in the H. neptunium genome. Consistent with this observation, H. neptunium swarmer cells did not respond to any chemotactic stimuli that were tested, which suggests that H. neptunium motility is a random dispersal mechanism for swarmer cells rather than a stimulus-controlled navigation system for locating specific environments. In addition to providing insights into bacterial development, the H. neptunium genome will provide an important resource for the study of other interesting biological processes including chromosome segregation, polar growth, and cell aging.