|Home | About | Journals | Submit | Contact Us | Français|
The pace of technical advancement in microbial genomics has been breathtaking. Since 1995, when the first complete genome sequence of a free-living organism, Haemophilus influenzae, was published,1 1554 complete bacterial genome sequences (the majority of which are from pathogens) and 112 complete archaeal genome sequences have been determined, and more than 4800 and 90, respectively, are in progress.2 A total of 41 complete eukaryotic genome sequences have been determined (19 from fungi), and more than 1100 are in progress. Complete reference genome sequences are available for 2675 viral species, and for some of these species, a large number of strains have been completely sequenced. Nearly 40,000 strains of influenza virus3 and more than 300,000 strains of human immunodeficiency virus (HIV) type 1 have been partially sequenced.4 However, the selection of microbes and viruses for genome sequencing is heavily biased toward the tiny minority that are amenable to cultivation in the laboratory, numerically dominant in particular habitats of interest (e.g., the human body), and associated with disease.
In 2006, investigators reported in-depth metagenomic sequence data from a human mixed microbial community5; in 2007 more than 1000 genes from single cells of cultivation-resistant bacteria were identified.6 Since then, a flood of such data has ensued (Fig. 1).7–9 Individual investigators can now produce a draft sequence of a bacterial genome containing 4 million base pairs in about a day.10–12 The revolution in DNA-sequencing technology has to a large extent democratized microbial genomics and altered the way infectious diseases are studied.11 However, gene annotation and error correction still take time and effort. Today, the major challenges in microbial genomics are to predict the function of gene products and the behavior of organisms and communities from their sequences and to use genomic data to develop improved tools for managing infectious diseases.
The human body contains remarkable microbial taxonomic richness, with thousands of symbiont species and strains per individual host. Of these, an estimated 90% have not yet been cultivated in the laboratory.13 Differences between closely related strains and species are responsible for virulence, host-species adaptation, and other aspects of lifestyle and account for the individualized nature of the human microbiota. For example, the gene content of pathogenic and nonpathogenic strains of Escherichia coli, as well as different pathogenic types, varies by as much as 36%.14,15 Comparisons of complete genome sequences from multiple strains of the same bacterial species reveal a set of core genes that are common to all strains and a set of dispensable genes that are absent in at least one strain.16 The sum of these genes (i.e., those represented in at least one strain) constitutes the species pangenome.
As compared with the genomes of plants and animals, genomes of microbes are small and usually contain one or two chromosomes, as well as a variable number of plasmids (see Glossary). Yet, approximately 90% of a typical microbial genome encodes proteins or structural RNAs,17 whereas only about 1.1% of the human genome is coding sequence.18 As a result, some complex bacteria have more genes than some simple eukaryotes.
Microbial diversification and adaptation have been accompanied by gene loss and genome reduction, genome rearrangement, horizontal gene transfer, and gene duplication.19,20 The first two of these processes are especially evident in human-specific pathogens, such as Bordetella pertussis (the causative agent of whooping cough),21,22Tropheryma whipplei (the agent of Whipple's disease),23 and Yersinia pestis (the agent of bubonic plague). A total of 3.7% of Y. pestis genes appear to be inactive, especially those associated with enteropathogenicity.24 The genome of Mycobacterium leprae, the cause of leprosy, provides an even more dramatic example of reductive evolution. Protein-coding genes account for less than half of its genome, whereas inactive and fragmented genes account for most of the remainder.25
Genomic islands are discrete clusters of contiguous genes found in bacterial chromosomes and plasmids, usually between 10,000 and 200,000 base pairs in length with features that suggest a history and origin distinct from other segments of the genome (see Glossary).26,27 Some islands are stably assimilated into the genome; others appear to have been acquired recently and may still be mobile. Genomic islands enhance the fitness of the recipient by providing new, accessory functions, such as pathogenicity, drug resistance, or catabolic functions.
One of the most dramatic examples of short-term genome evolution can be seen in the CRISPR (clustered regularly interspaced short palindromic repeat) loci of bacteria and archaea. CRISPRs serve as a defense against invading phages and plasmids, in a manner akin to adaptive immunity.28 These genomic loci contain segments of phage and plasmid sequences captured from previous encounters. These segments are stored within the CRISPR loci as spacer sequences and are expressed as small RNAs, which then interfere with replication of newly encountered phages and plasmids that bear the same sequences.
Differences in the sequence and structure of genomes from members of a microbial population reflect the composite effects of mutation, recombination, and selection. With the increasing availability of genome sequences, these effects have become better characterized and more effectively exploited so as to understand the history and evolution of microbes and viruses and their sometimes intimate relationships with humans. The resulting insights have practical importance for epidemiologic investigations, forensics, diagnostics, and vaccine development.29
Y. pestis, the cause of the Black Death, arose from a more genetically diverse ancestor that was related to Y. pseudotuberculosis, through genome reduction and gene loss. By analyzing approximately 1200 single-nucleotide polymorphisms (SNPs) and a worldwide collection of strains, the origins of this monomorphic pathogen have been placed between 2600 and 28,000 years ago in China, from which it spread to other areas of the world, giving rise to country-specific lineages.30 All the Y. pestis strains that are found in the United States today are descendants of a single import that probably arrived in San Francisco in 1899. As another example, patterns of early human migration have been traced by comparing genome sequences from contemporary isolates of the chronic gastric pathogen, Helicobacter pylori.31 Transmission of the pathogen is primarily from mother or other household members to baby, and colonization is usually lifelong; thus, pathogen sequences are reasonable markers of host ancestry and host migration. Sequence data for the H. pylori genome indicate the sequential timing and directionality of two distinct waves of human migration into the Pacific region.32 Population mosaicism in H. pylori gene sequences has been used to infer the history of social interactions in human populations.31
The power of full-genome sequencing to discriminate between closely related strains and track real-time evolution of disease-associated clonal isolates offers the possibility of tracing person-to-person transmission and identifying point sources of outbreaks. Using this approach, investigators established a previously unrecognized link among five patients with the same clonal strain of methicillin-resistant Staphylococcus aureus from a hospital in Thailand.33 A study of Vibrio cholerae genome sequences from the October 2010 cholera outbreak in Haiti suggested that the Haitian strains were clonal and more closely related to strains from Bangladesh that were isolated in 2002 and 2008 than to strains isolated in Peru in 1991 and in Mozambique in 2004. The authors concluded that the Haitian outbreak may have originated with the introduction of a V. cholerae strain from South Asia as a result of human activity rather than climatic events or other local environmental factors.12 However, the source of this outbreak has not been fully resolved; genome sequences of environmental strains and additional clinical isolates from Haiti may provide further insight.
A major challenge is the prediction of patterns of evolution and emergence of disease agents. The antigenic evolution of influenza virus is known to follow a punctuated equilibrium model in which periods of relative virus stability around the globe are followed by periods of rapid change, requiring modification of the influenza vaccine. However, it was not clear whether variants arise first in East and Southeast Asia and then seed other geographic regions or whether strains persist locally and evolve simultaneously in a similar fashion. An analysis of the gene encoding hemagglutinin (the major antigenic determinant) from more than 1000 human influenza A (H3N2) isolates that were collected worldwide from 2002 through 2007 produced strong support for external seeding, rather than local persistence, and suggested that the source of seeding is East and Southeast Asia.34 On the basis of whole-genome sequence analysis, the novel 2009 human H1N1 influenza strain was thought to have entered the human population in January of that year after arising from multiple swine virus progenitors that had probably been circulating in swine populations undetected for at least a decade.35 Work of this type will help target efforts regarding influenza virus surveillance more effectively, refine the selection of vaccine strains, and improve predictions of future antigenic characteristics.36 Similar approaches will assist in anticipating the emergence and spread of antibiotic and antiviral resistance.
Pathogens have received most of the attention in microbial genomics, despite their relative rarity in the microbial world.17,19 As a result, we now have a more complete and deeper understanding of how microbes cause disease and of pathogen emergence, host adaptation, and spread in human populations. The study of microbial genomes reveals four themes with respect to virulence.
First, horizontal gene transfer (see Glossary) has had a major role in the acquisition of genes associated with virulence. Most genes that encode virulence factors are physically segregated in clusters and located within mobile genetic elements. In S. aureus, these genes often occur within phage-related chromosomal islands and encode a variety of superantigens, including the toxin associated with the toxic shock syndrome and staphylococcal enterotoxin B, and encode factors that mediate antibiotic resistance, biofilm induction, and other virulence-associated properties (Fig. 2).37 Genomic islands with similar features occur in other gram-positive bacteria, including streptococcus, enterococcus, and lactococcus species. The emergence of the recent Shiga toxin–producing E. coli clone in Germany was probably the result of horizontal gene transfer, when a toxin-producing phage infected an enteroaggregative E. coli strain.38
Second, symbionts and avirulent relatives of pathogens often contain many of the same virulence-associated genes as do the microbes that typically cause disease.39 The genes that we commonly associate with virulence may have been selected for the advantages they confer in promoting colonization of animal and plant hosts, in avoiding or surviving phagocytosis, and in enhancing competition against symbionts.40–42 For example, the original role for bacterial toxins may have been to protect the bacterium against predation by protozoa and nematodes. The legionella protein IcmT facilitates the escape of the bacterium from human macrophages and also from the far more ancient predator, the free-living amoeba.43 Virulence depends on the choreographed expression of particular combinations of genes at the right place and time in the right host. Commensals and other symbionts also serve as reservoirs of antibiotic resistance genes and genetic diversity.44
A third theme is the surprising diversity of genes associated with mechanisms of virulence. In a study of four closely related fungal species, all of which cause late blight disease but in different host plant species, investigators identified specific regions of the fungal genomes with evidence of accelerated rates of evolution, suggesting that these regions have been under strong positive selective pressure.45 The genes in these regions produce effector molecules (see Glossary) that interact with host plant proteins and elicit host cell death. One of these fungal pathogens, the agent responsible for the 19th-century Irish potato famine, expresses 196 related effectors of unexpected complexity and diversity.46
A fourth theme is genome reduction and pseudogene formation (see Glossary), especially in pathogens with a relatively specialized lifestyle and with restricted numbers and types of habitats, niches, and hosts.20 This is illustrated by an unusual multidrug resistant strain of Salmonella enterica serovar Typhimurium that emerged in sub-Saharan Africa in the early 1990s to become the most common cause of invasive bacterial disease in some regions of that continent. Bacteremia and meningitis are common features of this disease, as they are for typhoid and paratyphoid fevers. The genome sequence of this strain reveals a large number of partially degraded and deleted genes, many of which are also degraded or de leted in the genomes of salmonella serovars Typhi and Paratyphi A.47
The role of the human indigenous microbiota in human heath and disease has received a great deal of attention in the past 5 years.7,8,13,48 Surveys of bacterial phylogenetic diversity that are based on comparative analyses of ribosomal RNA gene sequences recovered directly from clinical specimens have confirmed habitat- and individual-specific patterns in healthy persons.49,50 Yet, core features of the indigenous microbial communities are conserved in healthy persons.51
A metagenomic analysis (see Glossary) of fecal samples from 124 healthy European subjects identified an average of 536,112 unique genes in each of these samples, 99.1% of which were bacterial and 0.8% of which were archaeal — and a total of 3.3 million unique genes overall, or 150 times the number of genes in the human genome.9 Approximately 38% of an individual's fecal gene pool is shared by at least half of all other individuals. The shared gene products are predicted to mediate degradation of complex sugars, such as pectin and sorbitol, and of glycans harvested from the host diet or intestinal lining, as well as fermentation of mannose, fructose, cellulose, and sucrose (to short-chain fatty acids) and vitamin biosynthesis. These conserved genes constitute an accessory human genome that facilitates dietary energy harvest and nutrition. Alterations in the human microbiome are associated with a number of diseases in which no single organism seems to explain either the presence or the absence of disease. For these diseases (of which Crohn's disease is a leading example), the concept of community as pathogen has been proposed.52 Elucidation of the role played by altered microbial communities in such conditions and the associated mechanisms are likely to emerge from the application of genomic approaches during the next decade.
Genomic approaches have introduced a new era in the discovery and detection of microbial pathogens. The robustness, reliability, and portability of molecular sequence-based data for phylogenetic assessments and for characterization of previously unrecognized pathogens, coupled with technology developments, recommend genomic approaches for both research and routine clinical applications53–63 (Table 1 and Fig. 3; interactive graphic, available with the full text of this article at NEJM.org). Broad-range molecular methods for microbial discovery were introduced two decades ago.54,64 Approaches for targeting differentially abundant or phylogenetically informative molecules have now been joined by less efficient but more powerful methods for broad sequence surveys of clinical and environmental samples with the use of high-density DNA microarrays55,65 and shotgun sequencing56,66 (see Glossary). The advantages of DNA microarrays include the simultaneous detection of diverse sequences with widely varying relative abundance and recovery of captured sequences of interest directly from the microarray. A panviral DNA microarray with oligonucleotides designed from all known viral genera was used to characterize the novel causative agent of the severe acute respiratory syndrome (SARS)55 and has been used to detect viruses in nasopharyngeal aspirates from children with a variety of acute respiratory syndromes.65 The disadvantages of DNA microarrays include their insensitivity to rare microbial sequences in the presence of highly abundant host sequences (i.e., those obtained from host tissues) and their reliance on previous knowledge of microbial sequence diversity for oligonucleotide design.
High-throughput shotgun sequencing offers important new opportunities for the detection and discovery of microbial pathogens. This approach has revealed both previously known viruses (e.g., rotavirus, adenovirus, calicivirus, and astrovirus) and unknown viruses (e.g., novel types of picobirnavirus, enterovirus, TT virus, and norovirus) in fecal samples from children with unexplained acute diarrhea66 and a novel Old World arenavirus that caused fatal disease in three recipients of organs from a single donor.56 Dramatic advances in sequencing technology highlight the need to understand the diversity of microbial sequences in healthy subjects and to develop better methods for distinguishing rare, genuine microbial sequences from sequencing errors.
Sequence-based characterization of pathogens enables the design and development of sensitive and specific diagnostic assays and, in some cases, methods for cultivation of the pathogen. Characterization of the 16S ribosomal RNA gene from the agent of Whipple's disease, T. whipplei, led to a molecular diagnostic assay for this disease agent.67 Subsequent determination of its complete genome sequence23,68 provided additional potential target sequences and the basis for a more sensitive diagnostic test.61 It also provided insight into the metabolic defects of this bacterium, such that cell-free growth medium could be designed to include missing, needed growth factors.69
Genome sequences provide the blueprint for essential microbial and viral components, the disruption of which can lead to growth inhibition and death. These same sequences can sometimes indicate resistance of the microbe or virus to a particular drug. Although drug susceptibility and resistance are often governed by multiple genetic components, some drug-resistance traits are encoded by single genes and can therefore be easily predicted by detecting or sequencing such genes. Examples include rifampin resistance in M. tuberculosis, methicillin resistance in S. aureus, trimetho prim–sulfamethoxazole resistance in T. whipplei,70 and resistance to some antiretroviral drugs in HIV. Genome sequences have also provided new targets and leads for the development of new antimicrobials.
The standard of care for the management of HIV infection now includes targeted drug selection with the use of a profile for HIV-drug susceptibility that is derived from the sequence of the infecting HIV species.59 Testing for genotypic resistance is recommended for patients with HIV infection when they enter care and when there is a suboptimal reduction in viral load while they are receiving first- or second-line antiretroviral regimens. Clinically important resistance mutations occur in HIV genes encoding the reverse-transcriptase, protease, envelope, and integrase proteins. Interpretation of these mutant genotypes is facilitated by several databases, including those maintained by the International Antiviral Society–USA71 and a research group at Stanford University.72 Genotypic analysis is cheaper and faster than phenotypic analysis for HIV-drug resistance and is often more sensitive for detecting resistant strains within mixtures of drug-susceptible viruses.73 However, commercial assays of both types do not routinely detect resistant viruses when they are less than 10 to 20% of the overall circulating virus population. With newer sequencing techniques, less abundant strains are easier to detect and characterize. Although the clinical relevance of rare resistant variants is not fully understood, the pretreatment detection of such variants has been shown to have clinical value.58 Traditional phenotypic testing (measuring the ability of the virus to replicate in the presence of the antiviral drug) is still recommended for patients in whom viruses are suspected of having complex drug-resistance mutation patterns.
Schistosomiasis is a chronic and debilitating disease that affects approximately 210 million people in 76 countries around the globe and results in some 280,000 deaths per year in sub-Saharan Africa alone. Praziquantel has been the drug of choice for the treatment of schistosomiasis but is in danger of losing efficacy because of parasite resistance. Schistosoma mansoni is one of three helminths for which there is now a draft genome sequence available to the public.74 Besides enabling the study of gene and protein expression,75 the nuclear genome of S. mansoni and its approximately 11,800 putative genes point to critical compounds and processes on which the worm depends to survive in its host. These compounds and processes reveal potential new drug targets, one of which is a redox enzyme, thioredoxin–glutathione reductase.74 Quantitative highthroughput screening of small-molecule libraries for compounds with activity against the S. mansoni thioredoxin–glutathione reductase has already identified some candidate drugs.62
Microbes produce a wealth of druglike molecules, the vast majority of which remain uncharacterized.76,77 Because many of these molecules are not expressed under typical laboratory conditions, they often escape detection when laboratory culture filtrates are screened for druglike properties. Some of these molecules can now be identified by recognizing the relevant genes in the parent organism's genome with the use of computational tools and detecting the molecules with mass spectroscopy techniques.78,79 Derivative compounds can be designed and tested.
In the same way that genome sequences reveal drug-resistance profiles, vulnerabilities, and synthetic capabilities of microbes and viruses, these sequences also provide clues about antigenic repertoire. This information can be exploited for vaccine design and other immunoprophylactic interventions. Genome-based antigen discovery has also been undertaken for more complex pathogens. One approach, known as reverse vaccinology, involves cloning and expressing all proteins that are predicted (from the organism's complete genome) to be secreted or surface-associated, starting with the complete genome sequence (Fig. 3).80 After immunizing mice with each of the proteins, each of the corresponding antiserum samples is tested for its ability to neutralize or kill the original target organism. On the basis of this approach, a small group of proteins from group B meningococcus,81 a pathogen that has so far eluded vaccine development, has shown promise as a candidate multivalent subunit vaccine. A similar approach has been taken with group B streptococcus82 and extraintestinal pathogenic E. coli.83 Protective antigens that are discovered through these sorts of methods may have been previously ignored because they are not immunogenic during natural infections.
Without question, the techniques for microbial and viral genome sequencing are becoming increasingly rapid and less expensive. Genome sequencing of a microbe or virus will soon be easier than characterization of its growth-based behavior in the laboratory. In the next 3 to 5 years, direct shotgun sequencing of the DNA and RNA in a clinical sample may become a routine matter. What is less clear is how clinically relevant information will be most effectively extracted from the ensuing massive amounts of data. In the near term, genomic and metagenomic analyses of microbes are most likely to be useful in areas such as the cataloguing and understanding of microbial and viral diversity in the human body, the identification of molecular determinants of virulence and symbiosis, and real-time tracking of particular strains of pathogens. Such analyses will also provide a deeper understanding of how pathogens spread and cause disease and will identify new targets for therapies and antigens for vaccines. Thoughtfully designed clinical and epidemiologic studies will be required to see the full realization of these benefits.
Disclosure forms provided by the author are available with the full text of this article at NEJM.org.