|Home | About | Journals | Submit | Contact Us | Français|
The author(s) have made the following declarations about their contributions: Conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, and wrote the paper: The International Aphid Genomics Consortium.
Aphids are important agricultural pests and also biological models for studies of insect-plant interactions, symbiosis, virus vectoring, and the developmental causes of extreme phenotypic plasticity. Here we present the 464 Mb draft genome assembly of the pea aphid Acyrthosiphon pisum. This first published whole genome sequence of a basal hemimetabolous insect provides an outgroup to the multiple published genomes of holometabolous insects. Pea aphids are host-plant specialists, they can reproduce both sexually and asexually, and they have coevolved with an obligate bacterial symbiont. Here we highlight findings from whole genome analysis that may be related to these unusual biological features. These findings include discovery of extensive gene duplication in more than 2000 gene families as well as loss of evolutionarily conserved genes. Gene family expansions relative to other published genomes include genes involved in chromatin modification, miRNA synthesis, and sugar transport. Gene losses include genes central to the IMD immune pathway, selenoprotein utilization, purine salvage, and the entire urea cycle. The pea aphid genome reveals that only a limited number of genes have been acquired from bacteria; thus the reduced gene count of Buchnera does not reflect gene transfer to the host genome. The inventory of metabolic genes in the pea aphid genome suggests that there is extensive metabolite exchange between the aphid and Buchnera, including sharing of amino acid biosynthesis between the aphid and Buchnera. The pea aphid genome provides a foundation for post-genomic studies of fundamental biological questions and applied agricultural problems.
Aphids are common pests of crops and ornamental plants. Facilitated by their ancient association with intracellular symbiotic bacteria that synthesize essential amino acids, aphids feed on phloem (sap). Exploitation of a diversity of long-lived woody and short-lived herbaceous hosts by many aphid species is a result of specializations that allow aphids to discover and exploit suitable host plants. Such specializations include production by a single genotype of multiple alternative phenotypes including asexual, sexual, winged, and unwinged forms. We have generated a draft genome sequence of the pea aphid, an aphid that is a model for the study of symbiosis, development, and host plant specialization. Some of the many highlights of our genome analysis include an expanded total gene set with remarkable levels of gene duplication, as well as aphid-lineage-specific gene losses. We find that the pea aphid genome contains all genes required for epigenetic regulation by methylation, that genes encoding the synthesis of a number of essential amino acids are distributed between the genomes of the pea aphid and its symbiont, Buchnera aphidicola, and that many genes encoding immune system components are absent. These genome data will form the basis for future aphid research and have already underpinned a variety of genome-wide approaches to understanding aphid biology.
Aphids are small, soft-bodied insects with elaborate life cycles that include all-female, parthenogenetic generations that alternate with sexual generations (Figure 1). Aphids feed exclusively on plant phloem sap by inserting their slender mouthparts into sieve elements, the primary food conduits of plants. Many of the ~5,000 aphid species attack agricultural plants and inflict damage both through the direct effects of feeding and by vectoring debilitating plant viruses. Annual worldwide crop losses due to aphids are estimated at hundreds of millions of dollars ,,.
Phloem sap is rich in simple sugars but contains an unbalanced mixture of amino acids. This unbalanced diet is compensated for by the intracellular mutualistic bacterium, Buchnera aphidicola (Figure 2), which has coevolved with aphids  and provides essential amino acids that are absent or rare in phloem sap . Additionally, some aphids, including the pea aphid, have facultative associations with a variety of other heritable bacterial symbionts that provide ecological benefits, such as heat tolerance and resistance to parasitoids .
Aphids, which are essentially plant parasites, have evolved complex life cycles involving extensive phenotypic plasticity . They produce individuals with multiple distinct phenotypes (polyphenism), so that individuals with identical genotypes can develop into one of several alternative phenotypes, each adapted to a particular ecological situation (Figure 1). Aphids develop as asexual live-bearing females or as sexual males and egg-laying females during different seasons. Asexual females occur as sedentary wingless forms or as winged forms specialized for dispersal. In many aphid species, individuals from different stages of the life cycle may feed on distinct sets of plant species. In addition, some aphid species produce morphs that are specialized to resist desiccation or to defend the colony. Asexual forms have evolved a highly modified meiosis that omits the reduction division of Meiosis I, allowing apomictic parthenogenesis. Parthenogenetically produced embryos develop directly within their mothers, sometimes before the birth of the mother herself, so that females can end up carrying both their daughters and their granddaughters within them. This telescoping of generations promotes short generation times, allowing aphid colonies to rapidly exploit new resources. Like other hemimetabolous insects, aphids undergo an incomplete metamorphosis from juvenile to adult stages.
Here we present the genome sequence of the pea aphid, Acyrthosiphon pisum. This aphid, which is widely used in laboratory studies, attacks legume crops (Fabaceae) and is closely related to important crop pests, including the green peach aphid (Myzus persicae) and the Russian wheat aphid (Diuraphis noxia) . This first published hemimetabolous genome, coupled with the genomes of its obligate and facultative bacterial symbionts ,,, provides a strong foundation for exploring the genetic basis of coevolved symbiotic associations, of host plant specialization, of insect-plant interactions, and of the developmental causes of extreme phenotypic plasticity. We first provide an overview of the general features of the pea aphid genome and then review findings of manual gene annotation efforts focused on genes related to symbiosis, insect-plant interactions, and development. Additional findings from these annotation projects can be found in multiple companion papers ,–.
The haploid pea aphid genome of four holocentric chromosomes (three autosomes and one X chromosome) was estimated by flow cytometry for the sequenced pea aphid line LSR1.AC.G1 to be 517 Mb (SE=3.15 Mbp, N=7). Sanger sequencing of DNA samples from line LSR1.AC.G1 produced 4.4 million raw sequence reads (6.2× genome coverage, Table S1) of which 3.05 million were in the final assembly. This Acyr 1.0 assembly contains 72,844 contigs, with an N50 length of 10.8 kb and a total length of 446.6 Mb. The scaffold N50 is 88.5 kb, and scaffolds including gaps between the ordered and oriented contigs had a total length of 464 Mb. To estimate the gene coverage of the assembly, 97,878 ESTs (5′-EST: 49,991; 3′-EST: 47,837; ) generated from a full-length A. pisum cDNA library were mapped to the Acyr 1.0 assembly. Ninety-nine percent of these EST sequences were mapped in Acyr 1.0, and 81% of the clones had both 5′- and 3′-ESTs mapping to the same scaffold with appropriate separation distance and opposite orientations. No sequences with high similarity to the ~170,000 available ESTs were found in the unassembled reads, suggesting that few protein-coding genes remain in the unassembled fraction of the dataset.
The assembled regions of the pea aphid genome have the lowest GC content of any insect genome sequenced to date; at 29.6%, pea aphid GC content is 5.2% lower than that of Apis mellifera at 34.8% . Computed over all concatenated transcripts pea aphid GC content averages 38.8% (SD=8.4, N=37,994), a value similar to that of Apis mellifera (mean=38.6%, SD=9.7, N=17,182) (Table S2).
Prior to this project, less than 200 pea aphid genes had been sequenced. Thus, we performed automated gene predictions to aid study of the pea aphid gene repertoire. High-quality gene models with either partial or full-length EST and/or protein homology support computed by NCBI's gene prediction pipeline serve as a core set of 10,249 protein-coding gene models and are integrated into the public RefSeq databases at NCBI. Since the number of gene models with EST or protein homology support is expected to be smaller than the true number of protein-coding genes in the pea aphid genome, additional gene models were calculated using six additional gene prediction programs and combined, using GLEAN , into a consensus set of 24,355 additional gene models (Table 1). When compared to 2,089 exons of known origin and sequence, the GLEAN consensus gene models contained the highest number of bases overlapping the known exons. Other details of this comparison are in Table S3, and a comparison of pea aphid and other arthropod gene structures is shown in Table S4.
Ab initio prediction requires the detection of intron/exon junctions based on rules observed from the major spliceosome machinery. However, some introns are excised by the minor spliceosome driven by the U12 small snoRNA, and these introns are poorly predicted by ab initio algorithms. We identified 134 putative U12 introns in the pea aphid genome representing the most identified in any insect. This high number of U12 introns likely complicates ab initio gene modeling in the pea aphid.
The combined total of 34,604 gene predictions includes unsupported ab initio models, partial gene models, and genes incorrectly shown as duplicated in the Acyr_1.0 assembly (see below). This estimate is likely, therefore, to exceed the true number of protein-coding genes. Nevertheless, the combined set of computational gene predictions provided a foundation for subsequent analyses, including manual annotation of 2,010 genes.
We took advantage of the first genome for a hemipteran species to perform a whole genome-based species phylogeny of the insects. The resulting phylogeny, based on 197 genes with single copy orthologs, is congruent with previous phylogenetic analyses  and places the pea aphid together with Pediculus humanus, another member of the para-neoptera clade, basal to the Holometabola (Figure 3). Comparing gene content across this phylogeny revealed that the pea aphid shares 30%–55% (e-value<10−3) of its genes in its complete gene set with other sequenced insects, with the highest overlap with Nasonia vitripennis and Tribolium castaneum (53% in both cases) (Figure 3). However, 37% of predicted pea aphid genes have no significant hits (e-value<10−3) with genes identified to date in any other species. This large number of orphan genes may reflect high rates of false positive gene predictions or distinctive properties of the aphid genome, or both.
Beyond these comparisons—which are based on BLAST searches of aphid genes against other insect gene sets—we employed a phylogeny-based homology prediction pipeline , to generate the pea aphid phylome: a phylogenetic tree and orthology prediction for every predicted, non-orphan A. pisum protein. Although rampant duplications have produced large gene families (see below), phylogeny-based orthology predictions allowed us to directly transfer GO annotations to 4,058 pea aphid genes that display one-to-one orthology relationships with annotated Drosophila melanogaster genes.
Analysis of the pea aphid phylome revealed 2,459 gene families that appear to have undergone aphid lineage-specific duplications, a number greater than that of any other sequenced insect genome (Figure 4A). Only the genome of the crustacean Daphnia pulex appears to have experienced a similar level of lineage-specific duplications . The largest gene family expansions, involving 19 families with 50 to 200 members, encode reverse transcriptase and transposase domains probably representing pieces of transposable elements (TEs). However, most gene family expansions do not involve TEs. Notable examples include approximately 200 lineage-specific paralogs of the Drosophila gene kelch, which encodes an actin-binding protein involved in ovarian follicle cell migration and oogenesis (Gene tree ACYPI51424-PA in phylomeDB), and 19 paralogs of a putative Acetyl-CoA transporter (Figure 4B). This high level of gene duplication in the pea aphid genome is widespread among different types of genes, and numerous additional examples are discussed below.
To provide a time scale for the origin of aphid-specific duplications, we estimated the synonymous distances (dS values) among all paralog pairs, which were identified using a within-genome reciprocal best blast hit. Because the sequenced line showed some heterozygosity, divergence between truly paralogous gene pairs could be confounded with allelic variation, but this should be a problem only for very close pairs of paralogs, since divergence values for allelic variants in most systems are generally very low (<1%). The large majority of gene pairs have higher divergence (dS > 0.05) than this allelic variant cut-off value, and thus can be assumed to represent true paralogs. Paralog pairs display a wide range of dS values, suggesting that gene duplication has occurred for an extended time in the pea aphid lineage. The elevated gene duplication rate appears to have started early in aphid evolution, since the oldest paralog pairs within the pea aphid genome show dS values that are comparable to the dS values for ortholog pairs between pea aphid and Aphis gossypii, a species from a different aphid subfamily (Figure 5).
The pea aphid, similar to other non-dipteran insects, possesses a single candidate telomerase gene and the canonical arthropod telomere repeat of TTAGG . Examination of raw read mate pairs revealed long stretches of TTAGG repeats at presumptive chromosome ends. Of the expected eight telomeres, we identified simple TTAGG repeats at the ends of five scaffolds: two contain relatively long repeat stretches of apparently true TTAGG simple repeat telomeres, while three are similar to the telomeres of Bombyx and Tribolium and contain non-LTR retrotransposon insertions ,.
Approximately 38% of the assembled genome is composed of TEs. We identified 13,911 consensus TE sequences in the pea aphid genome using REPET, a TE annotation pipeline. The consensus TE sequences were grouped by sequence similarity and classified according to their structural and coding features into 1,883 TE families (consisting of two or more consensus sequences) and 1,672 singletons. Within the 1,883 TE families, we manually curated 85 families including the largest families representative of widespread TE groups, such as LTRs, LINEs, SINEs, TIRs, and Helitrons (Table 2). The curated repeats account for 4% of the genome, and less complex repeat families with few sequence variants remain uncurated and account for 34% of the pea aphid genome. Of the curated repeats, most super-families represent old invasions, as indicated by the distribution of nucleotide identities between sequences within TE families (Figure 6).
Like the hymenopteran honey bee and parasitic wasp Nasonia and unlike other insects with sequenced genomes, the pea aphid has a full complement of DNA methylation genes, with orthologs for two maintenance DNA methyltransferases (Dnmt1a and Dnmt1b), two de novo DNA methyltransferases (Dnmt3a and Dnmt3X), and the Dnmt2 found in all sequenced insect genomes. In addition to the DNA methyltransferases, we also identified a single putative methyl-DNA-binding-domain-containing gene involved in the recruitment of chromatin modification enzymes.
Methylated C nucleotides in CpGs—the sites of known DNA methylation in pea aphid—are prone to deamination to uracil, after which DNA repair machinery can produce thymidine. Thus, an excess of CpG sites over those expected at random can provide evidence for purifying selection maintaining CpG sites for methylation. This approach has been used previously to successfully predict methylated genes . We investigated the frequency in aphid genes of CpG sites compared with the frequency expected based on the low overall GC content. Pea aphids, like Apis mellifera, exhibit a double peak in the frequency of genes with different ratios of observed/expected CpG content, a pattern different than that of Drosophila melanogaster and of Tribolium castaneum (Figure 7). The double peak suggests two broad classes of genes with different methylation status. Direct examination of DNA methylation states will be required to confirm that two major groups of pea aphid genes are differentially regulated by methylation.
Micro RNA and small interfering RNA gene silencing participates in regulation of eukaryotic gene expression . We identified 163 microRNAs, including 52 conserved and 111 orphan microRNAs. We also found an expansion of gene families related to miRNA-related gene regulation (Figure 8). This expansion includes four copies of pasha, a co-factor of drosha involved in the first step of miRNA biosynthesis, a duplication of dicer-1, an RNAse involved in the processing of miRNAs, and a duplication of Argonaute-1, the key protein of the multiprotein RNA Induced Silencing Complex (RISC). These gene family expansions are present in other aphid species , but no other metazoa outside the aphids appear to have duplications of these genes.
Most aphid species harbor the obligate, mutualistic, primary symbiont, Buchnera aphidicola (Gamma proteobacteria), within the cytoplasm of specialized cells called bacteriocytes. These bacteria are passed from mother to eggs during oogenesis in sexual forms and directly to developing embryos during embryogenesis of asexual morphs .
Although this sequencing project was designed to target the genome of A. pisum, the project also generated sequences of the primary symbiotic bacteria, Buchnera aphidicola APS. We obtained 24,947 sequence reads corresponding to ~20× coverage of the Buchnera genome. Assembly of this sequence and PCR-based gap closure allowed reconstruction of the complete 642,011-base-pair genome of Buchnera (Genbank Accession ACFK00000000). Compared with the first sequenced strain from Japan , the new strain (from North America) shows approximately 1,500 mismatches (0.23%) and two larger inserts (1.2 kbp and 150 bp). The newly sequenced strain is almost 100% identical to a cluster of five recently sequenced Buchnera strains from pea aphids collected in North America (CP001161; ).
Besides Buchnera, aphids often harbor facultative heritable symbiotic bacteria known as secondary symbionts, of which different strains have been shown to protect pea aphid hosts from heat stress, fungal pathogens, and parasitoid wasps . As part of the pea aphid genome project, the genomic sequence of the secondary symbiont Regiella insecticola was obtained . Along with the recently completed sequence for the secondary symbiont Hamiltonella defensa , these data contrast with the genomes of Buchnera and other obligate symbionts, illustrating the genomic underpinnings of two very different symbiotic lifestyles. Buchnera possesses a highly reduced genome largely comprised of genes essential for basic cellular processes and aphid nutrition. Its chromosome is unusually stable and completely lacks mobile elements, bacteriophage, or genes for toxin production. In contrast, H. defensa and R. insecticola possess phage genes, many mobile elements, and numerous genes predicted to encode toxins ,,. For example, about 12% of all R. insecticola genes are homologous to transposases of mobile elements, and 5% of genes are phage-related, suggesting a highly dynamic genome especially as compared to Buchnera and other small genome symbionts.
The pea aphid genome provides a first opportunity for an exhaustive search for genes of bacterial origin in the genome of a eukaryotic host showing persistent associations with heritable bacterial symbionts. Besides their ancient association with Buchnera and facultative associations with Regiella and other symbionts within the Enterobacteriaceae , aphids sometimes harbor Spiroplasma species, Rickettsia species, and Wolbachia species as heritable endosymbionts.
Screening of the genome project data for bacterial sequences revealed a large number of genes of apparent bacterial origin, even after vector contaminants had been screened out. However, a majority of these were on small contigs (mostly under 5 kb) that did not contain evident aphid sequence; PCR experiments on a subsample of such genes supported their identity as bacterial contaminants in the dataset rather than as true transferred genes (Table S5). A minority of apparent bacterial genes was present on larger contigs, some of which contained genes of evident insect origin, suggesting that these represented true transferred genes. Phylogenetic analyses, incorporating homologous genes from prokaryotes and eukaryotes, supported the bacterial origin of 12 such genes or gene fragments, extending previous findings of gene transfer from a bacterial lineage to the aphid genome ,. Apparent transferred genes included those encoding LD-carboxypeptidases (LdcA), N-acetylmuramoyl-L-alanine amidase (AmiD), 1,4-beta-N-acetylmuramidase, and rare lipoprotein A (RlpA). Several of the genes originating from bacteria were previously detected as transcripts expressed in bacteriocytes , where some are highly expressed . The coding regions of most of these genes are intact. Another source of transferred DNA is the mitochondrial genome, and aphids were one of the first animals for which transferred mitochondrial genes were reported . In the pea aphid genome, a total of 56 mitochondrial gene sequences were detected. All of these transferred mitochondrial genes have been pseudogenized through substitutions and deletions, and some transferred sequences have been duplicated.
Our findings indicate that overall aphids have acquired few functional genes via lateral gene transfer from bacteria. However, these few genes may be critical in the maintenance of the symbioses exhibited by aphids.
The pea aphid genome provides insight into the intimate metabolic associations between an insect host and obligate bacterial symbiont, revealing how the pea aphid's amino acid and purine metabolism might be adapted to support essential amino acid synthesis and nitrogen recycling by Buchnera. Manual annotation of metabolism genes reveals that, like other animals, the pea aphid lacks the capacity for de novo synthesis of nine protein-amino acids (histidine, isoleucine, leucine, lysine, methionine, phenylalanine, threonine, tryptophan, and valine). All genes underlying the urea cycle are also missing, rendering the pea aphid incapable of synthesizing a further amino acid, arginine.
A global view of the metabolism of the pea aphid as inferred from genome sequence data is available at AcypiCyc, a dedicated BioCyc database (see http://pbil.univ-lyon1.fr/software/cycads/acypicyc/home and Table S6) . This analysis highlighted several noteworthy features of pea aphid metabolism. First, the genetic capacities of pea aphids and of Buchnera for amino acid biosynthesis are broadly complementary, an effect that can be attributed principally to gene loss from Buchnera ,. This complementarity results in several apparent instances of metabolic pathways shared between the pea aphid and Buchnera (Figure 9). For example, the aphid genome includes a gene for glutamine synthetase 2, which is highly expressed in the bacteriocytes that house Buchnera . This raises the possibility that bacteriocytes actively synthesize glutamine, which is then utilized by Buchnera as an amino donor in several metabolic pathways, including arginine synthesis. Second, the pea aphid apparently lacks two core genes of the purine salvage pathway, adenosine deaminase and purine nucleoside phosphorylase, as well as genes necessary for the urea cycle. The absence of these genes makes it unlikely that aphids can produce uric acid or urea, an inference consistent with the absence of detectable uric acid or urea in pea aphid excreta .
Analyses revealed an additional unusual trait with implications for metabolism. Neither the aphid nor Buchnera has the genetic capacity to utilize selenocysteine, the 21st protein amino acid. Selenocysteine is encoded by the codon UGA, normally a stop codon. A number of specific genes and factors comprise the selenoprotein machinery required to recode UGA to selenocysteine . Although cysteine homologs were found for some selenoproteins, no homolog was found for the known insect selenoproteins, nor did we find a tRNA for selenocysteine. Additionally we searched for the selenoprotein machinery genes (SBP2, Efsec, Secp43, pstk, SecS, SPS1, and SPS2) and found only SPS1, which appears to not function in selenocysteine biosynthesis in insects  and SecS. Buchnera does not have the genetic capacity to compensate for these gene losses. Together, these findings strongly suggest that A. pisum lacks the capacity to make selenoproteins, a trait atypical for an animal ,.
The aphid immune system is expected to be critical in determining responses to microbial symbionts . Orthologs of the key components of the immune-related Toll, Jak/Stat, and JNK signaling pathways are present in the pea aphid genome. However, other immune response pathways appear to be absent (Figure 10). Specifically, many of the genes comprising the IMD (Immunodeficiency) pathway, including IMD, dFADD, Dredd, and Relish, could not be detected in the pea aphid genome. The IMD pathway is intact in genomes of other sequenced insects , and some of these IMD pathway genes are found in the crustacean, Daphnia pulex . Furthermore, the pea aphid genome also lacks recognizable peptidoglycan recognition proteins (PGRPs), which detect certain pathogens and trigger the IMD and Toll pathways in Drosophila . Additionally, manual annotation identified few antimicrobial peptide (AMP) genes, which are produced in response to activated immune pathways. Consistent with this, studies of immune-challenged pea aphids—using a variety of assays (SSH, ESTs, HPLC) that have successfully identified AMP genes in other species—recovered no AMPs from bacteria-challenged or fungal-challenged aphids ,. These studies found that during immune challenges, aphids up-regulate few genes of known immune function and few novel genes that could be associated with an alternate immune response. Together our observations suggest that, in comparison to previously studied insects, aphids have a reduced immune repertoire. Reduced immune capabilities could facilitate the acquisition and maintenance of microbial symbionts, a hypothesis testable in other obligately symbiotic systems. An alternate possibility is that rapid reproduction and a largely microbe-free diet of phloem sap, decrease selective pressures on the aphid to maintain costly immune protection.
Plant volatiles are important cues for host plant recognition by aphids. In insects, such cues enter the antennae, bind to odorant-binding proteins (OBPs) , and are transported to chemoreceptors ,,,, which then activate a cascade of events leading to sensory neuron activity. Chemoreceptors include basal gustatory receptors (GRs) and more derived odorant receptors (ORs). Chemosensory proteins (CSPs) are also thought to be involved in chemoreception.
We identified 15 genes encoding putative OBPs and 13 putative CSP genes. By way of contrast, other insects also have more OBPs than CSPs . Zhou et al. (2009) also identified highly conserved orthologs for 10 of the 15 pea aphid OBPs in nine other aphid species .
We identified 79 genes in the OR family, including intact, partially annotated genes, and putative pseudogenes. An ortholog of the highly conserved DmOr83b gene  was named ApOr1. As in other sequenced genomes, the remainder of the OR genes represent aphid-specific expansions with no orthologs in other insects.
The pea aphid GR family contains at least 77 genes. There are six members of the well-conserved sugar receptor subfamily and no homologs of the highly conserved carbon dioxide receptors found in holometabolous insects . The remaining 71 GR genes are orphans. Overall, the number of the OR and GR chemoreceptor classes does not differ substantially from that seen in other insects. Smadja et al. found that for both the OR and GR genes, some subfamilies appear to have resulted from relatively old duplication events, whereas others represent recent duplication events . The rapid evolution of some OR and GR genes might be related to host plant specialization observed in A. pisum (for example, ,), because host plant acceptance has been shown to rely mainly on chemosensory processes .
Responsible for transmission of 28% of known plant viruses, aphids show four modes of virus transmission; (1) non-persistent (stylet-borne), (2) semi-persistent (foregut-borne), (3) persistent circulative, and (4) persistent propagative . The persistent circulative mode of transmission is exploited by members of the Luteoviridae family, which are transmitted specifically by aphids. Because luteovirids are transported by membrane trafficking mechanisms, proteins involved in endocytosis, vesicle transport, and exocytosis are potentially involved in virus transmission. As expected, we found genes for such proteins in the pea aphid genome. Of particular interest, we found 12 genes encoding a novel type of dynamin, which are large GTPases involved in membrane dynamic processes.
As an herbivore, the pea aphid is likely to overcome plant chemical defenses, at least in part, by employing detoxification enzymes, including cytochrome P450 monooxygenases (P450s), glutathione S-transferases (GSTs), and carboxyl/choline esterases (CCEs). From the genome sequence, 83 potential pea aphid P450 genes have been identified, but only 58 of these have a complete P450 domain and good homology to other insect P450s. Although previously studied insects harbor six classes of GSTs , the 20 identified pea aphid GSTs belong to only three of these classes. The CCE gene family has 29 members in the pea aphid, all of which appear to encode functional proteins. Although the pea aphid has fewer detoxification enzymes than the non-herbivorous insects whose genomes have been examined (Drosophila, Anopheles, and Tribolium), it possesses more than the pollinator Apis mellifera .
The osmotic pressure of phloem sap is significantly greater than that of aphid hemolymph , and thus sugar transport can occur down a concentration gradient. Consistent with this we find that sodium-sugar symporters, proteins that facilitate movement against concentration gradients, are absent from the pea aphid genome. Instead, sugar transport from gut to hemolymph apparently relies on uniporters, proteins that exploit favorable concentration gradients to transport sugars from the gut into epithelial cells, and from epithelial cells into the hemolymph. The pea aphid genome contains a large number of uniporter-encoding genes, including approximately 200 genes encoding proteins of the major facilitator superfamily (MFS). Companion work  found that the most abundant sugar transporter transcript encodes a uniporter with capacity to transport both fructose and glucose. The pea aphid with 34 sugar/inositol transporter genes has more than Drosophila melanogaster (15 genes), Apis mellifera (17 genes), Anopheles gambiae (22 genes), and Bombyx mori (19 genes), but less than Tribolium castaneum (54 genes) . Among these 34 pea aphid sugar/inositol transporter genes, 8 occur as either tandem repeats or inverted repeats, suggesting that they may have resulted from recent duplication events. Adaptation of aphids to an “extreme” diet requiring specialized sugar transport has likely contributed to the evolutionary expansion of this gene family.
As hemimetabolous insects, aphids undergo incomplete metamorphosis, passing through a series of molts involving four immature instars to reach the adult stage. Aphids display a wide range of adult phenotypes (Figure 1) and possess two divergent modes of embryonic development: parthenogenetic and sexual embryogenesis .
The majority of genes involved in axis formation, segmentation, neurogenesis, eye development, and germ-line specification in the embryo are well-conserved. Genes playing critical roles in Drosophila embryogenesis, but thus far not found outside the Diptera, are also missing from aphids, including oskar (germ-line specification), bicoid (anterior development), and gurken (dorso-ventral patterning). Despite the absence of these orthologs, the downstream components of the developmental pathways to which they belong are well-conserved. Lineage-specific gene losses were found for giant, huckebein, and orthodenticle-1. Orthologs of some genes involved in establishing the body plan, such as spätzle and dorsal, have undergone aphid-specific gene duplications. There are also two paralogs of torso-like, the gene encoding the most conserved molecule in the terminal patterning pathway.
In arthropods, chitin contributes to the structure of the cuticle (i.e., the lining of the tracheae, foregut, and hindgut; and the exoskeleton). There are three major classes of chitin-binding proteins. The pea aphid genome contains a large expansion of the first class, genes containing the R&R consensus sequence , and multiple copies of the second class, genes with a cysteine-based chitin-binding domain (CBD). For the third class, genes containing a chitin deacetylase domain, the pea aphid genome encodes five of the six main types. Consistent with the aphid's lack of a peritrophic membrane, the sixth type, which is located in the peritrophic membrane of other insects, is absent in the pea aphid. Compared to other insects, the pea aphid has fewer genes encoding chitinase, an enzyme with chitinolytic activities that degrades old cuticle. This difference possibly reflects the fact that hemimetabolous insects, which do not undergo a complete metamorphosis to the adult form, do not require dramatic exoskeletal reconstruction.
Genes of the highly conserved TGF-β, Wnt, EGF, and JAK/STAT signaling pathways, all utilized in development, have undergone several aphid-specific duplications and losses. Multiple paralogs of Dpp (4 paralogs), Medea (5), Mad (2), Domeless (4), STAT (2), Argos (4), and Armadillo (2) are found in the pea aphid genome. These gene expansions are of particular note because duplications of genes that encode the components of signaling pathways are rare in animals . Conversely, we identified aphid lineage-specific gene losses for several TGF-β ligands (BMP10, Maverick, and Alp23), Wnt ligands (Wnt6, Wnt10), and Sprouty (RTK signaling inhibitor).
The pea aphid genome contains 640 putative sequence-specific transcription factors. Most of the transcription factor families are similar in size and composition to those of other insects. However, the pea aphid genome encodes significantly more zinc-finger-containing proteins than other insects with sequenced genomes. Although the number of bHLH encoding genes is similar to other insects, orthologs of the achaete-scute genes, which are required for neurogenesis and bristle development in Drosophila and are found in other (holometabolous) insect genomes, were not found. All Hox complex genes are present, but Hox3 (zen) and ftz, which have evolved non-homeotic functions in insects, are highly divergent from the orthologs of other species.
JH has been implicated in regulating aphid reproductive polyphenism ,. The main enzymes responsible for the synthesis and degradation of JH are present in the pea aphid genome, and several of these developmental genes are methylated , supporting the hypothesis that methylation could play a role in the developmental plasticity of aphids as it does in other insects (Table 3). The pea aphid apparently lacks other JH associated proteins such as hexamerins, which constitute a class of JH binding proteins implicated in many physiological processes including caste regulation of lower termites .
Aphids exhibit plasticity in meiosis and the cell cycle, allowing for both sexual reproduction and parthenogenesis. Most genes involved in meiosis and the cell cycle in vertebrates and yeasts are present in the pea aphid genome, while other sequenced insect genomes show lineage-specific losses of individual genes or gene family members . While genes known to regulate the transition from G1 (growth) to S (DNA replication) phases of the cell cycle in metazoans are present in aphids (Figure 11A), the pea aphid genome also contains lineage-specific duplications of several mitotic regulators, such as Cdk1, Polo, Wee1, Cdc25, and Aurora (Figure 11B). In addition, the pea aphid genome contains lineage-specific duplications of several mitosis-related genes, including Smc6 (structural maintenance of chromosomes 6) and Topo2 (DNA Topoisomerase 2). These genes are single copy in other insects with sequenced genomes but duplicated in the Crustacean, Daphnia pulex, which is also capable of both sexual and asexual reproduction .
Neuropeptides and biogenic amines are cell-to-cell signaling molecules that act as hormones, neurotransmitters, and/or neuromodulators . By homology search, we found 42 genes encoding at least 70 neuropeptides and neurohormones. Expressed sequence tag and proteomic analyses suggest that many of these genes are active . The vasopressin (which in insects is called inotocin, from insect oxytocin/vasopressin-related peptide; ), sulfakinin, and corazonin precursor genes and their respective receptors were not found. Corazonin has been found previously in several hemipteran species  and is involved in the regulation of migratory phase transition in Locusta and Schistocerca . The pea aphid is the first sequenced insect genome lacking a sulfakinin gene. We found 18 biogenic amine G protein-coupled receptor (GPCR) genes and 42 genes encoding neuropeptide and protein hormone GPCRs. In general, there is excellent agreement between the presence or absence of neuropeptides and the presence or absence of their GPCRs.
Circadian clocks are internal oscillators governing daily cycles of activity and are proposed to underlie responses to day-night cycle, the most important cue triggering aphid reproductive polyphenism. In Drosophila, the circadian clock is regulated by two interdependent transcriptional feedback loops involving several genes of which the genes period and clock occupy a central position . All core genes from both loops were found in the pea aphid genome (Figure 12). The pea aphid Clock feedback loop shows high conservation of Clock, Vrille, and Pdp1. In contrast the period/timeless feedback loop is not well conserved. Two other participants at the core of the circadian clock, the cryptochromes Cry1 and Cry2 , are present in the pea aphid genome. Cry2, which is absent in Drosophila but present in single copy in all non-drosophilid insects, is duplicated in A. pisum, a pattern similar to that found in many vertebrates. Additional genes required for the Drosophila circadian clock, including the kinases double-time, shaggy, casein kinase 2, protein phosphatase 2a, and the protein degradation protein Supernumerary Limbs, are found in the pea aphid genome. We did not detect the F-box protein jetlag, which is necessary for light entrainment in Drosophila (Figure 12) .
Aphid sex determination is chromosomal. Females have two X chromosomes and males have only one . We searched the A. pisum genome for homologs of 32 sex-determination-related genes previously characterized in Drosophila melanogaster. Of the 32 genes, pea aphid homologs of 22 (69%) were identified. Like the honeybee, the pea aphid has homologs of the penultimate gene (transformer 2) and the DM-DNA binding domain of the ultimate gene (doublesex) genes of the D. melanogaster sex determination pathway. Multiple hits to four of the 32 genes were found in the pea aphid, all representing recent duplication events.
Major results from analyses of the pea aphid genome can be summarized as follows:
Our analysis of the pea aphid genome has begun to reveal the genetic underpinnings of this animal's complex ecology—including its capacity to parasitize agricultural crops, its association with microbial symbionts, and its developmental patterning. One project benefiting from the availability of the genome sequence is the investigation of aphid saliva proteins  thought critical for host plant feeding. This highlights the ability of the genome to facilitate future exploration of both basic and applied biological problems.
The parental line of the sequenced aphid clone, LSR1, was collected in a field of alfalfa (Medicago sativa) near Ithaca, New York, in 1998 . Aphids for DNA isolation resulted from a single generation of inbreeding to produce LSR1.AC.G1. The LSR1.AC.G1 aphid line was grown from a single female and treated with ampicillin to remove R. insecticola. Prior to DNA preparation, aphids were heat treated to reduce the number of Buchnera cells; entire aphid colonies on broad bean plants were placed in a 30°C incubator for 4 d. RT qPCR quantification of Buchnera/aphid DNA ratios revealed a significant decrease in the level of Buchnera relative to aphids not subjected to heat. Approximately 2% of the sequencing reads came from the Buchnera genome and were removed for separate assembly of Buchnera genome.
The genome size of LSR1.AC.G1 was estimated from single heads of seven asexual females by flow cytometry as described in  against D. melanogaster strain Iso-1, 1C=175 Mb (provided by Gerald Rubin, University of California, Berkley, CA, USA).
3.13 million Sanger sequence reads were produced on 3,730 sequencing (Applied Biosystems, Foster city, CA, USA) machines and assembled using the Atlas assembly pipeline, representing about 464 Mb of sequence and about 6.2× coverage of the (clonable) A. pisum genome. Two whole genome shotgun libraries, with inserts of 2–3 kb and 4–5 kb and a BAC library with insert size ~130 kb were used to produce the data. The LSR1.AC.G1 pea aphid genome sequence is available from the NCBI with project accession ABLF01000000.
We took two complementary approaches to automated gene prediction. First, for high-quality evidence-based gene models, we used the NCBI evidence-based RefSeq pipeline. Second, because EST and protein homology evidence was insufficient for the RefSeq pipeline to generate a comprehensive gene model set, we supplemented the RefSeq models with a GLEAN  consensus set of gene models based on a collection of ab initio gene predictors.
The NCBI RefSeq pipeline uses a combination of homology searching with ab initio modeling. First cDNAs and ESTs were aligned to the genomic sequences using Splign  and proteins were aligned to the genomic sequences using ProSplign . The best scoring coding sequence was identified for all cDNA alignments using the same scoring system used by Gnomon , the NCBI ab initio prediction tool. All cDNAs with a coding sequence scoring above a certain threshold were marked as coding cDNAs, and all others were marked as UTRs. Coding sequences that lack a translation initiation or termination signal were categorized as incomplete. Protein alignments were scored the same way, and coding sequences that did not satisfy the threshold criterion for a valid coding sequence were removed. After determining the UTR/CDS nature of each alignment, the alignments were assembled using a modification of the Maximal Transcript Alignment algorithm , accounting for not only exon-intron structure compatibility but also the compatibility of the reading frames. Two coding alignments were connected only if they both had open and compatible coding sequences. UTRs were connected to coding alignments only if the necessary translation initiation or termination signals were present. There were no restrictions on the connection of UTRs other than the exon-intron structure compatibility. All assembled models with a complete coding sequence, including the translation initiation and termination signals, were combined into alternatively spliced isoform groups. Incomplete or partially supported models were directed to Gnomon  for extension by ab initio prediction. Models containing a debilitating mutation such as a frameshift or nonsense mutation were categorized as either transcribed or non-transcribed pseudogenes. A subset of pseudogenes are likely to be functional genes that have errors in the Acyr_1.0 assembly and may be reclassified as protein-coding genes with subsequent improvements to the assembly and annotation. Gnomon  was also used to predict pure ab initio models in regions of the genome that lacked any cDNA, EST, or protein alignments.
Our supplemental GLEAN consensus gene model set of 36,606 was generated with input gene model sets from six different gene predictors: Augustus, FgenesH, FgenesH++, NCBI Gnomon, Maker, and NCBI RefSeq. Of these gene models, 12,251, overlapped RefSeq gene models by 100 bp or more, and in these cases, the RefSeq models were used. The final automated gene model set contains 34,604 gene models (Table 1).
Using results of computational annotation as a baseline, members of the International Aphid Genomics Consortium manually curated over 2,000 genes of biological interest. Briefly, sequences of target genes from other arthropods were utilized to blast search the RefSeq gene set, Gnomon predictions, scaffolds, and unassembled reads. Homology of putative aphid genes was verified using a combination of reciprocal blast and information garnered from phylomeDB and other phylogenetic analyses. Gene models (e.g., starts and stops, exon boundaries) were then manually refined based on available EST and full-length cDNA support, as well as alignment with homologs from other taxa. Manual curation was facilitated by an Apollo instance directly integrated with AphidBase (see below).
The pea aphid assembled genome sequence data has been comprehensively scanned and annotated to highlight transcription evidence. ESTs, EST contigs, and full-length cDNAs have been mapped to the genome using SIM-4, whereas homologs in other insect genomes or Uniprot have been identified by high-throughput BLAST searches. All of the approximately 170,000 ESTs and 200 full-length cDNAs, as well as gene models generated by different programs (Augustus, RefSeq, Genscan, Maker, Snap, GeneID, Gnomon, and Fgenesh) and RefSeq and Glean gene model repertoires, were loaded into a GMOD-Chado database , accessible at the AphidBase web portal (www.aphidbase.com; ,). Additionally, all manually curated genes are available at AphidBase.
One hundred and ninety-seven genes with single-copy orthologs in all species included in the analyses were selected to infer a species phylogeny. Alignments performed with MUSCLE described were concatenated into a super-alignment containing 14,922 positions. The removal of positions with gaps in more than 50% of the sequences resulted in a final alignment of 90,512 positions. This alignment was used for Maximum Likelihood (ML) tree reconstruction as implemented in PhyML v2.4.4 , using JTT as an evolutionary model and assuming a discrete gamma-distribution model with four rate categories and invariant sites, where the gamma shape parameter and the fraction of invariant sites were estimated from the data. Bootstrap analysis was performed on the basis of 100 replicates.
We reconstructed the complete collection of phylogenetic trees, also known as the Phylome, for all A. pisum protein-coding genes with homologs in other sequenced insect genomes. For this we used a similar automated pipeline to that described earlier for the human genome . A database was created containing the pea aphid proteome and that of 16 other species. These include 12 other insects (Tribolium castaneum, Nasonia vitripennis, Apis mellifera [from NCBI database], Drosophila pseudoobscura, Drosophila melanogaster, Drosophila mojavensis, Drosophila yakuba [from FlyBase], Pediculus humanus, Culex pipiens [from VectorBase], Anopheles gambiae, Aedes aegypti [from Ensembl], and Bombyx mori [from SILKDB]) and four outgroups (the crustacean Daphnia pulex [the GNOMON predicted set provided by the JGI], the nematode Caenorhabditis elegans, and two chordates, Ciona intestinalis and Homo sapiens [from Ensembl]). For each protein encoded in the pea aphid genome, a Smith-Waterman  search (e-val 10−3) was performed against the above mentioned proteomes. Sequences that aligned with a continuous region longer than 50% of the query sequence were selected and aligned using MUSCLE 3.6  with default parameters. Gappy positions were removed using trimAl v1.0 (http://trimal.cgenomics.org), using a gap threshold of 25% and a conservation threshold of 50%. Phylogenetic trees were estimated with Neighbor Joining (NJ) trees using scoredist distances as implemented in BioNJ  and by ML as implemented in PhyML v2.4.4 , using JTT as an evolutionary model and assuming a discrete gamma-distribution model with four rate categories and invariant sites, where the gamma shape parameter and the fraction of invariant sites were estimated from the data. Support for the different partitions was computed by approximate likelihood ratio test as implemented in PhymL (aLRT) . All trees and alignments have been deposited in PhylomeDB  (http://phylomedb.org). Additional details for this analysis can be found in .
Prediction of orthology is a fundamental step in the functional annotation of newly sequenced genomes. Reciprocal BLAST best hit is often used for genome-wide orthology detection, but phylogeny-based orthology predictions are considered more accurate, especially at large evolutionary distances or when gene duplication and loss is rampant . To overcome this, orthology and paralogy relationships among A. pisum genes and those encoded in the other considered genomes were inferred by a phylogenetic approach that uses a previously described species-overlap algorithm . This algorithm uses the level of species overlap (if there is species overlap) between the two daughter partitions of a given node to define it as a duplication or speciation (if there is no species overlap). After mapping all duplications and speciations on the phylogenetic tree of a given gene family, orthology and paralogy relationships are inferred accordingly. All orthology and paralogy predictions can be accessed through PhylomedDB .
A list of orthology-based transfer of functional annotations was built based on phylogeny-based orthology relationships with Drosophila melanogaster. Pea aphid genes with orthology relationships with annotated D. melanogaster genes were grouped according to the type of orthology relationship. Twelve percent (4,058) of aphid genes could be annotated based on a clear one-to-one orthology relationship with a drosophila gene. An additional 2,315 genes presented a many-to-one relationship with annotated drosophila genes and thus were tentatively annotated with the GO terms associated with the fly genes, with the caution that neo and or sub-functionalization may have occurred.
The duplication events defined by the above mentioned species overlap algorithm that only comprised paralogs from A. pisum were considered lineage-specific duplications. Whenever more than one round of duplication followed an A. pisum speciation event (family expansion), all resulting paralogs were grouped into a single group of “in-paralogs”. Results from all the trees in the phylome were merged into a non-redundant list of in-paralogs groups, by merging groups sharing a significant fraction of their members (50%).
Putative pairs of paralogs were identified as pairs of genes following a reciprocal best hit criterion (RBH) within the A. pisum gene set; however, due to errors in the assembly process, these may comprise allelic variants found on different scaffolds (for alleles, coding sequences are expected to be extremely similar). We filtered alignments with Gblocks  to reduce the risk of partially non-homologous alignments and estimated the pairwise dS among genes. For comparison, the same task was performed for transcripts (not considering alternative transcripts) from Drosophila and honeybee genomes.
The pea aphid has four chromosomes  with eight telomeres. Searches of the genome assembly for long stretches of the expected TTAGG telomeric repeat reveal several candidates, but only two are at the ends of reasonably long contigs in reasonably long scaffolds. They are ~480 bp stretches of TTAGG repeats at the 3′ ends of 14 kb SCAFFOLD14618 (GenBank EQ125390.1) and 11 kb SCAFFOLD13146 (EQ123918.1). The remainder of these scaffolds do not encode any genes, and the subtelomeric ~700 bp before the TTAGG repeats shows considerable sequence similarity between these two scaffolds. These are likely to be true telomeres. Unfortunately the remaining six telomeres are not assembled in scaffolds, although pieces of them might be in short single contigs. Attempts to determine their structure employed an approach similar to that utilized with the Tribolium genome assembly , involving searching of the raw reads at the Trace Archive at NCBI with a query consisting of 1000 bp of TTAGG repeats. Examination of the internal mate pairs of the first 100 such matches revealed several from the two telomeres identified above. The remainder, however, were either matches to RT domains or other regions of retrotransposons or were other simple sequence repeats. It appears, therefore, that the remaining six telomeres are rather more complicated than the two identified above, which are reminiscent of the relatively simple telomeres of the honey bee Apis mellifera . They likely involve insertion of retrotransposons into the telomeres, much like those of the silkmoth Bombyx mori  and the red flour beetle Tribolium castaneum .
TEs were identified and annotated using the “REPET” (http://urgi.versailles.inra.fr/development/repet/) pipeline, which correctly annotate nested and fragmented TEs. In the first part of the pipeline, consensus TEs were predicted ab initio by first searching for repeats with BLASTER for an all-by-all BLASTN  genome comparison and then results grouped using three clustering methods—GROUPER , RECON , and PILER —with default parameters. We then built one consensus per group with the MAFFT  multiple sequence alignment program and classified each consensus (1) according to BLASTER matches using TBLASTX and BLASTX  with the entire Repbase Update databank  and (2) according to the presence of structural features such as terminal repeats (TIR, LTR, and polyA or SSR tails).
These TE consensus sequences representing ancestral copies of TEs subfamilies were clustered into groups for family identification using the GROUPER clustering method. Each family (i.e., group) was characterized assuming that the most populated well characterized TE category in a group of consensus sequences can define the order of the group it belongs to. Eighty-five families containing at least five TE consensus sequences were then manually curated using multiple sequences alignments, phylogenies, and Hidden Markov Models . This close examination allowed us to confirm groupings and decipher specific features like chimeric TE families or subfamilies.
The pea aphid genome was annotated with all the subfamilies of TE consensus sequences using the second part of the REPET annotation pipeline. This pipeline is composed of TE detection software—BLASTER , RepeatMasker , and Censor —and satellite detection software—RepeatMasker, TRF , and Mreps . Simple repeats have been used to filter out spurious hits.
TEs often insert into other TEs fragmenting each other. A specific “long join” annotation procedure was performed, using age estimates of repeat fragments to correctly identify fragments from the same repeat. The percent identity between a fragment and its reference TE/repeat consensus can be used to estimate the age of TE fragments.
CpG analysis was performed as described in .
During the course of whole genome sequencing of pea aphid clones, LSR1.AC.G1, 24,947 sequence reads corresponding to the Buchnera genome were obtained as by-products. Using the chromatogram data of these sequences, the whole genome of Buchnera LSR1 was reconstructed in two distinct methods: de novo assembly using CAP3  and comparative (read mapping against a reference) assembly using AMOScmp of AMOS package . Results of both methods were essentially the same and the latter output was used for further analyses. Five gaps that remained after the assembly were closed by PCR reactions followed by Sanger sequencing. This Buchnera Whole Genome Shotgun project was deposited at DDBJ/EMBL/GenBank under the project accession ACFK00000000. The version described in this article is the first version, ACFK01000000.
A BioCyc metabolism database  was constructed for the pea aphid using a newly developed data management system specific for the creation and updating of Cyc databases and the BioCyc Pathway Tools. Currently, the pea aphid database, “AcypiCyc” (http://pbil.univ-lyon1.fr/software/cycads/acypicyc), utilizes the RefSeq automated annotation, complemented by three alternative annotations of the pea aphid's 34,821 proteins performed using KAAS . The AcypiCyc database allows for comparison of the pea aphid database with two other BioCyc databases: SymbioCyc for Buchnera aphidicola APS and DromeCyc for Drosophila melanogaster.
Sanger read statistics.
(0.04 MB DOC)
GC content of selected arthropod genomes.
(0.04 MB DOC)
Comparison of pea aphid gene model sets to 2089 gold standard pea aphid exons from 402 genes. bp overlap, the total number of base pairs overlapping between gold standard exons, and exons from the indicated gene model set; bp query miss, the number of bp in exons that had some overlap with the gold standard exon set but did not overlap the gold standard exon; bp target miss, the number of bp in the gold standard set that were not overlapped by the candidate gene set; any overlap, the number of gold standard exons that had 1 bp or more overlap with the gene model set in question; # correct splices, the number of gold standard exon splice sites exactly predicted by the gene model set in question; # within 6 bp, the number of splice site within 6bp, not including those exactly predicted.
(0.04 MB DOC)
Arthropod gene structure statistics. Genome size value in parentheses is total gene-containing sequence (i.e., excluding heterochromatin, scaffolds without genes, etc.). No. of genes is from the gene set examined, not necessarily the official gene set for new genomes. Gene density is calculated as the sum of coding exon bases/total gene-containing genome bases. Gene length is the span including introns and UTR. CDS size is the coding sequence length without introns or UTRs. Exons/gene and Exon size are count and size of coding exons. Sizes are given as mean in bp except for Intron size. Intergenic size is measured from distance between adjacent genes. These statistics have a standard deviation close to the mean, but Intergenic size has a much larger variance. 1 Gene part sizes and exons/gene are measured with EST-validated gene models for these noted genomes. Others are measured from reference database gene feature data. 2 Exon size distribution for Drosophila is strongly bimodal; one-exon genes average twice the size of multi-exon genes (830 bp versus 470 bp/exon). Other species show unimodal distribution of exon sizes. 3 Intron size is non-normally distributed. Intron size lists the primary and secondary peaks, mean, and the percent of introns larger than exons. It has a narrow, high peak frequency at the indicated (median) value. Fruitfly and nematode have a secondary peak at about 400 bp; mouse reverses this with its secondary peak at 90 bp. Daphnia appears to have no secondary intron size peak. 4 UTR size is an overestimate, as it is measured only where exons extend past coding sequence, and misses true cases of zero length UTRs. Genome sequences used: Aphid, Acyr. pisum (acyr1); Beetle, Tribolium castenatum (tcas3); Bee, Apis mellifera (ncbi1); Daphnia, Daphnia pulex (daphx1); Fruitfly, Drosophila melanogaster (fb5.5); Mosquito, Culex pipens (cpip12); Mouse, Mus musculus (mgi3); Wasp, Nasonia vitripennis (nvit1); Worm, Caen. elegans (wb167).
(0.05 MB DOC)
Diagnostic PCR to check the presence/absence of scaffolds that appeared to be bacterial contaminants. Among 642 PPPs located in scaffolds that appeared to be of bacterial contaminants, 46 were portions of 42 RefSeq aphid gene models. We performed diagnostic PCRs to check the presence/absence of these genes/scaffolds in the A. piusm genome. Specific primers were designed for each unique target gene. Each 30 µL PCR reaction contained 0.5 µM each primer, 0.2 µM dNTPs, 10 ng template, and 2.5 U AmpliTaq (Applied Biosystems) in 1× AmpliTaq buffer. Parameters for PCRs were: 94°C for 30 s, followed by 35 cycles of 94°C for 15 s, 50°C for 30 s, 72°C for 1.5 min, 72°C for 10 min and, 4°C hold. LdcA1 was used as a positive control. PCR primers for LdcA1 were Ap_ldcA_482F (5′-TATGATACCGTACCTGGAGGCGTT-3′) and Ap_ldcA_1127R′ (5′-GTTTTAATCACGCAGCACATGGG-3′). None of the target DNA sequences were amplified by PCR, verifying the absence of these scaffolds in the aphid genome.
(0.12 MB DOC)
Distribution of reactions in the AcypiCyc database across the six top-level categories identified by the Enzyme Commission (EC). Included in this table are all reactions in the AcypiCyc database that have been assigned either full or partial EC numbers.
(0.04 MB DOC)
Special thanks to J. Colbourne for insightful discussions on genome project organization and future directions.
The members of The International Aphid Genomics Consortium (IAGC) are as follows: Sequencing leadership: Stephen Richards1, Richard A. Gibbs†1, Project Leadership: Nicole M. Gerardo2, Nancy Moran3, Atsushi Nakabachi4, Stephen Richards1, David Stern5, Denis Tagu6, Alex C. C. Wilson7; DNA sequence and global analysis: DNA sequencing: Sequence Production: Donna Muzny1, Christie Kovar*1, Andy Cree1, Joseph Chacko1, Mimi N. Chandrabose1, Marvin Diep Dao1, Huyen H. Dinh1, Ramatu Ayiesha Gabisi1, Sandra Hines1, Jennifer Hume1, Shalini N. Jhangian1, Vandita Joshi1, Lora R. Lewis1, Yih-shin Liu1, John Lopez1, Margaret B. Morgan1, Ngoc Bich Nguyen1, Geoffrey O. Okwuonu1, San Juana Ruiz1, Jireh Santibanez1, Rita A. Wright1; Sequence Production Informatics: Gerald R. Fowler*1, Matthew E. Hitchens1, Ryan J. Lozado1, Charles Moen1, David Steffen1, James T. Warren1, Jingkun Zhang1; Sequence Library Production: Lynne V. Nazareth*1, Dean Chavez1, Clay Davis1, Sandra L. Lee1, Bella Mayurkumar Patel1, Ling-Ling Pu1, Stephanie N. Bell1, Angela Jolivet Johnson1, Selina Vattathil1, Rex L. Williams Jr.1; Full length ESTs: Shuji Shigenobu5,8, David Stern5, Stephen Richards1, Phat M. Dang9, Mizue Morioka10, Takema Fukatsu11, Toshiaki Kudo12, Shin-ya Miyagishima4, Atsushi Nakabachi*4; Genome Assembly: Huaiyang Jiang1, Stephen Richards1, Kim C. Worley*1; AphidBase and bioinformatics resources: Fabrice Legeai6, Jean-Pierre Gauthier6, Olivier Collin13, Shuji Shigenobu5,8 Denis Tagu*6; Gene prediction and consensus gene set: Fabrice Legeai6, Lan Zhang1, Jean-Pierre Gauthier6, Shuji Shigenobu5,8, Denis Tagu6, Stephen Richards*1, Hsiu-Chuan Chen14, Olga Ermolaeva14, Wratko Hlavina14, Yuri Kapustin14, Boris Kiryutin14, Paul Kitts14, Donna Maglott14, Terence Murphy14, Kim Pruitt14, Victor Sapojnikov14, Alexandre Souvorov9, Françoise Thibaud-Nissen14, Francisco Câmara15, Roderic Guigó15,16, Mario Stanke17, Victor Solovyev18, Peter Kosarev19, Don Gilbert20; Phylogenomic Analyses: Toni Gabaldón*21, Jaime Huerta-Cepas21, Marina Marcet-Houben21, Miguel Pignatelli22,23, Don Gilbert20, Andrés Moya22,23; Gene duplications: Claude Rispe*6, Morgane Ollivier6, Fabrice Legeai6, Denis Tagu6; Transposable elements: Hadi Quesneville*24, Emmanuelle Permal24, Andrés Moya22,23, Carlos Llorens22,25, Ricardo Futami25, Alex C. C. Wilson7, Dale Hedges26; Telomeres: Hugh M. Robertson27; U12-Introns and Seleno-Proteins: Tyler Alioto21, Marco Mariotti21, Roderic Guigó21; Symbiosis: Bacterial and mitochondrial genes in the aphid genome: Naruo Nikoh28, John P. McCutcheon29, Miguel Pignatelli22,23, Gaelen Burke3, Nicole M. Gerardo2, Alexandra Kamins2, Amparo Latorre22,23, Andrés Moya22,23, Toshiaki Kudo12, Shin-ya Miyagishima4, Nancy A. Moran3, Atsushi Nakabachi*4; Metabolism: Peter Ashton30, Federica Calevro31, Hubert Charles31, Stefano Colella31, Angela Douglas*32, Georg Jander33, Derek H. Jones7, Gérard Febvay31, Lars G. Kamphuis34, Philip F. Kushlan7, Sandy Macdonald30, John Ramsey33, Julia Schwartz7, Stuart Seah35, Gavin Thomas30, Augusto Vellozo31,36, Alex C. C. Wilson7; Comparative genomics of Buchnera: Shuji Shigenobu*5,8, Stephen Richards1, Nancy Moran3, Shin-ya Miyagishima4, Atsushi Nakabachi4; Genome analysis of Regiella insecticola: Bodil Cass3, Patrick Degnan3, Bonnie Hurwitz3, Teresa Leonardo5, Ryuichi Koga11, Nancy Moran*3, Stephen Richards1, David Stern5; Stress and immunity group: Boran Altincicek37, Caroline Anselme31,38, Hagop Atamian39, Seth M. Barribeau*2, Martin de Vos33, Elizabeth J. Duncan40, Jay Evans41, Toni Gabaldon21, Nicole M. Gerardo*2, Murad Ghanim*42, Abdelaziz Heddi31, Isgouhi Kaloshian39, Amparo Latorre22,23, Carole Vincent-Monegat31, Andrés Moya22,23, Atsushi Nakabachi4, Ben J. Parker2, Vicente Pérez-Brocal22,31, Miguel Pignatelli22,23, Yvan Rahbé31, John Ramsey33, Chelsea J. Spragg2, Javier Tamames22,23, Daniel Tamarit22, Cecilia Tamborindeguy43, Andreas Vilcinskas37; Development group: Shuji Shigenobu*5,8, Ryan D. Bickel44, Jennifer A. Brisson44, Thomas Butts45, Chun-che Chang46, Olivier Christiaens47, Gregory K. Davis48, Elizabeth Duncan40, David Ferrier49, Masatoshi Iga47, Ralf Janssen50, Hsiao-Ling Lu46, Alistair McGregor51, Toru Miura52, Guy Smagghe47, James Smith40,Maurijn van der Zee53, Rodrigo Velarde54, Megan Wilson40, Peter Dearden40, David Stern5; Germ line group: Chun-che Chang*46, Hsiao-Ling Lu46, Ryan D. Bickel44, Shuji Shigenobu5,8, Gregory K. Davis*48; Epigenetics and Methylation: Jennifer A. Brisson44, Owain R. Edwards34, Karl Gordon55, Roland S. Hilgarth56, Stanley Dean Rider Jr.*57, Hugh M. Robertson27, Dayalan Srinivasan5, Thomas K. Walsh*34; Wing development: Jennifer A. Brisson*44, Asano Ishikawa52, Toru Miura52; JH-related: Toru Miura*52, Jennifer A. Brisson44, Asano Ishikawa52, Stéphanie Jaubert-Possamai6, Denis Tagu6, Thomas K. Walsh34; Mitosis, meiosis and cell cycle: Dayalan Srinivasan*5, Brian Fenton58, Stéphanie Jaubert-Possamai6; Sex determination: Wenting Huang7, Derek H. Jones7, Alex C. C. Wilson*7; MicroRNA and phenotypic plasticity: Fabrice Legeai6,59, Thomas K. Walsh34, Guillaume Rizk60, Owain R. Edwards34, Karl Gordon55, Dominique Lavenier61, Jacques Nicolas59, Denis Tagu6, Stéphanie Jaubert-Possamai*6, Claude Rispe6; Aphid Plant Interactions: Chemoreceptors: Carole Smadja62, Hugh M. Robertson*27; Odorant-Binding Proteins: Jing-Jiang Zhou63, Filipe G. Vieira64, Carole Smadja62, Xiao-Li He63, Renhu Liu63, Julio Rozas64, Linda M. Field*63; Detoxification enzymes: Stanley Dean Rider Jr.57, John Ramsey33, Karl Gordon55, Thomas K. Walsh34, Martin de Vos33, Georg Jander*33; Salivary glands: Peter D. Ashton30, Peter Campbell55, James C. Carolan*65, Angela E. Douglas32, Owain R. Edwards*34,66, Carol I. J. Fitzroy65, Lars G. Kamphuis35, Karen T. Reardon65, Gerald R. Reeck66,67, Karam Singh35, Thomas L. Wilkinson65; Neuropeptides: Jurgen Huybrechts68, Mohatmed Abdel-latief69, Alain Robichon38, Jan A. Veenstra70, Frank Hauser71, Giuseppe Cazzamali71, Martina Schneider71, Michael Williamson71, Elisabeth Stafflinger71, Karina K. Hansen71, Cornelis J. P. Grimmelikhuijzen71, Denis Tagu*6; Transporters: Daniel R.G Price72, Marina Caillaud73, Eric van Fleet73, Qinghu Ren74, Yvan Rahbé31, Angela E. Douglas*32, John A. Gatehouse72; Virus transmission and transcytosis group: Véronique Brault75, Baptiste Monsion75, Marina Caillaud73, Eric Van Fleet73, Jason Diaz73, Laura Hunnicutt76, Atsushi Nakabachi4, Ho-Jong Ju77, Cecilia Tamborindeguy*43, Ximo Pechuan22, José Aguilar22, Daniel Tamarit22; Carlos Llorens22,25, Andres Moya22,23; Dynamins: Atsushi Nakabachi*4, Shin-ya Miyagishima4; Circadian rhythms group: Teresa Cortés22, Benjamín Ortiz-Rivas22, David Martínez-Torres*22; Cuticular proteins: Claude Rispe*6, Aviv Dombrovsky78, Stéphanie Jaubert-Possamai6, Denis Tagu6; Chitinase-like proteins: Atsushi Nakabachi*4, Shuji Shigenobu5,8, Shin-ya Miyagishima4; Ion Channels: Richard P. Dale*79, Thomas K. Walsh34, Cecilia Tamborindeguy43, T. G. Emyr Davies63, Linda M. Field63, Martin S. Williamson63, Andrew Jones80, David Sattelle80, Sally Williamson81, Adrian Wolstenholme81; Protease genes: Peter Campbell55, James C. Carolan*65, Owain R. Edwards34, Karl Gordon55, Carlos Llorens22,25, Andres Moya22,23, Miguel Pignatelli22,23, Yvan Rahbé31, Claude Rispe*6, Gerald R. Reeck67; AcypiCyc: Augusto Vellozo31,36, Stefano Colella31, Ludovic Cottret36, Gérard Febvay31, Federica Calevro31, Yvan Rahbé31, Angela Douglas32, Marie France Sagot36, Hubert Charles*31; Ribosomal Proteins: Claude Rispe*6, David G. Heckel82, Wayne Hunter83.
† Principal Investigator
* Analysis Group Leaders
1 Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, United States of America,
2 Department of Biology, Emory University, O. Wayne Rollins Research Center, Atlanta, Georgia, United States of America
3 Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, Arizona, United States of America,
4 Advanced Science Institute, Saitama, Japan,
5 Howard Hughes Medical Institute and Department of Ecology and Evolutionary Biology, Princeton University, Princeton, New Jersey, United States of America,
6 INRA, UMR BiO3P, Domaine de La Motte, Le Rheu, France,
7 Department of Biology, University of Miami, Coral Gables, Florida, United States of America,
8 PRESTO, JST, 4-1-8 Honcho Kawaguchi, Saitama, 332-0012, and Okazaki Institute for Integrative Bioscience, National Institutes of Natural Sciences, Higashiyama, Myodaiji, Okazaki, Japan,
9 USDA, ARS, National Peanut Research Laboratory, Dawson, Georgia, United States of America,
10 Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Tokyo, Japan,
11 National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba, Japan,
12 Discovery Research Institute, Saitama, Japan,
13 CNRS, IRISA, EPI Symbiose, Rennes, France,
14 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America,
15 Center de Regulació Genòmica, Universitat Pompeu Fabra, Barcelona, Spain,
16 Research Group in Biomedical Informatics, Universitat Pompeu Fabra, Barcelona, Spain,
17 Institut für Mikrobiologie und Genetik, Abteilung für Bioinformatik, Göttingen, Germany,
18 Department of Computer Science, Royal Holloway, University of London, Surrey, United Kingdom,
19 Softberry Inc, Mount Kisco, New York, United States of America,
20 Department of Biology, Indiana University, Bloomington, Indiana, United States of America,
21 Bioinformatics and Genomics Programme. Centre for Genomic Regulation (CRG), Barcelona, Spain,
22 Instituto Cavanilles de Biodiversidad y Biología Evolutiva, Universitat de València, València, Spain,
23 CIBER en Epidemiología y Salud Pública (CIBEResp) and Centro Superior de Investigación en Salud Pública (CSISP), Conselleria de Sanidad (Generalitat Valenciana), València, Spain,
24 Unité de Recherches en Génomique-Info (UR INRA 1164), INRA, Centre de recherche de Versailles, Versailles Cedex, France,
25 Biotechvana, Parc Cientific, Universitat de València, València, Spain,
26 Miami Institute for Human Genomics, 1120 NW 14th Street, Miami, Florida, United States of America,
27 Department of Entomology, University of Illinois at Urbana-Champaign, Illinois, United States of America,
28 Division of Natural Sciences, The Open University of Japan, Wakaba, Chiba, Japan,
29 Center for Insect Science, University of Arizona, Tucson, Arizona, United States of America,
30 Department of Biology, University of York, York, United Kingdom,
31 UMR203 Biologie Fonctionnelle Insectes et Interactions (BF2I), IFR41, INRA, INSA-Lyon, Université de Lyon, Villeurbanne, France,
32 Department of Entomology, Cornell University, Ithaca, New York, United States of America,
33 Boyce Thompson Institute for Plant Research, Ithaca, New York, United States of America,
34 CSIRO Entomology, Centre for Environment and Life Sciences (CELS), Floreat Park, Australia,
35 CSIRO Plant Industry, Centre for Environment and Life Sciences (CELS), Floreat Park, Australia,
36 Université de Lyon, CNRS, UMR5558, Laboratoire de Biométrie et Biologie Evolutive, Villeurbanne, France,
37 Interdisciplinary Research Center, Institute of Phytopathology and Applied Zoology, Justus-Liebig-University of Giessen, Giessen, Germany,
38 UMR Interactions Biotiques et Santé Végétale, INRA 1301-CNRS 6243-Université de Nice-Sophia Antipolis, Sophia-Antipolis Cedex, France,
39 Graduate Program in Genetics, Genomics and Bioinformatics and Department of Nematology, University of California, Riverside, California, United States of America,
40 Laboratory for Evolution and Development and National Research Centre for Growth and Development, Department of Biochemistry, University of Otago, Dunedin, New Zealand,
41 USDA-ARS Bee Research Laboratoy, Beltsville, Maryland, United States of America,
42 Department of Entomology, The Volcani Center, Bet Dagan, Israel,
43 Department of Entomology, Texas A&M University, College Station, Texas, United States of America,
44 Molecular and Computational Biology, University of Southern California, Los Angeles, California, United States of America,
45 Department of Zoology, University of Oxford, Oxford, United Kingdom,
46 Laboratory for Genetics and Development, Department of Entomology/Institute of Biotechnology, College of Bio-Resources and Agriculture, National Taiwan University, Taipei, Taiwan,
47 Lab Agrozoology, Department Crop Protection, Ghent University, Ghent, Belgium,
48 Department of Biology, Bryn Mawr College, Bryn Mawr, Pennsylvania, United States of America,
49 The Scottish Oceans Institute, University of St. Andrews, St. Andrews, Fife, United Kingdom,
50 Department of Earth Sciences, Palaeobiology, Uppsala University, Uppsala, Sweden,
51 University of Veterinary Medicine Vienna, Institut für Populationsgenetik, Veterinärmedizinische Universität Wien, Vienna, Austria,
52 Graduate School of Environmental Science, Hokkaido University, Sapporo, Japan,
53 Institute for Biology Leiden University, Leiden, Netherlands,
54 Department of Biology, Wake Forest University, Winston-Salem, North Carolina, United States of America,
55 CSIRO Entomology, Black Mountain Laboratories, Black Mountain, Acton, Australia,
56 Department of Pathology and Laboratory Medicine, Emory University School of Medicine, Atlanta, Georgia, United States of America,
57 Department of Pathobiology, Texas A & M University, College Station, Texas, United States of America,
58 Plant Pathology, SCRI, Invergowrie and The Scottish Crop Research Institute, Dundee, Scotland, United Kingdom,
59 INRIA, IRISA, EPI Symbiose, Rennes, France,
60 Université de Rennes 1, IRISA, EPI Symbiose, Rennes, France,
61 ENS Cachan, INRIA, EPI Symbiose, Rennes, France,
62 Animal and Plant Sciences Department, University of Sheffield, United Kingdom,
63 Department of Biological Chemistry, Rothamsted Research, Harpenden, United Kingdom,
64 Departament de Genètica, Universitat de Barcelona, Barcelona, Spain,
65 UCD School of Biology and Environmental Science, University College Dublin, Ireland,
66 Cooperative Research Centre for National Plant Biosecurity, Canberra, Australia,
67 Department of Biochemistry, Kansas State University, Manhattan, Kansas,
68 Research Group of Functional Genomics and Proteomics, Leuven, Belgium,
69 Freie Universität Berlin, Institut für Angewandte Zoologie, Berlin, Germany,
70 Université de Bordeaux, CNRS, Talence, France,
71 Center for Functional and Comparative Insect Genomics, Department of Biology, University of Copenhagen, Copenhagen, Denmark,
72 School of Biological and Biomedical Sciences, Durham University, Durham, United Kingdom,
73 Department of Biology, Ithaca College, Ithaca, New York, United States of America,
74 J. Craig Venter Institute, Rockville, Maryland, United States of America,
75 INRA UMR SVQV, Equipe Virologie Vection, Colmar, France,
76 Genomic Sciences Program, North Carolina State University, Raleigh, North Carolina, United States of America,
77 Department of Plant Pathology and Plant-Microbe Biology, Cornell University, Ithaca, New York, United States of America,
78 Department of Virology, The Volcani Center, Bet Dagan, Israel,
79 Syngenta, Jealotts Hill Research Centre, Bracknell, Berkshire, United Kingdom,
80 MRC Functional Genomics Unit, Department of Phisology, Anatomy and Genetics, University of Oxford, Oxford, United Kingdom,
81 Department of Biology and Biochemistry, University of Bath, Bath, United Kingdom,
82 Max Planck Institute for Chemical Ecology, Jena, Germany,
83 United States Department of Agriculture, Agriculture Research Service, U.S. Horticultural Research Lab, Fort Pierce, Florida, United States of America.
The authors have declared that no competing interests exist.
Work at the Baylor Medical College Human Genome Sequencing Center was funded by grant 5-U54-HG003273 from the National Human Genome Research Institute. AphidBase is supported with funding from the French National Institute for Agricultural Research (INRA) and the French National Institute for Research in Computer Science and Control (INRIA). Pea Aphid Genome Annotation Workshop I was supported by an American Genetic Association Special Event Award and an NRI, US Department of Agriculture Cooperative State Research, Education, and Extension Service 2007-04628 award to ACCW. FgenesH models were donated by Softberry, Inc. This research was additionally supported in part by the Intramural Research Program of the NIH, National Library of Medicine. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.