|Home | About | Journals | Submit | Contact Us | Français|
Edited by Yuji Kohara
The study of the pearl oyster Pinctada fucata is key to increasing our understanding of the molecular mechanisms involved in pearl biosynthesis and biology of bivalve molluscs. We sequenced ~1150-Mb genome at ~40-fold coverage using the Roche 454 GS-FLX and Illumina GAIIx sequencers. The sequences were assembled into contigs with N50 = 1.6 kb (total contig assembly reached to 1024 Mb) and scaffolds with N50 = 14.5 kb. The pearl oyster genome is AT-rich, with a GC content of 34%. DNA transposons, retrotransposons, and tandem repeat elements occupied 0.4, 1.5, and 7.9% of the genome, respectively (a total of 9.8%). Version 1.0 of the P. fucata draft genome contains 23 257 complete gene models, 70% of which are supported by the corresponding expressed sequence tags. The genes include those reported to have an association with bio-mineralization. Genes encoding transcription factors and signal transduction molecules are present in numbers comparable with genomes of other metazoans. Genome-wide molecular phylogeny suggests that the lophotrochozoan represents a distinct clade from ecdysozoans. Our draft genome of the pearl oyster thus provides a platform for the identification of selection markers and genes for calcification, knowledge of which will be important in the pearl industry.
We sequenced the genome of the pearl oyster Pinctada fucata (Fig. 1A) with two major goals in mind. First, we sought to obtain genomic information that could be used in future studies of the molecular mechanisms underlying the biosynthesis of pearl in the bivalve mollusc. Pearls are of significant value industrially and as jewels; better understanding of the mechanism is essential to further improvement of pearl production in the fisheries industry. To date, mechanisms involved in the nacreous layer formation in the pearl oyster have been studied extensively.1–4 The shell of pearl oysters consists of two distinct structures: inner nacreous layers composed of aragonite and outer prismatic layers composed of calcite. An intriguing question in the field of bio-mineralization research is how the two polymorphs of calcium carbonate are produced in the same organism. Previous studies that explored these bio-mineralization processes identified and characterized a wide variety of genes and proteins; their functions have also been examined in association with shell formation.4–6 For example, Kinoshita et al.7 conducted an expressed sequence tag (EST) analysis of P. fucata genes expressed in the pallial mantle, which forms the nacreous layer, and in the mantle edge, which forms the prismatic layer. That study generated 29 682 unique sequences, from which the authors succeeded in identifying multiple genes, some novel, that may play roles in the pearl formation. In spite of such extensive studies, however, we still have only a limited understanding of the molecular mechanisms underlying pearl oyster shell formation. Decoding the genome of pearl oysters is therefore essential for future genome-wide analyses of these mechanisms.
Our second goal was to improve our overall understanding of the biology of bivalve molluscs, specifically those features that characterize molluscs and their evolution among metazoans and/or lophotrochozoans.8–10 Lophotrochozoa is one of the largest clades of bilaterians, together with the Ecdysozoa and Deuterostomia. The core group of Lophotrochozoa includes molluscs, annelids, nemertines, lophophorates, and platyhelminths. Since these core groups show the greatest variety of adult body plans, the relationships among the lophotrochozoan taxa are still controversial.11 Although adult morphology varies vastly across the group, many core lophotrochozoans share spiral cleavage and a trochophore larval stage. Understanding the proposed repeated losses of both spiral cleavage and trochophore larvae in the Lophotrochozoa clade requires comparative studies of genes involved in the formation of body plans, performed on the basis of full genomic information.
The phylum Mollusca itself is one of the most diverse animal groups; mollusc species exhibit a range of morphologies and sizes, from microscopic bivalves to 1-m long giant clams.8–10 The phylum includes ~100 000 described and living species. Due to their hard exoskeletal shell, molluscs appear in the fossil record dating from the Cambrian era, and the major bivalves were established by the Middle Ordovician.12,13 However, the search for bivalve ancestors has proved less straightforward, mainly due to differing interpretations of the extant microfossils. Mollusca comprise seven or eight major lineages, and three majors are Bivalvia (clams and oysters), Gastropoda (snails and slugs), and Cephalopoda (squids and octopus). A recent phylogenetics of evolutionary relationship among Mollusca by Kocot et al.14 showed a sister-taxon relationship between Bivalvia and Gastropoda.
The class Bivalvia includes ~20 000 living species; their developing embryos pass through two larval stages, trochophore and verliger. Bivalves are characterized by their possession of two separate shells, called valves. It is thought that during evolution of molluscs leading to the bivalves, a single dorsal shell in the ancestor was doubled. Recently, Kin et al.15 examined genes involved in this prominent morphological transition in the Japanese spiny oyster Saccostrea kegaki. They found that a member of transforming growth factor beta (TGF-β) family, dpp, is expressed only in the cells along the dorsal midline, and plays an important role in establishing the characteristic shape of the bivalve shell anlagen.
The genome contains all the genetic information of a given organism; therefore, sequenced genomes provide an invaluable basis for studying every aspect of biology. Since the first sequencing of a metazoan genome, the nematode Caenorhabditis elegans16 in 1998, ~35 metazoan genome sequences have been reported. However, there is an apparent bias in the selection of targeted metazoan genomes. In contrast to vertebrates,17–19 basal chordates,20,21 ecdysozoans including Drosophila melanogaster22 and Apis mellifera,23 and cnidarians,24–26 placozoans,27 and the sponge,28 there have been no reports of sequenced genomes in lophotrochozoans, although planarians, annelids, and gastropod molluscs have been the subjects of their own genome projects.
With the aims described, in the context of the current status of genome sequencing efforts, we sequenced the genome of the pearl oyster P. fucata. Because of its comparatively large size, the assembly so far obtained is still comparatively short, but the draft genome will provide a platform sufficient for exploring the molecular basis of pearl biosynthesis.
Pinctada fucata (Fig. 1A) cultured at the Pearl Research Institute of Mikimoto Co., Ltd, Shima, Japan, was used in the present study.
The number of chromosomes was determined by chromosome preparation from nuclei of embryonic cells at the blastula and gastrula stages. The chromosomes were spread according to the method described by Shoguchi et al.29 and stained with Giemsa. The number was confirmed from a total of 25 spreads.
The genome size of P. fucata was estimated by flow cytometry using sperm nuclei from the same specimen used in the genome sequencing.30 Sperm nuclei were treated with a DAPI flow cytometry kit and a BD Cycletest Plus DNA Reagent Kit (BD Biosciences) and analysed on a BD FACSAria II cell sorter (BD Biosciences). The nuclear suspension was confirmed using a fluorescence microscope. The sperm of Takifugu rubripes (365 Mb),18 Oryzias latipes31 (1100 Mb), and Danio rerio (2300 Mb)32 were used for comparison. Runs were performed with stained sperm of each species and also with mixtures of P. fucata sperm with sperm from T. rubripes, O. latipes, or D. rerio. The genome sizes of Takifugu, Oryzias, and Danio were also compared with the published values in order to detect machine-related and/or other technical problems.
The sperm was obtained from a single male during the spawning season, autumn of 2010. The sperm DNA was fragmented, libraries were prepared and sequenced by whole-genome shotgun (WGS) and paired-end read protocols (4 and 10 kb libraries) using a Roche 454 GS-FLX,33 and by WGS and mate-pair protocols (3 and 10 kb libraries) using an Illumina Genome Analyzer IIx (GAIIx) instruments.34
The 454 WGS and paired-end reads were assembled de novo by GS De Novo Assembler version 2.6 (Newbler, Roche).33 Subsequent scaffolding of the Newbler output was performed by SSPACE35 by adding the Illumina mate-pair sequence information.
RNA was isolated from nine different stages, from eggs to D-stage larvae. Total RNA was extracted using TRIzol (Invitrogen) and purified using DNase and an RNeasy micro kit (QIAGEN). mRNA was amplified using a MessageAmp II aRNA Amplification Kit (Ambion). Transcriptome libraries were sequenced using the 454 GS-FLX. Transcriptome data were also obtained from the mantle edge, pallial muscle, and pearl sac.7 All sequences were poly-A trimmed and assembled by Newbler.
A set of gene model predictions (the P. fucata Gene Model version 1.0) was generated using AUGUSTUS 22.214.171.124 All of the following were incorporated as AUGUSTUS ‘hints’: all raw transcriptome reads (except ribosomal RNA sequences); assembled ESTs of P. fucata; bivalve EST data sets available on NCBI; and putative protein-coding loci found using BLASTX on scaffolds present in the proteomes of D. melanogaster, Mus musculus, and Homo sapiens. Next, we aligned models to an annotated transposable element (TE) database using BLASTP (1e−10) to filter out genes that overlapped TE proteins over >60% of their exonic length. All the EST sequences that encode putative full-length proteins were screened by PASA,37 and ~200 EST assemblies randomly selected from the screened ESTs were used for AUGUSTUS training. The gene models were created by running AUGUSTUS on repeat-masked genome sequences produced by RepeatMasker.38
Tandem repeats were detected using Tandem Repeat Finder (version 4.04),39 and then classified using Tandem Repeats Analysis Program (version 1.1).40 A de novo repeats library was generated by RepeatScout (version 1.0.5).41 The repeats were annotated based on BLAST hits to Repbase TE library (version 16.05).42 Transposons and SINEs (short interspersed nuclear elements) in the scaffold were identified using CENSOR (version 4.2)43 by BLAST search against the P. fucata repeat library.
Three approaches, individually or in combination, were used to annotate the protein-coding genes in the P. fucada genome. Our primary approach to the identification of putative orthologues of P. fucada genes was reciprocal BLAST analysis. This was carried out on the basis of mutual best-hit in the BLAST analyses of human, mouse, or Drosophila genes against the P. fucada gene models (BLASTP) or the assembly (TBLASTN).
A second approach, used in the case of gene-encoding proteins with one or more specific protein domains, was to screen the merged models against the Pfam database (Pfam-A.hmm, release 24.0; http://pfam.sanger.ac.uk)44, which contains ~11 000 conserved domains, using HMMER (hmmer3).45
A genome browser has been established using the assembled genome sequences using the Generic Genome Browser (GBrowse), version 2.17.46
Contigs containing mitochondrial genes were searched by BLASTN and BLASTX using bivalvian mitochondrial genes as queries. Mitochondrial protein-coding sequences thereby obtained were confirmed using open reading frame (ORF) Finder (http://www.ncbi.nlm.nih.gov/projects/gorf/) using the invertebrate mitochondrial genetic code as a reference. The tRNA genes were annotated using DOGMA,47 ARWEN48 with default settings, and tRNAscan-SE 1.249 with a COVE cut-off score of 1.
We also performed reconstruction and extension of the mitochondrial DNA sequences, using the following strategy: first, mitochondrial gene sequences obtained above were set as ‘seeds’. A total of 454 raw reads that yielded BLAST hits against the seed sequences were picked up and assembled using Newbler (version 2.6). Assembled sequences were BLAST searched against mitochondrial sequences, and BLAST-hit sequences were used as seeds for the next round of BLAST-assembly workflow. This process was replicated until the contigs did not extend any further.
Based on the study by Philippe et al.50 in which only genes with a moderate rate of amino acid substitution were targeted for molecular phylogeny, we performed mutual best-hit BLAST analyses against three metazoans (H. sapiens, D. melanogaster, and P. fucata) and selected a set of 2087 orthologous genes. Next, the orthologous gene set was surveyed in the genomes of the beetle Tribolium castaneum, the mosquito Anopheles gambiae, the pea aphid Acyrthosiphon pisum, the water flea Daphnia pulex, the lancelet Branchiostoma floridae, the tunicate Ciona intestinalis, the fugu T. rubripes, the zebrafish D. rerio, the anole Anolis carolinensis, the chicken Gallus gallus, the opossum Monodelphis domestica, and the mouse M. musculus. The sea anemone Nematostella vectensis and the coral Acropora digitifera were also examined as outgroup. Most of the protein sequences were retrieved from Refseq (NCBI) through GenomeNet (http://www.genome.jp), except for N. vectensis and D. pulex (Joint Genome Institute, http://genome.jgi-psf.org).
In order to avoid mixing paralogous genes, we performed a BLAST search under stringent conditions (E-value cut-off at 1e−50). Genes detected in all species were aligned using ClustalW,51 and poorly aligned regions were excluded using Gblocks.52 Next, Neighbor Joining tree of each gene was generated with ClustalW to evaluate the branch length of the trees. Each gene with moderate branch length was concatenated and used for molecular phylogenetic analyses. We ran RAxML 7.2.853 for maximum likelihood analyses using the WAG+CAMMA+F model for the whole data matrix. Bootstrap analysis was performed on the basis of 100 replicates.
As a part of the P. fucata genome project, we examined the number of chromosomes and estimated the genome size. The chromosome number of P. fucata was previously reported by Wada54 to be 2n= 28. We confirmed this number of chromosomes (Supplementary Fig. S1).
The genome size of P. fucata was estimated by comparing genomes of other metazoans with reported size estimation. We found that the genome size of P. fucata was a little larger than that of the medaka O. latipes (Fig. 1B; Supplementary Fig. S2D and E). Since the genome size of O. latipes has been calculated to be ~1100 Mb,31 we estimated the P. fucata genome to be ~1150 Mb in size. According to the Animal Genome Size Database, Release 2.0 (http://www.genomesize.com), the genome sizes of bivalves are between 1200 and 2100 Mb. Thus, the P. fucata genome is comparatively small among bivalves.
Figure 2 shows a flow chart of sequencing and assembly; details of the sequenced and assembled data are presented in Supplementary Table S1. The Roche 454 sequencing platform generated a total of 44.5 M WGS reads, yielding 15.2 Gb of sequence (average read length was 341 bp) and 11.6 M paired-end reads (the insert sizes were 4 and 10 kb, respectively; average read lengths were 286 and 309 bp, respectively), yielding 3.5 Gb of sequence. In addition, the Illumina platform generated 76 M WGS reads (3.9 Gb) and 22.8 Gb mate-pair sequences (10.9 Gb for the 3-kb library and 11.9 Gb for the 10-kb library). A total of 45.4 Gb were sequenced. Based on the genome size estimation of 1.15 Gb (see above), this corresponds to ~40-fold coverage of the P. fucata genome.
The 454 WGS and paired-end reads were assembled (Fig. 2). A total of 18.7 Gb of sequence, which corresponds to 16.3× of the estimated genome size, was used at this stage. The reads were first screened and trimmed using the built-in filter module of Newbler in order to remove potential primer and adaptor contamination. About 17 Gb were used for the subsequent assembly procedure; 14 Gb was successfully aligned to form contigs. The assembled genome contained 1 085 563 contigs with an N50 size of 1629 bp (Table 1; Supplementary Table S1; Fig. 2). The longest contig was 44.5 kb, and ~40% of the sequences were covered by contigs of >2 kb in length (Supplementary Table S1). The assembled genome contained a total of 1024 Mb, which is comparable with the estimated genome size (Supplementary Table S1; Fig. 2). Gaps in the contig assembly were only 7.6 kb (Fig. 2).
Subsequent scaffolding of the 1 085 563-contig Newbler output was performed using the Illumina mate-pair information. The final assembly contained 800 982 scaffolds with an N50 size of 14.5 kb (Table 1; Supplementary Table S2; Fig. 2). The total length of scaffolds was 1029 Mb (a total of 1413–384 Mb gaps; Fig. 2). The size of the longest scaffold was 709.8 kb; three scaffolds extended over 500 kb in length. Approximately 44% of the sequences were covered by scaffolds of >20 kb in length (Supplementary Table S2). The scaffolds included ~384-Mb gaps containing unknown sequences (Fig. 2); gap size distribution is shown in Supplementary Table S3.
As described above, in spite of our considerable sequence coverage (~40-fold) of the P. fucata genome, its contig N50 and scaffold N50 were comparatively small. Possible reasons for this include the large genome size, the presence of various types of repetitive sequence (see Section 3.9), and high allelic polymorphism or heterozygosity. We examined whether allelic polymorphism affected the assembly; in addition, we examined the assembly of mitochondrial genome in this context. These analyses are described below.
We analysed the relationship between contig sizes and coverage depth. In haploid genomes, or diploid genomes with very low heterozygosity, this relationship would take the form of a Gaussian distribution curve with a single peak. On the other hand, in the case of a diploid genome with high allelic polymorphism, the plot would appear as a curve with two peaks with approximately double coverage−depth relation. As shown in Fig. 3 and Supplementary Fig. S3, a curve with two distinct peaks appeared in the graph, one at around ~18× the coverage depth and the other at ~10×. Most of the 100 longest contigs were present in the former group, indicating that this region consisted of sequences with lower polymorphism. This notion is also supported by the fact that the average coverage depth of this region (18.2×) was similar to the overall sequence coverage for contig construction (16.3×).
The green area of Supplementary Fig. S3, which represents nucleotides of 10× coverage depth, covers 668 837 291 bases, accounting for 65.3% of the total contig size (1 024 053 188 bases), whereas the yellow area covers 25.3% (258 687 068 bases). It is highly likely that the green area consists of contigs with high polymorphism and the yellow area those with less polymorphism. If so, nearly two-thirds of genome sequences are highly polymorphic. As this would prevent proper contig assembly, polymorphism is probably one of the major reasons why contig N50 is small.
The size of the P. fucata genome, as estimated by flow cytometry, is 1150 Mb (Fig. 1B). On the other hand, the euchromatic region of the genome is ~666 Mb (see Supplementary data), indicating that ~58% of the genome is euchromatic and the other 42% heterochromatic. The euchromatic sequences occupy ~73 and 66% of the genomes of C. intestinalis and O. latipes, respectively. Compared with these, the pearl oyster genome includes more heterochromatic regions; this may be one of the reasons why the P. fucata genome is comparatively large.
On the other hand, if the former are of less allelic polymorphism while the latter high allelic polymorphism, this result raises an intriguing possibility. As mentioned in Section 2, the pearl oyster whose genome is described in this study is from a strain that Mikimoto Co. has cultivated for more than 30 years. The strain might have been selected for better production of pearls under both qualitative and quantitative genetic pressure. If this is the case, and if genes found in the genomic area with less heterozygosity are responsible for calcification or pearl production, we may have revealed some of the reasons why this strain is advantageous for pearl production. On the other hand, polymorphic genes from the region with high heterozygosity will be useful as markers in studies of the population genetics of pearl oysters.
In contrast to the large nuclear genome, which encodes more than 20 000 protein-coding genes, the typical mitochondrial genome of metazoans is small, circular, and 15−17 kb long.55 It contains genes encoding 13 proteins of the respiratory chain [cytochorome c oxidase subunit I−III (cox1, cox2, and cox3), apocytochrome b (cytb), ATPase 6 and 8 (atp6 and atp8), and NADH dehydrogenase subunits 1–6 and 4L (nad1–6 and nad4L), two for rRNA (rrnS and rrnL), and 22 genes for tRNA (trn)].55 Although molluscs in general, and bivalves in particular, exhibit an extraordinarily high degree of mitochondrial gene order variation when compared with other clades of metazoans,56–58 it is reasonable to analyse the P. fucata mitochondrial genome to examine the degree of contig construction.
First, BLAST search, using bivalvian mitochondrial genes as queries against contigs of the P. fucata draft genome, yielded three contigs containing mitochondrial genes (Fig. 4A). The first and largest contig (no. 46136), which consisted of 11 459 bp, contained rrnL, trnE, trnM, nad5, cox1, cox2, nad6, atp6, cox3, nad3, trnC, trnP, trnH, nad1, trnL, trnN, trnF, and nad4, in that order. The second contig (no. 46133), 4127 bp in length, contained trnD, trnQ, trnG, trnA, nad2, trnR, trnT, trnL, and trnK, in that order (Fig. 4A). The third contig (no. 46134), 857 bp in length, included rrnS. This analysis did not identify atp8, trnI, trnS, trnW, trnY, or trnV. Previous studies, however, demonstrated that atp8 has been lost independently from the mitochondrial genomes of several bivalves and oysters.56–58
Secondly, we reconstructed contigs that contained only mitochondrial genes, as described in Section 2. This analysis yielded two contigs, mt1 and mt2 (Fig. 4B). mt1, 12 066 bp in length, contained genes whose arrangement is identical to contig 46136 (Fig. 4B). On the other hand, mt2 (5142 bp) contained trnD, trnQ, trnG, trnA, nad2, trnR, trnT, trnL, trnK, trnW, and rrnS; in other words, contigs 46133 and 46134 were joined via trnW into a single mitochondrial contig, mt2 (compare Fig. 4A and B).
These analyses of mitochondrial genes suggest that the coverage of the pearl oyster genome sequences (both nuclear and mitochondrial DNAs) was high. The occurrence of the three contigs obtained via BLAST (no. 46136, 46133, and 46134), compared with that of two reconstructed mitochondrial contigs, suggested that the sequence assembly and contig construction themselves were not always insufficient. This also supports the notion that the low contig N50 was caused by high allelic polymorphism in pearl oyster nuclear DNA.
The GC content was estimated to be ~34% from the assembled genome sequences (Fig. 5), and therefore the pearl oyster genome is AT-rich. Figure 5 shows GC contents of genomes of various metazoans for comparison; they ranged from low values (34% in P. fucata and C. elegans; 35% in Strongylocentrotus purpuratus) to higher values (40% in H. sapiens; 38% in B. floridae) genomes. Based on these comparisons, we conclude that P. fucata has a relatively high AT-content among metazoans.
The present P. fucata genome project used sperm DNA from a single male. In the assembled genome, we have not detected any sequences that are predicted to correspond to bacteria or other contaminating organisms. The single peak in the GC content of the raw reads (Fig. 5) further supported lack of sequence contamination.
Transcriptome analysis is essential in order to understand the genes that are expressed in a given organism. As mentioned before, a recent EST analysis of P. fucata identified 29 682 unique sequences.7 Here, we performed EST analyses of P. fucata with the aid of a Roche 454 GS-FLX. RNA was isolated from a mixture of nine developmental stages. Sequencing yielded 36 780 contigs (29.3 Mb) over 100 bp with an N50 size of 1575 bp. Of these, 33 570 (91.3%) had a BLAT alignment to the assembled genome (using default settings). Of the 36 780 contigs, 8571 contained a complete ORF of at least 450 bp. Of these putatively full-length EST contigs, 8342 (97.3%) had a BLAT alignment to the assembled scaffolds. These data were used to produce gene models and annotation.
The final set of the P. fucata gene model version 1 contained 43 760 genes (Table 2). Of them, 23 257 were complete models with the both start and stop codons (Table 2). Of these, 69.7% are substantially supported by P. fucata ESTs (Table 2).
At the present modelling stage, the average length of genes was 7700 bp. The number of exons per gene was 3.2, and average length of exons was 589 bp, suggesting that a gene consists of 1885-bp long exons on average. Thus, each gene contains on average 4815 bp of intronic sequence (6700–1885 = 4815). The long intron insertions partially explain the large genome size of P. fucata. Proteins consisted of 274 amino acids on average.
The 43 760 predicted protein-coding loci should be examined by further improvement of sequence assembly and gene prediction. Due to the short lengths of contigs and scaffolds, presumably caused by insufficient coverage of the comparatively large genome, the model may include predictions that do not represent bona fide protein-coding genes. Such spurious predictions could arise from unrecognized repetitive elements and/or splitting of genes between scaffolds. However, BLAST search of the 23 257 predicted Pinctada genes with those of other metazoans showed that 15 077 of the Pinctada gene models (65%) are homologous to other metazoan genes or ESTs.
As is the case for the pearl oyster, mollusc genomes are comparatively large in size, plausibly due to the presence of a larger number of TEs and repetitive elements in the genome. Recently Yoshida et al.59 examined whether repetitive elements caused expansion of the genome size in molluscs. They showed that the proportions of repetitive elements are 9.2, 4.0, and 3.8% in the pygmy squid, nautilus, and scallop, respectively. Since their genome sizes are estimated to be 2.1, 4.2, and 1.8 Gb, respectively, the authors concluded that the repetitive element expansion does not always explain the increase in genome size; specifically, nautilus is an outlier.
We determined the proportion of TEs and repetitive elements in the assembled genome of the pearl oyster. As shown in Fig. 6 and Supplementary Table S4, 0.37 and 1.45% of the P. fucata genome appear to originate from DNA transposons and retrotransposons, respectively. The DNA transposons included Mariner, Polinton, Helitron, Academ, and others, while retrotransposons included LTR retrotransposons such as Gypsy, DIRS, and Bel_Pao as well as non-LTR retrotransposons such as enelop and CR1. The percentages of LINEs (long interspersed nuclear elements such as L1, RTE, and Jockey) and SINEs were not significantly high in the pearl oyster genome. These ratios are lower than those of three mollusc species mentioned above. On the other hand, nearly 7.9% of the genome was occupied by tandem repeat elements (Fig. 6; Supplementary Table S4); among these repeats, microsatellites occupied ~3.7% of the genome.
It is possible that repetitive elements, especially shorter ones, were discarded during the process of sequence assembly. To examine this issue, we also determined the proportion of DNA transposons, retrotransposons, and tandem repeats in raw data generated in a one-run WGS read. We determined that the three data sets yielded similar results; one example is shown in Supplementary Table S4. The percentages of DNA transposons and retrotransposons were 0.4 and 2.7, respectively, suggesting that the assembly process did not affect so much the estimation of DNA transposons and retrotransposons in the Pinctada genome.
Although the ratio of these components is likely to be underestimated, owing to the insufficient assembly of genome sequences because of the large genome size, these data indicate that the Pinctada genome contains a comparatively small number of DNA transposons, retrotransposons, and tandem repeats. Therefore, the presence of these elements is not the cause of the large genome size.
To date, cDNAs and/or proteins for more than 80 genes have been reported as those associated with the calcification and/or bivalve shell formation.6 These include Aspein,60 ferritin,61 KRMP,62 MSI60,63 N16,64 N19,65 Nacrein,66 PFMG,67 Pfty (Pictada fucata tyrosinase-1),68 Pif177,69 Prisilkin-39,70 Prismalin-14,71 and Shematrin72 (Table 3). In order to examine the validity of the assembled genome, we searched for corresponding genes in the assembled P. fucata genome. We found the presence of 21 corresponding genes in the genome (Table 3). We found four genes corresponding to N-19 in the genome, three of which are aligned in scaffold 2495, suggesting tandem duplication of the genes (data not shown).
These results suggest that the present draft genome will be useful for exploration of the organization of various genes in the pearl oyster genome. An annotation and characterization of the pearl formation-related genes, found in previous studies, are now underway (Kinoshita et al., unpublished).
Transcription factors and signal transduction molecules play pivotal roles in the formation of animal body plans.73,74 Qualitative and quantitative changes in these molecules have been discussed in the context of the evolution of mollusc body plans.15,75 We examined transcription factors and signal transduction molecules using the Pfam domain method.
We determined the number of domains that have been identified in transcription factors and compared it with the genomes of N. vectensis, D. melanogaster, and H. sapiens (Supplementary Table S5). The domains include HLH, homeobox, nuclear hormone receptors, POU, bZIP1, Ets, Fork head, PAX, SRF-TF, GATA, HMG box, RHD, DM SRF, Runt, P53 DNA-binding domain, T-box, and others (Supplementary Table S5). This analysis illustrates several characteristic features of transcription factors in the pearl oyster genome (Supplementary Table S5). First, the genome contains major transcription factors found in other metazoan genomes. For example, in the Gene Model version 1.0, the Pinctada genome contains 43 genes for the bHLH domain, 88 genes for the homeobox domain, and 25 for nuclear hormone receptor domains. These numbers are comparable with those found in the Drosophila genome, and about half of those found in the human genome.
Recent studies of genes encoding cell−cell signalling molecules in molluscs reported unexpected complexity among the signalling genes. As in the case of identifying domains associated with transcription factors, we used the Pfam domain method to identify genes involved in cell−cell signal transduction. The results are summarized in Supplementary Table S6. The Pinctada genome contains genes that encode major signalling molecules, including members of the Wnt family, TGF-β family, and G-protein-coupled signalling family (Supplementary Table S6). For example, the Pinctada, Nematostella, and Drosophila genomes contain, respectively, 4, 6, and 8 candidates for the TGF-β family, and 9, 13, and 10 candidates for the G-protein-coupled signalling family. At present, the Pinctada genome contains six genes in the Wnt family and one gene in the FGF family (Supplementary Table S6).
Although the human genome contains genes for interleukin domains, the Nematostella and Drosophila genomes are unlikely to have genes for interleukin domains, including interleukin-2, -3, -4, -5, -7, -11, and -12 (Supplementary Table S6). The mollusc P. fucata genome also lacks interleukin genes.
Molecular phylogeny is a powerful and objective method for inferring the phylogenic relationships of extant metazoans; various molecules have recently been used, singly or in combination, to deduce mollusc phylogeny in the context of the evolution of lophotrochozoans. For example, Dunn et al.11 examined the broad phylogenic relationship among animals, using data from 77 taxa and 150 genes. This study positioned Mollusca in a clade of Lophotrochozoa, together with another group that includes Annelida, Phoronidae, Nemertea, and Entoprocta. On the other hand, Paps et al.76 reported that Mollusca forms a clade with Brachiozoa as a group of Spiralia within Lophotrochozoa.
Within the 2087 orthologus genes, 409 genes were detected in all selected organisms, which allowed us to compare 77 547 aligned amino acid positions (see Section 2). As shown in Fig. 7, the tree indicates that the mollusc forms a clade of Lophotrochozoa that is independent from a clade comprising ecdysozoans and deuterostomes. This is the first confirmation by genome-wide analyses using nuclear genes of the three major groups of bilaterians.
We sequenced the genome of the pearl oyster P. fucata using pyrosequencing technologies. The draft genome, ~1150 Mb at this draft stage, contains 23 257 complete gene models. Most likely, the assembled genome contains almost no sequences derived from other organisms such as bacteria, because the genome DNA was obtained from the sperm of a single male. In spite of the recent accumulation of genomic information in various metazoan taxa, there have been very few reports of sequenced genomes of molluscs and/or lophotrochozoans. This study therefore provides the first opportunity to obtain insight into a bivalvian mollusc genome. The P. fucata genome also provides a basic platform for further studies of the biosynthesis of pearl, which has a significant importance in the fisheries industry. Such studies are now in progress by multiple investigators, including the authors of this study.
This work was supported by the OIST Internal Fund for Marine Genomics Unit. We thank S. Brenner and R. Baughman for their support. The supercomputer work was supported by the IT Section of the OIST and the Human Genome Center at the University of Tokyo.