|Home | About | Journals | Submit | Contact Us | Français|
Users may view, print, copy, download and text and data- mine the content in such documents, for the purposes of academic research, subject always to the full Conditions of use: http://www.nature.com/authors/editorial_policies/license.html#terms
Castor bean (Ricinus communis) is an oil crop that belongs to the spurge (Euphorbiaceae) family. Its seeds are the source of castor oil, used for the production of high-quality lubricants due to its high proportion of the unusual fatty acid ricinoleic acid. Castor bean seeds also produce ricin, a highly toxic ribosome inactivating protein, making castor bean relevant for biosafety. We report here the 4.6X draft genome sequence of castor bean, representing the first reported Euphorbiaceae genome sequence. Our analysis shows that most key castor oil metabolism genes are single-copy while the ricin gene family is larger than previously thought. Comparative genomics analysis suggests the presence of an ancient hexaploidization event that is conserved across the dicotyledonous lineage.
Castor bean (Ricinus communis) is an oilseed crop that belongs to the family Euphorbiaceae, which is composed of 6,300 species including crops such as cassava (Manihot esculenta), rubber tree (Hevea brasiliensis), and physic nut (Jatropha curcas), as well as the invasive weed leafy spurge (Euphorbia esula) and ornamental poinsettias (Euphorbia pulcherrima). The castor bean plant is a tropical perennial shrub originated in Africa and cultivated in many tropical and subtropical regions around the world. It can be self- and cross-pollinated and worldwide studies showed low genetic diversity among castor bean germplasm1,2.
Castor bean seed oil contains 90% of the unusual hydroxy-fatty acid ricinoleic acid3. Because of the nearly uniform content of ricinoleic acid in castor oil, and the unique chemical properties that this fatty acid confers to the oil, castor bean is a highly valued oilseed crop for lubricant, cosmetic, medical, and specialty chemical applications. Furthermore, castor bean is a potential biodiesel source, due to its high seed oil content4 and because it can be cultivated in unfavorable environments, making it an appealing crop in tropical developing countries. It is believed that castor oil was first used as an ointment 4,000 years ago in Egypt, from where it spread to other parts of the world, including Greece and Rome, where it was used as a laxative 2,500 years ago5.
An important problem of castor bean as a crop is the high seed content of ricin, an extremely toxic protein6. Ricin is considered one of the deadliest natural poisons when administered intravenously or inhaled as fine particles. Ricin was first isolated more than a century ago7. It has been reportedly used as a weapon6 and attempts to using ricin as a specific immunotoxin for therapeutic purposes in different cancers have been reported8,9. Its biochemical activity has been characterized as a type 2 ribosome-inactivating protein (RIP), composed by two subunits linked by a disulfide bond: a 32 kDa ricin toxin A (RTA) chain that harbors the ribosome-inactivating activity, and a 34 kDa ricin toxin B (RTB) chain, with a galactose-binding lectin domain. RTA is an N-glycosidase that depurinates adenine in a specific residue of the 28S ribosomal RNA10,11. The RTB chain, allows ricin to enter eukaryotic cells by binding to cell surface galactosides and subsequent endocytosis. Other RIPs are common in plants, although they are not toxic because they are usually monomeric and lack a lectin domain. These proteins constitute the type 1 RIPs12.
Ricin is synthesized as a precursor encoding both subunits in the endoplasmic reticulum of endosperm cells, and is translocated and accumulated in protein bodies13. The precursor is proteolytically processed in the endoplasmic reticulum and in the protein bodies, where it is stored as the mature heterodimer.
Ricin is very similar to the Ricinus communis agglutinin or RCA14. However, while ricin is a weak hemagglutinin, RCA has low toxicity and a strong hemagglutinin activity. In addition, RCA is a tetrameric protein composed of two RTA- and two RTB-like subunits.
Purification of ricin can be achieved through a relatively simple process, raising biosafety concerns. For this reason, the United States does not extensively produce castor oil, and it is among the world's largest importers of castor oil and its derivatives. Therefore, knowledge of the castor oil metabolism is important to advance towards using castor oil as biofuel, as well as to enable metabolic engineering to obtain safe sources of hydroxy-fatty acids without the complications of ricin.
The castor bean genome size has been estimated in 320 Mb by flow cytometry15 and it is distributed in 10 chromosomes but, to our knowledge, no genetic map is available for castor bean. Because of the lack of available genomic information for castor bean, and due its importance for biodefense and as a model system in the Euphorbiaceae family, we set up to generate a draft sequence of the castor bean genome.
We produced approximately 2.1 million high-quality sequence reads from plasmid and fosmid libraries (see Supplementary methods), and used the Celera assembler to build consensus sequences or contigs and to link these contigs into 25,800 scaffolds using the two end-sequences from individual clones (mate-paired reads). The assembly covered the genome approximately 4.6X, spanning 350 Mb, which is consistent with previous genome size estimations. If only the 3,500 scaffolds larger than 2 kb are considered, the assembly spans 325 Mb with an N50 of 0.56 Mb (Table 1).
The genome sequence assembly was searched for repetitive DNA using a combination of sequence alignment to databases of repetitive sequences and RepeatScout to identify repeats de novo. Overall, over 50% of the genome was identified as repetitive DNA (excluding low-complexity sequences), most of which could not be associated with known element families. One third of the repetitive elements were retrotransposons, and less than 2% were DNA transposons (Table 2). The most abundant known repeats are long terminal repeat (LTR) elements (22.7% Gypsy-type and 9.5% Copia-type).
Protein coding genes were annotated using multiple gene-prediction programs, homology searches against sequence databases, and the cDNA spliced-alignment tool PASA (Program to Assemble Spliced Alignments). In order to aid the genome annotation, we also generated 52,165 expressed sequence tags (ESTs) from 5 cDNA non-normalized libraries. Using PASA, these and other castor bean cDNA sequences from GenBank could be aligned to 5,491 predicted genes and to 688 genomic regions where no gene had been predicted, allowing the creation of additional gene models. After all gene-prediction programs and homology searches were run, these data were consolidated into consensus gene predictions using the program Evidence Modeler (EVM; see Supplementary methods). EVM showed better sensitivity and specificity than any of the individual gene finders used (Supplementary Table 1). In this way, we identified 31,237 gene models (Table 1). Using TIGR’s Paralogous Families pipeline, 58.5% of the castor bean gene models were grouped in 3,020 predicted protein families of at least two members (Supplementary Fig. 1; Supplementary Table 2).
Although the castor bean genome assembly is fairly fragmented, it contains several megabase-sized scaffolds. Thus we attempted to investigate the extent of genome duplications in castor bean and contribute to the elucidation of the evolutionary history of the dicotyledonous lineage. Different models have been proposed to explain the origin of genome duplications in dicots. One supports the occurrence of an ancestral hexaploidization event common to all dicots16, while the other model suggests that all dicot genomes share one duplication event17. Analysis of genomic duplications in the castor bean genome brings an opportunity to advance toward resolving this controversy. Thus, we searched for putative paralogous genes using reciprocal best BLAST matches between all castor bean genes. We then selected the 30 pairs of scaffolds that contained the highest numbers of paralogous gene pairs. The 22 unique scaffolds containing those 30 pairs of scaffolds were displayed in a dot plot. This approach led to the identification of 6 triplicated regions (i.e. regions for which 2 additional paralogous regions exist in the genome). We also identified 9 duplicated regions (unmarked strings of dots) for which we cannot determine if a third paralogous region exists or not (Fig. 1). We then carried out a more precise and comprehensive search for evidence of genomic triplications by first building Jaccard clusters of paralogous genes using an all-versus-all BLASTP search. We identified and displayed blocks of syntenic genes using Sybil18 and manually inspected the results to identify triplicated regions. With this method, we identified 17 triplicated regions (Supplementary Fig. 2) that included those found using the reciprocal best BLAST matches method. The fact that the triplications were found in multiple groups of scaffolds suggests that the castor bean genome underwent a hexaploidization event.
In order to determine if the triplication of the castor bean genome corresponds to ancestral polyploidization events previously described in the dicot lineage, we compared the castor bean triplicated regions versus the Arabidopsis19, poplar20, grapevine16, and papaya21 genomes by generating Jaccard clusters in a pair-wise manner between castor bean and each of the other genomes. Out of the 17 triplications, 8 (including 5 of the 6 triplications identified by reciprocal best BLAST matches) contained blocks of 5 or more syntenic gene pairs between each of the three castor bean regions and all of the other dicot genomes. Castor bean paralogous gene blocks generally showed a one-to-one, one-to-two, and one-to-four relationship with their grapevine, poplar, and Arabidopsis orthologues, respectively (Fig. 2 and Supplementary Fig. 3). Some exceptions were observed in the comparison with Arabidopsis that were expected due to the further rearrangements that exist in its genome19. Comparison between the castor bean and papaya genomes is less clear due to the fragmentation of both genome assemblies. Our results support the presence of a hexaploidization event common to all dicots, as well as one additional genome duplication in poplar, and two further duplications in the Arabidopsis genome.
As the presence of ricin makes castor bean an important subject for biosecurity research, we analyzed the lectin gene family that includes the genes for ricin and RCA. The ricin protein gene codes for three domains: an N-terminal RIP domain and two C-terminal lectin domains. It has been reported that this gene family is composed of 6 to 8 members, detected by Southern-blot hybridization using a ricin cDNA probe22,23. However, the castor bean genome revealed 28 putative genes in the family, including potential pseudogenes or gene fragments. In order to increase the reliability of our analysis of this gene family, sequence gaps or ambiguities inside the ricin-like gene models were subjected to manual finishing work to improve the sequence and assembly quality. In this way, the sequence and assembly of 8 scaffolds was improved and the 28 gene family members (Fig. 3) were contained in a total of 17 scaffolds, each containing 1 to 5 ricin-agglutinin gene family members (Supplementary Table 3). These results suggest that the members of this lectin family tend to be clustered in the castor bean genome. The largest cluster spans 70 kb and includes a group of 5 family members interrupted by one gene that does not belong to the gene family. The other clusters contain 2 or 3 gene family members in regions ranging between 0.7 and 17 kb. Ten scaffolds contained only one gene family member, and 4 of them were longer than 250 kb, suggesting that these four genes were not part of clusters. However it is uncertain if the other 6 scaffolds that contain only one member of the family are part of clusters because they are shorter than 12 kb. Probably, some of these tandem duplications were not discriminated in previous studies using Southern-blot analysis, resulting in an underestimation of the gene family size. Furthermore, although we did not manually curate structural annotation, we found two cases in which adjacent ricin-like gene fragments could belong to pseudogenes that accumulated frame shifts and stop codons (Fig. 3). The length of the different members of the family identified by automatic annotation was variable, ranging from 66 to 584 amino acids. Although some of the shorter genes could be non-functional or pseudogenes, start and stop codons could be predicted, making it difficult to determine their functionality, and 4 of them were truncated due to being at the end of a contig or scaffold. Sequence comparison to ricin and RCA coding sequences in GenBank uncovered one full-length gene model (60629.m00002) identical to the ricin coding sequence and another full-length gene model (60637.m00004) showing 99% identity to RCA coding sequence. These gene models likely correspond to the reported ricin and RCA sequences, respectively. An additional predicted gene (60628.m00003) shows 100% identity to the ricin coding sequence, although presumably, the sequence coding for about 150 of the 576 amino acids is missing from this gene model because it is located at the end of a scaffold. Three other gene models are truncated in a similar way (60626.m00001; 60639.m00003; 60627.m00002) and show 100% identity to the ricin coding sequence, but the available sequences are much shorter (149 to 188 amino acids). Thus, it is uncertain if these genes represent complete identical copies of the ricin-coding gene. The rest of the gene family members showed different degrees of similarity to the ricin or RCA coding sequences. Overall, the proteins coded by 7 genes (including ricin and RCA) out of the 28 family members contain the RIP and the two lectin domains, 9 contain only the RIP domain, and 9 contain one or two lectin domains only (Fig. 3). cDNA alignments showed evidence of expression of the genes coding ricin and RCA as well as one of the homologues (60638.m00018) for which a putatively complete gene was modeled (not shown). Furthermore, evidence of RIP activity has been recently reported for the proteins coded by the 7 full-length ricin-like genes24.
Due to the importance of castor bean as an oilseed, we examined the annotation of 71 gene models that showed similarity to known genes involved in the biosynthesis of fatty acids and triacylglycerols, which in castor bean correspond mainly to ricinoleic acid and triricinolein25. Out of these 71 gene models, the annotation of 67 was manually improved (Supplementary Table 4). Castor bean has not only evolved an oleic acid hydroxylase to synthesize ricinoleic acid, but it also developed the capacity to efficiently accumulate high levels of ricinoleic acid in its seed oil. Therefore, we focused on a few key genes in the ricinoleic acid biosynthetic and metabolic pathways. The oleic acid hydroxylase gene (FAH) that produces ricinoleic acid from oleoyl-phosphatidycholine likely evolved from the widely occurring FAD2 gene for the Δ12-oleic acid desaturase26. BLAST searches of these genes against the entire castor genome confirmed that there is only one copy of each of these genes (28035.m000362 and 29613.m000358, respectively). Among the key enzymes involved in the incorporation of ricinoleic acid into oils are diacylglycerol acyltransferases (DGATs), which catalyze the final step in triacylglycerol assembly. Two classes of endoplasmic reticulum-associated DGATs (DGAT1 and DGAT2) occur in castor bean, as well as a homolog of a soluble DGAT27,28,29. The gene models coding for these enzymes are also single-copy (29912.m005373, 29682.m000581, and 29889.m003411, respectively). In addition to DGAT-coding genes, it is likely that other genes have evolved to maintain high and specific flux of ricinoleic acid from its synthesis on phosphatidylcholine to its storage in triacylglycerols in castor bean seeds. Remarkably, even though ricinoleic acid accounts for nearly 90% of the fatty acid seeds in castor bean seeds, it is present at less than 5% of the fatty acids in phosphatidylcholine30. Although the mechanism for ricinoleic acid flux among lipid classes is not clear, a number of specialized acyltransferase and phosphatidylcholine metabolic enzymes likely participate in these reactions, including phospholipid:diacylglycerol acyltransferase 1 (PDAT1; 29912.m005286)31 and the recently identified phosphatidylcholine:diacylglycerol cholinephosphotransferase32 (PDCT; 29841.m002865). Information on copy number, genomic context, and regulatory regions of these and other metabolic genes will be important for the biotechnological transfer of ricinoleic acid production to established oilseed crops that lack ricin and its associated health risks. In addition, it is likely that the correct combination of specialized metabolic genes identified from the castor bean genome sequence will enable the engineering of triricinolein accumulation to amounts substantially higher than the modest levels achieved to date in model oilseeds33,34.
In order to contribute to the biotic stress research field in Euphorbiaceae, which is particularly important for cassava35, we compiled the castor bean predicted proteins whose functional annotation was related to disease resistance. One hundred and twenty one potential disease-resistance predicted proteins could be identified (Supplementary Table 5) using our automated annotation pipeline. The majority of these predicted proteins belong to the nucleotide binding-leucine-rich repeat (NBS-LRR) class, followed by the less common extracellular LRR-containing (eLRR) proteins36, and dirigent-like proteins that have been associated with disease resistance37. The castor bean gene models coding for these resistance genes were found distributed in 69 scaffolds and were often found in clusters of genes from the same class, although in some cases (i.e. scaffold 30190), different resistance gene classes are found in the same cluster (Supplementary Table 5). These data will be useful for comparative studies on resistance genes in cassava as well as other Euphorbiaceae crops.
Due to the growing biosecurity importance of castor bean as the source of the highly toxic protein ricin38, genomic information becomes crucial for the development of improved diagnostic and forensic methods for ricin detection and cultivar identification for tracing sample origins. Molecular diagnostic methods39 and world-wide analyses of castor bean populations1,2 have been reported and the availability of the castor bean genome sequence will accelerate efforts to advance such studies and technologies.
In addition to its relevance for biosecurity, the castor bean genome information we report here can have implications for the production of biofuels and thus contribute to reducing greenhouse gas production. The industry of castor oil as a biodiesel component is being developed in Brazil4, where the use of biofuels is highly advanced. Furthermore, castor oil can also be used as lubricity additive to replace sulfur-based lubricant components in petroleum diesel, helping to reduce sulfur emissions40.
Unfortunately, the presence of ricin poses a problem for castor bean as widely cultivated oilseed crop. Therefore, considerable effort has been directed to engineering ricinoleic acid production in seeds of the model plant Arabidopsis as a prelude to transferring the required genes to an established ricin-free oilseed crop such as soybean. The initial strategy has involved the seed-specific expression of the castor bean FAH gene for the FAD2-related Δ12 oleic hydroxylase26, the key enzyme in ricinoleic acid synthesis41,42. However, transgenic expression of FAH resulted in the accumulation of ricinoleic acid and other hydroxy fatty acids to only 15 to 20% of the total fatty acids in Arabidopsis seeds41,42. Even co-expression of FAH with one additional ricinoleic acid metabolic gene, including the castor bean gene for DGAT2, yielded only small increases in ricinoleic acid accumulation in seeds of transgenic Arabidopsis that were far less than the levels typically found in castor bean seeds33,43. These results also reflect the modest production of other unusual fatty acids that has been achieved by expression of FAD2 variants such as the Δ12 epoxygenase and fatty acid conjugases in seeds of transgenic plants44,45. These results suggest that expression of a single biosynthetic gene, such as FAH alone or together with a gene involved in the metabolism of a given unusual fatty acid is insufficient to reproduce the oil composition observed in castor bean seeds. Thus, additional information on regulatory and metabolic genes is needed to fully transfer high levels of unusual fatty acid production and accumulation to engineered oilseed crops43,46,47. It is envisioned that the castor bean genome sequence and its annotation constitute the foundation for identifying the regulatory and metabolic networks controlling castor oil biosynthesis. In combination with metabolomics studies, these castor bean genome resources will enable metabolic engineering for improving castor oil production in crop plants lacking ricin.
Our analysis of the castor bean genome contributes to the debate on the polyploidization events that occurred in dicotyledonous genomes, supporting the presence of an ancestral hexaploidization event. Extending this type of analyses to cassava will have a great impact in the cassava research community as it will synergize with the recently released genome sequence of cassava (http://www.phytozome.net/cassava), which is an important food and, more recently, industrial crop in poor, tropical countries. It has been proposed that cassava is an allopolyploid48, and preliminary comparative genomics analyses between cassava and castor bean showed evidence of genomic duplications in cassava relative to castor bean (S. Rounsley, pers. comm.). These analyses suggest that the allopolyploidization event may have occurred in the cassava genome relatively recently, after the split between the two lineages. Further genome-wide comparative studies will provide insights on the genome evolution of cassava and the Euphorbiaceae family. Such information will help advance cassava breeding, which is a key means for developing countries to generate improved cassava lines with increased levels of stress resistance and nutritional content.
Methods and any associated references are available in the online version of the paper at http://www.nature.com/naturebiotechnology.
This work was supported by the National Institute of Allergy and Infectious Diseases (NIAID), National Institutes of Health, Department of Health and Human Services, under NIAID Contract N01-AI-30071 (R. communis subproject) awarded to CMFL, JR, JRW, and PDR; Federal Bureau of Investigation grant J-FBI-04-186 to CMFL, JR, and PDR; and National Science Foundation grant DBI 0701919 to EBC. We thank the Joint Technology Center at JCVI for carrying out all sequencing work, and Ken Wurdack for his assistance with phylogenetics.
Note: Supplementary information is available on the Nature Biotechnology website.
AUTHORS CONTRIBUTIONSA.P.C., J.C., H.L., B.J.H., and J.R.W. performed genomic analyses. Q.Z., J.O., and M.S. conducted genome annotation. D.P. worked on the genome assembly. A.M.-B., K.M.J., and J.R. made DNA preparations, library constructions, and closure work. G.C., E.B.C., and M.G. performed manual annotations. C.M.F.-L., J.R. conceived the project. P.D.R. conceived and directed the project.
COMPETING FINANCIAL INTERESTS
The authors declare no competing financial interests.
Availability of data. Sequence and annotation data has been submitted to GenBank (accession numbers AASG02000001-AASG02059013; and XP_002509419.1-XP_002540639.1), and the annotation data can also be freely accessed through the project's website (http://castorbean.jcvi.org), which includes a genome browser and a BLAST server.
Reprints and permissions information is available at www.npg.nature.com/reprintsandpermissions.