|Home | About | Journals | Submit | Contact Us | Français|
GeneDesign is a set of web applications that provides public access to a nucleotide manipulation pipeline for synthetic biology. The server is public and freely accessible, and the source is available for download under the New BSD License. Since GeneDesign was published and made publicly available 3 years ago, we have made its code base more efficient, added several algorithms and modules, updated the restriction enzyme library, added batch processing capabilities, and added several command line modules, all of which we briefly describe here.
GeneDesign was originally developed as an in-house project to automate the design of oligonucleotides for the construction of individual synthetic genes (1). Gene synthesis is becoming more practical as synthesis costs decrease, and many excellent computational tools have been introduced to aid in the design of synthetic constructs, the most recent of which are compared in Table 1. In 6 months from June 2009 to December 2009, GeneDesign was accessed over 2000 times from cities all over the globe (Figure 1). College and university networks account for 63% of GeneDesign access (Figure 2). The most popular modules are Building Block Design, Reverse Translation and Restriction Site Addition, the three of which account for three quarters of traffic (Figure 3). Since our original publication we have adapted GeneDesign for the construction of entire chromosomes (2). This genome-scale project has necessitated new modules, a new approach to assembly of multikilobase genes from oligonucleotides, and significant improvement to the efficiency of the underlying code.
The relative synonymous usage (RSCU) value for a codon is the ratio of how often that codon is seen over how often it would be expected to be seen, given a completely random distribution (3). GeneDesign is equipped with a canonical set of RSCU values from early studies in codon distribution on a set of genes with observed high rates of expression (4). However, at the time the study was performed, there were only 809 gene sequences available from six model organisms. Although we have enjoyed experimental success with the sequences designed using the old RSCU data set, we wanted to be able to determine the RSCU values for genes in much more targeted subsets; for example, in designing a yeast histone gene, we may wish to use RSCU values derived from many yeast histone genes as a reference set rather than values derived from the whole yeast genome. This module takes as input a single gene or a list of genes in FASTA format and returns a table of RSCU values, as well as a table of the most often used codons for each residue, which may be directly used in the Reverse Translation module. This module is now available as a command line script.
It is sometimes helpful to visualize the change in RSCU values across a gene. Significant deviations from the average RSCU value in a sub-sequence may indicate a nucleotide requirement (such as an effect on local translation rate, which might affect folding) that would be disrupted by manipulation. In Figure 4, we show the wild type and optimized sequences for the integrase and reverse transcriptase open reading frame from the Saccharomyces cerevisiae Ty1 transposon. The wild-type sequence has a significant dip in RSCU averages that is completely disrupted by optimization; this valley corresponds to the nucleotides at the boundary between the two genes, an area which has been shown to have an impact on the expression of reverse transcriptase (5). The new Codon Bias Graphing module accepts FASTA input and generates a graph of the average RSCU value across the length of each gene.
The original release of GeneDesign employed a very simple model of gene assembly. Multikilobase coding regions were designed to be synthesized as ~500 bp segments, or building blocks, that overlapped at restriction enzyme recognition sites. Building blocks were assembled by restriction digestion and ligation. This works well for genes on the small side of the multikilobase scale; otherwise, practical restriction enzymes sites are used up rather quickly. We currently enjoy success in synthesizing ~750 bp building blocks and have updated GeneDesign's defaults to reflect this. We have also moved away from the restriction enzyme model for the assembly of large genes (and even small chromosomes) and have therefore added a module that designs oligos and primers for the assembly of building blocks based on a uracil excision reaction (USER) protocol (6). This module carves up chunks of 10 kb or longer into building blocks of any desired size, and defines endpoints at sequences conforming to the consensus ANxT, where x is an odd integer and where the chosen sequence is shared between adjacent building block ends. Picking an odd integer ensures that the overhang will be non-palindromic and hence will assemble in only one orientation. The default lets x be any odd number between 5 and 11 (generating USER overhangs of 7, 9, 11 or 13 base pairs). We empirically determined that with yeast DNA, a mixture of x values results in the nearly constant building block lengths that are desirable for production synthesis. GeneDesign also ensures that every building block except the adjacent ends meant to assemble together, will have incompatible overhangs, facilitating assembly in a single defined order and orientation. A third option for building block synthesis is to simply have GeneDesign assign overlaps of a constant length. This can be used to assemble building blocks by overlap extension PCR or exonuclease-based methods. This module is now available on the command line, where it offers the ability to design building blocks and oligos from multiple sequences at once.
All codes have been revised for efficiency, consistency and compatibility. In addition, all modules that output gene sequences now offer FASTA output.
The Reverse Translate module takes a protein sequence and replaces each residue with a user-determined codon. Typically, users select one of GeneDesign's codon sets, but they may also define their own. This module now offers Bacillus subtilis and Drosophila melanogaster RSCU data sets. It will now accept a set of protein sequences in the FASTA format. This module is now available as a command line script, where it offers the ability to design sequences reverse translated for more than one organism at a time.
The Codon Juggling module takes a protein coding nucleotide sequence and offers several synonymous, algorithmic variations on it. This module now has a new algorithm, ‘least different RSCU’, which seeks to replace as many codons as possible while minimizing disruption of the original average RSCU value for the sequence. This module is now available as a command line script, where it offers the ability to design sequences using multiple algorithms and for more than one organism at a time.
The Restriction Site Subtraction module takes a protein coding nucleotide sequence and allows the user to specify which restriction sites will be removed without modifying the encoded protein sequence. The Subtraction module has been modified to use the ‘least different RSCU’ algorithm when replacing codons in order to minimize the impact of each edit, and to change as few codons as possible in every edit.
The restriction enzyme database used by GeneDesign has been updated to include a more current set of commercially available enzymes and their prices (15). The enzyme choosing module has been updated to add overhang palindromy, heat inactivation, star activity, optimal incubation temperature, incubation buffer and methylation sensitivity as filter criteria.
We have modified the most popular modules to use a command line interface, allowing high-throughput design of synthetic genes and enabling GeneDesign to be embedded in other software applications and synthesis pipelines. Currently, the Reverse Translation, Codon Juggling and Building Block Design modules are implemented as scripts executable on a POSIX command line that use the FASTA format for input and output. There is evidence that codon optimization, in some cases, is more involved than using the most highly expressed codons. The command line modules allow users to either use GeneDesign’s built-in codon definitions or to define their own codon tables for special cases. For example, recent work indicates that Escherichia coli protein expression is optimized by using codons that are charged during amino acid starvation rather than overrepresented in highly expressed proteins (16).
An instance of GeneDesign is freely available at http://www.genedesign.org and the source is now available under the New BSD License from a github source control server at http://github.com/GeneDesign.
We will continue to expand the command line interface to the GeneDesign libraries. Planned modules will select unique PCR primers that distinguish between original and recoded sequences, suggest sequencing primers, and identify hairpins in designed sequences and oligonucleotides. We regularly solicit users for suggestions for new modules and for improving the web interface and command line version.
Department of Energy (grant number DE-FG02097ER25308 to S.M.R.); Microsoft Research (to J.S.B.); National Science Foundation (grant numbers MCB0718846 to J.D.B. and J.S.B., MCB-0546446 to J.S.B.). Funding for open access charge: Department of Energy (grant number DE-FG02097ER25308 to S.M.R.).
Conflict of interest statement. None declared.
Many thanks to Jessica Dymond for insightful comments and enthusiastic beta testing, and to John Kloss for a suffix tree implementation.