The central dogma of molecular biology states that a cell's genetic information, found in the form of DNA, is transcribed into mRNA and then translated into protein. Transcription and translation are regulated processes that together dictate the amount of a specific protein found inside the cell. Transcriptional regulation has been extensively studied, can take many different forms and will generate a quantity of mRNA used for translation (1
). Transcriptional regulatory strategies include activation, enhancement and de-repression, with these mechanisms working during initiation (2–4
). Promoter proximal stalling has recently been reported to regulate transcriptional elongation (5
), demonstrating that this step can be modulated to control gene expression. Translational regulation has primarily been studied at the level of initiation, with codons an optimal regulatory unit that could be used by cells to influence translation elongation. Codons serve as an optimal unit of information in mRNA and by pairing with anticodons found in tRNA, they allow for the translation of nucleic acid information into protein sequences (6
). Translation elongation is an understudied process and we have previously proposed that gene-specific codon usage patterns matched to specific tRNA modifications could be used to regulate elongation steps (7
Individual codon usage patterns have also been studied to generate regulatory information. In 1987, Sharp et al
) described a method for summarizing codon usage called the codon adaptation index (CAI). In the CAI, all of the genes in the genome are compared with an optimal codon usage pattern inferred from a set of presumed high-expression genes. This CAI analysis method results in a quantitative measurement of the high-expression codon usage bias exhibited by each gene in the genome (9
). Codon usage information has also been used in correlation studies, with high usage codons in a genome corresponding to multi-copy tRNAs with corresponding matching anticodons (10
), further demonstrating a connection between codons and tRNAs and their potential influence on gene expression. The biotechnology sector has also exploited codon–anticodon interactions and developed resources to optimize these interactions by increasing the levels of specific tRNAs (11
). These codon–anticodon optimization tools promote high protein expression levels and further demonstrate the potential for codon usage patterns to affect gene expression.
There is an abundance of single codon data for most sequenced organisms, but understanding codon usage may require local information associated with tandem codons. Dicodons are an interesting gene-specific parameter because these tandem codons can be matched to the presence of mRNA sequence in the A and P sites of the ribosome. Nguyen et al
) described the use of dicodons as a promising feature for gene classification. Their study analyzed 1841 human leukocyte antigen (HLA) sequences for dicodon frequencies. One conclusion of the Nguyen study was that gene-specific dicodon data provides specific local information and can be used to classify genes into biological categories. The study further speculated that the translation of dicodons could be very sensitive to tRNA levels (13
). While these authors do not analyze their data from a regulatory perspective, their study does demonstrate that dicodon characteristics classify HLA's into two major groups. Noguchi et al
) developed the MetaGene approach to identify genes from sequenced genomes and it utilizes dicodon frequencies to obtain higher open reading frame prediction accuracy than simply using codon frequencies. It is interesting to note that dicodon information has found extensive use in classification-based approaches. Little information is readily available, however, to compare dicodon sequences between individual genes or among groups of genes. In addition, there is a need for genome-based resources for the analysis and comparison of specific combinations of three, four and five codons in a row.
In the following study, we describe a bioinformatics resource that we have developed to analyze, catalog and compare gene-specific codon information: The Gene-Specific Codon Counting (GSCC) Database. We have exhaustively analyzed each Saccharomyces cerevisiae
gene to identify all one-, two-, three-, four- and five-codon combinations. We have developed both genomic and gene-specific resources to analyze our data, with the latter being used to identify unique codon runs in genes previously shown to be translationally regulated by tRNA methyltransferase nine (Trm9)-catalyzed tRNA modifications (7
). We have also used functional ontology information to analyze gene sequences with distinct codon usage patterns and have demonstrated that some transcripts whose corresponding proteins are associated with translation use a minimal group of codons. We have also demonstrated that same–same dicodon usage is over-represented in smaller than average genes, suggesting a regulatory potential for these sequences. The GSCC database and analysis method has been developed to serve as a resource for those scientists interested in studying the regulatory role of local codons and as a launching pad for studies on the regulation of translation elongation.