The rules that specify how the information contained in DNA is translated into amino acid “language” during protein synthesis are called “the genetic code”, commonly called the “Standard” or “Universal” Genetic Code Table. As a matter of fact, this coding table is not at all “universal”: in addition to different genetic code tables used by different organisms, even within the same organism the nuclear and mitochondrial genes may be subject to two different coding tables. Results In an attempt to understand the advantages and disadvantages these coding tables may bring to an organism, we have decided to analyze various coding tables on genes subject to mutations, and have estimated how these genes “survive” over generations. We have used this as indicative of the “evolutionary” success of that particular coding table. We find that the “standard” genetic code is not actually the most robust of all coding tables, and interestingly, Flatworm Mitochondrial Code (FMC) appears to be the highest ranking coding table given our assumptions. Conclusions It is commonly hypothesized that the more robust a genetic code, the better suited it is for maintenance of the genome. Our study shows that, given the assumptions in our model, Standard Genetic Code is quite poor when compared to other alternate code tables in terms of robustness. This brings about the question of why Standard Code has been so widely accepted by a wider variety of organisms instead of FMC, which needs to be addressed for a thorough understanding of genetic code evolution.
genetic code; evolution; robustness; statistical analysis
The genetic code provides the translation table necessary to transform the information contained in DNA into the language of proteins. In this table, a correspondence between each codon and each amino acid is established: tRNA is the main adaptor that links the two. Although the genetic code is nearly universal, several variants of this code have been described in a wide range of nuclear and organellar systems, especially in metazoan mitochondria. These variants are generally found by searching for conserved positions that consistently code for a specific alternative amino acid in a new species. We have devised an accurate computational method to automate these comparisons, and have tested it with 626 metazoan mitochondrial genomes. Our results indicate that several arthropods have a new genetic code and translate the codon AGG as lysine instead of serine (as in the invertebrate mitochondrial genetic code) or arginine (as in the standard genetic code). We have investigated the evolution of the genetic code in the arthropods and found several events of parallel evolution in which the AGG codon was reassigned between serine and lysine. Our analyses also revealed correlated evolution between the arthropod genetic codes and the tRNA-Lys/-Ser, which show specific point mutations at the anticodons. These rather simple mutations, together with a low usage of the AGG codon, might explain the recurrence of the AGG reassignments.
The authors find evidence for parallel evolution of an alternate genetic code in arthropod mitochondria (AGG is translated into lysine rather than serine), and correlated co-evolution of the tRNA-Lys/Ser anticodons.
Published descriptions of biology protocols are often ambiguous and incomplete, making them difficult to replicate in other laboratories. However, there is increasing benefit to formalizing the descriptions of protocols, as laboratory automation systems (such as microfluidic chips) are becoming increasingly capable of executing them. Our goal in this paper is to improve both the reproducibility and automation of biology experiments by using a programming language to express the precise series of steps taken.
We have developed BioCoder, a C++ library that enables biologists to express the exact steps needed to execute a protocol. In addition to being suitable for automation, BioCoder converts the code into a readable, English-language description for use by biologists. We have implemented over 65 protocols in BioCoder; the most complex of these was successfully executed by a biologist in the laboratory using BioCoder as the only reference. We argue that BioCoder exposes and resolves ambiguities in existing protocols, and could provide the software foundations for future automation platforms. BioCoder is freely available for download at http://research.microsoft.com/en-us/um/india/projects/biocoder/.
BioCoder represents the first practical programming system for standardizing and automating biology protocols. Our vision is to change the way that experimental methods are communicated: rather than publishing a written account of the protocols used, researchers will simply publish the code. Our experience suggests that this practice is tractable and offers many benefits. We invite other researchers to leverage BioCoder to improve the precision and completeness of their protocols, and also to adapt and extend BioCoder to new domains.
At earlier stages in the evolution of the universal genetic code, fewer than 20 amino acids were considered to be used. Although this notion is supported by a wide range of data, the actual existence and function of the genetic codes with a limited set of canonical amino acids have not been addressed experimentally, in contrast to the successful development of the expanded codes. Here, we constructed artificial genetic codes involving a reduced alphabet. In one of the codes, a tRNAAla variant with the Trp anticodon reassigns alanine to an unassigned UGG codon in the Escherichia coli S30 cell-free translation system lacking tryptophan. We confirmed that the efficiency and accuracy of protein synthesis by this Trp-lacking code were comparable to those by the universal genetic code, by an amino acid composition analysis, green fluorescent protein fluorescence measurements and the crystal structure determination. We also showed that another code, in which UGU/UGC codons are assigned to Ser, synthesizes an active enzyme. This method will provide not only new insights into primordial genetic codes, but also an essential protein engineering tool for the assessment of the early stages of protein evolution and for the improvement of pharmaceuticals.
Synthesizing the state of the art from the published literature, this review assesses the basis for employing the Internet to support the information needs of primary care. The authors survey what has been published about the information needs of clinical practice, including primary care, and discuss currently available information resources potentially relevant to primary care. Potential methods of linking information needs with appropriate information resources are described in the context of previous classifications of clinical information needs. Also described is the role that existing terminology mapping systems, such as the National Library of Medicine's Unified Medical Language System, may play in representing and linking information needs to answers.
The fields of molecular biology and computer science have cooperated over recent years to create a synergy between the cybernetic and biosemiotic relationship found in cellular genomics to that of information and language found in computational systems. Biological information frequently manifests its "meaning" through instruction or actual production of formal bio-function. Such information is called Prescriptive Information (PI). PI programs organize and execute a prescribed set of choices. Closer examination of this term in cellular systems has led to a dichotomy in its definition suggesting both prescribed data and prescribed algorithms are constituents of PI. This paper looks at this dichotomy as expressed in both the genetic code and in the central dogma of protein synthesis. An example of a genetic algorithm is modeled after the ribosome, and an examination of the protein synthesis process is used to differentiate PI data from PI algorithms.
Prescriptive Information (PI); Functional Information; algorithm; processing; language; ribosome; biocybernetics; biosemiosis; semantic information; control; regulation; automata; Frame Shift Mutation
Investigation into the sequence structure of the genetic code by means of an informatic approach is a real success story. The features of human language are also the object of investigation within the realm of formal language theories. They focus on the common rules of a universal grammar that lies behind all languages and determine generation of syntactic structures. This universal grammar is a depiction of material reality, i.e., the hidden logical order of things and its relations determined by natural laws. Therefore mathematics is viewed not only as an appropriate tool to investigate human language and genetic code structures through computer science-based formal language theory but is itself a depiction of material reality. This confusion between language as a scientific tool to describe observations/experiences within cognitive constructed models and formal language as a direct depiction of material reality occurs not only in current approaches but was the central focus of the philosophy of science debate in the twentieth century, with rather unexpected results. This article recalls these results and their implications for more recent mathematical approaches that also attempt to explain the evolution of human language.
formal language; linguistic turn; incompleteness theorem; pragmatic turn; speech acts; biocommunication; natural genome editing
The standard genetic code table has a distinctly non-random structure, with similar amino acids often encoded by codons series that differ by a single nucleotide substitution, typically, in the third or the first position of the codon. It has been repeatedly argued that this structure of the code results from selective optimization for robustness to translation errors such that translational misreading has the minimal adverse effect. Indeed, it has been shown in several studies that the standard code is more robust than a substantial majority of random codes. However, it remains unclear how much evolution the standard code underwent, what is the level of optimization, and what is the likely starting point.
We explored possible evolutionary trajectories of the genetic code within a limited domain of the vast space of possible codes. Only those codes were analyzed for robustness to translation error that possess the same block structure and the same degree of degeneracy as the standard code. This choice of a small part of the vast space of possible codes is based on the notion that the block structure of the standard code is a consequence of the structure of the complex between the cognate tRNA and the codon in mRNA where the third base of the codon plays a minimum role as a specificity determinant. Within this part of the fitness landscape, a simple evolutionary algorithm, with elementary evolutionary steps comprising swaps of four-codon or two-codon series, was employed to investigate the optimization of codes for the maximum attainable robustness. The properties of the standard code were compared to the properties of four sets of codes, namely, purely random codes, random codes that are more robust than the standard code, and two sets of codes that resulted from optimization of the first two sets. The comparison of these sets of codes with the standard code and its locally optimized version showed that, on average, optimization of random codes yielded evolutionary trajectories that converged at the same level of robustness to translation errors as the optimization path of the standard code; however, the standard code required considerably fewer steps to reach that level than an average random code. When evolution starts from random codes whose fitness is comparable to that of the standard code, they typically reach much higher level of optimization than the standard code, i.e., the standard code is much closer to its local minimum (fitness peak) than most of the random codes with similar levels of robustness. Thus, the standard genetic code appears to be a point on an evolutionary trajectory from a random point (code) about half the way to the summit of the local peak. The fitness landscape of code evolution appears to be extremely rugged, containing numerous peaks with a broad distribution of heights, and the standard code is relatively unremarkable, being located on the slope of a moderate-height peak.
The standard code appears to be the result of partial optimization of a random code for robustness to errors of translation. The reason the code is not fully optimized could be the trade-off between the beneficial effect of increasing robustness to translation errors and the deleterious effect of codon series reassignment that becomes increasingly severe with growing complexity of the evolving system. Thus, evolution of the code can be represented as a combination of adaptation and frozen accident.
This article was reviewed by David Ardell, Allan Drummond (nominated by Laura Landweber), and Rob Knight.
Open Peer Review
This article was reviewed by David Ardell, Allan Drummond (nominated by Laura Landweber), and Rob Knight.
Owing to the degeneracy of the genetic code, protein-coding regions of mRNA sequences can harbour more than only amino acid information. We search the mRNA sequences of 11 human protein-coding genes for evolutionarily conserved secondary structure elements using RNA-Decoder, a comparative secondary structure prediction program that is capable of explicitly taking the known protein-coding context of the mRNA sequences into account. We detect well-defined, conserved RNA secondary structure elements in the coding regions of the mRNA sequences and show that base-paired codons strongly correlate with sparse codons. We also investigate the role of repetitive elements in the formation of secondary structure and explain the use of alternate start codons in the caveolin-1 gene by a conserved secondary structure element overlapping the nominal start codon. We discuss the functional roles of our novel findings in regulating the gene expression on mRNA level. We also investigate the role of secondary structure on the correct splicing of the human CFTR gene. We study the wild-type version of the pre-mRNA as well as 29 variants with synonymous mutations in exon 12. By comparing our predicted secondary structures to the experimentally determined splicing efficiencies, we find with weak statistical significance that pre-mRNAs with high-splicing efficiencies have different predicted secondary structures than pre-mRNAs with low-splicing efficiencies.
The 3x redundancy of the Genetic Code is usually explained as a necessity to increase the mutation-resistance of the genetic information. However recent bioinformatical observations indicate that the redundant Genetic Code contains more biological information than previously known and which is additional to the 64/20 definition of amino acids. It might define the physico-chemical and structural properties of amino acids, the codon boundaries, the amino acid co-locations (interactions) in the coded proteins and the free folding energy of mRNAs. This additional information, which seems to be necessary to determine the 3D structure of coding nucleic acids as well as the coded proteins, is known as the Proteomic Code and mRNA Assisted Protein Folding.
Gene; code; codon; translation; wobble-base
Nursing Vocabulary Summit participants were challenged to consider whether reference terminology and information models might be a way to move toward better capture of data in electronic medical records. A requirement of such reference models is fidelity to representations of domain knowledge. This article discusses embedded structures in three different approaches to organizing domain knowledge: scientific reasoning, expertise, and standardized nursing languages. The concept of pressure ulcer is presented as an example of the various ways lexical elements used in relation to a specific concept are organized across systems. Different approaches to structuring information—the clinical information system, minimum data sets, and standardized messaging formats—are similarly discussed. Recommendations include identification of the polyhierarchies and categorical structures required within a reference terminology, systematic evaluations of the extent to which structured information accurately and completely represents domain knowledge, and modifications or extensions to existing multidisciplinary efforts.
We study the viability and resilience of languages, using a simple dynamical model of two languages in competition. Assuming that public action can modify the prestige of a language in order to avoid language extinction, we analyze two cases: (i) the prestige can only take two values, (ii) it can take any value but its change at each time step is bounded. In both cases, we determine the viability kernel, that is, the set of states for which there exists an action policy maintaining the coexistence of the two languages, and we define such policies. We also study the resilience of the languages and identify configurations from where the system can return to the viability kernel (finite resilience), or where one of the languages is lead to disappear (zero resilience). Within our current framework, the maintenance of a bilingual society is shown to be possible by introducing the prestige of a language as a control variable.
The genetic code appears to be optimized in its robustness to missense errors and frameshift errors. In addition, the genetic code is near-optimal in terms of its ability to carry information in addition to the sequences of encoded proteins. As evolution has no foresight, optimality of the modern genetic code suggests that it evolved from less optimal code variants. The length of codons in the genetic code is also optimal, as three is the minimal nucleotide combination that can encode the twenty standard amino acids. The apparent impossibility of transitions between codon sizes in a discontinuous manner during evolution has resulted in an unbending view that the genetic code was always triplet. Yet, recent experimental evidence on quadruplet decoding, as well as the discovery of organisms with ambiguous and dual decoding, suggest that the possibility of the evolution of triplet decoding from living systems with non-triplet decoding merits reconsideration and further exploration. To explore this possibility we designed a mathematical model of the evolution of primitive digital coding systems which can decode nucleotide sequences into protein sequences. These coding systems can evolve their nucleotide sequences via genetic events of Darwinian evolution, such as point-mutations. The replication rates of such coding systems depend on the accuracy of the generated protein sequences. Computer simulations based on our model show that decoding systems with codons of length greater than three spontaneously evolve into predominantly triplet decoding systems. Our findings suggest a plausible scenario for the evolution of the triplet genetic code in a continuous manner. This scenario suggests an explanation of how protein synthesis could be accomplished by means of long RNA-RNA interactions prior to the emergence of the complex decoding machinery, such as the ribosome, that is required for stabilization and discrimination of otherwise weak triplet codon-anticodon interactions.
We studied terminological phrases on surgical procedures-from coding systems, controlled vocabularies, textbooks, and medical records-by an ontological point of view. A surgical procedure can be accurately described only by a set of sentences, in textbooks or surgical reports; a terminological phrase is just a short synthesis of that description. We outline three points of view actually used to construct a phrase, based on i) relevant phases and variants; ii) focus on structures, functions and pathologies; iii) evolution of information and decisions during the process of care. For each of them we discuss potential principles and mechanisms, with the aim of deriving guidelines to generate homogeneous systematic names, to organize regularities in classifications and nomenclatures, to normalize expressions in formal languages.
The coding sequence of a protein must contain the information required for the canonical amino acid sequence. However, the redundancy of the genetic code creates potential for embedding other types of information within coding regions as well. In a genome-wide computational screen for functional motifs within coding regions based on evolutionary conservation, highly conserved motifs included some expected motifs, some novel motifs and coding region target sites for known microRNAs, which are generally presumed to target 3’ untranslated regions (UTRs) (www.SiteSifter.org). We report here an analysis of published proteomics experiments that further support a functional role for coding region microRNA binding sites, though the effects are weaker than for sites in the 3’ UTR. We also demonstrate a positional bias with greater conservation for sites at the end of the coding region, and the beginning and end of the 3’ UTR. An increased effectiveness of microRNA binding sites at the 3’ end of transcripts could reflect proximity to the poly(A) tail or interactions with the 5’ terminal 7mGpppN “cap”, which is physically adjacent to this region once the message is circularized. The effectiveness of 3’ UTR sites could reflect a cooperative role for RNA binding proteins. Finally, increased microRNA conservation near the stop codon suggests to us the possible involvement of proteins that execute nonsense-mediated decay, since this process is activated by tagging of termination codons with factors that induce transcript degradation.
microRNA; coding region; evolutionary conservation; RNA binding protein; nonsense mediated decay
Alström syndrome is a rare autosomal recessive genetic disorder characterized by cone-rod dystrophy, hearing loss, childhood truncal obesity, insulin resistance and hyperinsulinemia, type 2 diabetes, hypertriglyceridemia, short stature in adulthood, cardiomyopathy, and progressive pulmonary, hepatic, and renal dysfunction. Symptoms first appear in infancy and progressive development of multi-organ pathology leads to a reduced life expectancy. Variability in age of onset and severity of clinical symptoms, even within families, is likely due to genetic background.
Alström syndrome is caused by mutations in ALMS1, a large gene comprised of 23 exons and coding for a protein of 4,169 amino acids. In general, ALMS1 gene defects include insertions, deletions, and nonsense mutations leading to protein truncations and found primarily in exons 8, 10 and 16. Multiple alternate splice forms exist. ALMS1 protein is found in centrosomes, basal bodies, and cytosol of all tissues affected by the disease. The identification of ALMS1 as a ciliary protein explains the range of observed phenotypes and their similarity to those of other ciliopathies such as Bardet-Biedl syndrome.
Studies involving murine and cellular models of Alström syndrome have provided insight into the pathogenic mechanisms underlying obesity and type 2 diabetes, and other clinical problems. Ultimately, research into the pathogenesis of Alström syndrome should lead to better management and treatments for individuals, and have potentially important ramifications for other rare ciliopathies, as well as more common causes of obesity and diabetes, and other conditions common in the general population.
ALMS1; Alström syndrome; ciliopathy; truncal obesity.
Language-mediated visual attention describes the interaction of two fundamental components of the human cognitive system, language and vision. Within this paper we present an amodal shared resource model of language-mediated visual attention that offers a description of the information and processes involved in this complex multimodal behavior and a potential explanation for how this ability is acquired. We demonstrate that the model is not only sufficient to account for the experimental effects of Visual World Paradigm studies but also that these effects are emergent properties of the architecture of the model itself, rather than requiring separate information processing channels or modular processing systems. The model provides an explicit description of the connection between the modality-specific input from language and vision and the distribution of eye gaze in language-mediated visual attention. The paper concludes by discussing future applications for the model, specifically its potential for investigating the factors driving observed individual differences in language-mediated eye gaze.
language; vision; computational modeling; attention; eye movements; semantics
Insulin promotes muscle anabolism, but it is still unclear whether it stimulates muscle protein synthesis in humans. We hypothesized that insulin can increase muscle protein synthesis only if it increases muscle amino acid availability. We measured muscle protein and amino acid metabolism using stable-isotope methodologies in 19 young healthy subjects at baseline and during insulin infusion in one leg at low (LD, 0.05), intermediate (ID, 0.15), or high (HD, 0.30 mU·min−1·100 ml−1) doses. Insulin was infused locally to induce muscle hyperinsulinemia within the physiological range while minimizing the systemic effects. Protein and amino acid kinetics across the leg were assessed using stable isotopes and muscle biopsies. The LD did not affect phenylalanine delivery to the muscle (−9 ± 18% change over baseline), muscle protein synthesis (16 ± 26%), breakdown, or net balance. The ID increased (P < 0.05) phenylalanine delivery (+63 ± 38%), muscle protein synthesis (+157 ± 54%), and net protein balance, with no change in breakdown. The HD did not change phenylalanine delivery (+12 ± 11%) or muscle protein synthesis (+9 ± 19%), and reduced muscle protein breakdown (−17 ± 15%), thus improving net muscle protein balance but to a lesser degree than the ID. Changes in muscle protein synthesis were strongly associated with changes in muscle blood flow and phenylalanine delivery and availability. In conclusion, physiological hyperinsulinemia promotes muscle protein synthesis as long as it concomitantly increases muscle blood flow, amino acid delivery and availability.
metabolism; muscle perfusion
Standardized medical terminologies are gaining importance in the representation of medical data. In this paper, we present the evaluation of the SNOMED3.5 medical terminology to code concepts routinely used in chest radiology reports. Integration of this terminology mapper into a radiology reporting workstation that incorporates a speech recognition system and a natural language processor is also discussed. A total of 700 anatomical location terms (including synonyms) were tested and 72% of the terms had corresponding SNOMED terms. Of the 28% that did not result in a match, 16% were either morphological variants of SNOMED terms or could be found from a combination of terms from two or more SNOMED axes. Only 12% of the terms (primarily specialized radiology terms) were concepts not actually included in the SNOMED terminology.
Quantitative descriptions of amino acid similarity, expressed as probabilistic models of evolutionary interchangeability, are central to many mainstream bioinformatic procedures such as sequence alignment, homology searching, and protein structural prediction. Here we present a web-based, user-friendly analysis tool that allows any researcher to quickly and easily visualize relationships between these bioinformatic metrics and to explore their relationships to underlying indices of amino acid molecular descriptors.
We demonstrate the three fundamental types of question that our software can address by taking as a specific example the connections between 49 measures of amino acid biophysical properties (e.g., size, charge and hydrophobicity), a generalized model of amino acid substitution (as represented by the PAM74-100 matrix), and the mutational distance that separates amino acids within the standard genetic code (i.e., the number of point mutations required for interconversion during protein evolution). We show that our software allows a user to recapture the insights from several key publications on these topics in just a few minutes.
Our software facilitates rapid, interactive exploration of three interconnected topics: (i) the multidimensional molecular descriptors of the twenty proteinaceous amino acids, (ii) the correlation of these biophysical measurements with observed patterns of amino acid substitution, and (iii) the causal basis for differences between any two observed patterns of amino acid substitution. This software acts as an intuitive bioinformatic exploration tool that can guide more comprehensive statistical analyses relating to a diverse array of specific research questions.
A challenge of systems biology is to integrate incomplete knowledge on pathways with existing experimental data sets and relate these to measured phenotypes. Research on ageing often generates such incomplete data, creating difficulties in integrating RNA expression with information about biological processes and the phenotypes of ageing, including longevity. Here, we develop a logic-based method that employs Answer Set Programming, and use it to infer signalling effects of genetic perturbations, based on a model of the insulin signalling pathway. We apply our method to RNA expression data from Drosophila mutants in the insulin pathway that alter lifespan, in a foxo dependent fashion. We use this information to deduce how the pathway influences lifespan in the mutant animals. We also develop a method for inferring the largest common sub-paths within each of our signalling predictions. Our comparisons reveal consistent homeostatic mechanisms across both long- and short-lived mutants. The transcriptional changes observed in each mutation usually provide negative feedback to signalling predicted for that mutation. We also identify an S6K-mediated feedback in two long-lived mutants that suggests a crosstalk between these pathways in mutants of the insulin pathway, in vivo. By formulating the problem as a logic-based theory in a qualitative fashion, we are able to use the efficient search facilities of Answer Set Programming, allowing us to explore larger pathways, combine molecular changes with pathways and phenotype and infer effects on signalling in in vivo, whole-organism, mutants, where direct signalling stimulation assays are difficult to perform. Our methods are available in the web-service NetEffects: http://www.ebi.ac.uk/thornton-srv/software/NetEffects.
Genes do not act in isolation but perform their biological functions within genetic pathways that are connected in larger networks. Investigation of nucleotide variation within genetic pathways and networks has shown that topology can affect the rate of protein evolution; however, it remains unclear whether a same pattern of nucleotide variation is expected within functionally similar networks and whether it may be due to similar or different biological mechanisms. We address these questions by investigating nucleotide variation in the context of the structure of the insulin/Tor-signaling pathway in Caenorhabditis, which is well characterized and is functionally conserved across phylogeny. In Drosophila and vertebrates, the rate of protein evolution is negatively correlated with the position of a gene within the insulin/Tor pathway. Similarly, we find that in Caenorhabditis, the rate of amino acid replacement is lower for downstream genes. However, in Caenorhabditis, the rate of synonymous substitution is also strongly affected by the position of a gene in the pathway, and we show that the distribution of selective pressure along the pathway is driven by differential expression level. A full understanding of the effect of pathway structure on selective constraints is therefore likely to require inclusion of specific biological function into more general network models.
network; aging; molecular evolution; gene expression; selection
All the information necessary for protein folding is supposed to be present in the amino acid sequence. It is still not possible to provide specific ab initio structure predictions by bioinformatical methods. It is suspected that additional folding information is present in protein coding nucleic acid sequences, but this is not represented by the known genetic code.
Nucleic acid subsequences comprising the 1st and/or 3rd codon residues in mRNAs express significantly higher free folding energy (FFE) than the subsequence containing only the 2nd residues (p < 0.0001, n = 81). This periodic FFE difference is not present in introns. It is therefore a specific physico-chemical characteristic of coding sequences and might contribute to unambiguous definition of codon boundaries during translation. The FFEs of the 1st and 3rd residues are additive, which suggests that these residues contain a significant number of complementary bases and that may contribute to selection for local RNA secondary structures in coding regions. This periodic, codon-related structure-formation of mRNAs indicates a connection between the structures of exons and the corresponding (translated) proteins. The folding energy dot plots of RNAs and the residue contact maps of the coded proteins are indeed similar. Residue contact statistics using 81 different protein structures confirmed that amino acids that are coded by partially reverse and complementary codons (Watson-Crick (WC) base pairs at the 1st and 3rd codon positions and translated in reverse orientation) are preferentially co-located in protein structures.
Exons are distinguished from introns, and codon boundaries are physico-chemically defined, by periodically distributed FFE differences between codon positions. There is a selection for local RNA secondary structures in coding regions and this nucleic acid structure resembles the folding profiles of the coded proteins. The preferentially (specifically) interacting amino acids are coded by partially complementary codons, which strongly supports the connection between mRNA and the corresponding protein structures and indicates that there is protein folding information in nucleic acids that is not present in the genetic code. This might suggest an additional explanation of codon redundancy.
In the last decade, there have been many applications of formal language theory in bioinformatics such as RNA structure prediction and detection of patterns in DNA. However, in the field of proteomics, the size of the protein alphabet and the complexity of relationship between amino acids have mainly limited the application of formal language theory to the production of grammars whose expressive power is not higher than stochastic regular grammars. However, these grammars, like other state of the art methods, cannot cover any higher-order dependencies such as nested and crossing relationships that are common in proteins. In order to overcome some of these limitations, we propose a Stochastic Context Free Grammar based framework for the analysis of protein sequences where grammars are induced using a genetic algorithm.
This framework was implemented in a system aiming at the production of binding site descriptors. These descriptors not only allow detection of protein regions that are involved in these sites, but also provide insight in their structure. Grammars were induced using quantitative properties of amino acids to deal with the size of the protein alphabet. Moreover, we imposed some structural constraints on grammars to reduce the extent of the rule search space. Finally, grammars based on different properties were combined to convey as much information as possible. Evaluation was performed on sites of various sizes and complexity described either by PROSITE patterns, domain profiles or a set of patterns. Results show the produced binding site descriptors are human-readable and, hence, highlight biologically meaningful features. Moreover, they achieve good accuracy in both annotation and detection. In addition, findings suggest that, unlike current state-of-the-art methods, our system may be particularly suited to deal with patterns shared by non-homologous proteins.
A new Stochastic Context Free Grammar based framework has been introduced allowing the production of binding site descriptors for analysis of protein sequences. Experiments have shown that not only is this new approach valid, but produces human-readable descriptors for binding sites which have been beyond the capability of current machine learning techniques.
The gene coding for cyclohexanone monooxygenase from Acinetobacter sp. strain NCIB 9871 was isolated by immunological screening methods. We located and determined the nucleotide sequence of the gene. The structural gene is 1,626 nucleotides long and codes for a polypeptide of 542 amino acids; 389 nucleotides 5' and 108 nucleotides 3' of the coding region are also reported. The complete amino acid sequence of the enzyme was derived by translation of the nucleotide sequence. From a comparison of the amino acid sequence with consensus sequences of nucleotide-binding folds, we identified a potential flavin-binding site at the NH2 terminus of the enzyme (residues 6 to 18) and a potential nicotinamide-binding site extending from residue 176 to residue 208 of the protein. An overproduction system for the gene to facilitate genetic manipulations was also constructed by using the tac promoter vector pKK223-3 in Escherichia coli.