Search tips
Search criteria 


Logo of springeropenLink to Publisher's site
Theory in Biosciences
Theory Biosci. 2009 August; 128(3): 165–170.
Published online 2009 June 26. doi:  10.1007/s12064-009-0067-y
PMCID: PMC2766041

Defining genes: a computational framework


The precise elucidation of the gene concept has become the subject of intense discussion in light of results from several, large high-throughput surveys of transcriptomes and proteomes. In previous work, we proposed an approach for constructing gene concepts that combines genomic heritability with elements of function. Here, we introduce a definition of the gene within a computational framework of cellular interactions. The definition seeks to satisfy the practical requirements imposed by annotation, capture logical aspects of regulation, and encompass the evolutionary property of homology.

Keywords: Gene concept, Homology, Computation, I/O-relations


The concept of the gene has come under intense scrutiny in recent years. This is largely in response to the recognition that the “standard” model of genes as beads on a genomic DNA string is inconsistent with the findings of high-throughput transcriptomics, see for example, Pearson (2006), Pennisi (2007). As a consequence, several modifications of the concept of the gene have been explored, ranging from purely structural definitions in terms of groups of transcripts (Gerstein et al. 2007), the consideration of transcripts themselves as the central operational units of the genome (Gingeras 2007), to functional notions (Scherrer and Jost 2007). In Prohaska and Stadler (2008) we suggest that a “useful” gene concept should satisfy several criteria:

  • The gene concept combines structural and functional components.
    • The gene concept is based on a well-defined notion of function that is amenable to experimental measurement.
    • The gene has a well-defined structural representation at the genomic sequence.
  • Genes are heritable (not to imply that all inheritance is embodied in genes). In particular, the concept must be compatible with a suitable notion of (phylogenetic) homology.
  • The gene concept is embedded in a larger framework that views “gene expression” as a form of computation.
  • Genes are “expressed” from the DNA, hence genes are associated with transcripts and/or further processing products.
  • The gene concept relates genomic mutations to changes in a gene product, and thereby allows for the explicit construction of genotype–phenotype maps.

In this short paper, we introduce a framework that satisfies these requirements. We do not claim that this framework is unique or optimal. We view this as an exercise in deriving a concrete model using the road map outlined in Prohaska and Stadler (2008).

A chemical/computational framework

The basis for our construction is an abstract computational model of regulation. We start with the observation that cellular processes can be described as chemical reactions. This includes the interconversions of metabolites, the interactions of regulators, the aggregation of supermolecular structures, and the transport of molecules. There will be no need to operate at the level of individual molecules. It is more practical to employ coarse-grained representations. For instance, transcription could be viewed as an input/output (I/O)-relation that takes genomic DNA and a set of transcription factors as input, and results in a specific output transcript. In this way, we emphasize the computational aspects of bulk chemical reactions.

More formally, each I/O-relation is a quadruple (χ, [x], [p], [y]), which we write in the form

equation M1

where [x] is a list of material components (inputs) transformed into a list of material outputs [y] by means of a process χ that depends on a list [p] of additional influences. We call [x] the arguments and [p] the parameters of χ. Equation 1 is an abstract, and arbitrarily coarse-grained representation of a chemical reaction. In chemical notation, we could write it in the form,

equation M2

Equation 1 can also represent transport “reactions”, where input and output describe the same object(s) in different spatial locations or compartments, as well as other high-level aggregate processes including replication, transcription, translation, or the production of biomass (if one chooses not to model such parts of the system in detail). In contrast to an implementation at the finest level, that of elementary chemical reactions, the I/O-relations are not required to satisfy conservation of mass or atom types. We are able, for instance, to ignore ubiquitous chemical species (such as H2O, CO2, or coenzyme A) and energy and redox currencies ATP and NADH, if we choose. Our framework is consistent with, but will be more coarse-grained than, a full-fledged representation of all chemical reactions. This is a common coarse graining in Systems Biology models (Palsson 2006). For our purposes, it will be convenient to model transcription and translation as I/O-relations that “produce” primary transcripts from a DNA template and a polypeptide from an RNA template. Equation 1 may also include compartment/spatial information and thus can describe cellular processes of more than one cell or organism, including a complete microbial community or even entire ecologies with complex predator–prey dynamics. Note that some or all elements of the output list [y] of χ will typically appear as inputs [x] and/or parameters [p] of other I/O-relations ξ.

A system Ξ of I/O-relations over a given domain of “objects” X has a natural interpretation as a model of computations on X (Berry and Boudol 1992; Taylor 1998). This gives us considerable freedom in implementing a model of cellular processes in the form of Eq. 1 depending on: (1) the level of aggregation or abstraction beyond elementary chemical reactions; and (2) the effect that a parameter p must have on the outcome of χ to be considered relevant. For example, we may define p to be relevant to a particular I/O-relation χ if the absence of p makes the transformation χ impossible. Alternatively, we could consider p a relevant influence whenever it affects the reaction rate.

Before proceeding, a formal issue requires attention. Each process χ links a particular triplet of input, output, and parameter lists. Hence, transformations utilizing the same input [x] to produce different outputs [y′] ≠ [y] are necessarily two distinct reactions χ and χ′. Here, we admit only physical objects as elements of the input and output lists [x] and [y]. The parameters [p], on the other hand, may be either objects or physical quantities such as temperature or pH. The parameter list may be empty, [p] = equation M3, e.g., in spontaneous chemical reactions or transport by diffusion.

If an object a appears both as an argument, a [set membership] [x], and as a parameter, a [set membership] p, in the same I/O-relation χ, this implies an autocatalytic mechanism. The argument and the parameter are necessarily two different instantiations of the object type a. The simplistic distinction between arguments and parameters in the formalism is akin to the notions of cis and trans action in molecular biology. Note, however, that the concepts are not equivalent in all cases.

Information metabolism

The crucial assumption in our exposition is that—given a suitable collection of I/O relationships—we can single out the transformations among informational molecules (i.e., heteropolymers that are capable of encoding information, such as RNA, DNA, and polypeptides) from the generic “reaction soup” of all I/O-relations. To this end, we identify those reactions in which both the input list and the output list [y] contains informational molecules (DNA, RNA, or peptide) of a single type. Depending on whether the informational molecules in [x] and [y] are of the same type or not, χ represents either processing or one of several information transfer mechanisms (translation, transcription, reverse transcription, and replication). Each informational molecule has an explicit representation as sequence of nucleotides or amino acids x = (z1z2, …, zn). Among the transformations of informational molecules, we single out those reactions that satisfy an additional property of traceability.

Definition 1 The I/O-relation χ is traceable if and only if for each informational molecule y [set membership] [y] in the output and each letter yk [set membership] y we can uniquely determine whether

  1. yk is an encoded letter, i.e., its identity can be traced to a single letter or an interval (e.g., a codon) on an input sequence x [set membership] [x]; or
  2. yk is not an encoded letter, in which case its identity is determined by the reaction χ.

The collection equation M4 of traceable I/O-relations involving informational molecules represents the “linear” part of information metabolism. We suggest that it is not only well defined but also encompasses important processing steps in the “life-history” of a transcript, including: primary transcription, splicing, translation, insertion editing, cleavage, intein extraction, chemical modification, poly-adenylation, etc. Thus we can interpret equation M5 as the subsystem of gene expression in a cell, organism, or ecosystem.

In a traceable reaction, we can determine, for any collection of sequence intervals in the output [y], the collection of all those sequence intervals in the input [x] that gave rise to the encoded letters in the output. This inverse map χ−1 gives us a well-defined “footprint” of each stretch of output sequence on the input. The concatenation of such inverse maps is well defined and allows each sequence position in an informational molecules z to be traced back to its genomic source, the genomic footprint of z. This construction will provide us with the structural part of our gene concept. Any letter xk of x that is not contained in the genomic footprint Γ(x) of x is identified as being inserted or appended at a particular stage in the production of z. The genomic “source” Γ(x) can be one of the cell’s genomes or, for instance, an intruding viral transcript.

In some cases, a product x that appears in the linear part of the information metabolism may have an empty genomic footprint Γ(x) = equation M6 This is the case e.g., for the so-called non-ribosomal peptides (NRPs), which are synthesized de novo without using the templating function of a messenger RNA (Walton 2006).

Theoretically, the definition of the genomic footprint Γ(x) requires detailed knowledge of the complete gene expression pathway leading to the production of x. In practise, however, Γ(x) can be approximated by mapping the sequence of a biopolymer x to the underlying genome. Current procedures of genome annotation do this in a way that incorporates knowledge of the genetic code, splicing, end processing, editing, etc. In other words, given the data provided by proteomics and transcriptomics, computational procedures can already produce reasonable estimates of genomic footprints. These are used in current genome annotations and genome browser systems (Furey 2006; Karolchik et al. 2008). In line with the prevailing simplified model of the transcriptome and proteome, genome browsers restrict themselves to co-linear arrangements of footprints of a given product, thereby neglecting rearrangements, trans-splicing, and conceivably other “non-monotonous” processing mechanisms.


Within the framework of I/O-relations, the function of an object z [set membership] X becomes a derived property. It appears natural to identify the function of z with those processes that it influences. (Structural proteins are captured by formulating pseudo-reactions that describe the formation and reorganization of supramolecular structures.) We insist that “being processed” (i.e., being an input for an I/O-relation), “being produced” (i.e., appearing as output of an I/O-relation), and “encoding information” is not in itself a function. The reason for this distinction is that we need to avoid the trivial notion that everything is functional just because it is present.

Formally, let us denote by param(χ) the set of parameters of the I/O-relation χ. This leads us to

Definition 2 The functionFct(z) of an object z [set membership] X the set of I/O-relations

equation M7

While this looks somewhat contrived, it forms the basis of the official Nomenclature of Enzymes (see (NC-ICBMB and Webb 1992) and the annual supplements at, in which enzymes are named for the chemical reactions that they catalyse: an alcohol dehydrogenase, for instance is per definitionem an enzyme that catalyzes the dehydrogenation of alcohols. In our setting, this would be represented by associating z with a set of I/O-relations Fct(z) that consists (largely) of chemical reactions χ describing the dehydrogenation of various alcohols. Also note that nothing precludes an object from having multiple disparate functions in this framework i.e., Fct(z) can be very large and contain several, semantically different, groups of I/O-relations.

Applying this notion of function, we identify the collection of functional informational molecules as those biopolymer sequences x for which Fct(x) ≠ equation M8. Note, again, that in this way we pin function only to physical objects that influence transformations (of other objects). “Being transformed”, on the other hand, by construction does not count as a function in itself. Furthermore, an informational molecule does not acquire a function, for example, by housing a cis-regulatory element that regulates its own subsequent processing step. Our model declares that the function of regulating/influencing subsequent processing step ξ is attributed to the parameter(s) of ξ, not to the argument of ξ. Intermediate processing steps thereby will not typically have a function. For instance, if x is the mRNA coding for a protein p, we will often observe that Fct(x) = equation M9, while the translation product p of x typically will have some function as an enzyme, signal molecule, or structural protein, such that Fct(p) ≠ equation M10 Fig. 1. Not all proteins are necessarily functional. The precursor p of one or more small hormone peptides p1′,...,pr′, for instance, may not have any function alone (Dicou 2008). In this case, we have Fct(pi′) ≠ equation M11 for the hormone peptides p′, but Fct(p) = equation M12 for the prohormone protein, see c and d in Fig. 1.

Fig. 1
Functional objects a to e and relationships with their genomic footprints Γ(a) to Γ(e). A functional RNA molecule (e.g., a miRNA) with function Fct(a) is processed in two steps from an intronic sequence. Its image on the DNA is the genomic ...

In practise, Fct(z) is dependent upon the experimental and computational methodologies employed to determine the processes (I/O-relations) that are dependent upon z. Improved measurements thus have the potential to change our representation of Fct(z).

We note, finally, that this simple notion of function brings with it a completely natural definition of functional equivalence: two objects p and q are functionally equivalent if Fct(p) = Fct(q).


We are now in a position to define a gene.

Definition 3 A gene on a given genome is the pair (Γ(z),z) consisting of a functional informational molecule z and its genomic footprint Γ(z), i.e., the collection of intervals on the genome that give rise to the encoded letters in the sequence z through a sequence of I/O-relations in Ξ.

In Definition 3, we require that there is at least one sequence of I/O-relations linking Γ(z) to the gene product z. Alternatively, we might want to include the specific sequence of I/O-relations in the definition of the gene. The distinction between these two alternative points of view is whether we would require that every gene has a unique way of being processed (each particular sequence of I/O-relations linking Γ(z) to z), or whether we allow that a gene (Γ(z),z) can be expressed in alternatives ways. At present, we lack sufficient evidence of alternative transcripts processed into the same functional “gene product”, to decide which version is biologically more useful.

Definition 3 of course allows overlapping genes, and in particular, different genes with the same genomic footprint: if the same collection of genomic intervals gives rise to a different product (necessarily via a different processing cascade) we have two distinct genes. Thus, as in the proposal of Scherrer and Jost (2007), we label distinct (functional) splice variants as distinct genes. Similarly, if the same product z can be produced from different genomic footprints, we also speak of two distinct genes [an example are some pairs of paralogous microRNAs with identical mature products (Griffiths-Jones et al. 2008)].

Functional similarity and homology

In taking an extreme “functional” point of view, one might want to interpret genes (in the above sense) as “the same” if products are functionally equivalent. As far as we can see, this choice leads to problems with notions of homology. We now discuss the connection of our gene definition with the homology concept.

In order to analyze the notion of a function in detail, we assume that there is a distance function D that allows us to measure how different two I/O-relations χ′ and χ′′ are. In the analysis of chemical reaction networks, distance functions between reactions are typically based on a notion of differences Δ among underlying objects (Maggiora and Shanmugasundaram 2004). Given a measure of dis-similarity of objects, one constructs a dis-similarity for input, output and parameter lists, which are finally combined into the desired measure D (Tohsato and Nishimura 2007). We can use D to cluster the elements of Fct(z) into distinct functional classes and to construct a measure D of the functional differences between two objects. The functional distance D between I/O-relations can be extended naturally to sets of I/O-relations and hence implies a notion of functional distance equation M13 which is conceptually related to network distance (Forst et al. 2006).

In the case of information molecules, we can think of the object distance Δ as an edit or alignment distance that measures a quantity of sequence similarity. In contrast, equation M14 which can be derived from Δ and the system of I/O-relations Ξ, measures functional dissimilarity. Homology-based gene annotation is based on the observation that Δ and equation M15 are correlated in practice. Thus similar sequences (of functional information molecules, or of their genomic footprints) often—but not always—give rise to products with similar functions. Deviations from this rule exist in nature and indicate that either large changes in function are acquired by closely related sequences (small Δ, large equation M16), or to horizontal replacement of a gene by a functionally equivalent one (small equation M17 large Δ). An example of the first type is the imprinting-related ncRNA Xist in Mammalia, which originated by pseudogenization of the lnx3 transcript whose primary product is a PDZ-like ring finger protein (Duret et al. 2006). An example of the latter type are several unrelated “clans” of serine proteases that share only the common catalytic triad Ser-His-Asp (Krem and Di Cera 2001).

The concept of homology is the subject of intensive discussion, with several competing definitions, see e.g., Laubichler (2000), Brigandt and Griffiths (2007). In the context of evolutionary and molecular biology, one requires that homologous characters are linked by common descent. In the strict “phylogenetic” definition, this is the only requirement. In our framework, phylogenetic homology of (Γ(x),x) and (Γ(y),y) is naturally established by the existence of a common ancestor of the genomic footprints Γ(x) and Γ(y). In practice, evidence for homology of genes can be evaluated by comparative sequence analysis of Γ(x) and Γ(y), i.e., in terms of Δ, the same way as this done for protein, RNA, or DNA sequences. The only modification is a more precise recipe for delimiting the sequences that need to be compared.

The strictly functional notion that identifies functionally equivalent genes runs into an insurmountable problem because there is nothing to prevent it from identifying objects that do not share common descent. This would lead to a gene concept that is not compatible with (phylogenetic) homology.

Our framework also provides a starting point for formalizing notions of homology that postulate functional similarities in addition to common descent, e.g., via a suitable concept of homology for I/O-relations. We view this as a research agenda beyond the scope of this contribution.


In this contribution, we have introduced a formal framework that satisfies several of the intuitive requirements of a gene definition.

  • We have emphasized a functional/computational notion that remains instantiated and identifiable in genomic material. We argue that both properties are necessary: purely structural gene-definitions are useless at best and harmful at worst for annotation purposes because they tend to blur the information provided by transcriptomics and proteomics data. On the other hand, modern Molecular Biology can only work with a gene concept that is firmly rooted to sequence information and therefore annotatable at a genomic level.
  • The gene concept includes genomic heritability and hence can be used to establish homology relationships over large phylogenetic distances.
  • As far as we can tell, the concept is consistent with a fine-grained system of homology concepts that distinguish between sequence homology (of the genomic footprint), homology of the gene (in terms of both sequence and function), and concepts that also include homologies between intermediate processing products. This will be valuable when extending this approach to, e.g., Developmental Biology.
  • Our framework suggests a definition of the phenotype as the collection of functions “performed” by an organism, Φ = {Fct(z)}. This phenotype is in turn evaluated by a complex fitness function f(Φ) to determine the viability and selective properties of the phenotype Φ. This provides compatibility with operational models of evolution such as Population Genetics.
  • Our construction is consistent with the concept of genetic engineering, which is based on the assumption that genomic changes can be transferred (most of the time) in an unambiguous way into gene products.
  • There is a relatively simple relationship between “classical genes” and our concept: for instance, the genomic CDSs of functional proteins, as well as the genomic loci of mature microRNAs, tRNAs, rRNAs, are all by default (genomic footprints of) genes in our sense.
  • Pragmatically, the currently available methods of data analysis e.g., in comparative genomics just need to be applied somewhat more carefully to selected data.
  • This proposal does not require the introduction of an army of auxiliary concepts unlikely to be adopted by practicing biologists.

This approach, does require however that we forfeit properties that might be desirable for certain cases. For instance, the genomic footprints of genes are in general proper subsets of the footprints of transcripts that describes a more inclusive functional set. We reject however the notion that a gene comprises all region/regions of DNA required to produce a function. The reason for doing this is to keep the “sphere of influence” of a gene limited. In an all-inclusive view that includes everything necessary to unfold a given product, large parts of the cellular machinery (and their DNA loci) would become constituents of all genes. We find such an approach untenable because it does not lead to “genes” upon which one can perform meaningful, discriminatory experiments.


This work originated in the aftermath of the Workgroup on the Complexity of the Gene Concept, which took place at the Santa Fe Institute 8–11 March 2009 thanks to funding by the James S. McDonnel Foundation in Robustness. We are grateful for the intensive discussions on the topic with all the participants of this workshop. Sven Findeiß’ comments on the manuscript are gratefully acknowledged. This work was funded in part by a grant from the Volkswagen Foundation on “Evolution of networks: robustness, complexity and adaptability” to PFS, and a grant on Innovation in Biological and Technological Systems from the David and Lucile Packard Foundation to DCK.

Open Access This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Contributor Information

Peter F. Stadler, ed.gizpiel-inu.fnioib@alduts.

Sonja J. Prohaska, ed.gizpiel-inu.fnioib@ajnos.

Christian V. Forst, ude.nretsewhtuostu@tsrof.naitsirhc.

David C. Krakauer, ude.efatnas@reuakark.


  • Berry G, Boudol G (1992) The chemical abstract machine. Theor Comput Sci 96:217–248

  • Brigandt I, Griffiths P (2007) The importance of homology for biology and philosophy. Biol Philos 22:633–641

  • Dicou E (2008) Biologically active, non membrane-anchored precursors: an overview. FEBS J 275:1960–1975 [PubMed]

  • Duret L, Chureau C, Samain S, Weissenbach J, Avner P (2006) The Xist RNA gene evolved in eutherians by pseudogenization of a protein-coding gene. Science 312:1653–1655 [PubMed]

  • Forst CV, Flamm C, Hofacker IL, Stadler PF (2006) Algebraic comparison of metabolic networks, phylogenetic inference, and metabolic innovation. BMC Bioinf 7:67 [PMC free article] [PubMed]

  • Furey TS (2006) Comparison of human (and other) genome browsers. Hum Genomics 2:266–270 [PMC free article] [PubMed]

  • Gerstein MB, Bruce C, Rozowsky JS, Zheng D, Du J, Korbel JO, Emanuelsson O, Zhang ZD, Weissman S, Snyder M (2007) What is a gene, post-ENCODE? History and updated definition. Genome Res 17:669–681 [PubMed]

  • Gingeras TR (2007) Origin of phenotypes: genes and transcripts. Genome Res 17:682–690 [PubMed]

  • Griffiths-Jones S, Saini HK, van Dongen S, Enright AJ (2008) miRBase: tools for microRNA genomics. Nucleic Acids Res 36:D154–D158 [PMC free article] [PubMed]

  • Karolchik D, Kuhn RM, Baertsch R, Barber GP, Clawson H, Diekhans M, Giardine B, Harte RA, Hinrichs AS, Hsu F, Kober KM, Miller W, Pedersen JS, Pohl A, Raney BJ, Rhead B, Rosenbloom KR, Smith KE, Stanke M, Thakkapallayil A, Trumbower H, Wang T, Zweig AS, Haussler D, Kent WJ (2008) The UCSC Genome Browser Database: 2008 update. Nucleic Acids Res 36:D773–D779 [PMC free article] [PubMed]

  • Krem MM, Di Cera E (2001) Molecular markers of serine protease evolution. EMBO J 20:3036–3045 [PubMed]

  • Laubichler MD (2000) Homology in development and the development of the homology concept. Int Comp Biol 40:777–788
  • Maggiora GM, Shanmugasundaram V (2004) Molecular similarity measures. In: Bajorath J (ed) Chemoinformatics: concepts, methods, and tools for drug discovery. Methods in molecular biology, vol 275. Springer, Heidelberg, pp 1–50
  • NC-ICBMB, Webb EC (ed) (1992) Enzyme Nomenclature 1992: Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology. Academic Press, San Diego

  • Palsson BO (2006) Systems biology. Cambridge University Press, Cambridge

  • Pearson H (2006) Genetics: what is a gene? Nature 441:398–401 [PubMed]

  • Pennisi E (2007) DNA study forces rethink of what it means to be a gene. Science 316:1556–1557 [PubMed]

  • Prohaska SJ, Stadler PF (2008) Genes. Theor Biosci 127:215–221 [PMC free article] [PubMed]

  • Scherrer K, Jost J (2007) The gene and the genon concept: a conceptual and information-theoretic analysis of genetic storage and expression in the light of modern molecular biology. Theor Biosci 126:65–113 [PMC free article] [PubMed]

  • Taylor RG (1998) Models of computation and formal languages. Oxford University Press, New York

  • Tohsato Y, Nishimura Y (2007) Metabolic pathway alignment based on similarity between chemical structures. IPSJ Digit Courier 3:736–745

  • Walton JD (2006) HC-toxin. Phytochemistry 67:1406–1413 [PubMed]

Articles from Springer Open Choice are provided here courtesy of Springer