Darwin’s fundamental insight into the nature of evolution has been one of the cornerstones of modern biology. The diversity of species and individuals in the fossil record shows the evolution of complex features, not once, but many times, demonstrating that evolution is a powerful, creative force. Starting in the late 1950s, computer scientists began using the concepts of evolution and natural selection to find creative solutions for computing problems (Fraser 1957
, Barricelli 1957
, Friedberg 1958
, Bremermann 1962
). The key ideas of these algorithms were the creation of a population of solutions, with some method of varying the composition of the individuals in the population based on a measure of fitness, or how well the individuals perform on a particular task. The most fit individuals created would replace less fit individuals so that over time the overall fitness of the population would increase.
The broad adoption of genetic algorithms, where a population of individual parameter values are encoded as strings of bits, was introduced in (Holland 1962
), and has been used since many different disciplines including engineering design (Selig 1996
), economic modeling (Arifovic 1994
), and genetics (Hill 2005
). A key component of genetic algorithms is the emphasis on crossover or recombination of individual solutions in a population rather than small, incremental mutations. Recombination takes the bit strings on individuals in a populations and swaps large pieces of the bit stings that make up the “genetic material” with other individuals to create offspring solutions. While this was not a new idea, the emphasis on crossover as a key component of the process, the representation of complex structures as bit strings mapped to complex components, and Holland’s development of a theory describing the effect of crossover put genetic algorithms at the forefront of evolutionary computing.
Shortly thereafter, a new effort to “evolve” computer programs began. In essence, this was an effort to create computer programs whose outputs were computer programs that solved a problem. This may be contrasted with earlier evolutionary algorithms that were mainly focused on optimizing numerical solutions by finding values for computational models that produced the best results. So, while a genetic algorithm might be used to find the optimal set of pressures and mix of reactants in a chemical reaction, these new algorithms would try to find programs that could create computer models of chemical reactions learned from a set of inputs and the resultant products. The fitness measure for such a process would be measured by how accurately the program would predict the resulting products across a number of starting conditions.
(Rechenberg 1965) created evolution strategies that started with a population of computer programs, altered or “mutated” individual programs in the population and tested whether the mutated program was better than its parent by using a fitness measure. If it was more fit, then the mutated program would replace its parent in the population. (Fogel 1966
) described a system of evolutionary programming which mutated individuals within a population of programs and put them into competition with the other members of the population, not just their parents. By the 1980s, more complex method were proposed for evolving computer programs including Forsyth (1981)
and (Cramer 1985
The most widely used of these systems was proposed by John Koza (1989)
and is called genetic programming. Genetic programming (GP) uses a fixed representation for programs and, based on this representation, created a recombination operation that could combine programs to create offspring programs. This followed the pattern established in genetic algorithms and meant that offspring programs could combine useful features of their parents. In this case the “building blocks” exchanged during crossover were function pieces of the programs. The flexibility of the representation proposed by Koza has made GP the most common form used to evolve computer programs.
shows the representation used by GP for simple programs and the method of recombining the programs to programs to produce offspring programs. In 1a, the simple function A/B > C is a Boolean function that produces a TRUE/FALSE result when given input values for the variables A, B and C. It is represented as a tree structure where individual variables, or “terminals” are combined with mathematical functions. This is a standard representation used internally by computer compilers to transform high level languages like C or C++ to the machine code in executable programs.
Figure 1 Genetic Programming process. Programs are represented by tree structures in A. B shows the process of evolution starting from input data until output rules are produced. After rules are produced they are reviewed for biological insights and relationships (more ...)
shows how two such programs may be combined to produce offspring programs during crossover. Here the program (A*B) > (C+D) is combined with the program G < (E/F) by swapping parts of the program trees to produce the offspring programs (A*B) > (E/F) and G < (C+D). While not shown, mutation is achieved by stochastically selecting a portion of the tree and randomly replacing it. For example, mutation might change (E/F) in one of the offspring programs to (H–I) after the crossover is completed. It is important to note that both crossover and mutation operations are stochastic in nature with a probability assigned by the control parameters of the GP process. Typical values of crossover and mutation rates are 10%–50% for crossover and 1%–5% for mutation.
Because GP usually operates in a supervised learning mode, a training set of data is necessary with known outcomes that can be used to test the fitness of each program. For example, microarray expression data of tumor tissues and patient outcome information could be used to develop a cancer prognostic program that could be used to predict which patients are likely have a post-operative recurrence. The GP system would “learn” by example, using the evolutionary process to search for and find a combination of mathematical functions that can model recurrence features.
is a flowchart that shows how the evolutionary process works. The initial step is to create a population of program trees that are randomly assembled from the available inputs in an unbiased way. This allows a broad sampling of possible inputs and begins the search for productive combinations. Though the initial population will probably have a poor performance overall, by random chance, some individuals will perform better than others because of the features selected, or the way they are combined mathematically.
Once the initial population is created, a small number of programs are randomly selected from the population and applied to the training data, producing outputs that can be compared with the true results. In the prognostic example above, this small sub-population of selected programs make a prediction about each patient in the training set as to whether he or she will have a recurrence within the bounding timeframe. The results of these predictions will be used to calculate a measure of effectiveness, or fitness, for each individual in this group. The fitness measure is used to select the two best individuals from this sub-population and by combining elements from each program, produces new “offspring” programs by applying the crossover operation shown in with a chance of mutation as described above. The generated offspring programs then replace the two least fit individuals in the population. Each of these operations has a separate probability of being applied to the mating pair after selection, which creates a stochastic element in the evolutionary process.
This cycle of sub-group selection, fitness testing, recombination, mutation and replacement is repeated until a halting condition is reached. An example of such a criteria might be a pre-determined level of prediction accuracy or a number of replacements equal to the number of individuals in the main population (a “generation”). While the initial, randomly created programs of the population are very poor in terms of performance, the internal cycle of evolution quickly improves the fitness of the programs in the population, and usually focuses on a small combination of input features that are most useful in solving the problem.
Almal et al (2008)
describes a surprising similarity in behavior between genetic programming and natural evolution despite the significant differences in genetic representation and evolutionary mechanisms. This hints at the power of the basic, underlying mechanisms of evolution, regardless of the medium in which it operates.
From an analytic point of view, some of the desirable features of genetic programming include:
- The ability to control the bias incorporated into feature selection – this can be varied from a completely unbiased approach in feature selection to a partial bias that uses prior knowledge or even a paradigm that imposes complete constraint on feature selection.
- The ability to integrate diverse data to produce mixed models of the data such as integrating demographic, clinical and molecular data
- The ability to create human readable results in contrast to “black box” solutions
- The ability to create solutions whose form is not constrained; it is equally likely to create linear or non-linear models from an analysis.
While genetic programming is not the only analytic approach to have many of these features, but it is almost unique in providing all of them in a single system. Because of these features, genetic programming has a rich history of applications. It has been used for novel design such as patented antenna designs (Lohn et al 2004), and patented analog electronic circuits (Koza et al 2004
), and small molecule design (Nachbar 2000
). It is also being used commercially to characterize dynamic processes such as chemical processes (Hinchliffe and Willis 2003), financial markets (Becker et al, 2006) and image processing (Zhang 2007
Given that genetic programming is one among many possible options, both from machine learning and from analyzing data, one must ask the question of where it fits into the possible choices for analyzing data. One way to view this is suggested by (Vadislavleva 2008
) who points out that the choice of analytic tool is in part driven by your knowledge about the likely solution. In this assessment, there are usually two main issues: whether the key factors in the solution are known, and whether the structure of the solution is known. For example, if the key factors of a problem are known, and it is clear that a linear solution is likely to work, the linear regression is probably the best choice: it is computationally inexpensive, reliable and interpretable. Similarly, if a non-linear solution is likely but the factors are known, then non-linear regression analysis is probably the best choice.
However, if the key features are known, but are large in number or the form of the solution is unknown, then machine learning approaches such as artificial neural networks, support vector machines (SVMs), or fuzzy rule systems would be a better fit. This is because such algorithms can flexibly assemble a model from a fairly limited and known set of variables, “learning” by testable example such as how well they predict an outcome. If the key features are not known, and the number of features are too large to provably identify the key features, than these techniques will usually require a combined approach such as selecting features using statistical univariate or multivariate analysis to select likely features or hybrid approaches such as combining a genetic algorithm with a neural network to find and build a usable model (Yao 1999
Genetic programming has the advantage of extreme flexibility, but at a high computational cost. It can model systems where the key features are not known (though there are upper limits on the number of features it can sift through (Moore 2006
)) and can model systems where the structure and complexity of the desired models are not known (Kordon 2002
). With GP, you can use whole genome chip data sets as inputs, and allow the system to search for models from a range of possible model structures.
However, it is important to realize that while the selection pressure toward better solutions that drives the evolutionary process tends to produce better and better solutions, it can never be said to be an optimal solution unless it clearly meets all desired criteria. This is because the flexibility that allows it to search a very large space of possible solutions, cannot guarantee that it has found the best solution. It cannot make an exhaustive search of all possible programs of greater than a trivial size so the stochastic nature of the algorithm samples the space while the fitness and selection pressure it exercises focuses on preferred areas of the space. Nevertheless, we can seldom guarantee that there is not a better solution “just around the corner” we merely have a solution that is “good enough” for our purposes. Therefore it cannot be viewed as an optimization algorithm, but rather as a powerful search algorithm that can discover novel and useful solutions to problems.
However, GP has been used as a way to discover both the key features and the structure of a model that is a good approximation of a first principles model. This insight allows first principle modelers to arrive at a solution more quickly than would otherwise be possible. (Kordon 2002
) estimates that using GP to develop an approximate model based on empirical data reduces the time it takes to develop a first principles model for chemical processes from a year to 3–4 months, a significant saving of time and resources. In the context of biology, such models may suggest interactions that can be used in a “systems biology” approach to create better models of disease or disease processes.