Homology is defined as the relationship that exists between two biological entities - for example, two sequences or two anatomic characters - that are derived from a common ancestor. In 1970, Walter Fitch coined the concepts of orthology and paralogy to distinguish two types of homology relationships between biological sequences [1
]. Orthologous sequences are those that derive by a speciation event from their common ancestor, whereas the origin of paralogous sequences can be traced back to a gene-duplication event. Despite this clear definition, orthology and paralogy are often misinterpreted by biologists. This is partly due to the fact that what may seem simple when comparing pairs of closely related species, easily gets complicated when wider groups of distantly related species are involved. It is sometimes wrongly claimed, for example, that only two sequences from the same species can be regarded as paralogs, or that two sequences from different species are orthologous to each other only if they perform the same biological function. I will briefly summarize here the main misunderstandings that can arise when dealing with properties of orthologous sequences (see [7
] for a more thorough discussion), which are key to understanding why some of the methods discussed later would be more appropriate than others.
The first clarification is that orthology is a purely evolutionary concept, certainly related to, but not based on, the functionality of the sequences involved. All homologous proteins have a common ancestry and thus are expected to have similar three-dimensional structures and to perform related functions. But changes in functionality within a homologous family of proteins caused by sequence variation or context-dependency are not rare [10
]. This is especially true in the case of paralogs, because processes of neo- or subfunctionalization may favor the retention of duplicate genes [11
]. Orthologous sequences derived by speciation are, therefore, less prone to functional shifts but are definitely not free from them.
A second important point to note is that the orthology or paralogy relationship between two genes will extend to their descendants as they disperse by further speciation or duplication events. Thus, groups of orthologs, and not just pairs, may more adequately represent the ancestral relationships of the genes in a set of organisms. An important corollary of its definition is that orthology, in contrast to homology, is not transitive. If a gene A is orthologous to B and B to C, A and C are not necessarily orthologous to each other. For instance, if A and C are related by a duplication event, they will be paralogous to each other while both being co-orthologous to B. This is best explained with a graphical example (Figure ). The human tumor suppressor protein p53 belongs to a wider family of proteins that also includes p73 and p73L. The tree shown in Figure depicts the evolutionary relationships among several metazoan members of the family, ranging from insects to mammals. As can be inferred from the tree, several duplications (nodes marked with gray circles) occurred at different periods. Most significantly, two consecutive duplications at the base of the vertebrates originated three sister groups (shadowed regions in the tree) that correspond to the p53, p73 and p73L subfamilies. Human p53 can be considered orthologous to the sequences in other vertebrates that cluster within the same shadowed region, because they all derive by speciation events. Paralogous relationships can be drawn between human p53 and human p73 and p73L, because their common ancestral node always corresponds to a duplication node. The same reasoning can be used to infer paralogous relationships between any sequence within the p53 subfamily and those in the p73 and p73L subfamilies, even though they might not be encoded in the same genome, such as human p53 and mouse p73L. The only criteria to mark them as paralogs is the fact that they derived by the duplication of an ancestral gene. Human p53 is also orthologous to any of the two Ciona intestinalis sequences, because they diverged from a speciation node (marked with an arrow). Note that this is the only node that is important in defining their orthology relationship, and we do not consider the fact that, subsequent to that speciation, both lineages experienced duplication events. These later duplication events are, however, important to define other proteins at the same orthology level. In fact, human p53, p73 and p73L all are orthologous to any of the sequences in C. intestinalis because they diverged at the same speciation node. To accurately define the orthology relationships between human and C. intestinalis members of this family one should say that human p53, p73L and p73 are all co-orthologous to the two C. intestinalis proteins.
Figure 1 p53 phylogeny. Phylogenetic tree representing the evolutionary relationships among p53 and related proteins. Sequences were obtained from the p53 tree at phylomeDB  (entry code Hsa0012331). After selecting a group of representative sequences, a maximum (more ...)
Yet another complication in defining orthology relationships among proteins is that they often comprise distinct domains that may have followed different evolutionary histories [12
]. Such evolutionary chimeras can be created by fusion and recombination events between different genes and may lead to situations in which, for example, a single member of a given protein family has recently acquired a new domain through recombination with another family. In such cases the different domains should, in principle, be treated as independent evolutionary units and orthology relationships be delineated accordingly. Thus, in multidomain families, orthology relationships should be first established among core domains and then extended, where possible, to adjacent regions.