PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of narLink to Publisher's site
 
Nucleic Acids Res. 2013 February; 41(4): 2105–2120.
Published online 2013 January 8. doi:  10.1093/nar/gks1456
PMCID: PMC3575792

Functional characterization of motif sequences under purifying selection

Abstract

Diverse life forms are driven by the evolution of gene regulatory programs including changes in regulator proteins and cis-regulatory elements. Alterations of cis-regulatory elements are likely to dominate the evolution of the gene regulatory networks, as they are subjected to smaller selective constraints compared with proteins and hence may evolve quickly to adapt the environment. Prior studies on cis-regulatory element evolution focus primarily on sequence substitutions of known transcription factor-binding motifs. However, evolutionary models for the dynamics of motif occurrence are relatively rare, and comprehensive characterization of the evolution of all possible motif sequences has not been pursued. In the present study, we propose an algorithm to estimate the strength of purifying selection of a motif sequence based on an evolutionary model capturing the birth and death of motif occurrences on promoters. We term this measure as the ‘evolutionary retention coefficient’, as it is related yet distinct from the canonical definition of selection coefficient in population genetics. Using this algorithm, we estimate and report the evolutionary retention coefficients of all possible 10-nucleotide sequences from the aligned promoter sequences of 27 748. orthologous gene families in 34 mammalian species. Intriguingly, the evolutionary retention coefficients of motifs are intimately associated with their functional relevance. Top-ranking motifs (sorted by evolutionary retention coefficients) are significantly enriched with transcription factor-binding sequences according to the curated knowledge from the TRANSFAC database and the ChIP-seq data generated from the ENCODE Consortium. Moreover, genes harbouring high-scoring motifs on their promoters retain significantly coherent expression profiles, and those genes are over-represented in the functional classes involved in gene regulation. The validation results reveal the dependencies between natural selection and functions of cis-regulatory elements and shed light on the evolution of gene regulatory networks.

INTRODUCTION

Diverse life forms are largely driven by conservation and variations of the gene regulatory circuits. Recent progress in high-throughput technologies such as next-generation sequencing platforms and DNA microarrays enables biologists to map the regulatory networks and investigate their evolution across multiple species. For instance, studies in evolutionary developmental biology (EvoDevo) compared the gene regulatory networks for animal development and discovered conserved cores responsible for body plan formation and variable modules modifying species-specific phenotypes such as the shapes of limbs or wings [e.g., (1,2)].

One remarkable feature from the gene regulatory networks of multiple species is the conservation of their constituent proteins (3). Most proteins possess multiple functions (pleiotropic), hence are subjected to tight selective constraints. Alterations on protein sequences (e.g., changes on the DNA-binding domain of a transcription factor) may affect many partners (e.g., changes on the bindings of all targets of a transcription factor), thus are likely to be deleterious. In contrast, alterations on cis-regulatory elements have local effects and thus enable the systems to evolve in an incremental fashion. Consequently, evolution of non protein-coding regions in general and cis-regulatory elements in particular plays a critical role in the evolution of the gene regulatory systems.

Early studies of cis-regulatory element evolution focus on identification of conserved transcription factor-binding motifs (4) and detection of conserved regions on gene promoters (5). Sequence conservation alone, however, does not suffice to account for the evolution of gene regulatory systems. Comparison of known cis-regulatory elements on closely related species indicates high rates of turnover and divergence (6–10). These changes may yield gains and losses of cis-regulatory elements (19), modify the regulatory programs (1,2) or are accompanied by compensatory mutations to maintain stable regulatory programs (11,12).

Like protein-coding regions, evolution of cis-regulatory elements is driven by a variety of mechanisms including sequence substitutions (18), gene duplications (13), tandem repeat insertions and deletions (14). Cis-regulatory elements are added or deleted on the promoters/enhancers according to these mechanisms. Ideally, a complete model for the evolution of cis-regulatory elements should be based on the models of all individual mechanisms for molecular evolution. In practice, mechanisms other than sequence substitutions are hard to formalize. Consequently, the majority of quantitative models for cis-regulatory element evolution are derived from sequence substitution processes. Several studies use simulations to examine the effects of sequence mutations on the rates for cis-regulatory element evolution [e.g., (15,16)]. Others start with sequence substitution models in population genetics and attempt to identify the cis-regulatory elements under selection [e.g., (17–19)]. Despite the fruitful outcomes generated from these studies, they suffer from two major limitations. First, they focus primarily on the deviation of observed sequences from a known regulatory element (e.g., a transcription factor-binding motif) rather than the changes of regulatory element occurrence on promoters. Alterations on motif counts can be more critical for gene regulation than specific sequence variations, as the former modulate the number of transcription factors bound on promoters. Second, all the current studies only examine a collection of known transcription factor-binding motifs. Complete characterization of the evolution of all possible motif sequences of a fixed length is lacking. This characterization, however, is critical for discovering new regulatory elements and comprehending their evolution on genomes.

Previously, we proposed an evolutionary model and an algorithm to quantify the strength of natural selection of a motif sequence (20). The evolution of motif occurrence was formulated as a birth–death process, whereas the rates of motif additions and deletions were derived from substitutions of their constituent sequences. The evolutionary retention coefficient of a motif was defined as a penalty to slow down motif death, and the evolutionary retention coefficient value maximizing the log likelihood of the data was estimated. In the present study, we extend this model and evaluate the evolutionary retention coefficients of all the 410 = 1 048 576 10-nucleotide sequences on the promoters of 27 748 orthologous gene families from 34 mammalian species. Intriguingly, evolutionary retention coefficients of the 10-mer sequences are significantly associated with the tendency of transcription factor-binding events and expression coherence of the genes harbouring the motifs. By examining the annotations of the top-ranking motifs, we find many of them match the GC-rich binding sequences of the transcription factors. Furthermore, genes harbouring the top-ranking motifs are highly enriched with the processes of transcriptional regulation. The results provide a comprehensive picture of the evolution of cis-regulatory elements.

MATERIALS AND METHODS

Data sources

Aligned 5 kb upstream sequences of 27 748 orthologous gene families from 34 mammalian species were extracted from the UCSC Genome Browser (21). Supplementary Table S1 and Figure S1 report the names and the phylogenetic tree of the selected species.

To validate the functional relevance of the high-scoring motifs, we downloaded external datasets from the following sources: the consensus motifs of transcription factor-binding sequences from the TRANSFAC database (22), 407 ChIP-seq data files from the ENCODE database (23), DNA microarray data of human and mouse tissue gene expressions (24) and RNA-seq data of human tissue gene expressions (25), the annotations and member genes of 3201 Gene Ontology (GO) categories (27) and pathway information from three databases (28–30).

Quantifying the strength of natural selection of motif sequences

We define a motif as a collection of sequences with the same length. Over time motifs are created, annihilated or maintained in a specified region (e.g., a gene promoter) by sequence substitutions of the constituting nucleotides. Motifs undergoing purifying selection would possess slower rates of annihilation than those without selective constraints. Accordingly, we quantify the strength of natural selection of a motif by comparing the empirical distribution of its occurrences over multiple species with the one generated by a neutral evolutionary model. The evolutionary model of motif occurrences and the algorithm of evaluating the evolutionary retention coefficients of motifs are described below.

A Poisson process model of sequence substitution

We adopt the simplest model of sequence substitution assuming all nucleotides at all positions and across all lineages transition with an equal rate (31). In an infinitesimal time interval dt, the nucleotide sequence of a position transitions to another base with probability λdt. An external file that holds a picture, illustration, etc.
Object name is gks1456i1.jpg denotes the cumulative number of sequence changes at time t. The transitions within the time interval [t, t + dt] is as follows

equation image
(1)

and An external file that holds a picture, illustration, etc.
Object name is gks1456i2.jpg has a Poisson distribution

equation image
(2)

The maximum likelihood estimate of An external file that holds a picture, illustration, etc.
Object name is gks1456i3.jpg is simply the total number of sequence changes divided by the total length of the time interval considered. In this work we estimated An external file that holds a picture, illustration, etc.
Object name is gks1456i4.jpg from the aligned 5 kb promoter sequences of the 27 748 gene families over the 34 mammalian species. For each position of the aligned promoters in each gene family, we observed the sequences in the terminal nodes (the 34 extant species) of the species tree and inferred the sequences in the internal nodes (ancestral species) by a dynamic programming algorithm (32). We then counted the total number of sequence changes along all branches of the species tree for all positions and all gene families and the total lengths of the time intervals, and calculated An external file that holds a picture, illustration, etc.
Object name is gks1456i5.jpg accordingly. From the empirical data, An external file that holds a picture, illustration, etc.
Object name is gks1456i6.jpg = 0.8371.

A birth–death model for the evolution of motif occurrences

A motif An external file that holds a picture, illustration, etc.
Object name is gks1456i7.jpg is defined as a collection of nucleotide sequences of fixed length An external file that holds a picture, illustration, etc.
Object name is gks1456i8.jpg. We first consider the sequence evolution of An external file that holds a picture, illustration, etc.
Object name is gks1456i9.jpg consecutive positions. There are An external file that holds a picture, illustration, etc.
Object name is gks1456i10.jpg possible sequences that can occur in this An external file that holds a picture, illustration, etc.
Object name is gks1456i11.jpg-mer window, and each sequence An external file that holds a picture, illustration, etc.
Object name is gks1456i12.jpg can be labelled as either a member of the motif An external file that holds a picture, illustration, etc.
Object name is gks1456i13.jpg or not An external file that holds a picture, illustration, etc.
Object name is gks1456i14.jpg. These sequences comprise an undirected graph G = (V, E), where a node An external file that holds a picture, illustration, etc.
Object name is gks1456i15.jpg denotes an An external file that holds a picture, illustration, etc.
Object name is gks1456i16.jpg-mer sequence and an edge e = (v1, v2) denotes a sequence pair v1 and v2 different at one position. The evolution of An external file that holds a picture, illustration, etc.
Object name is gks1456i17.jpg-mer sequences can be viewed as a Markov random walk on G. In an infinitesimal time interval, a sequence can only transition to a neighboring node in G and the rate of transition is An external file that holds a picture, illustration, etc.
Object name is gks1456i18.jpg.

A motif An external file that holds a picture, illustration, etc.
Object name is gks1456i19.jpg constitutes a subset of nodes in G (black nodes in the left diagram of Figure 1), while the remaining nodes are non-motif sequences (white nodes in the left diagram of Figure 1). We are interested in the transition rate from non-motif sequences to motif sequences and vice versa. With a simplifying approximation, we characterize these transitions with two numbers: r01 as the fraction of all non-motif → motif transitions among all transitions from non-motifs, and r10 as the fraction of all motif → non-motif transitions among all transitions from motifs. PA, PC, PG and PT denote the background frequencies of the four nucleotides obtained from all promoters of the 34 species. r01 and r10 are calculated by the following formulas:

equation image
(3)

Where δ(·) is an indicator function and ω(ν1, ν2) is the nucleotide background probability of ν2 at the position where ν1 and ν2 differ. For instance, ω(AGGC, AGTC) = PT. ω(ν1, ν2)’s rescale the weights of transitions according to the frequencies of the destination sequences. For instance, transitions to GC-rich sequences are more likely to occur on mammalian promoters, as they are over-represented in the CpG islands (33).

Figure 1.
Left: A sequence space of fixed length as a graph. A node denotes a sequence, and an edge denotes two sequences differing at one position. Black nodes are members of a motif and white nodes are non-motifs. Dotted edges denote transitions between motifs ...

n(t) denotes the number of motif occurrence at time An external file that holds a picture, illustration, etc.
Object name is gks1456i24.jpg. In an An external file that holds a picture, illustration, etc.
Object name is gks1456i25.jpg-mer window, An external file that holds a picture, illustration, etc.
Object name is gks1456i26.jpg as the sequence is either a motif or not. The transitions of An external file that holds a picture, illustration, etc.
Object name is gks1456i27.jpg hence conform with the following equations

equation image
(4)

We now extend the analysis to the entire promoter of length An external file that holds a picture, illustration, etc.
Object name is gks1456i28.jpg. Suppose motif occurrence at time t is An external file that holds a picture, illustration, etc.
Object name is gks1456i29.jpg and the An external file that holds a picture, illustration, etc.
Object name is gks1456i30.jpg occurring motifs are not overlapped. Each motif instantiation can be annihilated with a rate An external file that holds a picture, illustration, etc.
Object name is gks1456i31.jpg. Hence the ‘death rate’ on the entire promoter is the rate on an An external file that holds a picture, illustration, etc.
Object name is gks1456i32.jpg-mer window multiplied by An external file that holds a picture, illustration, etc.
Object name is gks1456i33.jpg:

equation image
(5)

There are An external file that holds a picture, illustration, etc.
Object name is gks1456i34.jpg positions unoccupied by motif sequences, and the maximum number of (possibly overlapped) An external file that holds a picture, illustration, etc.
Object name is gks1456i35.jpg-mer windows is An external file that holds a picture, illustration, etc.
Object name is gks1456i36.jpg. Each of these An external file that holds a picture, illustration, etc.
Object name is gks1456i37.jpg-mer windows can generate a new motif. Hence the ‘birth rate’ on the entire promoter is approximately the rate on an An external file that holds a picture, illustration, etc.
Object name is gks1456i38.jpg-mer window multiplied by An external file that holds a picture, illustration, etc.
Object name is gks1456i39.jpg:

equation image
(6)

Equations (6) and (5) specify a birth–death process (34) of motif occurrence on a promoter of length An external file that holds a picture, illustration, etc.
Object name is gks1456i40.jpg. The distribution An external file that holds a picture, illustration, etc.
Object name is gks1456i41.jpg of motif occurrences over time can be expressed as a system of differential-difference equations:

equation image
(7)

The system is illustrated by the right diagram of Figure 1.

The aforementioned model assumes that sequences randomly drift and henceforth no selective pressure is exerted on the evolution of motif occurrence. In contrast, purifying selection should penalize decrements of motif occurrence. Therefore, the evolutionary model of motif occurrence under purifying selection largely resembles the model for neutral evolution (Equation 7) except for a modification of the motif death rate:

equation image
(8)

The motif death rate An external file that holds a picture, illustration, etc.
Object name is gks1456i42.jpg under selection slows down the neutral motif death rate An external file that holds a picture, illustration, etc.
Object name is gks1456i43.jpg by a factor s > 1. We term s as the evolutionary retention coefficient of a motif. Notably, this definition is related yet distinct from the canonical definition of selection coefficient in population genetics (35). In population genetics, the selection coefficient is the decline of the relative fitness of a selectively disadvantageous genotype compared with that of a selectively favoured genotype. In a sufficiently large population, the selectively advantageous genotype will appear with a higher frequency than that of a genotype without selection. In this regard, both the canonical selection coefficient and evolutionary retention coefficient aim for capturing the strength of purifying selection from the observed genotypes. However, despite the common goals shared by the two measures, the evolutionary retention coefficient is distinct from the canonical selection coefficient in two aspects. First, the evolutionary retention coefficient bypasses the abstract notion of relative fitness and directly tackles the consequences of purifying selection—elevation of motif occurrence frequencies. Second, the canonical selection coefficient examines the allele frequencies of a single site, whereas the evolutionary retention coefficient is inferred from the frequencies of motif occurrence in a consecutive region of the genome. We will further clarify the relation between these two scores in simulation studies.

Estimating the evolutionary retention coefficients from empirical data

We estimated the evolutionary retention coefficient of a An external file that holds a picture, illustration, etc.
Object name is gks1456i44.jpg-mer motif from the aligned 5 kb promoter sequences of 34 mammalian species with the following procedures. First, we divided a promoter into multiple segments of fixed length An external file that holds a picture, illustration, etc.
Object name is gks1456i45.jpg. Segments with >10% gaps in any species were discarded. This partition reduces the number of valid terms in Equation (7), hence greatly simplifies subsequent estimation.

Second, we treated humans as the reference species and assumed that alterations of motif counts from the reference to another species followed the birth–death process. t denotes the distance between humans and another species x in the phylogenetic tree, An external file that holds a picture, illustration, etc.
Object name is gks1456i46.jpg and An external file that holds a picture, illustration, etc.
Object name is gks1456i47.jpg the motif counts in the segments of humans and species x, respectively. For each combination of t (or species x), An external file that holds a picture, illustration, etc.
Object name is gks1456i48.jpg and An external file that holds a picture, illustration, etc.
Object name is gks1456i49.jpg, we then counted An external file that holds a picture, illustration, etc.
Object name is gks1456i50.jpg, the total number of segments with An external file that holds a picture, illustration, etc.
Object name is gks1456i51.jpg and An external file that holds a picture, illustration, etc.
Object name is gks1456i52.jpg motif instances in the counterparts of humans and species x.

Third, the log likelihood of motif occurrences can be expressed as

equation image
(9)

where C is a constant and An external file that holds a picture, illustration, etc.
Object name is gks1456i53.jpg denotes the conditional probability derived from the birth–death model under selection by 6. applying the death rate of Equation (8) in Equation (7). Given the relatively short length of segments, we only considered motif occurrences up to 3 and restricted the terms in Equation (7) to An external file that holds a picture, illustration, etc.
Object name is gks1456i54.jpg accordingly. For each fixed value of evolutionary retention coefficient s, we solved An external file that holds a picture, illustration, etc.
Object name is gks1456i55.jpg numerically by the finite difference method for ordinary differential equations. To estimate s that maximizes the log likelihood in Equation (9), we used a binary search to find the optimal s over the interval An external file that holds a picture, illustration, etc.
Object name is gks1456i56.jpg.

Comparison of selection coefficients and evolutionary retention coefficients in simulated data

To elucidate the relation between selection coefficients and evolutionary retention coefficients, we simulated haploid sequence evolution with varying selection coefficients and compared the evolutionary retention coefficients estimated from the observed data with the given selection coefficients. Given a promoter sequence of fixed length (30 nucleotides) and a motif (10 nucleotides), we define the relative fitness of the promoter as An external file that holds a picture, illustration, etc.
Object name is gks1456i57.jpg, where k is the number of motif occurrence on the promoter and An external file that holds a picture, illustration, etc.
Object name is gks1456i58.jpg the selection coefficient. The sequence containing An external file that holds a picture, illustration, etc.
Object name is gks1456i59.jpg motif instances possesses the highest fitness, whereas the sequences containing 1 and 0 motif instance possess intermediate and low fitness, respectively.

We simulated promoter evolution according to both sequence substitutions of individual positions and purifying selection dictated by motif occurrence. One promoter sequence was randomly generated in the first generation. In each of the following generations, each sequence produced 10 progenies with a Poisson mutation rate of 0.02 per position. Among the progenies from the same cohort, 100 of them were selected with probabilities proportional to their relative fitness, and the remaining sequences were eliminated. This process of sequence substitutions and selection lasted for 100 generations. For each selection coefficient, we generated randomly 20 motif and initial promoter sequences and simulated their evolution separately. Finally, we repeated simulations for the following selection coefficient values: 0.01, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.49.

In each simulation, the coalescent tree of the 100 observed sequences was recorded. We chose the sequence closest to other observed sequences as the reference and evaluated their phylogenetic distances according to the structures and branch lengths of the coalescent tree. The evolutionary retention coefficient of each designated motif can be consequently inferred from the simulated sequences.

An exhaustive evaluation of evolutionary retention coefficients of all 10-mers on mammalian promoters

Using the aforementioned algorithm, we estimated the evolutionary retention coefficients of all An external file that holds a picture, illustration, etc.
Object name is gks1456i60.jpg 10-mer sequences on the aligned 5 kb promoter sequences of 34 mammalian species. To accelerate computations, we ran the estimation procedures on two PC cluster systems simultaneously. Eight jobs were assigned in parallel to the HP DL360 G7 servers containing dual Intel Xeon E5520 CPUs with 2.27 GHz and 24 GB main memory, and 10 jobs were assigned in parallel to the HP BL460C servers containing Intel Xeon CPUs with 3.16 GHz and 16 GB main memory. The total running time was 6048 hours. Notably, although the theoretical framework we present can handle more general motifs (i.e., a collection of nucleotide sequences), in this work we only investigate the single sequence motifs (i.e., occurrences of a particular 10-mer sequence), as their evolutionary retention coefficients can be exhaustively calculated with limited computing resources.

Functional validation of motif sequences under selective pressure

We incurred the following four tests to validate the functional relevance of motif sequences under selection.

Enrichment of TRANSFAC motifs

First, we demonstrated that evolutionary retention coefficients were correlated with enrichment of transcription factor-binding motifs extracted from TRANSFAC (22). A non-parametric statistical test was used to evaluate the enrichment of transcription factor-binding motifs in high-scoring sequences. In brief, all the 104 857 610-mer sequences were sorted by their evolutionary retention coefficients in a descending order. 168 397 of these sequences matched completely or partially with transcription factor-binding motifs in TRANSFAC. We defined An external file that holds a picture, illustration, etc.
Object name is gks1456i61.jpg over the normalized rank An external file that holds a picture, illustration, etc.
Object name is gks1456i62.jpg resembling the cumulative distribution function (CDF) of TRANSFAC motifs over the sorted 10-mer sequences.

equation image
(10)

An external file that holds a picture, illustration, etc.
Object name is gks1456i63.jpg should have a high area under the curve if TRANSFAC motifs are enriched in the top-ranking sequences. In contrast, the null hypothesis stipulates that TRANSFAC motifs are evenly distributed along the sorted sequences and the corresponding CDF is An external file that holds a picture, illustration, etc.
Object name is gks1456i64.jpg. The maximum deviation between An external file that holds a picture, illustration, etc.
Object name is gks1456i65.jpg and An external file that holds a picture, illustration, etc.
Object name is gks1456i66.jpg gives rise to a statistic of the Kolmogorov–Smirnov test. This method is similar to the Gene Set Enrichment Analysis (GSEA) (36).

Enrichment of protein-binding sites from the ENCODE data

Second, we demonstrated that the top-ranking motifs were enriched in the protein-binding sites of the human genome reported from the ChIP-seq data generated by the ENCODE consortium (23). 407 ChIP-seq data files were extracted from the ENCODE website (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/). Each file reports the sequences of fragments containing the binding sites of one protein in one cell type. For each motif, we constructed a simple null model assuming its occurrence on the entire human genome followed a Poisson process. The rate of motif occurrence per position is An external file that holds a picture, illustration, etc.
Object name is gks1456i67.jpg, where N denotes the number of motif occurrence in the entire genome and L denotes the genome length. Suppose in a ChIP-seq file, the total length of fragments is An external file that holds a picture, illustration, etc.
Object name is gks1456i68.jpg and the number of motif occurrence is An external file that holds a picture, illustration, etc.
Object name is gks1456i69.jpg. Then the Poisson rate of motif occurrence in the designated fragments is An external file that holds a picture, illustration, etc.
Object name is gks1456i70.jpg and the P-value for motif enrichment is

equation image
(11)

Coherence of expression profiles in human and mouse genes

Third, we showed that human and mouse genes harbouring high-scoring motif sequences tended to have coherent expression profiles compared with genes harbouring low-scoring motif sequences. We used both oligonucleotide microarray data (24) and RNA-seq data (25) in defining expression levels and co-expression of human and mouse genes. For the microarray data, we obtained the expression information of human genes and mouse genes from the Gene Atlas V2 dataset (http://symatlas.gnf.org/SymAtlas/). This dataset comprises oligonucleotide microarray data in 63 human and 58 mouse normal tissues sampled from animal bodies. We assigned the expression data from probe sets to corresponding Ensembl genes following (37,38). The expression levels of a gene in a specific tissue were averaged among replicates. For the RNA-seq data, we obtained that of 11 human tissues from GEO Series GSE13652 from the University of Toronto (25) (brain/liver/muscle/cerebral cortex) and GSE12946 from MIT (26) (adipose/breast/colon/heart/lung/lymph node/testes). The raw 32-mer RNA-seq sequence reads were mapped to human genome (Ensembl version v56), and RNA-seq-based gene expression levels were calculated according to (39,40).

The expression profile divergence between two genes in the human or mouse genome was defined by An external file that holds a picture, illustration, etc.
Object name is gks1456i71.jpg, where r is the Pearson’s correlation coefficient of expression levels across the tissues. In the present study, we specifically examined co-expression of genes that are not paralogs or genes located on different chromosomes. The chromosomal coordinates and annotations of paralogous relationships of human and mouse genes based on Ensembl v62 were obtained through BioMart (http://www.biomart.org/).

Functional enrichment of genes harbouring high-scoring motifs

Fourth, we showed that human genes harbouring high-scoring motif sequences were enriched with certain functional classes. Four sources pertaining to functional information of human genes were extracted: the GO categories (27), the curated pathway databases of Reactome (28), Biocarta (29) and the NCI-Nature curations (30). For a given motif, we extracted the genes harbouring the motif on their promoters and calculated hyper-geometric P-values of enrichment for each GO category and pathway. The enriched functional classes for both top-ranking motifs (evolutionary retention coefficient An external file that holds a picture, illustration, etc.
Object name is gks1456i72.jpg, 231 motifs) and control motifs (231 motifs surrounding the median of the sorted list) were reported.

RESULTS

Evolutionary retention coefficients are correlated with selection coefficients in simulated data

We first demonstrate the resemblance between canonical selection coefficients and motif evolutionary retention coefficients with simulated data. For each of 11 pre-determined selection coefficient values, we simulated the evolution of 20 phylogenies, where each phylogeny constituted 100 generations and 100 observed promoter sequences in their leaves. We then inferred the motif evolutionary retention coefficient from the observed promoter sequences of each simulated phylogeny. The left part of Figure 2 shows the (pre-determined) canonical selection coefficients and the (inferred) motif evolutionary retention coefficients of the 220 phylogeny instances. Overall, the two scores exhibit a positive correlation coefficient (r = 0.678). Because direct evaluation of relative fitness in a population is often challenging, the high correlation between the two quantities indicates that the motif retention coefficient is a reasonable measure for the selective strength of motifs.

Figure 2.
Left: The scatter plot of canonical selection coefficients and evolutionary retention coefficients on simulated data. Each point denotes the scores obtained from 100 simulated sequences derived from one common ancestor over 100 generations. Right: Empirical ...

Summary of evolutionary retention coefficients of 10-mer motif sequences

We evaluated the evolutionary retention coefficients of all An external file that holds a picture, illustration, etc.
Object name is gks1456i74.jpg 10-mer sequences among the 5 kb promoters of 27 748 gene families in 34 mammalian species. The right part of Figure 2 shows the empirical distribution of evolutionary retention coefficients, and Supplementary Table S2 reports the sorted evolutionary retention coefficients of these sequences. As expected, the majority of sequences possess low evolutionary retention coefficients: the median value is 0.508 and the scores of about 80% of the sequences (834 552 of 1 048 576) are below 1.0. We considered the first 231 (0.022%) sequences with evolutionary retention coefficients An external file that holds a picture, illustration, etc.
Object name is gks1456i75.jpg as the top-ranking motif sequences and employed further analysis to these sequences.

Motifs with high evolutionary retention coefficients have slower death rates, thus should exhibit high level conservation on promoters. For each motif, we define a conservation measure as the probability of its presence on the promoter of a mammalian species, conditioned on its presence on the orthologous promoter of humans. Figure 3 displays the conditional probabilities of motif presence of the 231 top-ranking motifs (top panel) and 231 control motifs (bottom panel) with evolutionary retention coefficients near the global median and with An external file that holds a picture, illustration, etc.
Object name is gks1456i76.jpg instances on human promoters. The species (indices in the horizontal axis) are sorted by their phylogenetic distances to humans. All (both high-scoring and control) motifs have high level conservation between humans and anthropoid primates (chimpanzees, gorillas, orangutans, macaques, indices 2–5) and marmosets (Callithrix jacchus, index 6), and the conditional probabilities drop abruptly beyond marmosets. For instance, the median conditional probabilities between humans and chimpanzees (index 2), orangutans (index 4) and marmosets (index 6) are 0.861, 0.697 and 0.320 respectively, whereas the median conditional probability between humans and the Philippine tarsiers (Tarsius syrichta, index 7) drops below 0.074. However, the top-ranking motifs retain considerably higher level conservation than control motifs in all selected mammals. For instance, the median conditional probabilities of the top-ranking motifs between humans and guinea pigs (Cavia porcellus, index 14), horses (Equus caballus, index 21) and African elephants (Loxodonta africana, index 28) are 0.074, 0.138 and 0.070 respectively, whereas those of the control motifs are 0.030, 0.072 and 0.031 respectively.

Figure 3.
Conservation of motif occurrence between humans and another species [P(motif occurs in a species | motif occurs in humans)] for the top 231 motifs and 231 control motifs from the middle of the ranked list. The horizontal axis denotes the species index ...

Notably, in Figure 3 all but one of the top 23 motifs have relatively low level conservation compared with the remaining top-ranking motifs. By examining those sequences (Table 1), we found they all belonged to the Alu-J repeat elements (41). They exhibit background level conservation between humans and most other species but undergo a massive number of insertions on the promoters of gray mouse lemurs (Microcebus murinus, index 8) and small-eared galagos (Otolemur garnettii, index 9) (Supplementary Table S3). These insertions violate the Poisson process model of sequence substitutions, increase the counts of An external file that holds a picture, illustration, etc.
Object name is gks1456i77.jpg where An external file that holds a picture, illustration, etc.
Object name is gks1456i78.jpg, and therefore elevate the evolutionary retention coefficients.

Table 1.
Annotations of top-ranking motifs in terms of evolutionary retention coefficients

High-scoring motifs are enriched with transcription factor-binding sites

Transcription factor-binding sites likely accommodate some motif sequences under selective pressure. We confirmed the dependency of transcription factor-binding sites and evolutionary retention coefficients with two external datasets. First, we verified that sequences with higher evolutionary retention coefficients tended to match the transcription factor-binding motifs reported in the TRANSFAC database (22). A simple check on sequences sorted by evolutionary retention coefficients provides obvious evidence: 77 of the top 231 10-mer sequences match TRANSFAC motifs, whereas only 35 of the 231 sequences in the middle and 22 of the 231 sequences in the bottom of the sorted list match TRANSFAC motifs. Supplementary Table S4 reports the TRANSFAC match on top-ranking, middle and bottom control motifs.

In addition to observations on small subsets of sequences, we also quantified this dependency on the entire sorted list. Denote An external file that holds a picture, illustration, etc.
Object name is gks1456i79.jpg a random variable indicating the match of sorted sequences with TRANSFAC motifs. An external file that holds a picture, illustration, etc.
Object name is gks1456i79a.jpg indicates the probability that a sequence with normalized rank An external file that holds a picture, illustration, etc.
Object name is gks1456i81.jpg matches TRANSFAC motifs. If evolutionary retention coefficients are uncorrelated with the presence of TRANSFAC motifs, then all TRANSFAC motifs should be evenly distributed along the normalized ranks, and An external file that holds a picture, illustration, etc.
Object name is gks1456i82.jpg follows a uniform distribution. Therefore, enrichment of TRANSFAC motifs on high-scoring sequences is quantified by the deviation of the empirical distribution of An external file that holds a picture, illustration, etc.
Object name is gks1456i83.jpg from a uniform distribution.

Figure 4 plots the empirical CDF of X (F1(x) in Equation 10) and the CDF of a uniform distribution (F0(x)). F1(x) lies above F0(x) for all x [set membership] [0,1], indicating that sequences with high evolutionary retention coefficients are more likely to match TRANSFAC motifs than those with low evolutionary retention coefficients. The P-value of the Kolmogorov–Smirnov test <10−325.

Figure 4.
Enrichment of TRANSFAC motifs in high-scoring sequences. The blue curve shows the distribution of TRANSFAC motif occurrences along the normalized rank of the sorted 10-mer sequences [An external file that holds a picture, illustration, etc.
Object name is gks1456i84.jpg in equation 10]. The red curve shows the CDF of a uniform distribution ...

Second, we showed that the top-ranking motifs were enriched in the protein-binding DNA fragments reported from the ENCODE data (23). We downloaded 407 files from the ENCODE website. Each file reports the protein-binding DNA fragments generated by one ChIP-seq experiment with a specified antibody and cell type. The 407 files cover 59 proteins (transcription factors, RNA polymerase II, nucleosome-binding proteins, etc.), where multiple ChIP-seq experiments with distinct cell types and replicates were undertaken for each protein. For each motif, we quantified the significance of its enrichment in an ENCODE file with a null model assuming that its occurrence followed a Poisson process with a rate An external file that holds a picture, illustration, etc.
Object name is gks1456i86.jpg, where N denoted the number of motif occurrence in the entire genome, An external file that holds a picture, illustration, etc.
Object name is gks1456i87.jpg the genome length and An external file that holds a picture, illustration, etc.
Object name is gks1456i88.jpg the total fragment length in the file.

We evaluated the enrichment P-values for the top-ranking motifs in each ENCODE file. About 12% of the motif-file combinations (11 063 of 94 017) exhibit significant enrichment (PAn external file that holds a picture, illustration, etc.
Object name is gks1456i89.jpg). To reduce errors generated by individual ChIP-seq experiments, we grouped the results of the same proteins together and counted the fractions of files in each group with significant enrichment. There are 182 motif-protein combinations with at least 5 ENCODE files and significant enrichment P-values (≤An external file that holds a picture, illustration, etc.
Object name is gks1456i90.jpg) in at least 80% of the constituent ENCODE files. The number of enriched motif–protein combinations drops considerably in control motifs. Among the 231 control motifs in the middle of the sorted list, 80 motif–protein combinations retain the same level of enrichment. Furthermore, among the 231 control motifs in the bottom of the sorted list, only 38 motif–protein combinations retain the same level of enrichment. Intriguingly, both TRANSFAC and ENCODE data indicate that levels of enrichment shrink by half from the top to the middle and from the middle to the bottom of the sorted list.

Table 2 shows the 182 motif–protein combinations with significant enrichment. Six motifs are enriched in the ChIP-seq data of at least 10 proteins. These motifs are heavily biased toward GC-rich sequences: GCGCCTGCGC (index 27), GGGGCGGGGC (index 33), GCCCCGCCCC (index 56), GGGCGGGGCC (index 65), GCGCATGCGC (index 76) and GGCCCCGCCC (index 134). The GC-rich sequences match the binding motifs of several proteins such as SP1 (42), AP2 (43), NRF1 (44) and E2F1 (44). Reciprocally, the ChIP-seq files of seven proteins contain at least 10 enriched motif sequences: ELF1, GABP, YY1, ERG1, RAD21, POL2 and PAX5. Some of these proteins (such as SP1, AP2, POL2, YY1, RAD21) ubiquitously regulate many genes, thus their binding motifs yield high evolutionary retention coefficients.

Table 2.
Enrichment of top-ranking motifs in ENCODE ChIP-seq data

Four motif–protein combinations enriched in ENCODE files correspond to exact match with the TRANSFAC data. Motifs 33 (GGGGCGGGGC) and 36 (GGGGGCGGGG) match the SP1-binding motif in TRANSFAC. Motif 7 (CCGCCATCTT) matches the YY1-binding motif, and motif 170 (CTTCCTCTTT) matches the PU.1-binding motif.

Genes harbouring high-scoring motifs tend to retain functional coherence

Genes sharing the same protein-binding sequences on their promoters are likely co-regulated by the same transcription factors. Consequently, we expect genes harbouring high-scoring motifs to possess functional coherence. We validated this prediction with two tests using external data. First, using the method described in (39,40), we evaluated the divergence of expression profiles of genes from two human expression datasets and one mouse expression dataset. The distribution of expression divergence on genes harbouring the top 5000 motifs was compared with the distribution on the genes harbouring the bottom 5000 motifs. Intriguingly, genes harbouring the top 5000 motifs have consistently lower expression divergence than genes harbouring bottom 5000 motifs across all three datasets. The Wilcox test P-values of the deviation between the two gene sets are significant across the three datasets: 2.047 × 10−14, 2.414 × 10−4 and 1.670 × 10−12 respectively. Furthermore, by ruling out the two confounding factors for co-expression—co-localization of genes on the same chromosomes and paralogous genes sharing the same ancestry—the deviation of expression divergence between genes harbouring top 5000 and bottom 5000 motifs remains pronounced. Table 3 reports the significance of the deviation of expression divergence between gene pairs harbouring top 5000 and bottom 5000 motifs. The deviation of expression divergence suggests that genes harbouring motifs of high evolutionary retention coefficients tend to retain functional coherence.

Table 3.
Comparison of the distributions of expression divergence between gene pairs harbouring top 5000 motifs and bottom 5000 motifs

Second, we extracted the human genes harbouring each of the top 231 motif sequences and assessed their over-representations in 3201 GO categories and 889 pathways from three sources. Supplementary Table S5 reports the functional categories and pathways significantly enriched (hyper-geometric P An external file that holds a picture, illustration, etc.
Object name is gks1456i103.jpg) with each top-ranking motif. There are 45 motif-functional class pairs with significant enrichment. In contrast, there are only 9 significant motif-functional class pairs among the 231 control motifs in the middle of the sorted list.

By examining the functional enrichment results in Supplementary Table S5, we found that many top-ranking motifs were highly enriched in functional classes related to transcriptional regulation such as nucleosome assembly (motif CCAGCTCCAG, P-value An external file that holds a picture, illustration, etc.
Object name is gks1456i104.jpg), transcription factor activity (motif CCCCTCCCCC, P-value An external file that holds a picture, illustration, etc.
Object name is gks1456i105.jpg) and chromatin modification (motif CCGCCGCCGC, P-value An external file that holds a picture, illustration, etc.
Object name is gks1456i106.jpg). To justify this observation, we categorized the GO terms into four classes: regulators (transcription factors and signalling proteins, 2563 genes), enzymes (6947 genes), transporters (1087 genes) and structural proteins (571 genes), and evaluated the enrichment P-values of top-ranking and control motif targets in each class. Figure 5 reports the enrichment of motif targets on the four major categories. Strikingly, among the targets of the top 231 motif sequences, 20 are significantly enriched with known regulators (P An external file that holds a picture, illustration, etc.
Object name is gks1456i107.jpg). In contrast, the targets of only one motif are enriched with enzymes, transporters and structural proteins, respectively. Furthermore, among the targets of the 231 control motifs in the middle of the sorted list, only 3 have significant enrichment in regulators and none has significant enrichment in other classes. The results suggest that regulators tend to harbour motifs under stronger selective pressure on their promoters.

Figure 5.
Enrichment of four functional classes—regulators, enzymes, structural proteins and transporters—among the genes harbouring the top-ranking and control motifs. The horizontal axis denotes the four functional classes. The vertical axis denotes ...

Motif sequences with high evolutionary retention coefficients are derived by diverse causes

Beyond statistical validations on the motif sequences sorted by evolutionary retention coefficients, we also examined the individual top-ranking motifs and annotated them with known regulatory sequences or repeat elements. Table 1 reports the functional annotations of the top 231 motifs. Several remarkable features emerge from the annotations. First, 26 10-mers constitute two blocks of 13 consecutive nucleotides (TAGCTCACAGCAACCTCAAACT and AGTTTGAGGTTGCTGTGAGCTA, respectively). These two blocks match exactly the Alu-J repeat elements (41). As shown in Figure 2, they have background level conservation between humans and other species but undergo a massive number of insertions in gray mouse lemurs and small-eared galagos. Second, another ten 10-mers constitute a block of 19 consecutive nucleotides (TGCAGCAGCTGCTGCTGCT). Unlike Alu-J this block does not hit human repeats or gene sequences with significant blast E-values. This block largely coincides with many binding sites of MITF and AP4 according to the cisRED database of genome-wide regulatory module and element predictions (44). Third, three 10-mers (motifs 24, 26, 28) are three phases of the GCC-repeat sequences, and they coincide with many binding sites of ERG1 and DEAF1 according to cisRED. Fourth, four 10-mers (motifs 82, 32, 37, 38) form a 13-nucleotide consecutive block (CTCTGATTGGCTG) and coincide with NF-Y binding sites. Three additional 10-mers also coincide with NF-Y binding sites. Fifth, seven 10-mers are dominated by Cs and Gs (motif 27, GCGCCTGCGC; motif 33, GGGGCGGGGC; motif 36, GGGGGCGGGG; motif 56, GCCCCGCCCC; motif 65, GGGCGGGGCC; motif 70, CCCCGCCCCC; motif 134, GGCCCCGCCC), and most of these motif sequences coincide with the binding sites of SP1, AP2 and ERG1. Sixth, two 10-mers (motif 7, CCGCCATCTT; motif 53, GCCGCCATCT) form a consecutive block (GCCGCCATCTT) and largely coincide with the binding sites of YY1 (45).

DISCUSSION

Evolution of cis-regulatory elements is an essential and critical aspect of the evolution of the gene regulatory systems. Prior models and studies focus primarily on the sequence evolution of selected known cis-regulatory elements but do not characterize the evolution of all possible regulatory sequences. In this work, we propose a model to quantify the strength of purifying selection for motif sequences of a fixed length and estimate the evolutionary retention coefficients of all 10-mer sequences from the aligned promoters of 34 mammalian species. A series of validation tests confirm the functional relevance of the proposed evolutionary retention coefficients. High-scoring motifs are enriched with transcription factor-binding sites according to curated information from TRANSFAC and ChIP-seq experimental data from ENCODE. Furthermore, genes harbouring high-scoring motifs retain more coherent expression profiles in human and mouse and are over-represented in the functional categories and pathways involved in transcriptional regulation.

Many high-scoring motif sequences are bound by regulatory proteins with versatile or prevalent functions: POL2, SP1, YY1, RAD21 and AP2. POL2 encodes the RNA polymerase II that interacts ubiquitously with DNAs. SP1 encodes a zinc-finger transcription factor involved in many cellular processes, including cell differentiation, cell growth, apoptosis, immune responses, response to DNA damage and chromatin remodelling. YY1 encodes a ubiquitously distributed transcription factor belonging to the GLI-Kruppel class of zinc finger proteins. RAD21 encodes a nuclear protein involved in the repair of DNA double-strand breaks, as well as in chromatid cohesion during mitosis. AP2 (TFAP2A) encodes a transcription factor that interacts with enhancer elements. Furthermore, the high-scoring motifs interacting with these proteins are highly biased toward GC-rich sequences. Ubiquitous presence of these motifs on many target genes probably accounts for their high evolutionary retention coefficients. In contrast, binding motifs of the transcription factors with small numbers of specific targets will not exhibit high evolutionary retention coefficients, as there are only a few instances on promoters.

A bias toward abundant sequences among the high-scoring motifs may be relieved by adjusting the rates of the birth–death process to fit the background frequencies of motifs. Currently, the rates of drifting into and out of a motif depend primarily on the relative volume of the motif and frequencies of single nucleotides in sequence space (r01 and r10). To further reduce this bias, we can adjust the weights of transitions in Equation (3) according to the background frequencies of motifs in the entire genomes.

The birth–death model of neutral evolution (Equation 7) is based on the simplest model of sequence substitution assuming all nucleotides transition with an equal rate. This assumption is incongruent with various observed biases in sequence evolution. For instance, on single nucleotides, the rates of transitions (purine → purine or pyrimidine → pyrimidine) are higher than those of transversions (purine → pyrimidine or pyrimidine → purine). On dinucleotides, the mutation rates of CpG → TpG tend to be higher as methylated cytosines deaminate to form thymines. The current model can be extended to incorporate these biases with a price of complexity. In an extended version, the two parameters specifying relative transition rates from motifs to non-motifs and vice versa (r01 and r10) depend not only on motif complexity but also on its constituting sequences. Transversions between purines and pyrimidines are penalized while substitutions from CpG to TpG are rewarded. For the computational cost for implementing and running this extended model, we decided to leave this task in the future work.

Conservation of motif occurrences can be viewed as an aggregation of two factors: (i) the fraction of functional motif instances among all motif occurrences and (ii) the level of conservation among the functional motif instances. The function of a motif in eukaryotic genomes depends on various contextual factors such as the presence of other transcription factor-binding sites and enhancers, nucleosome positions and chromatin configurations. Hence only a fraction of motif occurrences are likely to be functional and are subjected to selective constraints. Moreover, among these functional instances, differential levels of conservation may be manifested on distinct sites. Some regulatory subsystems are less tolerant with dysregulation and thus undergo a stronger selective pressure, whereas others may be more flexible and thus retain a lower level of conservation. Disentangling these two factors from sequence data alone is very challenging, as the contextual information often cannot be determined by sequences. Alternatively, functional information such as ChIP-seq data can provide additional clues about the first factor. By examining the overlaps of selected motifs and transcription factor-binding sites, we can estimate the fraction of functional motif instances. The level of conservation of a few transcription factor-binding motifs has been investigated in the prior studies mentioned in Introduction.

It is puzzling that the top-ranking motifs are over-represented on the promoters of regulators (transcription factors and other DNA-binding proteins, signalling proteins) but not on other functional categories (enzymes, transporters, structural proteins). The results suggest that the regulatory circuits of regulators possess elements with high selective constraints, whereas those of other proteins do not. We provide two speculations to explain this observation. First, many regulators are involved in processes with pervasive impacts such as chromatin modification, nucleosome assembly and multicellular organismal development. Constituent genes of these processes are thus subjected to tighter selective constraints and are regulated by motifs with high evolutionary retention coefficients. Second, many motifs overrepresented in regulators are the GC-rich binding sequences of the aforementioned proteins. Regulatory programs of regulators may result from the combinatorial interactions between these generic motifs and other process-specific motifs. In contrast, the regulatory programs of other proteins may be dominated by process-specific motifs. Further experimental data and analysis are required in order to verify these speculations.

In the present study, the log likelihood function is obtained by comparing motif occurrences between a reference species (human) and all the other species (Equation 9). This is not an exact form of the joint log likelihood function, as it assumes the motif occurrences of other species are independent conditioned on human data and ignores their dependencies owing to a shared phylogeny. A more accurate form is to sum up the log conditional probabilities along all branches of the phylogenetic tree:

equation image
(12)

where An external file that holds a picture, illustration, etc.
Object name is gks1456i108.jpg denotes a branch in the phylogenetic tree T with length An external file that holds a picture, illustration, etc.
Object name is gks1456i109.jpg, An external file that holds a picture, illustration, etc.
Object name is gks1456i110.jpg and An external file that holds a picture, illustration, etc.
Object name is gks1456i111.jpg denote the motif occurrences at initial and terminal time points of the branch, respectively. Motif occurrences on ancestral nodes of the phylogenetic tree are not directly observed. Hence a dynamic programming algorithm can be applied to either reconstruct the maximum likelihood promoter sequences of ancestors or marginalize over the probability distributions of all possible sequence configurations (32). However, this accurate formulation requires reconstruction of 5 kb upstream sequences (or their probability distributions) of 27 748 orthologous gene families from 34 mammalian species. Owing to its tremendous computational cost, we decided to implement the simplified approximation for an exhaustive screening of all 10-mer motif sequences on all orthologous gene families. For subsequent studies on specific motifs and selected gene families, the accurate version in Equation (12) should be adopted.

In the present study, we consider each motif as one unique 10-mer nucleotide sequence. The formulation of the motif evolutionary model, however, does not impose this restriction. A motif can be a collection of sequences represented by one or multiple strings of 15 IUPAC symbols (e.g., the TP53 binding motif is NGRCWTGYCY, where R denotes A or G, W denotes A or T, Y denotes C or T and N denotes any base). The choice of investigating single sequence motifs is based on the concern of computational efficiency. Exhaustive evaluations of selection coefficients of all composite motifs are beyond the computing capacity accessible by the authors. For instance, there are An external file that holds a picture, illustration, etc.
Object name is gks1456i112.jpg motifs represented by 10-mers of IUPAC symbols without gaps. The number of these composite motifs is 54 994 folds as the number of 10-mer single sequence motifs, which would take about 333 million CPU hours using the current computing infrastructure. This number will grow exponentially when combinations of 10-mer IUPAC strings and gaps are considered. Therefore, simplifications without exhausting all possible sequences are required when extending the analysis into composite motif sequences.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online: Supplementary Tables 1–5 and Supplementary Figure 1.

FUNDING

Academia Sinica (to C.H.Y. and D.H.C.); National Science Council in Taiwan [98-2118-M-001-025-MY2 to C.H.Y.]. Funding for open access charge: National Science Council, Taiwan [100-2118-M-001-008-MY2].

Conflict of interest statement. None declared.

Supplementary Material

Supplementary Data:

ACKNOWLEDGEMENTS

We thank Ker-Chau Li, Robert Shin-Sheng Yuan and Guan-I Wu for logistic assistance.

REFERENCES

1. Carroll SB. Evolution at two levels: on genes and form. PLoS Biol. 2005;3:e245. [PMC free article] [PubMed]
2. Davidson EH, Erwin DH. Gene regulatory networks and the evolution of animal body plans. Science. 2006;311:796–800. [PubMed]
3. King MC, Wilson AC. Evolution at two levels in humans and chimpanzees. Science. 1975;188:107–116. [PubMed]
4. Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES. Sequencing and comparison of yeast species to identify genes and regulatory motifs. Nature. 2003;423:241–254. [PubMed]
5. Siepel A, Haussler D. Combining phylogenetic and hidden Markov models in biosequence analysis. J. Comput. Biol. 2004;11:413–428. [PubMed]
6. Emberly E, Rajewsky N, Siggia ED. Conservation of regulatory elements between two species of Drosophila. BMC Bioinf. 2003;4:57. [PMC free article] [PubMed]
7. Tautz D. Evolution of transcriptional regulation. Curr. Opin. Genet. Dev. 2000;10:575–579. [PubMed]
8. Dermitzakis ET, Clark AG. Evolution of transcription factor binding sites in mammalian gene regulatory regions: conservation and turnover. Mol. Biol. Evol. 2002;19:1114–1121. [PubMed]
9. Dermitzakis ET, Bergman CM, Clark AG. Tracing the evolutionary history of Drosophila regulatory regions with models that identify transcription factor binding sites. Mol. Biol. Evol. 2003;20:703–714. [PubMed]
10. Chen K, Rajewsky N. The evolution of gene regulation by transcription factors and microRNAs. Nature Genet. 2007;8:93–103. [PubMed]
11. Ludwig MZ, Bergman C, Patel NH, Kreitman M. Evidence for stabilizing selection in a eukaryotic enhancer element. Nature. 2000;403:564–567. [PubMed]
12. Ruvinsky I, Ruvkun G. Functional tests of enhancer conservation between distantly related species. Development. 2003;130:5133–5142. [PubMed]
13. Roth C, Rastogi S, Arvestad L, Dittmar K, Light S, Ekman D, Liberles DA. Evolutio after gene duplication: models, mechanisms, sequences, systems, and organisms. J. Exp. Zool. B. Mol. Dev. Evol. 2007;308:58–73. [PubMed]
14. Sinha S, Siggia ED. Sequence turnover and tandem repeats in cis-regulatory modules in Drosophia. Mol. Biol. Evol. 2005;22:874–885. [PubMed]
15. Stone JR, Wray GA. Rapid evolution of cis-regulatory sequences via local point mutations. Mol Biol. Evol. 2001;18:1764–1770. [PubMed]
16. MacArthur S, Brookfield JFY. Expected rates and modes of evolution of enhancer sequences. Mol. Biol. Evol. 2004;21:1064–1073. [PubMed]
17. Berg J, Willmann S, Lassig M. Adaptive evolution of transcription factor binding sites. BMC Evol. Biol. 2004;4:42. [PMC free article] [PubMed]
18. Mustonen V, Lassig M. Evolutionary population genetics of promoters: predicting binding sites and functional phylogenies. Proc. Natl Acad. Sci. USA. 2005;102:15936–15941. [PubMed]
19. Doniger SW, Fay JC. Frequent gain and loss of functional transcription factor binding sites. PLoS Comput. Biol. 2007;3:e99. [PMC free article] [PubMed]
20. Yeang CH. Proceedings of the 10th Workshop on Algorithms in Bioinformatics (WABI) Liverpool, UK: 2010. Quantifying the strength of natural selection of a motif sequence. 2010.
21. Kuhn RM, Karolchik D, Zweig AS, Wang T, Smith KE, Rosenbloom KR, Rhead B, Raney BJ, Pohl A, Pheasant M, et al. The UCSC Genome Browser Database: update 2009. Nucleic Acids Res. 2009;37:D755–D761. [PMC free article] [PubMed]
22. Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV, et al. TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 2003;31:374–378. [PMC free article] [PubMed]
23. Myers RM, Stamatoyannopoulos J, Snyder M, Dunham I, Hardison RC, Bernstein BE, Gingeras TR, Kent WJ, Birney E, et al. ENCODE Project Consortium. A user’s guide to the encyclopedia of DNA elements (ENCODE) PLoS Biol. 2011;9:e1001046. [PubMed]
24. Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden J, Hayakawa M, Kreiman G, et al. A gene atlas of the mouse and human protein-coding transcriptomes. Proc. Natl Acad. Sci. USA. 2004;101:6062–6067. [PubMed]
25. Pan Q, Shai O, Lee LJ, Frey BJ, Belencowe BJ. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 2008;40:1413–1415. [PubMed]
26. Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB. Alternative isoform regulation in human tissue transcriptomes. Nature. 2008;456:470–476. [PMC free article] [PubMed]
27. The Gene Ontology Consortium. The gene ontology project in 2008. Nucleic Acids Res. 2008;36:D440–D444. [PMC free article] [PubMed]
28. Joshi-Tope G, Gillespie M, Vastrik I, D'Eustachio P, Schmidt E, de Bono B, Jassal B, Gopinath GR, Wu GR, Matthews L, et al. Reactome: a knowledgebase of biological pathways. Nucleic Acids Res. 2005;33:D428–D432. [PMC free article] [PubMed]
29. Yu N, Seo J, Rho K, Jang Y, Park J, Kim WK, Lee S. hiPathDB: a human-integrated pathway databasenwitl facile visualization. Nucleic Acids Res. 2012;40:D797–D802. [PMC free article] [PubMed]
30. Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, Buetow KH. PID: the pathway interaction database. Nucleic Acids Res. 2009;37:D674–D679. [PMC free article] [PubMed]
31. Zuckerandl E, Pauling L. Evolutionary divergence and convergence in proteins. In: Bryson V, Vogel HJ, editors. Evolving Genes and Proteins. New York: Academic Press; 1965. pp. 97–166.
32. Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 1981;17:368–376. [PubMed]
33. Bird AP. CpG islands as gene markers in the vertebrate nucleus. Trends Genet. 1987;3:342–347.
34. Kendall DG. On the generalized birth-death process. Ann. Math. Stat. 1948;19:1–15.
35. Gillespie J. Population genetics – a concise guide. Baltimore, Maryland, USA: The Johns Hopkins University Press; 2004.
36. Subramanian A, Tamayo P, Mootha VK, Mukherhjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA. 2005;102:15545–15550. [PubMed]
37. Liao BY, Zhang J. Coexpression of linked genes in Mammalian genomes is generally disadvantageous. Mol. Biol. Evol. 2008;25:1555–1565. [PMC free article] [PubMed]
38. Liao BY, Chang AY. Mammalian genes preferentially co-retained in radiation hybrid panels tend to avoid coexpression. PLoS One. 2012;7:e32284. [PMC free article] [PubMed]
39. Chang AY, Liao BY. DNA methylation rebalances gene dosage after Mammalian gene duplications. Mol. Biol. Evol. 2012;29:133–144. [PubMed]
40. Qian W, Liao BY, Chang AY, Zhang J. Maintenance of duplicate genes and their functional redundancy by reduced expression. Trends Genet. 2010;26:425–430. [PMC free article] [PubMed]
41. Jurka J, Smith T. A fundamental division in the Alu family of repeated sequences. Proc. Natl Acad. Sci. USA. 1988;85:4775–4778. [PubMed]
42. Kriwacki RW, Schultz SC, Steitz TA, Caradonna JP. Sequence-specific recognition of DNA by zinc-finger peptides derived from transcription factor Sp1. Proc. Natl Acad. Sci. USA. 1992;89:9759–9763. [PubMed]
43. Roesler WJ, Vandenbark GR, Hanson RW. Cyclic AMP and the induction of eukaryotic gene transcription. J. Biol. Chem. 1988;263:9063–9066. [PubMed]
44. Robertson AG, Bilenky M, Lin K, He A, Yuen W, Dagpinar M, Varhol R, Teague K, Griffith OL, Zhang X, et al. cisRED: A database system for genome scale computational discovery of regulatory elements. Nucleic Acids Res. 2006;34:D68:73. [PMC free article] [PubMed]
45. Kim JD, Kim J. YY1’s longer DNA-binding motifs. Genomics. 2009;93:152–158. [PMC free article] [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press