The development of animals from zygotes to adults and the differentiation of cells into distinct tissues and organs requires the expression of a specific set of genes at each developmental stage and in each cell type1
. The features distinguishing humans from apes have long been attributed to differences in gene expression2
, and aberrant gene expression lies at the heart of multiple diseases. Thus, identifying the DNA sequences required for regulating gene expression, called cis
-regulatory modules (CRMs), can both expand our understanding of biology and have applications in several fields including evolution and medicine. For example, most of the genetic variants significantly associated with susceptibility to disease do not lie in protein-coding regions3
, and we surmise that many affect the regulation of gene expression.
Three major approaches have emerged for predicting CRMs. The first is to search genomic DNA for clusters of short motifs that are needed for the specific binding of transcription factors (TFs). Although CRMs should contain multiple such motifs, this approach to identifying CRMs has had limited success. A second approach for identifying CRMs involves comparing homologous, noncoding DNA sequences between related species. These methods can reveal important subsets of conserved CRMs that are under purifying selection, such as developmental enhancers, but they miss lineage-specific ones. More recently, high-throughput, direct assays for DNA sequences that have epigenetic features characteristic of regulatory regions provide a third approach that has potentially high predictive power for identifying CRMs. This method, which involves mapping the locations of TF-binding and histone modifications in a wide range of tissues and developmental stages, yields an unbiased genomic view of potential gene-regulatory regions that is not restricted to conserved regions or those with known regulatory motifs.
We briefly review the major types of CRMs being studied in animals and then review the strengths and weaknesses of the three approaches to CRM prediction, assessing the success rates of each. We suggest ways to use the three approaches in combination to improve predictions, and discuss important questions for future research. Improvements in CRM prediction and classification are already leading to advances in understanding how genetic variants affect susceptibility to disease4–7
Our emphasis in this review is to assess the efficacy of these methods and suggest ways in which they can be improved. Readers are referred to other recent reviews for more details on the biochemical features of chromatin around CRMs8–13
, prediction methods that are based on conservation and motifs14,15
, and earlier comparisons of the different approaches16,17