We have described a methodology to search for transcription factor binding sites that is particularly suited for factors that have a flexible DNA binding profile: the intersection of microarray data, HMM genome scan, and UCSC conservation tracks. Hidden Markov Models are widely used in bioinformatics for applications that range from exon boundary prediction to conservation analyses
[39]. The use of HMMs to search for transcription factor binding sites has been reported by several groups
[27],
[40]–
[42]; however, there is no study that integrates genomic HMM scans with conservation and expression data, or that corroborates predicted novel targets with experimental validation. Moreover, no publicly available software allows the search of entire genomes for matches to a custom HMM. Thus, this study constitutes the first large-scale expression analysis for COUP-TFI/NR2F1 with validation of direct COUP-TFI regulation
in vitro and
in vivo of novel target genes predicted by a bioinformatic approach.
We address the uncertainty regarding the performance of different microarray normalization methods
[15] by comparing the results of two algorithms -GCRMA and dChIP. Although the correlation between these two methods was limited, we reconciled both sets of results by extracting gene probes with a significant genotype difference and the best p-value correlation and validated the results by quantitative real-time RT-PCR using inner ear tissue.
The number of total COUP-TF HMM hits in the entire mouse genome is several orders of magnitude higher than the number of binding sites reported by ChIP-based location analyses
[2]. However, the number of COUP-TF HMM matches found within a random nucleotide sequence is negligible
(), and most of the sites are enriched within the non-coding genomic regions. These facts support the idea that these DNA-binding compatible sequences are not found in error, but rather that this is their true distribution and density within the genome. Other researches have reached similar conclusions
[31]. The prevailing dogma posits that most transcription factor binding sites found
in silico – although they would readily bind the protein
in vitro- will not have a functional role
in vivo (a concept defined as the “futility theorem”
[31]). Indeed, most ChIP-based assays have only a partial overlap with binding sites found by computational searches, and most sites predicted
in silico are not occupied by the cognate transcription factor when pursued experimentally. However, three factors account for at least part of this disparity in the number of transcription factor binding sites. First, most ChIP-based datasets are considered to be incomplete, since the information is restricted to a particular tissue or cell line and developmental time point. For example, some nuclear receptor binding sites may be functional only in a tissue-specific or developmental-time specific manner, and subordinate to chromatin modifications that occur at a certain stage of cell fating, during development, aging or disease state. Second, ChIP assays have specific technical limitations, such as the sensitivity of the antibody and the particular experimental conditions, and they may not detect transient protein-DNA interactions. In fact, high-throughput ChIP assays performed for the same nuclear receptor under similar experimental conditions have only partial overlap in their results
[2],
[43],
[44]. Third, it is likely that the functional role of a given DNA-binding site is also determined by the specific content of the surrounding DNA, and that unique combinatorial interactions between several transcription factors determine the expression profile of each cell. This concept is also supported by available nuclear receptor location data
[43]. However, we envision that it is feasible to increase the selectivity of bioinformatics searches by modeling higher order cis-regulatory modules (CRMs)
[31], consisting of several transcription factor binding sites, using the same HMM statistical principles and software described herein.
Our bioinformatic search for transcription factor binding sites is limited by the fact that it will not detect regulatory sites where recruitment to the DNA is dependent on tethering to other transcriptional cofactor(s), a mechanism that is estimated to account for 30% of nuclear receptor-dependent gene regulation
[2]. Physical association assays based on the chromatin immunoprecipitation technique are required to detect this type of downstream targets. Another constraint of our approach is that we limited our scan for COUP-TF HMM matches within the proximal regulatory regions, and therefore our analysis excludes regulatory sites located far from the gene's transcription start site, and it is susceptible to any mis-annotations in the corresponding genome assembly. Location assays have indeed identified that a significant proportion of nuclear receptor binding sites are distant from the regulated genes
[43]. Thus, chromatin immunoprecipitations remain the gold standard to detect DNA-protein interactions. We present the ChromAnalyzer software tool not as a substitute for ChIP-based assays, but as a reasonable alternative to perform quick and inexpensive genome-wide nuclear receptor location analyses, particularly for those laboratories with a limited budget or who are technically limited by the size of their cell population, such as stem cell studies or our own work on the developing and newborn cochlear sensory epithelia, which is only composed of 16,000 mechanosensory hair cells. Following ChromAnalyzer bioinformatic examination, the tissue and time-specific binding sites can be subsequently filtered with the aid of expression and physical association assays, as we have demonstrated. The definition of proximal regulatory elements can be revised and the microarray correlation analysis repeated iteratively with new parameters as necessary, although ChIP-based approaches are still necessary to identify very distant or non-canonical binding sites.
As the third layer of correlation, we used phylogenetic footprinting to filter the HMM hits, extracting the location of binding sites near microarray gene hits that are also conserved across evolution. Although phylogenetic footprinting aids in prioritizing target lists and increases specificity by ~90%
[31], it is important to bear in mind that it will miss conserved sites in regions that are not easily aligned. Therefore, absence of a conserved site – or absence of a site altogether – does not fully discard the possibility of regulation by COUP-TFI. Furthermore, the site might be missed by a particularly low
j value or consist of an atypical nucleotide distribution. At present, our genomic HMM scans are not intended to be used in isolation, but rather as one layer of evidence to guide the design of experiments, accelerating the otherwise arduous process of gene-wise transcription factor binding site validation. Furthermore, ChromAnalyzer HMM scans are meant to be an iterative process: each newfound, validated COUP-TFI DNA binding site will be incorporated into the HMM training list, and thus the COUP-TF HMM parameters will be fine-tuned to reflect the exact profile of the known COUP-TF binding sites.
We first describe the detailed and mechanistic experimental validation
in vitro and
in vivo of one of the highest-ranked candidate COUP-TFI targets identified by our analysis: the fatty-acid binding protein 7 (Fabp7) gene. Fabp7 is one of 9 intracellular fatty acid binding proteins, and it is expressed in the embryonic and adult nervous system, retina and mammary gland
[33]. Fabps act as lipid chaperones, promoting cellular uptake and transport of fatty acids and targeting them to specific organelles and metabolic pathways, including the nucleus for transcriptional regulation
[45]. Fabp7 is a marker for radial glial cells –the neuronal and glial precursor stem cells- during embryonic neurogenesis
[33],
[46],
[47], where COUP-TFI and COUP-TFII are also expressed and required for the timely neurogenesis to gliogenesis switch
[10]. Although Fabp7
−/− mice have no anatomical neurologic abnormalities –possibly due to compensatory overexpression of other Fabp homologues-, they do exhibit increased anxiety and fear memory
[48], decreased prepulse inhibition (a schizophrenia endophenotype), and decreased neurogenesis in the adult hippocampus
[49].
Although the role and expression pattern of Fabp7 is not known in all tissues, our results suggest that COUP-TFI lies upstream of Fapb7 and mediates the regulation of proliferation and differentiation of stem cells during neurogenesis and inner ear development. Several FABP proteins interact with orphan receptors to deliver their ligands in the nucleus, and their transcription rate is in turn regulated in a positive feedback loop by the activated nuclear receptor. For example, the
Fabp1 gene is a PPARα transcriptional target, and FABP1 transports PPARα ligands and can physically interact with this receptor in the nucleus
[50]. FABP7 binds long-chain fatty acids (LCFAs), with highest affinity (10
−9) for docosahexanoic acid (DHA)
[51], an essential fatty acid and an abundant lipid component in the nervous system. It is possible that Fabp7 transports a lipid ligand that activates COUP-TFI. A recent study demonstrated that supraphysiologic levels of retinoic acid activate COUP-TFII and used X-ray crystallographic data to propose that ligand binding releases COUP-TFII from a repressed state to allow cofactor binding and regulation of its target genes
[52]. However, endogenous ligands for these orphan receptors remain elusive.
As further ‘proof-of–concept’ of our methodology, 3 novel direct COUP-TFI targets
in vitro -
Crabp1,
Sod1 and
Casq1-, as well as one novel target
in vivo –
Foxo3a- were validated. These genes are known to be expressed in the cochlea and are relevant for inner ear physiology:
Crabp1 is expressed during cochlear development and controls the intracellular concentration of free retinoids
[53],
[54] (key hormone regulators of inner ear embryogenesis
[55]);
Casq1 participates in the calcium buffering system that is upregulated to protect hair cells from noise trauma and lies within DFNA49, an autosomal dominant non-syndromic hearing loss locus
[56]–
[58]; and mutations in
Sod1 are associated with early-onset hearing loss in mice
[59]. Interestingly,
Foxo3a regulates metabolic processes and is linked to longevity in species ranging from
C. Elegans to humans
[60],
[61], and it is deregulated in the inner ears of calorie-restricted mice
[62]. As stated, we will continue the experimental validation of candidate target genes, and will iteratively complement our approach with ChIP-seq analyses for COUP-TFI target genes throughout cochlear and brain development to refine the identification of COUP-TFI targets.
The bioinformatics tool we describe can be modified to scan for response elements of any transcription factor, combinations of several types of response elements, or any kind of flexible DNA pattern. A primary requirement is the availability of a pool of annotated binding sites or SELEX dataset in order to train a Hidden Markov Model. We propose that given the vast number of expression datasets and experimental data already available, this approach can be readily applied to expand the pool of candidate target genes for other nuclear receptors as well.