Understanding the regulation of gene expression is a complex problem and one of the most challenging domains of biological and biomedical research. Intensive ongoing studies aim to understand the detailed mechanisms of the transcriptional regulation in eukaryotes. Transcription factors (TFs) are proteins that regulate the activity of a gene at the levels of mRNA synthesis. These factors bind to specific DNA sequences at positions in the genome near the gene and either reduce or enhance its transcription rate [1
]. The binding of TFs to DNA requires specific short cis-regulatory sequences (binding sites), usually located upstream of 5' end of the gene (gene promoter). Binding sites may also be located in the promoter proximal region, or more distally from the target gene [2
]. Different DNA binding sites for a specific TF often share common features called sequence motifs [3
]. The binding site motifs are often highly degenerate, which makes it challenging to build reliable models for these DNA-encoded signals [4
]. A common approach to build these models is use of position weight matrices (PWMs) [5
A crucial limitation of the PWM approach is the paucity of a sufficient number of high confidence, experimentally verified binding sites. One way to address this problem is to include additional transcription factor binding sites (TFBS) identified computationally by including genomic sequences with substantial similarity to the PWM of a particular TF [14
Several methods to build PWMs have been described. One of the most successful methods was proposed by Staden [8
]. It uses a collection of aligned TFBS to calculate a base frequency table. The table comprises four rows for each nucleotide (A, T, G and C) and the columns represent the length of the binding sites. The weight matrix represents the logarithms of the probabilities of finding each base at each position in a signal. Correspondingly, the PWM is the estimate of the log-probabilities of each base occurring at each position in the aligned TFBS.
The Staden method does not include the definition of the optimal cutoff to minimize a level of false positive predictions for a given level of true positives [4
]. Bucher described a method to optimize the cutoff value of the PWM [16
] that was extended by Tsunoda and Takagi [17
]. They calculated the optimal cutoff values for 205 vertebrate TFs from TRANSFAC. The method proposed by Gershenzon et al. [18
] is another extension of the Staden-Bucher method [8
]. It optimizes various PWM parameters including the cutoff and calculates sensitivity and specificity of the derived PWM [19
]. In the present study we adopt the method by Gershenzon et al. [18
] first to use PWM built on experimental binding site data from Jaspar to identify probable GATA-3 binding sites within promoters, and subsequently to incorporate additional binding site information into the PWM, hereby achieving better sensitivity and/or specificity of the putative binding motif prediction by the optimized PWM. Analysis of high throughput ChIP data opens additional opportunities of PWM optimization. The study by Leping et al. [21
] considered the ChIP data (human and mouse Oct4 and human p53) for PWM optimization using genetic algorithm. However, the main problem of using ChIP data for PWM optimization is its low resolution which may result in high level of false positive predictions by the optimized PWM. To overcome this problem, we consider the GATA-3 binding sites as more likely to be located in the relatively narrow area of a promoter region. Our method would be also useful for optimization of TFs whose ChIP data is not yet available.
A standard PWM approach is based on the assumption that individual nucleotides contribute independently and additively to the binding of a TF to a given DNA motif [3
]. Yet previous studies [18
] demonstrate that some TFBS nucleotides are mutually dependent. To account for such non-additive effects we proposed that di-nucleotide PWMs may be more accurate [18
]. In our analysis, we optimize both the mono-nucleotide and the di-nucleotide matrices. (See Materials and Methods for details.)
We implemented the Gershenzon's method [18
] to analyze the known binding sites for GATA-3 and to identify novel GATA-3 TFBS. We selected this TF because of its important role in the T-cell development [25
] and the differentiation of T-cells into effector subset [27
]. The factor is involved in three differentiation steps: specification, T cell receptor (TCRαβ)-dependent positive selection, and the activation of T helper cell (Th2) programs in mature T-cells. In addition, the method we adopted [18
] originally dealt with Sp1 factor which has a broad positional distribution of binding motifs with a single peak around TSS in the interval (-499 to +100
bp). However, GATA-3 occurrence distribution has two peaks instead of one. We compared the distributions of several factors and found that they exhibit either one peak in the promoter area like ubiquitous Sp1 and E2F or two peaks like GATA-3, TCF1 and Ets-1 specific for T-cell lineage. Hence GATA-3 is an attractive candidate for this study, also because it may be considered as a typical representative of variety of T-cell specific TFs with specific positional distribution.
From the binding sites discovered in the present study some were previously confirmed to be the important binding sites [28
] for GATA-3, as mutating them causes complete loss of enhancer activity [30
]. Nonetheless, they are not incorporated in existing databases of TFBSs and thus were neither a part of the original PWM nor were predicted by it. Identification of these sites by our optimized PWM provides experimental evidence of superiority of our TFBS prediction approach versus existing techniques.