1.  The Impact of Trans-Regulation on the Evolutionary Rates of Metazoan Proteins 
Nucleic Acids Research  2013;41(13):6371-6380.
Transcription factor (TF) and microRNA (miRNA) are two crucial trans-regulatory factors that coordinately control gene expression. Understanding the impacts of these two factors on the rate of protein sequence evolution is of great importance in evolutionary biology. While many biological factors associated with evolutionary rate variations have been studied, evolutionary analysis of simultaneously accounting for TF and miRNA regulations across metazoans is still uninvestigated. Here, we provide a series of statistical analyses to assess the influences of TF and miRNA regulations on evolutionary rates across metazoans (human, mouse and fruit fly). Our results reveal that the negative correlations between trans-regulation and evolutionary rates hold well across metazoans, but the strength of TF regulation as a rate indicator becomes weak when the other confounding factors that may affect evolutionary rates are controlled. We show that miRNA regulation tends to be a more essential indicator of evolutionary rates than TF regulation, and the combination of TF and miRNA regulations has a significant dependent effect on protein evolutionary rates. We also show that trans-regulation (especially miRNA regulation) is much more important in human/mouse than in fruit fly in determining protein evolutionary rates, suggesting a considerable variation in rate determinants between vertebrates and invertebrates.
PMCID: PMC3711421  PMID: 23658220
2.  Associations between intronic non-B DNA structures and exon skipping 
Nucleic Acids Research  2013;42(2):739-747.
Non-B DNA structures are abundant in the genome and are often associated with critical biological processes, including gene regulation, chromosome rearrangement and genome stabilization. In particular, G-quadruplex (G4) may affect alternative splicing based on its ability to impede the activity of RNA polymerase II. However, the specific role of non-B DNA structures in splicing regulation still awaits investigation. Here, we provide a genome-wide and cross-species investigation of the associations between five non-B DNA structures and exon skipping. Our results indicate a statistically significant correlation of each examined non-B DNA structures with exon skipping in both human and mouse. We further show that the contributions of non-B DNA structures to exon skipping are influenced by the occurring region. These correlations and contributions are also significantly different in human and mouse. Finally, we detailed the effects of G4 by showing that occurring on the template strand and the length of G-run, which is highly related to the stability of a G4 structure, are significantly correlated with exon skipping activity. We thus show that, in addition to the well-known effects of RNA and protein structure, the relative positional arrangement of intronic non-B DNA structures may also impact exon skipping.
PMCID: PMC3902930  PMID: 24153112
3.  Evolution of cis-regulatory elements in yeast de novo and duplicated new genes 
BMC Genomics  2012;13:717.
New genes that originate from non-coding DNA rather than being duplicated from parent genes are called de novo genes. Their short evolution time and lack of parent genes provide a chance to study the evolution of cis-regulatory elements in the initial stage of gene emergence. Although a few reports have discussed cis-regulatory elements in new genes, knowledge of the characteristics of these elements in de novo genes is lacking. Here, we conducted a comprehensive investigation to depict the emergence and establishment of cis-regulatory elements in de novo yeast genes.
In a genome-wide investigation, we found that the number of transcription factor binding sites (TFBSs) in de novo genes of S. cerevisiae increased rapidly and quickly became comparable to the number of TFBSs in established genes. This phenomenon might have resulted from certain characteristics of de novo genes; namely, a relatively frequent gain of TFBSs, an unexpectedly high number of preexisting TFBSs, or lower selection pressure in the promoter regions of the de novo genes. Furthermore, we identified differences in the promoter architecture between de novo genes and duplicated new genes, suggesting that distinct regulatory strategies might be employed by genes of different origin. Finally, our functional analyses of the yeast de novo genes revealed that they might be related to reproduction.
Our observations showed that de novo genes and duplicated new genes possess mutually distinct regulatory characteristics, implying that these two types of genes might have different roles in evolution.
PMCID: PMC3553024  PMID: 23256513
De novo gene; Regulatory evolution; TFBS turnover; Promoter architecture
4.  Evidence of association between Nucleosome Occupancy and the Evolution of Transcription Factor Binding Sites in Yeast 
Divergence of transcription factor binding sites is considered to be an important source of regulatory evolution. The associations between transcription factor binding sites and phenotypic diversity have been investigated in many model organisms. However, the understanding of other factors that contribute to it is still limited. Recent studies have elucidated the effect of chromatin structure on molecular evolution of genomic DNA. Though the profound impact of nucleosome positions on gene regulation has been reported, their influence on transcriptional evolution is still less explored. With the availability of genome-wide nucleosome map in yeast species, it is thus desirable to investigate their impact on transcription factor binding site evolution. Here, we present a comprehensive analysis of the role of nucleosome positioning in the evolution of transcription factor binding sites.
We compared the transcription factor binding site frequency in nucleosome occupied regions and nucleosome depleted regions in promoters of old (orthologs among Saccharomycetaceae) and young (Saccharomyces specific) genes; and in duplicate gene pairs. We demonstrated that nucleosome occupied regions accommodate greater binding site variations than nucleosome depleted regions in young genes and in duplicate genes. This finding was confirmed by measuring the difference in evolutionary rates of binding sites in sensu stricto yeasts at nucleosome occupied regions and nucleosome depleted regions. The binding sites at nucleosome occupied regions exhibited a consistently higher evolution rate than those at nucleosome depleted regions, corroborating the difference in the selection constraints at the two regions. Finally, through site-directed mutagenesis experiment, we found that binding site gain or loss events at nucleosome depleted regions may cause more expression differences than those in nucleosome occupied regions.
Our study indicates the existence of different selection constraint on binding sites at nucleosome occupied regions than at the nucleosome depleted regions. We found that the binding sites have a different rate of evolution at nucleosome occupied and depleted regions. Finally, using transcription factor binding site-directed mutagenesis experiment, we confirmed the difference in the impact of binding site changes on expression at these regions. Thus, our work demonstrates the importance of composite analysis of chromatin and transcriptional evolution.
PMCID: PMC3124427  PMID: 21627806
5.  Reanalyze unassigned reads in Sanger based metagenomic data using conserved gene adjacency 
BMC Bioinformatics  2010;11:565.
Investigation of metagenomes provides greater insight into uncultured microbial communities. The improvement in sequencing technology, which yields a large amount of sequence data, has led to major breakthroughs in the field. However, at present, taxonomic binning tools for metagenomes discard 30-40% of Sanger sequencing data due to the stringency of BLAST cut-offs. In an attempt to provide a comprehensive overview of metagenomic data, we re-analyzed the discarded metagenomes by using less stringent cut-offs. Additionally, we introduced a new criterion, namely, the evolutionary conservation of adjacency between neighboring genes. To evaluate the feasibility of our approach, we re-analyzed discarded contigs and singletons from several environments with different levels of complexity. We also compared the consistency between our taxonomic binning and those reported in the original studies.
Among the discarded data, we found that 23.7 ± 3.9% of singletons and 14.1 ± 1.0% of contigs were assigned to taxa. The recovery rates for singletons were higher than those for contigs. The Pearson correlation coefficient revealed a high degree of similarity (0.94 ± 0.03 at the phylum rank and 0.80 ± 0.11 at the family rank) between the proposed taxonomic binning approach and those reported in original studies. In addition, an evaluation using simulated data demonstrated the reliability of the proposed approach.
Our findings suggest that taking account of conserved neighboring gene adjacency improves taxonomic assignment when analyzing metagenomes using Sanger sequencing. In other words, utilizing the conserved gene order as a criterion will reduce the amount of data discarded when analyzing metagenomes.
PMCID: PMC3098102  PMID: 21083935
6.  Roles of Trans and Cis Variation in Yeast Intraspecies Evolution of Gene Expression 
Molecular Biology and Evolution  2009;26(11):2533-2538.
Both cis and trans mutations contribute to gene expression divergence within and between species. We used Saccharomyces cerevisiae as a model organism to estimate the relative contributions of cis and trans variations to the expression divergence between a laboratory (BY) and a wild (RM) strain of yeast. We examined whether genes regulated by a single transcription factor (TF; single input module, SIM genes) or genes regulated by multiple TFs (multiple input module, MIM genes) are more susceptible to trans variation. Because a SIM gene is regulated by a single immediate upstream TF, the chance for a change to occur in its trans-acting factors would, on average, be smaller than that for a MIM gene. We chose 232 genes that exhibited expression divergence between BY and RM to test this hypothesis. We examined the expression patterns of these genes in a BY–RM coculture system and in a BY–RM diploid hybrid. We found that trans variation is far more important than cis variation for expression divergence between the two strains. However, because in 75% of the genes studied, cis variation has significantly contributed to expression divergence, cis change also plays a significant role in intraspecific expression evolution. Interestingly, we found that the proportion of genes with diverged expression between BY and RM is larger for MIM genes than for SIM genes; in fact, the proportion tends to increase with the number of transcription factors that regulate the gene. Moreover, MIM genes are, on average, subject to stronger trans effects than SIM genes, though the difference between the two types of genes is not conspicuous.
PMCID: PMC2767097  PMID: 19648464
cis-regulation; trans-regulation; yeast; expression evolution
7.  Impact of DNA-binding position variants on yeast gene expression 
Nucleic Acids Research  2009;37(21):6991-7001.
Transcription factors (TFs) regulate gene expression by binding to specific binding sites (TFBSs) in gene promoters. TFBS motifs may contain one or more variable positions. Although the prevailing assumption is that nucleotide variants at such positions are functionally equivalent, there is increasing evidence that such variants play a role in regulation of gene expression. In this article, we propose a method for studying the relationship between the expression of target genes and nucleotide variants in TFBS motifs at a genome-wide scale in Saccharomyces cerevisiae, especially the combinatorial effects of variants at two positions. Our analysis shows that nucleotide variations in more than one-third of variable positions and in 20% of dependent position pairs are highly correlated to gene expression. We define such positions as ‘functional’. However, some positions are only functional as dependent pairs, but not individually. In addition, a significant proportion of the functional positions have been well conserved across all yeast-related species studied. We also find that some positions require the presence of co-occurring TFs, while others maintain their functionality in the absence of a co-occurring TF. Our analysis supports the importance of nucleotide variants at variable positions of TFBSs in gene regulation.
PMCID: PMC2790881  PMID: 19767613
8.  Co-Expression of Neighboring Genes in the Zebrafish (Danio rerio) Genome 
Neighboring genes in the eukaryotic genome have a tendency to express concurrently, and the proximity of two adjacent genes is often considered a possible explanation for their co-expression behavior. However, the actual contribution of the physical distance between two genes to their co-expression behavior has yet to be defined. To further investigate this issue, we studied the co-expression of neighboring genes in zebrafish, which has a compact genome and has experienced a whole genome duplication event. Our analysis shows that the proportion of highly co-expressed neighboring pairs (Pearson’s correlation coefficient R>0.7) is low (0.24% ~ 0.67%); however, it is still significantly higher than that of random pairs. In particular, the statistical result implies that the co-expression tendency of neighboring pairs is negatively correlated with their physical distance. Our findings therefore suggest that physical distance may play an important role in the co-expression of neighboring genes. Possible mechanisms related to the neighboring genes’ co-expression are also discussed.
PMCID: PMC2812831  PMID: 20111688
gene expression; co-expression; neighboring genes; promoter; zebrafish
9.  Evolutionary conservation of DNA-contact residues in DNA-binding domains 
BMC Bioinformatics  2008;9(Suppl 6):S3.
DNA-binding proteins are of utmost importance to gene regulation. The identification of DNA-binding domains is useful for understanding the regulation mechanisms of DNA-binding proteins. In this study, we proposed a method to determine whether a domain or a protein can has DNA binding capability by considering evolutionary conservation of DNA-binding residues.
Our method achieves high precision and recall for 66 families of DNA-binding domains, with a false positive rate less than 5% for 250 non-DNA-binding proteins. In addition, experimental results show that our method is able to identify the different DNA-binding behaviors of proteins in the same SCOP family based on the use of evolutionary conservation of DNA-contact residues.
This study shows the conservation of DNA-contact residues in DNA-binding domains. We conclude that the members in the same subfamily bind DNA specifically and the members in different subfamilies often recognize different DNA targets. Additionally, we observe the co-evolution of DNA-contact residues and interacting DNA base-pairs.
PMCID: PMC2423444  PMID: 18541056
10.  Co-expression of adjacent genes in yeast cannot be simply attributed to shared regulatory system 
BMC Genomics  2007;8:352.
Adjacent gene pairs in the yeast genome have a tendency to express concurrently. Sharing of regulatory elements within the intergenic region of those adjacent gene pairs was often considered the major mechanism responsible for such co-expression. However, it is still in debate to what extent that common transcription factors (TFs) contribute to the co-expression of adjacent genes. In order to resolve the evolutionary aspect of this issue, we investigated the conservation of adjacent pairs in five yeast species. By using the information for TF binding sites in promoter regions available from the MYBS database , the ratios of TF-sharing pairs among all the adjacent pairs in yeast genomes were analyzed. The levels of co-expression in different adjacent patterns were also compared.
Our analyses showed that the proportion of adjacent pairs conserved in five yeast species is relatively low compared to that in the mammalian lineage. The proportion was also low for adjacent gene pairs with shared TFs. Particularly, the statistical analysis suggested that co-expression of adjacent gene pairs was not noticeably associated with the sharing of TFs in these pairs. We further proposed a case of the PAC (polymerase A and C) and RRPE (rRNA processing element) motifs which co-regulate divergent/bidirectional pairs, and found that the shared TFs were not significantly relevant to co-expression of divergent promoters among adjacent genes.
Our findings suggested that the commonly shared cis-regulatory system does not solely contribute to the co-expression of adjacent gene pairs in yeast genome. Therefore we believe that during evolution yeasts have developed a sophisticated regulatory system that integrates both TF-based and non-TF based mechanisms(s) for concurrent regulation of neighboring genes in response to various environmental changes.
PMCID: PMC2045684  PMID: 17910772
11.  MYBS: a comprehensive web server for mining transcription factor binding sites in yeast 
Nucleic Acids Research  2007;35(Web Server issue):W221-W226.
Correct interactions between transcription factors (TFs) and their binding sites (TFBSs) are of central importance to gene regulation. Recently developed chromatin-immunoprecipitation DNA chip (ChIP-chip) techniques and the phylogenetic footprinting method provide ways to identify TFBSs with high precision. In this study, we constructed a user-friendly interactive platform for dynamic binding site mapping using ChIP-chip data and phylogenetic footprinting as two filters. MYBS (Mining Yeast Binding Sites) is a comprehensive web server that integrates an array of both experimentally verified and predicted position weight matrixes (PWMs) from eleven databases, including 481 binding motif consensus sequences and 71 PWMs that correspond to 183 TFs. MYBS users can search within this platform for motif occurrences (possible binding sites) in the promoters of genes of interest via simple motif or gene queries in conjunction with the above two filters. In addition, MYBS enables users to visualize in parallel the potential regulators for a given set of genes, a feature useful for finding potential regulatory associations between TFs. MYBS also allows users to identify target gene sets of each TF pair, which could be used as a starting point for further explorations of TF combinatorial regulation. MYBS is available at
PMCID: PMC1933147  PMID: 17537814

