The data set simulated for Genetic Analysis Workshop 17 was designed to mimic a subset of data that might be produced in a full exome screen for a complex disorder and related risk factors in order to permit workshop participants to investigate issues of study design and statistical genetic analysis. Real sequence data from the 1000 Genomes Project formed the basis for simulating a common disease trait with a prevalence of 30% and three related quantitative risk factors in a sample of 697 unrelated individuals and a second sample of 697 individuals in large, extended pedigrees. Called genotypes for 24,487 autosomal markers assigned to 3,205 genes and simulated affection status, quantitative traits, age, sex, pedigree relationships, and cigarette smoking were provided to workshop participants. The simulating model included both common and rare variants with minor allele frequencies ranging from 0.07% to 25.8% and a wide range of effect sizes for these variants. Genotype-smoking interaction effects were included for variants in one gene. Functional variants were concentrated in genes selected from specific biological pathways and were selected on the basis of the predicted deleteriousness of the coding change. For each sample, unrelated individuals and family, 200 replicates of the phenotypes were simulated.
The Genetic Analysis Workshop 17 data we used comprise 697 unrelated individuals genotyped at 24,487 single-nucleotide polymorphisms (SNPs) from a mini-exome scan, using real sequence data for 3,205 genes annotated by the 1000 Genomes Project and simulated phenotypes. We studied 200 sets of simulated phenotypes of trait Q2. An important feature of this data set is that most SNPs are rare, with 87% of the SNPs having a minor allele frequency less than 0.05. For rare SNP detection, in this study we performed a least absolute shrinkage and selection operator (LASSO) regression and F tests at the gene level and calculated the generalized degrees of freedom to avoid any selection bias. For comparison, we also carried out linear regression and the collapsing method, which sums the rare SNPs, modified for a quantitative trait and with two different allele frequency thresholds. The aim of this paper is to evaluate these four approaches in this mini-exome data and compare their performance in terms of power and false positive rates. In most situations the LASSO approach is more powerful than linear regression and collapsing methods. We also note the difficulty in determining the optimal threshold for the collapsing method and the significant role that linkage disequilibrium plays in detecting rare causal SNPs. If a rare causal SNP is in strong linkage disequilibrium with a common marker in the same gene, power will be much improved.
Identifying rare variants that are responsible for complex disease has been promoted by advances in sequencing technologies. However, statistical methods that can handle the vast amount of data generated and that can interpret the complicated relationship between disease and these variants have lagged. We apply a zero-inflated Poisson regression model to take into account the excess of zeros caused by the extremely low frequency of the 24,487 exonic variants in the Genetic Analysis Workshop 17 data. We grouped the 697 subjects in the data set as Europeans, Asians, and Africans based on principal components analysis and found the total number of rare variants per gene for each individual. We then analyzed these collapsed variants based on the assumption that rare variants are enriched in a group of people affected by a disease compared to a group of unaffected people. We also tested the hypothesis with quantitative traits Q1, Q2, and Q4. Analyses performed on the combined 697 individuals and on each ethnic group yielded different results. For the combined population analysis, we found that UGT1A1, which was not part of the simulation model, was associated with disease liability and that FLT1, which was a causal locus in the simulation model, was associated with Q1. Of the causal loci in the simulation models, FLT1 and KDR were associated with Q1 and VNN1 was correlated with Q2. No significant genes were associated with Q4. These results show the feasibility and capability of our new statistical model to detect multiple rare variants influencing disease risk.
Rare causal variants are believed to significantly contribute to the genetic basis of common diseases or quantitative traits. Appropriate statistical methods are required to discover the highest possible number of disease-relevant variants in a genome-wide screening study. The publicly available Genetic Analysis Workshop 17 data set consists of 697 individuals and 24,487 genetic variants. It includes a simulated complex disease model with intermediate quantitative phenotypes. We compare four gene-wise scoring methods with respect to ranking of causal genes under variable allele frequency thresholds for collapsing of rare variants and considering whether or not rare variants were included. We also compare causal genes for which the ranks differ clearly between scoring methods regarding such characteristics as number and strength of causal variants. We corroborated our findings with additional simulations. We found that the maximum statistics method was superior in assigning high ranks to genes with a single strong causal variant. Hotelling’s T2 test was superior for genes with several independent causal variants. This was consistent for all phenotypes and was confirmed by single-gene analyses and additional simulations. The multivariate analysis performed similarly to Hotelling’s T2 test. The least absolute shrinkage and selection operator (LASSO) analysis was widely comparable with the maximum statistics method. We conclude that the maximum statistics method is a superior alternative to Hotelling’s T2 test if one expects only one independent causal variant per gene with a dominating effect. Such a variant could also be a supermarker derived by collapsing rare variants. Because the true nature of the genetic effect is unknown for real data, both methods need to be taken into consideration.
Next-generation sequencing technologies now make it possible to genotype and measure hundreds of thousands of rare genetic variations in individuals across the genome. Characterization of high-density genetic variation facilitates control of population genetic structure on a finer scale before large-scale genotyping in disease genetics studies. Population structure is a well-known, prevalent, and important factor in common variant genetic studies, but its relevance in rare variants is unclear. We perform an extensive population structure analysis using common and rare functional variants from the Genetic Analysis Workshop 17 mini-exome sequence. The analysis based on common functional variants required 388 principal components to account for 90% of the variation in population structure. However, an analysis based on rare variants required 532 significant principal components to account for similar levels of variation. Using rare variants, we detected fine-scale substructure beyond the population structure identified using common functional variants. Our results show that the level of population structure embedded in rare variant data is different from the level embedded in common variant data and that correcting for population structure is only as good as the level one wishes to correct.
Next-generation sequencing has opened up new avenues for the genetic study of complex traits. However, because of the small number of observations for any given rare allele and high sequencing error, it is a challenge to identify functional rare variants associated with the phenotype of interest. Recent research shows that grouping variants by gene and incorporating computationally predicted functions of variants may provide higher statistical power. On the other hand, many algorithms are available for predicting the damaging effects of nonsynonymous variants. Here, we use the simulated mini-exome data of Genetic Analysis Workshop 17 to study and compare the effects of incorporating the functional predictions of single-nucleotide polymorphisms using two popular algorithms, SIFT and PolyPhen-2, into a gene-based association test. We also propose a simple mixture model that can effectively combine test results based on different functional prediction algorithms.
Aitkin recently proposed an integrated Bayesian/likelihood approach that he claims is general and simple. We have applied this method, which does not rely on informative prior probabilities or large-sample results, to investigate the evidence of association between disease and the 16 variants in the KDR gene provided by Genetic Analysis Workshop 17. Based on the likelihood of logistic regression models and considering noninformative uniform prior probabilities on the coefficients of the explanatory variables, we used a random walk Metropolis algorithm to simulate the distributions of deviance and deviance difference. The distribution of probability values and the distribution of the proportions of positive deviance differences showed different locations, but the direction of the shift depended on the genetic factor. For the variant with the highest minor allele frequency and for any rare variant, standard logistic regression showed a higher power than the novel approach. For the two variants with the strongest effects on Q1 under a type I error rate of 1%, the integrated approach showed a higher power than standard logistic regression. The advantages and limitations of the integrated Bayesian/likelihood approach should be investigated using additional regions and considering alternative regression models and collapsing methods.
Currently there is a great deal of interest in developing methods for testing the role that rare variation plays in disease development. Here we propose a weighted association test that accumulates genetic variation across a signaling pathway. We evaluate our approach by analyzing simulated phenotype data from an exome sequencing study of 697 unrelated individuals from the Genetic Analysis Workshop 17 (GAW17) data set. Although our weighted approach identifies several interesting pathways associated with phenotype Q1, so does an alternative unweighted accumulation approach. Such a result is not unexpected because there is no systematic relationship between the allele frequency of a variant and its effect on phenotype in the GAW17 simulation model.
High-dimensional datasets with large amounts of redundant information are nowadays available for hypothesis-free exploration of scientific questions. A particular case is genome-wide association analysis, where variations in the genome are searched for effects on disease or other traits. Bayesian variable selection has been demonstrated as a possible analysis approach, which can account for the multifactorial nature of the genetic effects in a linear regression model.
Yet, the computation presents a challenge and application to large-scale data is not routine. Here, we study aspects of the computation using the Metropolis-Hastings algorithm for the variable selection: finite adaptation of the proposal distributions, multistep moves for changing the inclusion state of multiple variables in a single proposal and multistep move size adaptation. We also experiment with a delayed rejection step for the multistep moves. Results on simulated and real data show increase in the sampling efficiency. We also demonstrate that with application specific proposals, the approach can overcome a specific mixing problem in real data with 3822 individuals and 1,051,811 single nucleotide polymorphisms and uncover a variant pair with synergistic effect on the studied trait. Moreover, we illustrate multimodality in the real dataset related to a restrictive prior distribution on the genetic effect sizes and advocate a more flexible alternative.
Next-generation sequencing technology allows investigation of both common and rare variants in humans. Exomes are sequenced on the population level or in families to further study the genetics of human diseases. Genetic Analysis Workshop 17 (GAW17) provided exomic data from the 1000 Genomes Project and simulated phenotypes. These data enabled evaluations of existing and newly developed statistical methods for rare variant sequence analysis for which standard statistical methods fail because of the rareness of the alleles. Various alternative approaches have been proposed that overcome the rareness problem by combining multiple rare variants within a gene. These approaches are termed collapsing methods, and our GAW17 group focused on studying the performance of existing and novel collapsing methods using rare variants. All tested methods performed similarly, as measured by type I error and power. Inflated type I error fractions were consistently observed and might be caused by gametic phase disequilibrium between causal and noncausal rare variants in this relatively small sample as well as by population stratification. Incorporating prior knowledge, such as appropriate covariates and information on functionality of SNPs, increased the power of detecting associated genes. Overall, collapsing rare variants can increase the power of identifying disease-associated genes. However, studying genetic associations of rare variants remains a challenging task that requires further development and improvement in data collection, management, analysis, and computation.
1000 Genomes Project; association; collapsing methods; next-generation sequencing
Genome-wide association studies have been successful in identifying common variants for common complex traits in recent years. However, common variants have generally failed to explain substantial proportions of the trait heritabilities. Rare variants, structural variations, and gene-gene and gene-environment interactions, among others, have been suggested as potential sources of the so-called missing heritability. With the advent of exome-wide and whole-genome next-generation sequencing technologies, finding rare variants in functionally important sites (e.g., protein-coding regions) becomes feasible. We investigate the role of linkage information to select families enriched for rare variants using the simulated Genetic Analysis Workshop 17 data. In each replicate of simulated phenotypes Q1 and Q2 on 697 subjects in 8 extended pedigrees, we select one pedigree with the largest family-specific LOD score. Across all 200 replications, we compare the probability that rare causal alleles will be carried in the selected pedigree versus a randomly chosen pedigree. One example of successful enrichment was exhibited for gene VEGFC. The causal variant had minor allele frequency of 0.0717% in the simulated unrelated individuals and explained about 0.1% of the phenotypic variance. However, it explained 7.9% of the phenotypic variance in the eight simulated pedigrees and 23.8% in the family that carried the minor allele. The carrier’s family was selected in all 200 replications. Thus our results show that family-specific linkage information is useful for selecting families for sequencing, thus ensuring that rare functional variants are segregating in the sequencing samples.
Next-generation sequencing allows for a new focus on rare variant density for conducting analyses of association to disease and for narrowing down the genomic regions that show evidence of functionality. In this study we use the 1000 Genomes Project pilot data as distributed by Genetic Analysis Workshop 17 to compare rare variant densities across seven populations. We made the comparisons using regressions of rare variants on total variant counts per gene for each population and Tajima’s D values calculated for each gene in each population, using data on 3,205 genes. We found that the populations clustered by continent for both the regression slopes and Tajima’s D values, with the African populations (Yoruba and Luhya) showing the highest density of rare variants, followed by the Asian populations (Han and Denver Chinese followed by the Japanese) and the European populations (CEPH [European-descent] and Tuscan) with the lowest densities. These significant differences in rare variant densities across populations seem to translate to measures of the rare variant density more commonly used in rare variant association analyses, suggesting the need to adjust for ancestry in such analyses. The selection signal was high for AHNAK, HLA-A, RANBP2, and RGPD4, among others. RANBP2 and RGPD4 showed a marked difference in rare variant density and potential selection between the Luhya and the other populations. This may suggest that differences between populations should be considered when delimiting genomic regions according to functionality and that these differences can create potential for disease heterogeneity.
The allelic architecture of complex traits is likely to be underpinned by a combination of multiple common frequency and rare variants. Targeted genotyping arrays and next-generation sequencing technologies at the whole-genome sequencing (WGS) and whole-exome scales (WES) are increasingly employed to access sequence variation across the full minor allele frequency (MAF) spectrum. Different study design strategies that make use of diverse technologies, imputation and sample selection approaches are an active target of development and evaluation efforts. Initial insights into the contribution of rare variants in common diseases and medically relevant quantitative traits point to low-frequency and rare alleles acting either independently or in aggregate and in several cases alongside common variants. Studies conducted in population isolates have been successful in detecting rare variant associations with complex phenotypes. Statistical methodologies that enable the joint analysis of rare variants across regions of the genome continue to evolve with current efforts focusing on incorporating information such as functional annotation, and on the meta-analysis of these burden tests. In addition, population stratification, defining genome-wide statistical significance thresholds and the design of appropriate replication experiments constitute important considerations for the powerful analysis and interpretation of rare variant association studies. Progress in addressing these emerging challenges and the accrual of sufficiently large data sets are poised to help the field of complex trait genetics enter a promising era of discovery.
Recently there has been great interest in identifying rare variants associated with common diseases. We apply several collapsing-based and kernel-based single-gene association tests to Genetic Analysis Workshop 17 (GAW17) rare variant association data with unrelated individuals without knowledge of the simulation model. We also implement modified versions of these methods using additional information, such as minor allele frequency (MAF) and functional annotation. For each of four given traits provided in GAW17, we use the Bayesian mixed-effects model to estimate the phenotypic variance explained by the given environmental and genotypic data and to infer an individual-specific genetic effect to use directly in single-gene association tests. After obtaining information on the GAW17 simulation model, we compare the performance of all methods and examine the top genes identified by those methods. We find that collapsing-based methods with weights based on MAFs are sensitive to the “lower MAF, larger effect size” assumption, whereas kernel-based methods are more robust when this assumption is violated. In addition, many false-positive genes identified by multiple methods often contain variants with exactly the same genotype distribution as the causal variants used in the simulation model. When the sample size is much smaller than the number of rare variants, it is more likely that causal and noncausal variants will share the same or similar genotype distribution. This likely contributes to the low power and large number of false-positive results of all methods in detecting causal variants associated with disease in the GAW17 data set.
Large-scale, deep resequencing may be the next logical step in the genetic investigation of common complex diseases. Because each individual is likely to carry many thousands of variants, the identification of causal alleles requires an efficient strategy to reduce the number of candidate variants. Under many genetic models, causal alleles can be expected to reside within identity-by-descent (IBD) regions shared by affected relatives. In distant relatives, IBD regions constitute a small portion of the genome and can thus greatly reduce the search space for causal alleles. However, the effectiveness of this strategy is unknown. We test the simulated mini-exome data set in extended pedigrees provided by Genetic Analysis Workshop 17. At the fourth- and fifth-degree level of relatedness, case-case pairs shared between 1% and 9% of the genome identical by descent. As expected, no genes were shared identical by descent by all case subjects, but 43 genes were shared by many case subjects across at least 50 replicates. We filtered variants in these genes based on population frequency, function, informativeness, and evidence of association using the family-based association test. This analysis highlighted five genes previously implicated in triglyceride, lipid, and cholesterol metabolism. Comparison with the list of true risk alleles revealed that strict IBD filtering followed by association testing of the rarest alleles was the most sensitive strategy. IBD filtering may be a useful strategy for narrowing down the list of candidate variants in exome data, but the optimal degree of relatedness of affected pairs will depend on the genetic architecture of the disease under study.
Family-based study designs are again becoming popular as new next-generation sequencing technologies make whole-exome and whole-genome sequencing projects economically and temporally feasible. Here we evaluate the statistical properties of linkage analyses and family-based tests of association for the Genetic Analysis Workshop 17 mini-exome sequence data. Based on our results, the linkage methods using relative pairs or nuclear families had low power, with the best results coming from variance components linkage analysis in nuclear families and Elston-Stewart model-based linkage analysis in extended pedigrees. For family-based tests of association, both ASSOC and ROMP performed well for genes with large effects, but ROMP had the advantage of not requiring parental genotypes in the analysis. For the linkage analyses we conclude that genome-wide significance levels appear to control type I error well but that “suggestive” significance levels do not. Methods that make use of the extended pedigrees are well powered to detect major loci segregating in the families even when there is substantial genetic heterogeneity and the trait is mainly polygenic. However, large numbers of such pedigrees will be necessary to detect all major loci. The family-based tests of association found the same major loci as the linkage analyses and detected low-frequency loci with moderate effect sizes, but control of type I error was not as stringent.
Novel technologies allow sequencing of whole genomes and are considered as an emerging approach for the identification of rare disease-associated variants. Recent studies have shown that multiple rare variants can explain a particular proportion of the genetic basis for disease. Following this assumption, we compare five collapsing approaches to test for groupwise association with disease status, using simulated data provided by Genetic Analysis Workshop 17 (GAW17). Variants are collapsed in different scenarios per gene according to different minor allele frequency (MAF) thresholds and their functionality. For comparing the different approaches, we consider the family-wise error rate and the power. Most of the methods could maintain the nominal type I error levels well for small MAF thresholds, but the power was generally low. Although the methods considered in this report are common approaches for analyzing rare variants, they performed poorly with respect to the simulated disease phenotype in the GAW17 data set.
Genotyping of rare variants on a large scale is now possible using next-generation sequencing. The sample selection is a crucial step in designing the genetic study of a complex disease and knowledge of the efficiency and limitations of the population-based and family-based designs can help making the appropriate choice.
The 9 contributions to the Group 5 of the Genetic Analysis Workshop 17 evaluated the population-based and family-based designs by comparing the results obtained with various methods applied on the mini-exome simulations. These simulations consisted of 200 replicates comprising unrelated individuals and 8 extended pedigrees with genotypes and various phenotypes. The methods tested for association with a population-based and/or a family-based design, tested for linkage with a family-based design or estimated heritability.
In this paper, we summarize the strength and weaknesses of both designs. While a population-based design seems more suitable to detect the effect of multiple rare variants, a family-based design can potentially enrich the sample in very rare variants, for which the effect would be concealed at the population level. However, as of today, the main limitation is still the expensive cost of next-generation sequencing.
Genetic Analysis Workshop; 1000 genome project; next-generation sequencing; linkage; association; aggregation; familial relatedness; population stratification; heritability
Genome-wide association studies are a powerful approach used to identify common variants for complex disease. However, the traditional genome-wide association methods may not be optimal when they are applied to rare variants because of the rare variants’ low frequencies and weak signals. To alleviate the difficulty, investigators have proposed many methods that collapse rare variants. In this paper, we propose a novel ranking method, which we call stability selection based on random collapsing, to rank the candidate rare variants. We use the simulated mini-exome data sets of unrelated individuals from Genetic Analysis Workshop 17 for the analysis. The numerical results suggest that the selection based on a random collapsing method is promising for identifying functional rare variants in genome-wide association studies. Further research to examine the error control property of the proposed method is underway.
Next-generation sequencing technology provides new opportunities and challenges in the search for genetic variants that underlie complex traits. It will also presumably uncover many new rare variants, but exactly how these variants should be incorporated into the data analysis remains a question. Several papers in our group from Genetic Analysis Workshop 17 evaluated different methods of rare variant analysis, including single-variant, gene-based, and pathway-based analyses and analyses that incorporated biological information. Although the performance of some of these methods strongly depends on the underlying disease model, integration of known biological information is helpful in detecting causal genes. Two work groups demonstrated that use of a Bayesian network and a collapsing receiver operating characteristic curve approach improves risk prediction when a disease is caused by many rare variants. Another work group suggested that modeling local rather than global ancestry may be beneficial when controlling the effect of population structure in rare variant association analysis.
rare variant; association analysis; risk prediction model; population structure; biological information; receiver operating characteristic; Bayesian network
The selective genotyping approach in quantitative genetics means genotyping only individuals with extreme phenotypes. This approach is considered an efficient way to perform gene mapping, and can be applied in both linkage and association studies. Selective genotyping in association mapping of quantitative trait loci was proposed to increase the power of detecting rare alleles of large effect. However, using this approach, only common variants have been detected. Studies on selective genotyping have been limited to single-locus scenarios. In this study we aim to investigate the power of selective genotyping in a genome-wide association study scenario, and we specifically study the impact of minor allele frequency of variants on the power of this approach. We use the Genetic Analysis Workshop 16 rheumatoid arthritis whole-genome data from the North American Rheumatoid Arthritis Consortium. Two quantitative traits, anti-cyclic citrullinated peptide and rheumatoid factor immunoglobulin M, and one binary trait, rheumatoid arthritis affection status, are used in the analysis. The power of selective genotyping is explored as a function of three parameters: sampling proportion, minor allele frequency of single-nucleotide polymorphism, and test level. The results show that the selective genotyping approach is more efficient in detecting common variants than detecting rare variants, and it is efficient only when the level of declaring significance is not stringent. In summary, the selective genotyping approach is most suitable for detecting common variants in candidate gene-based studies.
Recent advances in next-generation sequencing technologies have made it possible to generate large amounts of sequence data with rare variants in a cost-effective way. Statistical methods that test variants individually are underpowered to detect rare variants, so it is desirable to perform association analysis of rare variants by combining the information from all variants. In this study, we use a Bayesian regression method to model all variants simultaneously to identify rare variants in a data set from Genetic Analysis Workshop 17. We studied the association between the quantitative risk traits Q1, Q2, and Q4 and the single-nucleotide polymorphisms and identified several positive single-nucleotide polymorphisms for traits Q1 and Q2. However, the model also generated several apparent false positives and missed many true positives, suggesting that there is room for improvement in this model.
Genetic Analysis Workshop 17 provided simulated phenotypes and exome sequence data for 697 independent individuals (209 case subjects and 488 control subjects). The disease liability in these data was influenced by multiple quantitative traits. We addressed the lack of statistical power in this small data set by limiting the genomic variants included in the study to those with potential disease-causing effect, thereby reducing the problem of multiple testing. After this adjustment, we could readily detect two common variants that were strongly associated with the quantitative trait Q1 (C13S523 and C13S522). However, we found no significant associations with the affected status or with any of the other quantitative traits, and the relationship between disease status and genomic variants remained obscure. To address the challenge of the multivariate phenotype, we used propensity scores to combine covariates with genetic risk factors into a single risk factor and created a new phenotype variable, the probability of being affected given the covariates. Using the propensity score as a quantitative trait in the case-control analysis, we again could identify the two common single-nucleotide polymorphisms (C13S523 and C13S522). In addition, this analysis captured the correlation between Q1 and the affected status and reduced the problem of multiple testing. Although the propensity score was useful for capturing and clarifying the genetic contributions of common variants to the disease phenotype and the mediating role of the quantitative trait Q1, the analysis did not increase power to detect rare variants.
Using the exome sequencing data from 697 unrelated individuals and their simulated disease phenotypes from Genetic Analysis Workshop 17, we develop and apply a gene-based method to identify the relationship between a gene with multiple rare genetic variants and a phenotype. The method is based on the Mantel test, which assesses the correlation between two distance matrices using a permutation procedure. Using up to 100,000 permutations to estimate the statistical significance in 200 replicate data sets, we found that the method had 5.1% type I error at an α level of 0.05 and had various power to detect genes with simulated genetic associations. FLT1 and KDR had the most significant correlations with Q1 and were replicated 170 and 24 times, respectively, in 200 simulated data sets using a Bonferroni corrected p-value of 0.05 as a threshold. These results suggest that the distance correlation method can be used to identify genotype-phenotype association when multiple rare genetic variants in a gene are involved.
Unraveling how regulatory divergence contributes to species differences and adaptation requires identifying functional variants from among millions of genetic differences. Analysis of allelic imbalance (AI) reveals functional genetic differences in cis regulation and has demonstrated differences in cis regulation within and between species. Regulatory mechanisms are often highly conserved, yet differences between species in gene expression are extensive. What evolutionary forces explain widespread divergence in cis regulation? AI was assessed in Drosophila melanogaster–Drosophila simulans hybrid female heads using RNA-seq technology. Mapping bias was virtually eliminated by using genotype-specific references. Allele representation in DNA sequencing was used as a prior in a novel Bayesian model for the estimation of AI in RNA. Cis regulatory divergence was common in the organs and tissues of the head with 41% of genes analyzed showing significant AI. Using existing population genomic data, the relationship between AI and patterns of sequence evolution was examined. Evidence of positive selection was found in 30% of cis regulatory divergent genes. Genes involved in defense, RNAi/RISC complex genes, and those that are sex regulated are enriched among adaptively evolving cis regulatory divergent genes. For genes in these groups, adaptive evolution may play a role in regulatory divergence between species. However, there is no evidence that adaptive evolution drives most of the cis regulatory divergence that is observed. The majority of genes showed patterns consistent with stabilizing selection and neutral evolutionary processes.
Cis regulatory divergence; Drosophila melanogaster; Drosophila simulans; allele-specific expression; adaptive evolution