A subset (~3–5%) of patients with cystic fibrosis (CF) develops severe liver disease (CFLD) with portal hypertension.
To assess whether any of 9 polymorphisms in 5 candidate genes (SERPINA1, ACE, GSTP1, MBL2, and TGFB1) are associated with severe liver disease in CF patients.
Design, Setting, and Participants
A 2-stage design was used in this case–control study. CFLD subjects were enrolled from 63 U.S., 32 Canadian, and 18 CF centers outside of North America, with the University of North Carolina at Chapel Hill (UNC) as the coordinating site. In the initial study, we studied 124 CFLD patients (enrolled 1/1999–12/2004) and 843 CF controls (patients without CFLD) by genotyping 9 polymorphisms in 5 genes previously implicated as modifiers of liver disease in CF. In the second stage, the SERPINA1 Z allele and TGFB1 codon 10 genotype were tested in an additional 136 CFLD patients (enrolled 1/2005–2/2007) and 1088 CF controls.
Main Outcome Measures
We compared differences in distribution of genotypes in CF patients with severe liver disease versus CF patients without CFLD.
The initial study showed CFLD to be associated with the SERPINA1 (also known as α1-antiprotease and α1-antitrypsin) Z allele (P value=3.3×10−6; odds ratio (OR) 4.72, 95% confidence interval (CI) 2.31–9.61), and with transforming growth factor β-1 (TGFB1) codon 10 CC genotype (P=2.8×10−3; OR 1.53, CI 1.16–2.03). In the replication study, CFLD was associated with the SERPINA1 Z allele (P=1.4×10−3; OR 3.42, CI 1.54–7.59), but not with TGFB1 codon 10. A combined analysis of the initial and replication studies by logistic regression showed CFLD to be associated with SERPINA1 Z allele (P=1.5×10−8; OR 5.04, CI 2.88–8.83).
The SERPINA1 Z allele is a risk factor for liver disease in CF. Patients who carry the Z allele are at greater odds (OR ~5) to develop severe liver disease with portal hypertension.
Exome sequencing has become a powerful and effective strategy for discovery of genes underlying Mendelian disorders1. However, use of exome sequencing to identify variants associated with complex traits has been more challenging, partly because the samples sizes needed for adequate power may be very large2. One strategy to increase efficiency is to sequence individuals who are at both ends of a phenotype distribution (i.e., extreme phenotypes). Because the frequency of alleles that contribute to the trait are enriched in one or both extremes of phenotype, a modest sample size can potentially identify novel candidate genes/alleles3. As part of the National Heart, Lung, and Blood Institute Exome Sequencing Project (ESP), we used an extreme phenotype design to discover that variants in DCTN4, encoding a dynactin protein, are associated with time to first Pseudomonas aeruginosa (P. aeruginosa) airway infection, chronic P. aeruginosa infection and mucoid P. aeruginosa among individuals with cystic fibrosis (MIM219700).
A shift in toxicity testing from in vivo to in vitro may efficiently prioritize compounds, reveal new mechanisms, and enable predictive modeling. Quantitative high-throughput screening (qHTS) is a major source of data for computational toxicology, and our goal in this study was to aid in the development of predictive in vitro models of chemical-induced toxicity, anchored on interindividual genetic variability. Eighty-one human lymphoblast cell lines from 27 Centre d’Etude du Polymorphisme Humain trios were exposed to 240 chemical substances (12 concentrations, 0.26nM–46.0μM) and evaluated for cytotoxicity and apoptosis. qHTS screening in the genetically defined population produced robust and reproducible results, which allowed for cross-compound, cross-assay, and cross-individual comparisons. Some compounds were cytotoxic to all cell types at similar concentrations, whereas others exhibited interindividual differences in cytotoxicity. Specifically, the qHTS in a population-based human in vitro model system has several unique aspects that are of utility for toxicity testing, chemical prioritization, and high-throughput risk assessment. First, standardized and high-quality concentration-response profiling, with reproducibility confirmed by comparison with previous experiments, enables prioritization of chemicals for variability in interindividual range in cytotoxicity. Second, genome-wide association analysis of cytotoxicity phenotypes allows exploration of the potential genetic determinants of interindividual variability in toxicity. Furthermore, highly significant associations identified through the analysis of population-level correlations between basal gene expression variability and chemical-induced toxicity suggest plausible mode of action hypotheses for follow-up analyses. We conclude that as the improved resolution of genetic profiling can now be matched with high-quality in vitro screening data, the evaluation of the toxicity pathways and the effects of genetic diversity are now feasible through the use of human lymphoblast cell lines.
chemical cytotoxicity; apoptosis; HapMap; lymphoblasts; qHTS
Head and neck squamous cell carcinoma (HNSCC) is a frequently fatal heterogeneous disease. Beyond the role of human papilloma virus (HPV), no validated molecular characterization of the disease has been established. Using an integrated genomic analysis and validation methodology we confirm four molecular classes of HNSCC (basal, mesenchymal, atypical, and classical) consistent with signatures established for squamous carcinoma of the lung, including deregulation of the KEAP1/NFE2L2 oxidative stress pathway, differential utilization of the lineage markers SOX2 and TP63, and preference for the oncogenes PIK3CA and EGFR. For potential clinical use the signatures are complimentary to classification by HPV infection status as well as the putative high risk marker CCND1 copy number gain. A molecular etiology for the subtypes is suggested by statistically significant chromosomal gains and losses and differential cell of origin expression patterns. Model systems representative of each of the four subtypes are also presented.
Resampling-based expression pathway analysis techniques have been shown to preserve type I error rates, in contrast to simple gene-list approaches that implicitly assume the independence of genes in ranked lists. However, resampling is intensive in computation time and memory requirements. We describe accurate analytic approximations to permutations of score statistics, including novel approaches for Pearson's correlation, and summed score statistics, that have good performance for even relatively small sample sizes. Our approach preserves the essence of permutation pathway analysis, but with greatly reduced computation. Extensions for inclusion of covariates and censored data are described, and we test the performance of our procedures using simulations based on real datasets. These approaches have been implemented in the new R package safeExpress.
Gene sets; Multiple hypothesis testing; Permutation approximation
Expression quantitative trait locus (eQTL) analysis is rapidly moving from a cutting-edge concept in genomics to a mature area of investigation, with important connections to genome-wide association studies for human disease, pharmacogenomics and toxicogenomics. Despite the importance of the topic, many investigators must develop their own code or use tools not specifically suited for eQTL analysis. Convenient computational tools are becoming available, but they are not widely publicized, and investigators who are interested in discovery or eQTL, or in using them to interpret genome-wide association study results may have difficulty navigating the available resources. The purpose of this review is to help investigators find appropriate programs for eQTL analysis and interpretation.
bioinformatics; fast linear modeling; gene expression
Motivation: A number of penalization and shrinkage approaches have been proposed for the analysis of microarray gene expression data. Similar techniques are now routinely applied to RNA sequence transcriptional count data, although the value of such shrinkage has not been conclusively established. If penalization is desired, the explicit modeling of mean–variance relationships provides a flexible testing regimen that ‘borrows’ information across genes, while easily incorporating design effects and additional covariates.
Results: We describe BBSeq, which incorporates two approaches: (i) a simple beta-binomial generalized linear model, which has not been extensively tested for RNA-Seq data and (ii) an extension of an expression mean–variance modeling approach to RNA-Seq data, involving modeling of the overdispersion as a function of the mean. Our approaches are flexible, allowing for general handling of discrete experimental factors and continuous covariates. We report comparisons with other alternate methods to handle RNA-Seq data. Although penalized methods have advantages for very small sample sizes, the beta-binomial generalized linear model, combined with simple outlier detection and testing approaches, appears to have favorable characteristics in power and flexibility.
Availability: An R package containing examples and sample datasets is available at http://www.bios.unc.edu/research/genomic_software/BBSeq
Contact: email@example.com; firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics online.
In genome-wide association studies, population stratification is recognized as producing inflated type I error due to the inflation of test statistics. Principal component-based methods applied to genotypes provide information about population structure, and have been widely used to control for stratification. Here we explore the precise relationship between genotype principal components and inflation of association test statistics, thereby drawing a connection between principal component-based stratification control and the alternative approach of genomic control. Our results provide an inherent justification for the use of principal components, but call into question the popular practice of selecting principal components based on significance of eigenvalues alone. We propose a new approach, called EigenCorr, which selects principal components based on both their eigenvalues and their correlation with the (disease) phenotype. Our approach tends to select fewer principal components for stratification control than does testing of eigenvalues alone, providing substantial computational savings and improvements in power. Analyses of simulated and real data demonstrate the usefulness of the proposed approach.
Genomic Control; GWAS; PCA; Population Stratification
Genetic studies of lung disease in Cystic Fibrosis are hampered by the
lack of a severity measure that accounts for chronic disease progression and
mortality attrition. Further, combining analyses across studies requires common
phenotypes that are robust to study design and patient ascertainment.
Using data from the North American Cystic Fibrosis Modifier Consortium
(Canadian Consortium for CF Genetic Studies, Johns Hopkins University CF Twin
and Sibling Study, and University of North Carolina/Case Western Reserve
University Gene Modifier Study), the authors calculated age-specific CF
percentile values of FEV1 which were adjusted for CF age-specific mortality
The phenotype was computed for 2061 patients representing the Canadian CF
population, 1137 extreme phenotype patients in the UNC/Case Western study, and
1323 patients from multiple CF sib families in the CF Twin and Sibling Study.
Despite differences in ascertainment and median age, our phenotype score was
distributed in all three samples in a manner consistent with ascertainment
differences, reflecting the lung disease severity of each individual in the
underlying population. The new phenotype score was highly correlated with the
previously recommended complex phenotype, but the new phenotype is more robust
for shorter follow-up and for extreme ages.
A disease progression and mortality adjusted phenotype reduces the need
for stratification or additional covariates, increasing statistical power and
avoiding possible distortions. This approach will facilitate large scale genetic
and environmental epidemiological studies which will provide targeted
therapeutic pathways for the clinical benefit of patients with CF.
Forced Expiratory Volume; Age Effects; Severity of Illness Index
MicroRNAs are short, non-coding RNA sequences that regulate genes at the post-transcriptional level and have been shown to be important in development, tissue differentiation, and disease. Limited attention has been given to the natural variation in miRNA expression across genetically diverse populations even though it is well established that genetic polymorphisms can have a profound effect on mRNA levels. Expression level of 577 miRNAs in the livers of 70 strains of inbred mice was assessed, and we found that miRNA expression is highly stable across different strains. Globally, the expression of miRNA target transcripts does not correlate with miRNA expression, primarily due to the low variance of miRNA but high variance of mRNA expression across strains. Our results show that there is little genetic effect on the baseline miRNA levels in murine liver. The stability of mouse liver miRNA expression in a genetically diverse population suggests that treatment-induced disruptions in liver miRNA expression, a phenomenon established for a large number of toxicants, may indicate an important mechanism for the disturbance of normal liver function, and may prove to be a useful genetic background-independent biomarker of toxicant effect.
micro RNA; liver; mouse; gene expression
Rectal cancer is often clinically resistant to radiotherapy and there would be value to identifying molecular markers to define the biological basis for this phenomenon. NF-κB is a potentially anti-apoptotic transcription factor that has been associated with resistance to radiotherapy in model systems. This study was designed to evaluate NF- κB activation in rectal cancers being treated with chemoradiation to determine whether NF- κB activity correlates with outcome in rectal cancer
Methods and Materials
22 patients were biopsied at multiple time points in a prospective study, and another 50 were analyzed retrospectively. Pre-treatment tumor tissue was analyzed for multiple NF- κB subunits by immunohistochemistry (IHC). Serial tumor biopsies were analyzed for NF- κB-regulated gene expression by RT-PCR and for NF-κB subunit nuclear localization by IHC.
Several NF- κB target genes (Bcl-2, cIAP-2, IL-8 and TRAF1) were significantly upregulated by a single fraction of radiotherapy at 24 hours demonstrating for the first time that NF-κB is activated by radiotherapy in human rectal tumors. Baseline NF-κB p50 nuclear expression did not correlate with pathologic response to radiotherapy, but increasing baseline p50 was prognostic for overall survival (HR 2.15, p = 0.040).
NF-κB nuclear expression at baseline in rectal cancer is prognostic for overall survival but not predictive of response to radiotherapy. Larger patient numbers would be needed to assess the effect of NF-κB target gene upregulation on response to RT. Our results suggest that NF-κB may play an important role in tumor metastasis as opposed to resistance to chemoradiotherapy.
Variants associated with meconium ileus in cystic fibrosis (CF) were identified in 3,763 patients by GWAS. Five SNPs at two loci near SLC6A14 (min P=1.28×10−12 at rs3788766), chr Xq23-24 and SLC26A9 (min P=9.88×10−9 at rs4077468), chr 1q32.1 accounted for ~5% of the phenotypic variability, and were replicated in an independent patient collection (n=2,372; P=0.001 and 0.0001 respectively). By incorporating that disease-causing mutations in CFTR alter electrolyte and fluid flux across epithelia into an hypothesis-driven genome-wide analysis (GWAS-HD), we identified the same SLC6A14 and SLC26A9 associated SNPs, while establishing evidence for the involvement of SNPs in a third solute carrier gene, SLC9A3. In addition, GWAS-HD provided evidence of association between meconium ileus and multiple constituents of the apical plasma membrane where CFTR resides (P=0.0002, testing 155 apical genes jointly and replicated, P=0.022). These findings suggest that modulating activities of apical membrane constituents could complement current therapeutic paradigms for cystic fibrosis.
A combined genome-wide association and linkage study was used to identify loci causing variation in CF lung disease severity. A significant association (P=3. 34 × 10-8) near EHF and APIP (chr11p13) was identified in F508del homozygotes (n=1,978). The association replicated in F508del homozygotes (P=0.006) from a separate family-based study (n=557), with P=1.49 × 10-9 for the three-study joint meta-analysis. Linkage analysis of 486 sibling pairs from the family-based study identified a significant QTL on chromosome 20q13.2 (LOD=5.03). Our findings provide insight into the causes of variation in lung disease severity in CF and suggest new therapeutic targets for this life-limiting disorder.
It is generally presumed that the Cystic Fibrosis (CF) population is relatively homogeneous, and predominantly of European origin. The complex ethnic make-up observed in the CF patients collected by the North American CF Modifier Gene Consortium has brought this assumption into question, and suggested the potential for population substructure in the three CF study samples collected from North America. It is well appreciated that population substructure can result in spurious genetic associations.
To understand the ethnic composition of the North American CF population, and to assess the need for population structure adjustment in genetic association studies with North American CF patients.
Genome-wide single-nucleotide polymorphisms on 3076 unrelated North American CF patients were used to perform population structure analyses. We compared self-reported ethnicity to genotype-inferred ancestry, and also examined whether geographic distribution and CFTR mutation type could explain the structure observed.
Although largely Caucasian, our analyses identified a considerable number of CF patients with admixed African-Caucasian, Mexican-Caucasian and Indian-Caucasian ancestries. Population substructure was present and comparable across the three studies of the consortium. Neither geographic distribution nor mutation type explained the population structure.
Given the ethnic diversity of the North American CF population, it is essential to carefully detect, estimate and adjust for population substructure to guard against potential spurious findings in CF genetic association studies. Other Mendelian diseases that are presumed to predominantly affect single ethnic groups may also benefit from careful analysis of population structure.
ethnicity; principal component analysis; population substructure; population stratification
Immortalized human lymphoblastoid cell lines have been used to demonstrate that it is possible to use an in vitro model system to identify genetic factors that affect responses to xenobiotics. To extend the application of such studies to investigative toxicology by assessing interindividual and population-wide variability and heritability of chemical-induced toxicity phenotypes, we have used cell lines from the Centre d'Etude du Polymorphisme Humain (CEPH) trios assembled by the HapMap Consortium. Our goal is to aid in the development of predictive in vitro genetics-anchored models of chemical-induced toxicity. Cell lines from the CEPH trios were exposed to three concentrations of 14 environmental chemicals. We assessed ATP production and caspase-3/7 activity 24 h after treatment. Replicate analyses were used to evaluate experimental variability and classify responses. We show that variability of response across the cell lines exists for some, but not all, chemicals, with perfluorooctanoic acid (PFOA) and phenobarbital eliciting the greatest degree of interindividual variability. Although the data for the chemicals used here do not show evidence for broad-sense heritability of toxicity response phenotypes, substantial cell line variation was found, and candidate genetic factors contributing to the variability in response to PFOA were investigated using genome-wide association analysis. The approach of screening chemicals for toxicity in a genetically defined yet diverse in vitro human cell-based system is potentially useful for identification of chemicals that may pose a highest risk, the extent of within-species variability in the population, and genetic loci of interest that potentially contribute to chemical susceptibility.
chemical toxicity; human; lymphoblasts; population variability
Mouse models play a crucial role in the study of human behavioral traits and diseases. Variation of gene expression in brain may play a critical role in behavioral phenotypes, and thus it is of great importance to understand regulation of transcription in mouse brain. In this study, we analyzed the role of two important factors influencing steady-state transcriptional variation in mouse brain. First we considered the effect of assessing whole brain vs. discrete regions of the brain. Second, we investigated the genetic basis of strain effects on gene expression. We examined the transcriptome of three brain regions using Affymetrix expression arrays: whole brain, forebrain, and hindbrain in adult mice from two common inbred strains (C57BL/6J vs. NOD/ShiLtJ) with eight replicates for each brain region and strain combination. We observed significant differences between the transcriptomes of forebrain and hindbrain. In contrast, the transcriptomes of whole brain and forebrain were very similar. Using 4.3 million single-nucleotide polymorphisms identified through whole-genome sequencing of C57BL/6J and NOD/ShiLtJ strains, we investigated the relationship between strain effect in gene expression and DNA sequence similarity. We found that cis-regulatory effects play an important role in gene expression differences between strains and that the cis-regulatory elements are more often located in 5′ and/or 3′ transcript boundaries, with no apparent preference on either 5′ or 3′ ends.
Mouse Genetic Resource; Mouse Collaborative Cross; mouse; gene expression; whole brain; forebrain; hindbrain; sequence variation
Summary: seeQTL is a comprehensive and versatile eQTL database, including various eQTL studies and a meta-analysis of HapMap eQTL information. The database presents eQTL association results in a convenient browser, using both segmented local-association plots and genome-wide Manhattan plots.
Availability and implementation: seeQTL is freely available for non-commercial use at http://www.bios.unc.edu/research/genomic_software/seeQTL/.
Contact: email@example.com; firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics online.
Sporadic amyotrophic lateral sclerosis (sALS) is a motor neuron disease with poorly understood etiology. Results of gene expression profiling studies of whole blood from ALS patients have not been validated and are difficult to relate to ALS pathogenesis because gene expression profiles depend on the relative abundance of the different cell types present in whole blood. We conducted microarray analyses using Agilent Human Whole Genome 4 × 44k Arrays on a more homogeneous cell population, namely purified peripheral blood lymphocytes (PBLs), from ALS patients and healthy controls to identify molecular signatures possibly relevant to ALS pathogenesis.
Differentially expressed genes were determined by LIMMA (Linear Models for MicroArray) and SAM (Significance Analysis of Microarrays) analyses. The SAFE (Significance Analysis of Function and Expression) procedure was used to identify molecular pathway perturbations. Proteasome inhibition assays were conducted on cultured peripheral blood mononuclear cells (PBMCs) from ALS patients to confirm alteration of the Ubiquitin/Proteasome System (UPS).
For the first time, using SAFE in a global gene ontology analysis (gene set size 5-100), we show significant perturbation of the KEGG (Kyoto Encyclopedia of Genes and Genomes) ALS pathway of motor neuron degeneration in PBLs from ALS patients. This was the only KEGG disease pathway significantly upregulated among 25, and contributing genes, including SOD1, represented 54% of the encoded proteins or protein complexes of the KEGG ALS pathway. Further SAFE analysis, including gene set sizes >100, showed that only neurodegenerative diseases (4 out of 34 disease pathways) including ALS were significantly upregulated. Changes in UBR2 expression correlated inversely with time since onset of disease and directly with ALSFRS-R, implying that UBR2 was increased early in the course of ALS. Cultured PBMCs from ALS patients accumulated more ubiquitinated proteins than PBMCs from healthy controls in a serum-dependent manner confirming changes in this pathway.
Our study indicates that PBLs from sALS patients are strong responders to systemic signals or local signals acquired by cell trafficking, representing changes in gene expression similar to those present in brain and spinal cord of sALS patients. PBLs may provide a useful means to study ALS pathogenesis.
Variability in cystic fibrosis (CF) lung disease is partially due to non-CFTR genetic modifiers. Mucin genes are very polymorphic, and mucins play a key role in the pathogenesis of CF lung disease; therefore, mucin genes are strong candidates as genetic modifiers. DNA from CF patients recruited for extremes of lung phenotype was analyzed by Southern blot or PCR to define variable number tandem repeat (VNTR) length polymorphisms for MUC1, MUC2, MUC5AC, and MUC7. VNTR length polymorphisms were tested for association with lung disease severity and for linkage disequilibrium (LD) with flanking single nucleotide polymorphisms (SNPs). No strong associations were found for MUC1, MUC2, or MUC7. A significant association was found between the overall distribution of MUC5AC VNTR length and CF lung disease severity (p = 0.025; n = 468 patients); plus, there was robust association of the specific 6.4 kb HinfI VNTR fragment with severity of lung disease (p = 6.2×10−4 after Bonferroni correction). There was strong LD between MUC5AC VNTR length modes and flanking SNPs. The severity-associated 6.4 kb VNTR allele of MUC5AC was confirmed to be genetically distinct from the 6.3 kb allele, as it showed significantly stronger association with nearby SNPs. These data provide detailed respiratory mucin gene VNTR allele distributions in CF patients. Our data also show a novel link between the MUC5AC 6.4 kb VNTR allele and severity of CF lung disease. The LD pattern with surrounding SNPs suggests that the 6.4 kb allele contains, or is linked to, important functional genetic variation.
Association studies using unrelated individuals have become the most popular design for mapping complex traits. One of the major challenges of association mapping is avoiding spurious association due to population stratification. Principal component analysis (PCA) on genome-wide marker genotypes is one of the most popular population stratification control methods. It implicitly assumes that the markers are in linkage equilibrium, a condition that is rarely satisfied and that we plan to relax.
We carefully examined the impact of linkage disequilibrium (LD) on PCA, and proposed a simple modification of the standard PCA to automatically adjust for the correlations among markers.
We demonstrated that LD patterns in genome-wide association datasets can distort the techniques for stratification control, showing ‘subpopulations’ reflecting localized LD phenomena rather than plausible population structure. We showed that the proposed method effectively removes the artifactual effect of LD patterns, and successfully recovers underlying population structure that is not apparent from standard PCA.
PCA is highly influenced by sets of SNPs with high LD, obscuring the true population substructure. Our shrinkage PCA applies to all available markers, regardless of the LD patterns. The proposed method is easier to implement than most existing LD adjusted PCA methods.
PCA; Loadings; GWAS
A number of settings arise in which it is of interest to predict Principal Component (PC) scores for new observations using data from an initial sample. In this paper, we demonstrate that naive approaches to PC score prediction can be substantially biased towards 0 in the analysis of large matrices. This phenomenon is largely related to known inconsistency results for sample eigenvalues and eigenvectors as both dimensions of the matrix increase. For the spiked eigenvalue model for random matrices, we expand the generality of these results, and propose bias-adjusted PC score prediction. In addition, we compute the asymptotic correlation coefficient between PC scores from sample and population eigenvectors. Simulation and real data examples from the genetics literature show the improved bias and numerical properties of our estimators.
PCA; PC Scores; Random Matrix; PC Regression
Motivation: DNA copy number gains and losses are commonly found in tumor tissue, and some of these aberrations play a role in tumor genesis and development. Although high resolution DNA copy number data can be obtained using array-based techniques, no single method is widely used to distinguish between recurrent and sporadic copy number aberrations.
Results: Here we introduce Discovering Copy Number Aberrations Manifested In Cancer (DiNAMIC), a novel method for assessing the statistical significance of recurrent copy number aberrations. In contrast to competing procedures, the testing procedure underlying DiNAMIC is carefully motivated, and employs a novel cyclic permutation scheme. Extensive simulation studies show that DiNAMIC controls false positive discoveries in a variety of realistic scenarios. We use DiNAMIC to analyze two publicly available tumor datasets, and our results show that DiNAMIC detects multiple loci that have biological relevance.
Availability: Source code implemented in R, as well as text files containing examples and sample datasets are available at http://www.bios.unc.edu/research/genomic_software/DiNAMIC.
Contact: email@example.com; firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics online.
Analysis of microarray experiments often involves testing for the overrepresentation of pre-defined sets of genes among lists of genes deemed individually significant. Most popular gene set testing methods assume the independence of genes within each set, an assumption that is seriously violated, as extensive correlation between genes is a well-documented phenomenon.
We conducted a meta-analysis of over 200 datasets from the Gene Expression Omnibus in order to demonstrate the practical impact of strong gene correlation patterns that are highly consistent across experiments. We show that a common independence assumption-based gene set testing procedure produces very high false positive rates when applied to data sets for which treatment groups have been randomized, and that gene sets with high internal correlation are more likely to be declared significant. A reanalysis of the same datasets using an array resampling approach properly controls false positive rates, leading to more parsimonious and high-confidence gene set findings, which should facilitate pathway-based interpretation of the microarray data.
These findings call into question many of the gene set testing results in the literature and argue strongly for the adoption of resampling based gene set testing criteria in the peer reviewed biomedical literature.
In quantitative-trait linkage studies using experimental crosses, the conventional normal location-shift model or other parameterizations may be unnecessarily restrictive. We generalize the mapping problem to a genuine nonparametric setup and provide a robust estimation procedure for the situation where the underlying phenotype distributions are completely unspecified. Classical Wilcoxon-Mann-Whitney statistics are employed for point and interval estimation of QTL positions and effects.
GEE; Generalized least squares estimate; Quantitative trait; Weighted least squares estimate; Wilcoxon- Mann-Whitney statistic