Motivation: Gene set enrichment has become a critical tool for interpreting the results of high-throughput genomic experiments. Inconsistent annotation quality and lack of annotation specificity, however, limit the statistical power of enrichment methods and make it difficult to replicate enrichment results across biologically similar datasets.
Results: We propose a novel algorithm for optimizing gene set annotations to best match the structure of specific empirical data sources. Our proposed method, entropy minimization over variable clusters (EMVC), filters the annotations for each gene set to minimize a measure of entropy across disjoint gene clusters computed for a range of cluster sizes over multiple bootstrap resampled datasets. As shown using simulated gene sets with simulated data and Molecular Signatures Database collections with microarray gene expression data, the EMVC algorithm accurately filters annotations unrelated to the experimental outcome resulting in increased gene set enrichment power and better replication of enrichment results.
Availability and implementation:
Supplementary data are available at Bioinformatics online.
Motivation: Epistasis, the presence of gene–gene interactions, has been hypothesized to be at the root of many common human diseases, but current genome-wide association studies largely ignore its role. Multifactor dimensionality reduction (MDR) is a powerful model-free method for detecting epistatic relationships between genes, but computational costs have made its application to genome-wide data difficult. Graphics processing units (GPUs), the hardware responsible for rendering computer games, are powerful parallel processors. Using GPUs to run MDR on a genome-wide dataset allows for statistically rigorous testing of epistasis.
Results: The implementation of MDR for GPUs (MDRGPU) includes core features of the widely used Java software package, MDR. This GPU implementation allows for large-scale analysis of epistasis at a dramatically lower cost than the standard CPU-based implementations. As a proof-of-concept, we applied this software to a genome-wide study of sporadic amyotrophic lateral sclerosis (ALS). We discovered a statistically significant two-SNP classifier and subsequently replicated the significance of these two SNPs in an independent study of ALS. MDRGPU makes the large-scale analysis of epistasis tractable and opens the door to statistically rigorous testing of interactions in genome-wide datasets.
Availability: MDRGPU is open source and available free of charge from http://www.sourceforge.net/projects/mdr.
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: The sequencing of the human genome has made it possible to identify an informative set of >1 million single nucleotide polymorphisms (SNPs) across the genome that can be used to carry out genome-wide association studies (GWASs). The availability of massive amounts of GWAS data has necessitated the development of new biostatistical methods for quality control, imputation and analysis issues including multiple testing. This work has been successful and has enabled the discovery of new associations that have been replicated in multiple studies. However, it is now recognized that most SNPs discovered via GWAS have small effects on disease susceptibility and thus may not be suitable for improving health care through genetic testing. One likely explanation for the mixed results of GWAS is that the current biostatistical analysis paradigm is by design agnostic or unbiased in that it ignores all prior knowledge about disease pathobiology. Further, the linear modeling framework that is employed in GWAS often considers only one SNP at a time thus ignoring their genomic and environmental context. There is now a shift away from the biostatistical approach toward a more holistic approach that recognizes the complexity of the genotype–phenotype relationship that is characterized by significant heterogeneity and gene–gene and gene–environment interaction. We argue here that bioinformatics has an important role to play in addressing the complexity of the underlying genetic basis of common human diseases. The goal of this review is to identify and discuss those GWAS challenges that will require computational methods.
Populations in sub-Saharan Africa are shifting from rural to increasingly urban. Although the burden of cardiovascular disease is expected to increase with this changing landscape, few large studies have assessed a wide range of risk factors in urban and rural populations, particularly in West Africa. We conducted a cross-sectional, population-based survey of 3317 participants from Ghana (≥18 years old), of whom 2265 (57% female) were from a mid-sized city (Sunyani, population ~250,000) and 1052 (55% female) were from surrounding villages (populations <5000). We measured canonical cardiovascular disease risk factors (BMI, blood pressure, fasting glucose, lipids) and fibrinolytic markers (PAI-1 and t-PA), and assessed how their distributions and related clinical outcomes (including obesity, hypertension and diabetes) varied with urban residence and sex. Urban residence was strongly associated with obesity (OR: 7.8, 95% CI: 5.3–11.3), diabetes (OR 3.6, 95% CI: 2.3–5.7), and hypertension (OR 3.2, 95% CI: 2.6–4.0). Among the quantitative measures, most affected were total cholesterol (+0.81 standard deviations, 95% CI 0.73–0.88), LDL cholesterol (+0.89, 95% CI: 0.79–0.99), and t-PA (+0.56, 95% CI: 0.48–0.63). Triglycerides and HDL cholesterol profiles were similarly poor in both urban and rural environments, but significantly worse among rural participants after BMI-adjustment. For most of the risk factors, the strength of the association with urban residence did not vary with sex. Obesity was a major exception, with urban women at particularly high risk (26% age-standardized prevalence) compared to urban men (7%). Overall, urban residents had substantially worse cardiovascular risk profiles, with some risk factors at levels typically seen in the developed world.
Bladder cancer is common disease with a complex etiology that is likely due to many different genetic and environmental factors. The goal of this study was to embrace this complexity using a bioinformatics analysis pipeline designed to use machine learning to measure synergistic interactions between single nucleotide polymorphisms (SNPs) in two genome-wide association studies (GWAS) and then to assess their enrichment within functional groups defined by Gene Ontology. The significance of the results was evaluated using permutation testing and those results that replicated between the two GWAS data sets were reported.
In the first step of our bioinformatics pipeline, we estimated the pairwise synergistic effects of SNPs on bladder cancer risk in both GWAS data sets using Multifactor Dimensionality Reduction (MDR) machine learning method that is designed specifically for this purpose. Statistical significance was assessed using a 1000-fold permutation test. Each single SNP was assigned a p-value based on its strongest pairwise association. Each SNP was then mapped to one or more genes using a window of 500 kb upstream and downstream from each gene boundary. This window was chosen to capture as many regulatory variants as possible. Using Exploratory Visual Analysis (EVA), we then carried out a gene set enrichment analysis at the gene level to identify those genes with an overabundance of significant SNPs relative to the size of their mapped regions. Each gene was assigned to a biological functional group defined by Gene Ontology (GO). We next used EVA to evaluate the overabundance of significant genes in biological functional groups. Our study yielded one GO category, carboxy-lysase activity (GO:0016831), that was significant in analyses from both GWAS data sets. Interestingly, only the gamma-glutamyl carboxylase (GGCX) gene from this GO group was significant in both the detection and replication data, highlighting the complexity of the pathway-level effects on risk. The GGCX gene is expressed in the bladder, but has not been previously associated with bladder cancer in univariate GWAS. However, there is some experimental evidence that carboxy-lysase activity might play a role in cancer and that genes in this pathway should be explored as drug targets. This study provides a genetic basis for that observation.
Our machine learning analysis of genetic associations in two GWAS for bladder cancer identified numerous associations with pairs of SNPs. Gene set enrichment analysis found aggregation of risk-associated SNPs in genes and significant genes in GO functional groups. This study supports a role for decarboxylase protein complexes in bladder cancer susceptibility. Previous research has implicated decarboxylases in bladder cancer etiology; however, the genes that we found to be significant in the detection and replication data are not known to have direct influence on bladder cancer, suggesting some novel hypotheses. This study highlights the need for a complex systems approach to the genetic and genomic analysis of common diseases such as cancer.
Identifying gene-gene interactions is essential to understand disease susceptibility and to detect genetic architectures underlying complex diseases. Here, we aimed at developing a permutation-based methodology relying on a machine learning method, random forest (RF), to detect gene-gene interactions. Our approach called permuted random forest (pRF) which identified the top interacting single nucleotide polymorphism (SNP) pairs by estimating how much the power of a random forest classification model is influenced by removing pairwise interactions.
We systematically tested our approach on a simulation study with datasets possessing various genetic constraints including heritability, number of SNPs, sample size, etc. Our methodology showed high success rates for detecting the interaction SNP pair. We also applied our approach to two bladder cancer datasets, which showed consistent results with well-studied methodologies, such as multifactor dimensionality reduction (MDR) and statistical epistasis network (SEN). Furthermore, we built permuted random forest networks (PRFN), in which we used nodes to represent SNPs and edges to indicate interactions.
We successfully developed a scale-invariant methodology to detect pure gene-gene interactions based on permutation strategies and the machine learning method random forest. This methodology showed great potential to be used for detecting gene-gene interactions to study underlying genetic architectures in a scale-free way, which could be benefit to uncover the complex disease mechanisms.
Random forest; GWAS; Machine learning; Scale invariant
Genome‐wide association studies (GWAS) have led to the discovery of over 200 single nucleotide polymorphisms (SNPs) associated with type 2 diabetes mellitus (T2DM). Additionally, East Asians develop T2DM at a higher rate, younger age, and lower body mass index than their European ancestry counterparts. The reason behind this occurrence remains elusive. With comprehensive searches through the National Human Genome Research Institute (NHGRI) GWAS catalog literature, we compiled a database of 2,800 ancestry‐specific SNPs associated with T2DM and 70 other related traits. Manual data extraction was necessary because the GWAS catalog reports statistics such as odds ratio and P‐value, but does not consistently include ancestry information. Currently, many statistics are derived by combining initial and replication samples from study populations of mixed ancestry. Analysis of all‐inclusive data can be misleading, as not all SNPs are transferable across diverse populations. We used ancestry data to construct ancestry‐specific human phenotype networks (HPN) centered on T2DM. Quantitative and visual analysis of network models reveal the genetic disparities between ancestry groups. Of the 27 phenotypes in the East Asian HPN, six phenotypes were unique to the network, revealing the underlying ancestry‐specific nature of some SNPs associated with T2DM. We studied the relationship between T2DM and five phenotypes unique to the East Asian HPN to generate new interaction hypotheses in a clinical context. The genetic differences found in our ancestry‐specific HPNs suggest different pathways are involved in the pathogenesis of T2DM among different populations. Our study underlines the importance of ancestry in the development of T2DM and its implications in pharmocogenetics and personalized medicine.
complex disease; East Asian populations; GWAS; human phenotype network; type 2 diabetes
Algorithmic scalability is a major concern for any machine learning strategy in this age of ‘big data’. A large number of potentially predictive attributes is emblematic of problems in bioinformatics, genetic epidemiology, and many other fields. Previously, ExS-TraCS was introduced as an extended Michigan-style supervised learning classifier system that combined a set of powerful heuristics to successfully tackle the challenges of classification, prediction, and knowledge discovery in complex, noisy, and heterogeneous problem domains. While Michigan-style learning classifier systems are powerful and flexible learners, they are not considered to be particularly scalable. For the first time, this paper presents a complete description of the ExS-TraCS algorithm and introduces an effective strategy to dramatically improve learning classifier system scalability. ExSTraCS 2.0 addresses scalability with (1) a rule specificity limit, (2) new approaches to expert knowledge guided covering and mutation mechanisms, and (3) the implementation and utilization of the TuRF algorithm for improving the quality of expert knowledge discovery in larger datasets. Performance over a complex spectrum of simulated genetic datasets demonstrated that these new mechanisms dramatically improve nearly every performance metric on datasets with 20 attributes and made it possible for ExSTraCS to reliably scale up to perform on related 200 and 2000-attribute datasets. ExSTraCS 2.0 was also able to reliably solve the 6, 11, 20, 37, 70, and 135 multiplexer problems, and did so in similar or fewer learning iterations than previously reported, with smaller finite training sets, and without using building blocks discovered from simpler multiplexer problems. Furthermore, ExS-TraCS usability was made simpler through the elimination of previously critical run parameters.
Learning Classifier System; Scalability; Evolutionary Algorithm; Data Mining; Classification; Prediction
To test the hypothesis that maternal complications significantly affect gut colonization patterns in very low birth weight infants.
49 serial stool samples were obtained weekly from 9 extremely premature infants enrolled in a prospective longitudinal study. Sequencing of the bacterial 16S rRNA gene from stool samples was performed to approximate the intestinal microbiome. Linear mixed effects models were used to evaluate relationships between perinatal complications and intestinal microbiome development.
Subjects with prenatal exposure to a non-sterile intrauterine environment, i.e. PPPROM and chorioamnionitis exposure, were found to have a relatively higher abundance of potentially pathogenic bacteria in the stool across all time points compared to subjects without those exposures, irrespective of exposure to postnatal antibiotics. Compared with those delivered by Caesarean section, vaginally delivered subjects were found to have significantly lower diversity of stool microbiota across all time points, with lower abundance of many genera, most in the family Enterobacteriaceae.
We identified persistently increased potential pathogen abundance in the developing stool microbiota of subjects exposed to a non-sterile uterine environment. Maternal complications appear to significantly influence the diversity and bacterial composition of the stool microbiota of premature infants, with findings persisting over time.
obstetrical complications; obstetrical interventions; prematurity; microbiome; PPPROM; chorioamnionitis; antibiotics
Modern technologies are capable of generating enormous amounts of data that measure complex biological systems. Computational biologists and bioinformatics scientists are increasingly being asked to use these data to reveal key systems-level properties. We review the extent to which curricula are changing in the era of big data. We identify key competencies that scientists dealing with big data are expected to possess across fields, and we use this information to propose courses to meet these growing needs. While bioinformatics programs have traditionally trained students in data-intensive science, we identify areas of particular biological, computational and statistical emphasis important for this era that can be incorporated into existing curricula. For each area, we propose a course structured around these topics, which can be adapted in whole or in parts into existing curricula. In summary, specific challenges associated with big data provide an important opportunity to update existing curricula, but we do not foresee a wholesale redesign of bioinformatics training programs.
big data; bioinformatics; data science; education
Electronic health records (EHRs) have become a vital source of patient outcome data but the widespread prevalence of missing data presents a major challenge. Different causes of missing data in the EHR data may introduce unintentional bias. Here, we compare the effectiveness of popular multiple imputation strategies with a deeply learned autoencoder using the Pooled Resource Open-Access ALS Clinical Trials Database (PRO-ACT). To evaluate performance, we examined imputation accuracy for known values simulated to be either missing completely at random or missing not at random. We also compared ALS disease progression prediction across different imputation models. Autoencoders showed strong performance for imputation accuracy and contributed to the strongest disease progression predictor. Finally, we show that despite clinical heterogeneity, ALS disease progression appears homogenous with time from onset being the most important predictor.
Simulation plays an essential role in the development of new computational and statistical methods for the genetic analysis of complex traits. Most simulations start with a statistical model using methods such as linear or logistic regression that specify the relationship between genotype and phenotype. This is appealing due to its simplicity and because these statistical methods are commonly used in genetic analysis. It is our working hypothesis that simulations need to move beyond simple statistical models to more realistically represent the biological complexity of genetic architecture. The goal of the present study was to develop a prototype genotype-phenotype simulation method and software that are capable of simulating complex genetic effects within the context of a hierarchical biology-based framework. Specifically, our goal is to simulate multilocus epistasis or gene-gene interaction where the genetic variants are organized within the framework of one or more genes, their regulatory regions and other regulatory loci. We introduce here the Heuristic Identification of Biological Architectures for simulating Complex Hierarchical Interactions (HIBACHI) method and prototype software for simulating data in this manner. This approach combines a biological hierarchy, a flexible mathematical framework, a liability threshold model for defining disease endpoints and a heuristic search strategy for identifying high-order epistatic models of disease susceptibility. We provide several simulation examples using genetic models exhibiting independent main effects and three-way epistatic effects.
Statistical interactions between markers of genetic variation, or gene‐gene interactions, are believed to play an important role in the etiology of many multifactorial diseases and other complex phenotypes. Unfortunately, detecting gene‐gene interactions is extremely challenging due to the large number of potential interactions and ambiguity regarding marker coding and interaction scale. For many data sets, there is insufficient statistical power to evaluate all candidate gene‐gene interactions. In these cases, a global test for gene‐gene interactions may be the best option. Global tests have much greater power relative to multiple individual interaction tests and can be used on subsets of the markers as an initial filter prior to testing for specific interactions. In this paper, we describe a novel global test for gene‐gene interactions, the global epistasis test (GET), that is based on results from random matrix theory. As we show via simulation studies based on previously proposed models for common diseases including rheumatoid arthritis, type 2 diabetes, and breast cancer, our proposed GET method has superior performance characteristics relative to existing global gene‐gene interaction tests. A glaucoma GWAS data set is used to demonstrate the practical utility of the GET method.
gene‐gene interaction; random matrix theory; global test
Although gene‐environment (G× E) interactions play an important role in many biological systems, detecting these interactions within genome‐wide data can be challenging due to the loss in statistical power incurred by multiple hypothesis correction. To address the challenge of poor power and the limitations of existing multistage methods, we recently developed a screening‐testing approach for G× E interaction detection that combines elastic net penalized regression with joint estimation to support a single omnibus test for the presence of G× E interactions. In our original work on this technique, however, we did not assess type I error control or power and evaluated the method using just a single, small bladder cancer data set. In this paper, we extend the original method in two important directions and provide a more rigorous performance evaluation. First, we introduce a hierarchical false discovery rate approach to formally assess the significance of individual G× E interactions. Second, to support the analysis of truly genome‐wide data sets, we incorporate a score statistic‐based prescreening step to reduce the number of single nucleotide polymorphisms prior to fitting the first stage penalized regression model. To assess the statistical properties of our method, we compare the type I error rate and statistical power of our approach with competing techniques using both simple simulation designs as well as designs based on real disease architectures. Finally, we demonstrate the ability of our approach to identify biologically plausible SNP‐education interactions relative to Alzheimer's disease status using genome‐wide association study data from the Alzheimer's Disease Neuroimaging Initiative (ADNI).
gene‐environment interactions; screening testing; hierarchical FDR; penalized regression
There are several factors that are known to affect research productivity; some of them imply the need for large financial investments and others are related to work styles. There are some articles that provide suggestions for early career scientists (PhD students and postdocs) but few publications are oriented to professors about scientific leadership. As academic mentoring might be useful at all levels of experience, in this note we suggest several key considerations for higher efficiency and productivity in academic and research activities. More research is needed into the main work style features that differentiate highly productive scientists and research groups, as some of them could be innate and others could be transferable. As funding agencies, universities and research centers invest large amounts of money in order to have a better scientific productivity, a deeper understanding of these factors will be of high academic and societal impact.
Scientometrics; Biomedical research; Bibliometrics; Global science; Scientific productivity; Academic mentoring
Although recent genome-wide studies have provided valuable insights into the genetic basis of human disease, they have explained relatively little of the heritability of most complex traits, and the variants identified through these studies have small effect sizes. This has led to the important and hotly debated issue of where the ‘missing heritability’ of complex diseases might be found. Here, seven leading geneticists offer their opinion about where this heritability is likely to lie, what this could tell us about the underlying genetic architecture of common diseases and how this could inform research strategies for uncovering genetic risk factors.
Metabolic syndrome (MetS) is diagnosed by the presence of at least 3 of the following: obesity, hypertension, hyperglycemia, hypertriglyceridemia, and low high‐density lipoprotein. Individuals with MetS also typically have elevated plasma levels of the antifibrinolytic factor, plasminogen activator inhibitor‐1 (PAI‐1), but the relationships between PAI‐1 and MetS diagnostic criteria are not clear. Understanding these relationships can elucidate the relevance of MetS to cardiovascular disease risk, because PAI‐1 is associated with ischemic events and directly involved in thrombosis.
Methods and Results
In a cross‐sectional analysis of 2220 Ghanaian men and women from urban and rural locales, we found the age‐standardized prevalence of MetS to be as high as 21.4% (urban women). PAI‐1 level increased exponentially as the number of diagnostic criteria increased linearly (P<10−13), supporting the conclusion that MetS components have a joint effect that is stronger than their additive contributions. Body mass index, triglycerides, and fasting glucose were more strongly correlated with PAI‐1 than with canonical MetS criteria, and this pattern did not change when pair‐wise correlations were conditioned on all other risk factors, supporting an independent role for PAI‐1 in MetS. Finally, whereas the correlations between conventional risk factors did not vary significantly by sex or across urban and rural environments, correlations with PAI‐1 were generally stronger among urban participants.
MetS prevalence in the West African population we studied was comparable to that of the industrialized West. PAI‐1 may serve as a key link between MetS, as currently defined, and the endpoints with which it is associated. Whether this association is generalizable will require follow‐up.
diabetes mellitus; epidemiology; fibrinolysis; hypertension; lipids; obesity; Epidemiology; Thrombosis; Diabetes, Type 2; Obesity; Hypertension
Modern cohort studies include self-reported measures on disease, behavior and lifestyle, sensor-based observations from mobile phones and wearables, and rich -omics data. Follow-up is often achieved through electronic health record (EHR) linkages across primary and secondary healthcare providers. Historically however, researchers typically only get to see the tip of the iceberg: coded administrative data relating to healthcare claims which mainly record billable diagnoses and procedures. The rich data generated during the clinical pathway remain submerged and inaccessible. While some institutions and initiatives have made good progress in unlocking such deep phenotypic data within their institutional realms, access at scale still remains challenging. Here we outline and discuss the main technical and social challenges associated with accessing these data for data mining and hauling the entire iceberg.
As the cost of genome-wide genotyping decreases, the number of genome-wide association studies (GWAS) has increased considerably. However, the transition from GWAS findings to the underlying biology of various phenotypes remains challenging. As a result, due to its system-level interpretability, pathway analysis has become a popular tool for gaining insights on the underlying biology from high-throughput genetic association data. In pathway analyses, gene sets representing particular biological processes are tested for significant associations with a given phenotype. Most existing pathway analysis approaches rely on single-marker statistics and assume that pathways are independent of each other. As biological systems are driven by complex biomolecular interactions, embracing the complex relationships between single-nucleotide polymorphisms (SNPs) and pathways needs to be addressed. To incorporate the complexity of gene-gene interactions and pathway-pathway relationships, we propose a system-level pathway analysis approach, synthetic feature random forest (SF-RF), which is designed to detect pathway-phenotype associations without making assumptions about the relationships among SNPs or pathways. In our approach, the genotypes of SNPs in a particular pathway are aggregated into a synthetic feature representing that pathway via Random Forest (RF). Multiple synthetic features are analyzed using RF simultaneously and the significance of a synthetic feature indicates the significance of the corresponding pathway. We further complement SF-RF with pathway-based Statistical Epistasis Network (SEN) analysis that evaluates interactions among pathways. By investigating the pathway SEN, we hope to gain additional insights into the genetic mechanisms contributing to the pathway-phenotype association. We apply SF-RF to a population-based genetic study of bladder cancer and further investigate the mechanisms that help explain the pathway-phenotype associations using SEN. The bladder cancer associated pathways we found are both consistent with existing biological knowledge and reveal novel and plausible hypotheses for future biological validations.
interactions; epistasis; pathway analysis; synthetic feature random forest (SF-RF); statistical epistasis network (SEN)
Many colleges and universities across the globe now offer bachelors, masters, and doctoral degrees, along with certificate programs in bioinformatics. While there is some consensus surrounding curricula competencies, programs vary greatly in their core foci, with some leaning heavily toward the biological sciences and others toward quantitative areas. This allows prospective students to choose a program that best fits their interests and career goals. In the digital age, most scientific fields are facing an enormous growth of data, and as a consequence, the goals and challenges of bioinformatics are rapidly changing; this requires that bioinformatics education also change. In this workshop, we seek to ascertain current trends in bioinformatics education by asking the question, “What are the core competencies all bioinformaticians should have at the end of their training, and how successful have programs been in placing students in desired careers?”
Bladder cancer is the 4th most common cancer among men in the U.S. and more than half of patients experience recurrences within 5 years after initial diagnosis. Additional clinically informative and actionable biomarkers of the recurrent bladder cancer phenotypes are needed to improve screening and molecular therapeutic approaches for recurrence prevention. MicroRNA-34a (miR-34a) is a short non-coding regulatory RNA with tumor suppressive attributes. We leveraged our unique, large, population-based prognostic study of bladder cancer in New Hampshire, U.S. to evaluate miR-34a expression levels in individual tumor cells to assess prognostic value. We collected detailed exposure and medical history data, as well as tumor tissue specimens from bladder patients and followed them long-term for recurrence, progression and survival. Fluorescence-based in situ hybridization assays were performed on urothelial carcinoma tissue specimens (n=229). A larger proportion of the non-muscle invasive tumors had high levels of miR-34a within the carcinoma cells compared to those tumors that were muscle invasive. Patients with high miR-34a levels in their baseline non-muscle invasive tumors experienced lower risks of recurrence (adjusted hazard ratio (HR) 0.57 95%CI 0.34–0.93). Consistent with these observations, we demonstrated a functional tumor suppressive role for miR-34a in cultured urothelial cells, including reduced matrigel invasion and growth in soft agar. Our results highlight the need for further clinical studies of miR-34a as a guide for recurrence screening and as a possible candidate therapeutic target in the bladder.
miR; miRNA; bladder cancer; urothelial carcinoma; recurrence
Gene set testing has become an indispensable tool for the analysis of high-dimensional genomic data. An important motivation for testing gene sets, rather than individual genomic variables, is to improve statistical power by reducing the number of tested hypotheses. Given the dramatic growth in common gene set collections, however, testing is often performed with nearly as many gene sets as underlying genomic variables. To address the challenge to statistical power posed by large gene set collections, we have developed spectral gene set filtering (SGSF), a novel technique for independent filtering of gene set collections prior to gene set testing. The SGSF method uses as a filter statistic the p-value measuring the statistical significance of the association between each gene set and the sample principal components (PCs), taking into account the significance of the associated eigenvalues. Because this filter statistic is independent of standard gene set test statistics under the null hypothesis but dependent under the alternative, the proportion of enriched gene sets is increased without impacting the type I error rate. As shown using simulated and real gene expression data, the SGSF algorithm accurately filters gene sets unrelated to the experimental outcome resulting in significantly increased gene set testing power.
Gene set testing; gene set enrichment; screening-testing; principal component analysis; random matrix theory; Tracy-Widom
The large volume of GWAS data poses great computational challenges for analyzing genetic interactions associated with common human diseases. We propose a computational framework for characterizing epistatic interactions among large sets of genetic attributes in GWAS data. We build the human phenotype network (HPN) and focus around a disease of interest. In this study, we use the GLAUGEN glaucoma GWAS dataset and apply the HPN as a biological knowledge-based filter to prioritize genetic variants. Then, we use the statistical epistasis network (SEN) to identify a significant connected network of pairwise epistatic interactions among the prioritized SNPs. These clearly highlight the complex genetic basis of glaucoma. Furthermore, we identify key SNPs by quantifying structural network characteristics. Through functional annotation of these key SNPs using Biofilter, a software accessing multiple publicly available human genetic data sources, we find supporting biomedical evidences linking glaucoma to an array of genetic diseases, proving our concept. We conclude by suggesting hypotheses for a better understanding of the disease.
GWAS; Epistasis; Gene-gene interaction; Human phenotype network; Statistical epistasis network; Biofilter; Eye diseases; Glaucoma; SNPs; Pathways