|Home | About | Journals | Submit | Contact Us | Français|
Preterm birth in the United States is now 12%. Multiple genes, gene networks, variants have been associated with this disease. Using a custom database for preterm birth (dbPTB) with a refined set of genes extensively curated from literature and biological databases, we analyzed a GWAS of preterm birth for complete genotype data on nearly 2000 preterm and term mothers. We used both the curated genes and a genome-wide approach to carry out a pathway-based analysis. There were 19 significant pathways, which withstood FDR correction for multiple testing that were identified using both the curated genes and the genome-wide approach. The analysis based on the curated genes was more significant than genome-wide in 15 out of 19 pathways. This approach demonstrates the use of a validated set of genes, in the analysis of otherwise unsuccessful GWAS data, to identify gene-gene interactions in a way that enhances statistical power and discovery.
Genome-wide association studies (GWAS) enable investigation of the genetic associations underlying complex diseases without a priori hypotheses [1, 2]. Advances in high-throughput genotyping, sequencing technology and developments in computational power have enhanced the feasibility of large case-controlled studies and reduced costs . Since they have the potential of identifying novel genetic variants, GWAS have become a popular approach to the investigation of complex diseases. By the second quarter of 2011, there were 1449 reports in the Catalog of Published Genome-Wide Association Studies (http://www.genome.gov/gwastudies/) covering hundreds of associations of common genetic variants with complex traits . These reports have provided valuable insights into the genetic architecture of disease, including inflammatory bowel disease, macular degeneration, and obesity [5–7]. Nonetheless, GWAS for complex diseases have had only a measured success. While, there have been many loci identified and replicated in GWAS, many studies have failed to identify significant associations. Likewise, the genetic markers that have been identified through the GWAS approach are rarely functional variants in the diseases with which they are associated. In addition, most common variants that are identified by GWAS are responsible for only a small portion of the genetic variation and thus there remains a large amount of “missing heritability” [8, 9]. If the “common disease common variant hypothesis” underlying the GWAS approach does not explain the genetic contributions to complex diseases than what does [8, 10]? It is likely that rare variants and/or genetic interactions, epistasis, underlie a significant portion of the ‘missing heritability’ not revealed by conventional GWAS analyses [11–13]. It is also likely that complex mechanisms and higher orders of gene-gene interactions underlie the pathogenesis of many (most) complex diseases and lead to variations/alterations of the phenotype [14–17]. Identification of the multiple genes contributing to disease pathogenesis may help in understanding the effects on phenotype and in the search for missing heritability . Nonetheless, the GWAS-based interrogation of large numbers of anonymous single nucleotide polymorphisms (SNP) severely limits power, thus weakening our computational ability to examine combinatorial gene-gene interactions [19–21].
We are interested in the genetic contribution(s) to preterm birth. Preterm birth is an important, poorly understood clinical problem [22–25]. The incidence of preterm birth (PTB) in the United States is now 12%, or 1 in 8 women . It creates enormous clinical, economic and psychological burdens. The pathogenesis has remained elusive. Clinical tests and interventions to identify the patients at risks for preterm birth have relied heavily on assessment of common pathways associated with labor, such as myometrial contractility, cervical ripening, and decidual/membrane activation . Interventions to prevent preterm birth are aimed at these common pathways. However, most of these interventions have proven ineffective. Multiple genes, gene networks, variants have been associated with preterm birth, however, single genes and pathways and simple patterns of inheritance are inadequate to explain the pathogenesis of the majority of preterm births . The pathogenesis of PTB may be better understood if the analysis incorporated a more complex model that entails a host of genes  and with environmental triggers overlaying the genetics as well.
We have developed an approach for identifying a parsimonious set of genes for the study of preterm birth validated by a priori biological information. We used a semantic data mining and natural language processing approach to extract all published articles related to preterm birth . Then, the genes identified from public databases and archives of expression arrays were aggregated with the gene set curated from the literature. Lastly, pathway analysis was used to impute genes from pathways identified during curation. The curated articles and collected genetic information form a unique resource for investigators interested in preterm birth, the Database for Preterm Birth (dbPTB), publicly-accessible at http://ptbdb.cs.brown.edu/dbPTBv1.php. Recently, results from a genome-wide study of preterm birth, “GENEVA,” became available in dbGAP . The dataset includes phenotypic information and complete genotype data on nearly 2000 mothers, ranging from 20 to 42 weeks gestation. Since It has been demonstrated that that the genetic risk of preterm birth segregates heavily to the maternal genome, we have concentrated our analysis only on maternal genotype information . Using the curated genes from dbPTB, we have analyzed the GENEVA data set from mothers only. The results of the refined curated genes were further analyzed by gene set enrichment analysis.
We applied standard case/control allelic testing in Plink v1.07 to analyze the association of individual SNPs with preterm birth. In the first analysis, we only used SNPs that belonged to the curated dbPTB genes . We included SNPs within the genomic region encompassing each gene as well as SNPs within 5kb upstream or downstream. Of the 617 genes identified in dbPTB, 551 were mapped onto the Illumina 660 quad platform encompassing 9077 tag SNPs. In the second analysis, we ran a genome-wide comparison using all of the SNPs on the Illumina platform (n= 560,768 SNPs). Preterm women were divided into three gestational age categories: less than 30 weeks gestation (n= 92), less than 34 weeks (n= 446), and less than 37 weeks gestation (n= 884) and compared to women who delivered greater than or equal to 38 weeks gestation (n=960). A Manhattan plot showing the results for all three preterm gestational age groups across the dbPTB SNPs is shown in Figure 1. As can be seen from this representative plot of chromosome 1, there were multiple regions where associations were seen for all three patient groups. Similar results were seen for the other chromosomes. While several regions demonstrated −log P values greater than 2.5 (p<0.0032), no significant single variants were identified in the dbPTB set of curated genes that withstood Bonferroni correction for multiple comparisons (p< 5.5×10−6). The lowest P-value was 1.45 10−4 for the SNP (rs5742637) which belongs to insulin-like growth factor 1 gene (IGF1). In the genome-wide analysis only a single variant reached the Bonferroni-corrected significance threshold (p<8.9×10−8). The P-value for this SNP (rs12682166) was 4.99 10−8. This SNP did not map within any known gene nor were there any genes within 5kb upstream or downstream from this variant.
Gene set enrichment analysis was carried out using the 9077 SNPs within the 515 curated dbPTB genes. We compared the cases less than 30 weeks gestation with the controls described in the previous section. For the GSEA analysis we selected the following analytical options: tag SNPS with the 5kb upstream and downstream of each gene ; gene sets were selected, including “canonical pathways, GO biological process, GO molecular function, GO cellular component;” lastly, we used gene set size ranges from 5 to 200. For comparison, we performed GSEA on the genome-wide SNP data (n= 560,768 SNPs). All parameters were kept the same when running the genome wide analysis.
From the dbPTB-based pathway analysis, we identified a total of 30 pathways with high confidence values (false discovery rate correction for multiple comparison, FDR, <0.05), see supplemental Table 1. From the whole genome based pathway analysis 39 pathways with high confidence were identified. When we compared the analyses, there were 37 shared pathways. A view illustrating the pathway analysis results for both analyses is shown in Figure 2 and in Table 1. The Vertical axis represents the −log P values for each of the pathways. The statistical values for the pathways identified by dbPTB are shown in dark blue and for the whole genome analysis are shown in light blue. Each of the pathways is shown along the horizontal axis. The threshold value of −log P= 1.3 corresponds to an FDR p value less than 0.05 which has already been adjusted for multiple comparisons. There were nineteen shared pathways that reached significance by either analysis. The dbPTB analysis showed greater significance in 15 out of the 19 shared significant pathways (Figure 2). There were 13 significant pathways that were only identified using the dbPTB curated set of genes and 33 significant pathways that were only identified in the whole genome analysis. Remarkably, most of the pathways involving inflammation were either shared or identified only by the dbPTB-based pathway analysis (Supplemental Table 1). Prominent among the results from the genome-wide analysis were metabolomic pathways including phospholipase A2 activity, amino acid derivative, biosynthetic processes, plus the pathway involving the trans-Golgi network, enzyme inhibitor pathway, lipase activity and nitrogen compound biosynthetic processes and carboxyl esterase activity, hypotaurine metabolism (Supplemental Table 2). A summary comparison of the results of the pathway analyses from both dbPTB and the genome-wide data is shown in supplemental Table 3 and supplemental Figure 1.
Although there have been some successes, GWAS based approaches have failed to provide comprehensive explanations for the genetic basis of many complex diseases . There are many challenges in identification the causative genes. As noted above for complex diseases, gene-gene interactions are a far more likely model as complex molecular networks and metabolic pathways are involved in polygenic diseases [31, 33–35]. For our approach, we took into consideration the a priori biological information about genes involved in preterm birth from the published literature and from available expression arrays. In addition to these initial steps we included pathway analysis to impute additional genes likely to be involved from pathways identified during curation. Combining these three sources powered the curated gene set for our disease of interest, preterm birth . We increased our power by focusing on a smaller number of comparisons, none of the identified single gene variants reached statistical significance. By employing pathway based permutation testing we identified important genes and their variants in this important disorder. Moreover, by using a more parsimonious, curated set of genes or variants with demonstrated biological significance, we greatly enhance our statistical power. This was most evident in the statistical validation of pathways involved in inflammation. Those pathways were not evident in the genome wide analysis but were SOLEY identified using the curated set of genes for permutation testing.
Since a portion of the ‘missing heritability’ is likely explained by gene-gene interactions, we employed a pathway-based approach to analyze the results from a large GWAS on preterm birth . Our pathway-based approach used the SNPs selected for the dbPTB set of genes and whole genome from GENEVA data. In order to enhance our likelihood of success by selecting the most “extreme phenotype”, we restricted our analysis to comparison of controls which delivered at 38 weeks gestation or higher to patients who delivered at 30 weeks gestation or lower.
In order to generate the “p-values” needed for the pathway analysis, we first carried out single variant analysis using both dbPTB curated genes and whole genome data. As already noted, we did not find significant single variant associated with any known genes using either the dbPTB curated gene set or the genome-wide data. By comparison, the pathway based approached yielded some rich and significant results which replicate the findings from other studies . Among the ranked list of SNPs in the dbPTB curated gene analysis, the best SNP (rs5742637) mapped onto the IGF1 gene. IGF1 was identified in the dbPTB gene set from a single manuscript which sought candidate genes associated with coagulation and inflammatory pathways in preterm birth . In that report, 1536 SNPs in 130 candidate genes were interrogated and IGF1 was one of the significant findings. In the pathway analysis, there were a total of 3 significant pathways which included IGF1. These included the erythrocyte differentiation pathway, prostate cancer and PIP3 signaling in cardiac myocytes. These overlap with the pathways with which IGF1 has been more broadly been associated and are listed in the preterm birth database and includes hypertrophic cardiomyopathy, the mTOR signaling pathway and prostate cancer. Also of note, the IGF1R was identified in the preterm birth database as associated with preterm birth. This was the result of a large linkage analysis done in the Finnish population . In the latter report, the association of IGF1R with preterm birth was verified by haplotype analysis in a larger, independent group of patients . It is likely that the failure to identify IGF1R in both our curated gene study and the pathway analysis is due to the omission of these tag SNPs on our genotyping platform. Nonetheless, the importance of this pathway is suggested by our results and others’ . IGF1 and its signaling pathway were included in previous candidate analyses because of their participation in the decidua-chorioamniotic, and systemic inflammation signaling pathways which involve the PI3kinase and mTOR signaling pathways , both of which were prominent in our results.
The pathways identified in our analysis are not independent but instead show a rich network of connectivity. This can best be seen graphically Figure 3. Gephi was used to make the network maps . In this figure, the blue nodes represent the shared pathways and orange nodes that pathways only identified by dbPTB. Likewise the genes forming the connectivity are displayed. AKT1 was the most connected gene, being identified as contributing to a significant role in preterm birth in 15 pathways. This is shown in Figure 4. A listing of the other, highly connected dbPTB genes is provided in supplemental Table 4. An alternative way to view the strength of the pathway analysis is not to look solely at the gene contributing the most pathways, but to indentify which pathways had the most genes contributing to their significance and which genes were these. These results are shown in Figures 5A and 5B. Breast cancer estrogen signaling and oxidoreductase activity pathways each had 10 different genes contributed significantly to their involvement in preterm birth. The latter two pathways were seen solely in the dbPTB analysis.
While our approach to pathway analysis was hypothesis-free, inspection of the genes which showed a significant relationship to preterm birth reveals that several of the traditional mechanisms for preterm birth were highly represented. This includes inflammation and metabolomic disorders. Infection and inflammation have been strongly linked to preterm birth . Genes involved in inflammatory mechanisms that emerged from the pathway analysis include: IL6, TGFβ2, NOS1, NFKB1, AKT1, IRAK1, TLR3, TLR7, TP53, IFNG, and AR2. These are all important genes, receptors and signaling elements in inflammation. Many of the associations of these inflammatory genes with preterm birth emerged from their involvement in other, related pathways including GSK3 signaling, small lung cell cancers, organism processes in response to biotic stimulus and PI3K signaling, protein serine3 kinase activity and the NFAT pathway, Figure 3. These genes were not identified in the only other published candidate and pathway-based interrogation although, in the latter study, pathways associated with inflammation were identified including JAK-STAT signaling, MAP kinase signaling, T cell receptor signaling and the Toll-like receptor signaling pathway .
Metabolomics has recently been identified as an emerging technology that may provide clues to the pathogenesis of preterm birth that were not previously apparent [38, 39]. In a recent report, gas chromatography-mass spectrometry was used to profile low molecular weight compounds in amniotic fluid of patients delivering with preterm birth with and without inter-amniotic inflammation . A classification profile was developed which subsequently allowed correct classification of patients with preterm birth. We identified several pathways involved in metabolomics that may provide a clue to the genetic architecture underlying the role of metabolomics in preterm birth. These include electron carrier activity, arginine and proline metabolism, the signaling pathways involved in GSK3 and PI3K, tyrosine metabolism, response to biotic stimuli, the oxido-reductase pathway, Figure 5B, protein oligomerization and serine threonine kinase activity. Of the genes associated with these pathways, the most prominent were NOS1 (which is also involved in inflammation), protein kinase C-alpha and ALK associated with both phosphotransferase activity alcohol group as acceptor and kinase activity and transferase activity transferring phosphorus containing groups. The strength of our approach can be seen through the inclusion of NOS1. NOS1 was not identified during the literature curation process or during the aggregation of genes from transcriptome-wide arrays. NOS1 was included in the dbPTB curated genes through the pathway imputation process. Remarkable now is the inclusion of NOS1 in multiple pathways through the GSEA analysis. NOS1 contributed prominently to the significance of the smooth muscle contraction pathway, the oxidoreductase activity pathway, arginine and proline metabolism, small lung cell cancer. In similar fashion, AKT1 was included in the dbPTB set of curated genes through pathway imputation. What is likewise remarkable is that AKT1 was the most frequently identified gene whose variants contributed to significance in the pathway analysis, Figure 4. AKT1 contributed to identification of signaling pathways as diverse as GSK3 signaling tight junctions, prostate cancer, small lung cell cancer, PI3K signaling in cardiac myocytes, telomerase pathways. AKT1 was also prominent in the pathways that were shared between the dbPTB analysis and the whole genome-based pathway analysis including PI3K signaling, HSA41/50 motor signaling, eIF4 pathway, protein serine3 kinase activity, phosphotransferase activity to alcohol groups, general kinase activity, melanoma, transferase activity transferring phosphorus-containing groups, the NFAT pathway.
We are especially interested in comparing the pathway results from the dbPTB curated genes and the pathway results from the genome-wide analysis. As noted above, while there were 37 shared pathways, there were 33 significant pathways that were only identified in the whole genome-based pathway analysis. Prominent among these were metabolomic pathways including phospholipase A2 activity, amino acid derivative, biosynthetic processes, plus the pathway involving the trans-Golgi network, enzyme inhibitor pathway, lipase activity and nitrogen compound biosynthetic processes and carboxyl esterase activity, hypotaurine metabolism. In contrast, most of the pathways involving inflammation were either shared or identified only by the dbPTB-based pathway analysis. This demonstrates the strength of our hybrid approach to identification of relevant genes and pathways in complex diseases. There have been other efforts to collate information on preterm birth. PTBGene is a publicly available database which stores published information on genetic associations with preterm birth . The database currently includes 84 genes with 189 polymorphisms. Using meta analyses these investigators reported 5 significant variants. Four of them were maternal and one was in the newborn.
In summary, we used a bioinformatically-driven strategy to identify a parsimonious set of genes associated with preterm birth. By aggregating genes from literature curation, publically-available databases (most often from transcriptome-wide analysis) and then using pathway-based imputation, we identified 617 genes for which there was a priori biological evidence for involvement in preterm birth. The tag SNPs associated with these genes were then used in traditional candidate gene association testing using data from the GENEVA genome-wide association study. While we increased our power by focusing on a smaller number of comparisons, none of the identified single gene variance reached statistical significance. We did, however, corroborate the best of those curated genes, IGF1 in the pathway-based analyses in both the dbPTB pathway analysis and the genome-wide analysis. The database for preterm birth was built to support analysis of gene/gene interactions. It is clear using extremely large sets of SNPs that it’s computationally expensive to carry out even pairwise comparisons of genes. Moreover, the knowledge-based association of genetic variation with disease dictates that all variants are not interacting with each other. Rather, gene/gene interactions occur on the basis of known biological information. This body of information has been built into robust databases including KEGG, Biocarta, DAVID, GO and Ingenuity . Although pathway-based analysis methods help us in understanding and evaluating GWAS data, improvements are forthcoming. Better summary statistics will help to evaluate the results more robustly as described in Wang et al . Likewise, gene level p-values which usually depend on SNP association test are limited by the number and preference of SNPs on the arrays. Another limitation is to identify which SNP is the best representative of a given gene by not only considering the best association p-values but also considering combined effects of SNPs in linkage disequilibrium .
The gene set enrichment approach allowed us to interrogate the known biological associations annotated by several of these databases. By permutation testing, we compared the association of the single nucleotide polymorphisms tagging the genes in our dataset and their association in cases and controls. Even given the anticipated improvements in pathway-based methods, the results were extraordinary. We identified a large number of significant pathways in which biologically relevant curated genes and their associated variants showed significant segregation between the preterm birth and full term births. Moreover, the curated genes from the dbPTB dataset gave much stronger associations than the genome wide analysis in all but a few of these pathways. These results provide important confirmation of the role of genetic architecture in the risk of preterm birth. They also provide important mechanistic insights and curated genes which are suitable for future genetic association testing or ideal targets for more thorough evaluation including targeted re-sequencing. We recognize that, due to the lack of a replication dataset, this study should be considered hypothesis generating and that these results will need to be replicated in an appropriate dataset.
We identified 186 genes using the literature-based curation, 215 genes from publically-available databases and an additional 216 genes from the pathway-based interpolation . These 617 genes represent a robust set of genes for which there is good prior biological evidence for involvement in preterm birth .
We analyzed the single nucleotide polymorphism (SNP) genotyping data from a prospective cohort study in Denmark. The data were derived from the Gene Environment Association Studies initiative (GENEVA) funded by the trans-NIH Genes, Environment, and Health Initiative (GEI) . The data from GENEVA consist of approximately 4000 Danish women and children and includes phenotype and genotype information from a genome-wide case/control study using approximately 1000 preterm mother-child pairs. There is also data from 1000 control mother-child pairs where the child was born greater than or equal to 38 weeks’ gestation. All data were deposited into the Database for Genotypes and Phenotypes (dbGaP) . Genome wide SNP genotyping was performed using Illumina Human 660W-Quad_v1_A (n=560,768 SNPs) at the Center for Inherited Disease Research, Baltimore, MD. As reported in the data set release, genotypes were not reported for any SNP which had a call rate less than 85% or which had more than 1 replicate error as defined with the HapMap control samples.
We ran basic SNP association tests in PLINK to obtain individual marker P-values . The basic association test is based on comparing allele frequencies between cases and controls. PLINK is a free, open-source whole genome association analysis toolset which performs a range of basic, large-scale analyses . The SNP-association analyses were conducted in PLINK using only curated-genes from dbPTB as well as using all the SNPs from the genome-wide analysis. For these analyses, the study “controls” consisted of the 960 mothers who had delivered at 38 weeks gestation or higher. For comparison we carried out the same curated gene analysis using three different patient groups from the GENEVA study. We analyzed the single SNP association with PTB by comparing the controls with the 884 patients delivering less than 37 weeks, the 446 patients delivering less than 34 weeks, and the 92 patients delivering less than 30 weeks.
In recent years, gene set enrichment analysis (GSEA, ) has become increasingly popular to support analysis of gene-gene interactions and to help in understanding the individual contribution(s) of biological pathways to genetic architecture. GSEA employs a new way of considering the SNPs in GWAS . Instead of analyzing single SNPs individually, GSEA tests disease association with genetic variants in functionally related genes by analyzing the genes that belong to the same pathway that may represent the possible SNP or gene association with complex diseases . We first performed GSEA on GENEVA GWAS data, using the SNP P-values from the dbPTB curated genes with i-GSEA4GWAS. The i-GSEA4GWAS web server implements i-GSEA to explore GWAS data efficiently. “i-GSEA” is an enhanced application and extension of GSEA. The program runs the analysis in three steps. First, it maps the variants to the genes, each gene is represented by −log (P-value) of closely spaced SNPs in a gene. Second, i-GSEA is performed to identify the pathways correlated to traits based on the distribution of enrichment scores generated by permutation. FDR is calculated and used to correct for multiple testing. The threshold of FDR < 0.25 denotes the confidence of ‘possible’ or ‘hypothesis’, while the threshold of FDR < 0.05 is regarded as ‘high confidence’ or ‘with statistical significance’. Finally, program lists significant pathways, the genes within those pathways which contributed to the significant associations along with the SNPs that contributed to the association.
AU built the original curation database and web based interfaces, participated in weekly curation meetings, had direct involvement in all genetic analysis, wrote and edited the paper.
ATD provided oversight and guidance on the initial genetic approaches, contributed to single variant analysis, gave feedback and guidance on GSEA analysis, edited the paper.
SI aided in identification of TAG SNPs, provided alternative semantic data mining results to cross checking the original data, participated in monthly discussions of the evolving data and analyses, contributor to the original funding, edited the paper.
JFP developed the initial concepts, participated in the weekly curation meetings, had direct involvement in all genetic analyses, wrote and edited the paper.
This work was supported by the National Foundation March of Dimes Prematurity Initiative # 21-FY08-563, and National Institutes of Health Grants NIH-5T35HL094308-02 and NIH-NCRR P20 RR018728. GWAS data for preterm birth was analyzed from the Genome-Wide Association Studies of Prematurity and Its Complications, dbGaP study accession numbered “phs000103.v1.p1”. This research was conducted using computational resources and services at the Center for Computation and Visualization, Brown University.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.