|Home | About | Journals | Submit | Contact Us | Français|
Summary: Associations between DNA polymorphisms and mRNA abundance are a natural target of genetic investigations, and microarrays facilitate genome-wide and transcriptome-wide surveys of these associations. This work is motivated by emerging requirements for data architectures and algorithm interfaces to allow flexible exploration of public and private archives of genotyping and expression arrays. Using R/Bioconductor facilities, Phase II HapMap genotypes and Illumina 47K expression assay results archived on multiple populations may be interactively explored and analyzed using commodity hardware.
Availability and Implementation: Open Source. Bioconductor 2.3 packages GGtools, GGBase, GGdata, hmyriB36. Freely available on the web at http://www.bioconductor.org
Numerous publications have addressed relationships between DNA polymorphisms and gene expression distributions (Cheung et al., 2005; Dixon et al., 2007; Göring et al., 2007; Schadt et al., 2008; Stranger et al., 2007). Interpretation of associations discovered in such analyses is complex and incomplete (Williams et al., 2007). Given the ease with which high-density genotypes and transcriptome-wide mRNA abundance measures can be obtained using microarrays, public and private archives of experiments in ‘integrative genomics’ will grow in number and scope. In this article, we describe a transparent and flexible approach to managing and interrogating data from integrative genomics experiments using R/Bioconductor.
The basic data structures are
Bioconductor's Biobase package defines the eSet class for unified representation of genome-scale assay results, sample-level annotation and experiment-level metadata. The GGtools package extends the eSet class to smlSet in which expression array results are coupled to a highly efficient representation of high-density genotypes due to Clayton and Leung (2007), in the snpMatrix package.
The gwSnpTests method is defined for various parameter sets to control execution of statistical tests over the transcriptome/SNP genome data contained in an smlSet instance. The most basic calls take the form gwSnpTests(f, sms, cnum) where f is an R linear modeling formula, sms is an smlSet and cnum is a chromosome label. The formula f has the form y[~x1+…] where y may be a gene symbol or name of a gene set as defined by the GSEABase package, and [~x1+…] is an (optional) linear predictor specification based on variables contained in the phenoData of sms. Such calls lead to estimation of generalized linear models relating expression in y to (implicitly) all SNP genotypes on chromosome cnum, optionally controlling for confounders in x1+… if present. The underlying computations use byte-level genotype representations as provided in the snp.rhs.tests function of snpMatrix.
Schadt et al. (2008) describe a set of genes in the HLA class II family with evidence of cis- or trans-associated SNP. We use the GSEABase package to define the set of genes represented on the Illumina WG-6 V1 expression array in an R variable hla2set. We filter the HapMap CEU cohort to N = 60 parents in an R variable hmFou. The GGtools method gwSnpTests computes associations between all HLA class II genes and all Phase II SNP with syntax:
Using a 64-bit Sun Blade with 8 GB RAM, running R 2.9.0, the 9 × 4 · 106 linear model calculations specified above complete in under 8 min. On a 4 GB 2.6 GHz MacBook Pro the wall clock time is 4 min and 56 s, when the task is split up into collection in chunks of up to four genes at a time with intermittent garbage collection performed manually. Figure 1 is the output of the masterSnps function applied to hla2run to indicate locations for genes and SNPs with associations possessing sufficiently small P-values. P-values may be adjusted for multiplicities using Bioconductor's multtest package as desired.
Figure 2 employs the rtracklayer package to visualize a component of hla2run as computed above. The fine structure of the series of association tests is visible at user-selectable resolution because the association scores are automatically converted to a custom track in the UCSC browser. A high-scoring SNP is seen to be in the vicinity of OREGAnno element 13 725, a CTCF binding site.
In summary, the adoption of efficient container designs for genome-scale assays of mRNA abundance and SNP genotype facilitates interactive analysis of reasonably large integrative genomics studies (N ≈ 103, G ≈ 105, S ≈ 106) on commodity hardware. Adoption of object designs in R for inferential outcomes facilitates downstream amalgamation of association patterns with diverse annotation resources, simplifying programmatic interpretation processes.
Funding: National Institutes of Health (P41 HG004059-01 and R01 HL086601-01).
Conflict of Interest: none declared.