Peripheral blood genetical genomics study populations
The peripheral blood genetical genomics study population contained 1,469 unrelated individuals from the United Kingdom and the Netherlands. Some of these are healthy controls while others are patient samples. The 49 ulcerative colitis (UC) cases in this study are part of the inflammatory bowel disease (IBD) cohort of the University Medical Centre Groningen. The 111 celiac disease samples were collected within the Barts and the London NHS Trust and the Oxford Radcliffe Hospitals NHS Trust. The 453 chronic obstructive pulmonary disease (COPD) samples were collected within the NELSON study. The 856 amyotrophic lateral sclerosis (ALS) cases and controls were collected in the University Medical Centre Utrecht. All samples were collected after informed consent and approved by local ethical review boards. Individual sample information is provided in Table S11
Peripheral blood (2.5 ml) for all samples was collected with the PAXgene system (PreAnalytix GmbH, UK). PAXgene vials were chosen to prevent density gradient centrifugation, immortalization or in vitro cell culture artifacts changing mRNA profiles. PAXgene tubes were mixed gently and incubated at room temperature for two hours. After collection, tubes were frozen at −20°C for at least 24 hours followed by storage at −80°C. RNA was isolated using the PAXgene Blood RNA isolation kit (PreAnalytix GmbH, UK). RNA was quantified using the Nanodrop (Nanodrop Technologies, USA). Total RNA integrity was analyzed using an Agilent Bioanalyzer (Agilent Technologies, USA).
Peripheral blood SNP genotyping
Peripheral blood samples were either genotyped using the Illumina (Illumina, San Diego, USA) HumanHap300, HumanHap370 or 610 Quad platform. Genotyping was performed according to standard protocols from Illumina. Although the different genotype oligonucleotide arrays differ, they share 294,757 SNPs, to which the analysis was confined. In addition, SNPs with a minor allele frequency of <5%, or a call-rate <95%, or deviating from Hardy-Weinberg equilibrium (exact p-value <0.001) were excluded, resulting in 289,044 SNPs for further analysis. Genotype calling for each SNP was performed by a previously described algorithm 
Peripheral blood Illumina expression profiling
Anti-sense RNA was synthesized, amplified and purified using the Ambion Illumina TotalPrep Amplification Kit (Ambion, USA) following the manufacturers' protocol. Complementary RNA was either hybridized to Illumina HumanRef-8 v2 arrays (229 samples, further referred to as H8v2) or Illumina HumanHT-12 arrays (1,240 samples, further referred to as HT12) and scanned on the Illumina BeadArray Reader. Raw probe intensities were extracted using Illumina's BeadStudio Gene Expression module v3.2 (No background correction was applied, nor did we remove probes with low expression). The raw expression data of the 1,240 HT12 peripheral blood samples were combined with the raw expression data of 296 replication samples (described in detail in paragraph ‘Trans-eQTL replication dataset’). Both datasets (H8v2 and HT12) were quantile normalized separately to the median distribution and expression values were subsequently log2 transformed. Subsequently, the probes were centered to zero and linearly scaled such that each probe had a standard deviation of one.
Integration of the Illumina H8V2 and HT12 peripheral blood expression platform identifiers
The HT12 and H8v2 arrays share a considerable number of probes with identical probe sequences. However, in a considerable number of occasions the two platforms use different probe identifiers for the same probe sequences. More importantly, although probe identifiers are often identical, they sometimes represent different probe sequences. In order to permit a meta-analysis incorporating data from both arrays, we decided on the following naming convention: if an H8v2 probe had the same sequence as an HT12 probe, the HT12 ‘ArrayAddressID’ probe identifier was used. If not, the original H8v2 probe identifier was used, but with the prefix “Human_RefSeq-8_v2-” to prevent any potential probe identifier ambiguity. A total of 52,061 unique probes were used for further analysis, representing 19,609 unique genes according to HUGO gene nomenclature.
Initial genomic mapping of Illumina expression probe sequences
Various mapping strategies were used for the expression probes to get a mapping location that was as unambiguous as possible: if probes have been mapped incorrectly, or cross-hybridize to multiple genomic loci, it might be that an eQTL will be incorrectly deemed a trans-eQTL, while in fact it is a cis-eQTL or primer polymorphisms. We used Ensembl database version 52 (NCBI 36.3 assembly) to obtain, for each annotated gene, the transcript with the largest number of exons and included this main spliced transcript in our reference set. Second, we added one sequence per intron, extending intron boundaries 40 bp on each side to allow mapping of the 50 bp probe sequences that overlapping exon-intron junctions. Last, a version of the reference DNA genome with masked annotated transcripts was included. Probe sequences were mapped using NOVOALIGN V2.05.12 for all the sequences (main transcript, introns, and non standard exon-exon junctions) originating from the same transcript (parameters −t 150 −v 20 20 200 [>]( [ ^_]*)_). For each probe it was determined whether it was mapping uniquely to one particular genomic locus, or, if multiple hits were present whether all these mappings resided in each other vicinity (<250 kb). Probes that did not map at all, or mapped to multiple different loci were excluded from further analyses. Using this approach, 43,202 of the 48,751 probes on the HT12 and 21,316 of the 22,185 probes on the H8v2 platform were eventually mapped to a single genomic location.
In order to detect cis-eQTLs, analysis was confined to those probe-SNP combinations for which the distance from the probe transcript midpoint to SNP genomic location was ≤250 kb. For trans-eQTLs, analysis was confined to those probe-SNP combinations for which the distance from probe transcript midpoint to SNP genomic location was ≥5 Mb (to exclude the possibility of accidentally detecting cis-eQTLs due to long ranging linkage disequilibrium). Additionally, for the trans-eQTL analysis the effects of the significant cis-eQTLs were removed from the expression data by keeping the residual expression after linear regression.
Association for cis- and trans-eQTL was tested with a non-parametric Spearman's rank correlation. For directly genotyped SNPs we coded genotypes as 0, 1 or 2, while for imputed SNPs we used SNP dosage values, ranging between 0 and 2. When a particular probe-SNP pair was present in both the HT12 and H8v2 datasets, an overall, joint p-value was calculated using a weighted (square root of the dataset sample number) Z-method.
To correct for multiple testing, we controlled the false-discovery rate (FDR) at 0.05: the distribution of observed p-values was used to calculate the FDR, by comparison with the distribution obtained from permuting expression phenotypes relative to genotypes 100 times within the HT12 and H8v2 dataset for both the cis
- and trans
- analyses 
In order to increase the number of detectable cis
- and trans
-eQTLs we applied a principal component analysis (PCA) on the sample correlation matrix. We, among others 
, argue that the dominant PCs, capturing the larger part of the total variation, will primarily capture sample differences in expression that reflect physiological or environmental variation as well as systematic experimental variation (e.g. batch and technical effects). Figure S6
shows for the 1,240 HT12 samples what per individual the PC scores are. It is evident there are, especially among the first PCs, strong batch effects are still present after proper quantile-quantile normalization. By removing the variation captured by these PCs, we expected that the residual expression is more strongly determined by genetic variants and the number of significantly detected cis
- and trans
-eQTLs will increase. An aspect to consider is that with the removal of more PCs from the data, the degrees of freedom of the data will decrease. Furthermore, it is not immediately clear which PCs will actually capture physiological, environmental, and systematic variation, which might lead to removal of genetically determined expression variation as well. Therefore a tradeoff has to be made on the number of PCs to subtract from the data. We assessed this systematically, by removing up to 100 PCs from the genetical genomics dataset (in steps of 5).
shows that the number of significantly detected cis
-eQTL probes increases two-fold when 50 PCs were removed from the expression data. There is a long plateau visible (around PC50), where the number of detected cis
-eQTLs probes remains approximately constant, irrespective of removing for instance 10 fewer or 10 extra PCs (reported numbers in this figure also include false-positive eQTLs due to potential primer polymorphisms, as we here wanted to solely compare the performance of removing different numbers of PCs). Figure S7B
shows that of the initial 5,950 significantly detected cis
-eQTL probes (no PCs removed), 4,965 (83.5%) were still detected with 50 PCs subtracted. The 985 initially detected cis
-eQTLs probes, yet no longer detected when 50 PCs had been removed from the expression data, all had a low significance (Figure S8
). As we controlled the FDR at 0.05 in all analyses it is therefore likely that a considerable amount of these reflect false-positives. Figure S8C
shows that for all the overlapping 4,965 detected cis
-eQTLs probes between the different analyses, the allelic direction was identical, and effect size on expression correlate well (Pearson r
0.95) although these were nearly always stronger after having subtracted 50 PCs.
We assessed this for trans
-eQTLs as well. An important aspect to consider is that trans
-eQTL SNPs might affect multiple genes. If these effects are substantial (either in effect size or the number of affected genes), it is likely that a certain PC will capture this. Removal of such PCs from the expression data will therefore unintentionally result in the inability to detect these trans
-eQTLs. In order to avoid such false-negatives
we first performed a QTL analysis on the first 50 PCs (that had been removed from the expression data for the cis
-eQTL analysis) to assess whether some of these PCs are under genetic control (genome-wide analysis, controlling FDR at 0.05). We did this for the large HT12 and the smaller H8v2 expression data separately, as PCA had been applied independently to these datasets. We observed that out of the first 25 PCs in the HT12 data three PCs and in the H8v2 two PCs were to some extent genetically determined (r2
>5%). This was different for PCAs 26–50 in the HT12 data: 11 PCs were under substantial genetic control (Figure S9a
We therefore assumed that most trans
-eQTLs could be detected when removing approximately 25 PCs. We quantified this systematically, by removing increasing amounts of PCs from the expression data and conducting a full genome-wide trans
-eQTL mapping. Indeed, in these analyses at most 244 significant trans
-eQTLs could be detected (at FDR 0.05, with potential false-positives due to cross-hybridizations removed), when removing 25 PCs (Figure S9b
). The overlap with the expression with no PCs removed was substantial: 62 of the 82 trans
-eQTLs (77%), detected in the original analysis were detected as well in the analysis with 25 PCs removed (Figure S9c
), all with identical allelic directions (Figure S9d
Identification of false eQTLs due to primer polymorphisms and cross-hybridization
One should be aware that sequence polymorphisms can cause many false cis
. Such false cis
-eQTLs do not reflect actual expression differences caused by sequence polymorphisms in cis
-acting factors that affect mRNA levels. Instead they indicate hybridization differences caused by sequence polymorphisms in the mRNA region that is targeted by the microarray expression probes. Therefore, SNP-probe combinations were excluded from the cis
-eQTL analysis when the 50 bp long expression probe mapped to a genomic location that contained a known SNP that was showing at least some LD (r2
>0.1) with the cis
-SNP. We used SNP data from the 1000 Genomes Projects, as it contains LD information for 9,633,115 SNPs (April 2009 release, based on 57 CEU samples of European descent).
-eQTLs might also reflect false-positives, although we initially had attempted to map the expression probes as accurately as possible, by using the aforementioned three different mapping strategies: it is still well possible that some of the identified, putative trans
-eQTLs in fact reflect very subtle cross-hybridization (e.g.
pertaining to only a small subsequence of the probe). We therefore tried to falsify each of the putative trans
-eQTLs by attempting to map each trans
-probe into the vicinity of the SNP probe location, by using a highly relaxed mapping approach. All putative Illumina trans
-expression probes were mapped using SHRiMP 
, which uses a global alignment approach, to the human reference genome (NCBI 36.3 build). The mapping settings were chosen very loosely to permit the identification of nearly all potential hybridization locations: match score was 10, the mismatch score was 0, the gap open penalty was −250, the gap extension penalty was −100, Smith and Waterman minimum identical alignment threshold was 30.0%, while other SHRiMP parameters were left at default. Using these settings all mappings with a minimum overlap of 15 bases, or with 20 matches with one mismatch, or 30 matches with 2 mismatches, or full-length (50 bp) probe hybridizations with no more than 15 mismatches were accepted. Any trans
-eQTL was discarded, if the expression probe had a mapping that was within 2 Mb of the SNP that showed the trans
-eQTL effect. Once these potential false-positive trans
-eQTLs had been removed from the real, non-permuted data, we repeated the multiple testing correction (again controlling the FDR at 0.05).
Using this strategy we observed several instances where only 20 out the 50 bases of a probe sequence mapped in the vicinity of the trans-SNP (data not shown). For these trans-eQTLs the Spearman's rank correlation p was often lower than 10−100, which would imply these SNPs explain over 25% of the total expression variation of the corresponding trans-genes. Given the small amount of trans-eQTLs we detected in total, such effect sizes are quite unlikely and therefore provide circumstantial evidence these indeed reflect cross-hybridization artifacts.
We also assessed whether any of the Illumina SNPs that constitute trans-eQTLs might map to a different position than what is reported in dbSNP. As such we mapped the 50 bp Illumina SNP probe sequences to the genome assembly, permitting up to four mismatches per 50 bp SNP probe sequence. We did not observe any SNP that could map (with some mismatches) to the same chromosome of the trans-probe.
It is still possible that some of the trans-eQTLs for which we did not find any evidence of cross-hybridization, still are false positives, e.g. by missing some cross-hybridizations due to imperfections in the NCBI v36 assembly we used. Although we have identified numerous occasions where a SNP affects two different probes within the same gene in trans, substantiating the likelihood these trans-eQTLs are real, providing unequivocal evidence that all our reported trans-eQTLs are real is not straightforward.
Enrichment analysis of trait-associated SNPs and SNPs located within the HLA region
To assess enrichment of trait-associated SNPs, we used a collection of 1,262 unique SNPs from 'A Catalog of Published Genome-Wide Association Studies' (accessed 09 February 2010, and each having at least one reported association p-value <5.0×10−7). We could successfully impute the genotypes for 1,167 of these SNPs and therefore confined all analyses to these SNPs. Of these SNPs 572 had been directly genotyped on the Illumina HumanHap300 platform, with a MAF>0.05, an HWE exact p-value >0.0001 and call-rate >95%.
To ascertain whether these SNPs are more often constituting an eQTL than expected, we used a methodology that is not affected by the following potential confounders: non-even distribution of SNP markers and expression probe markers across the genome, differences in MAF between SNPs and LD structure within the genotype date and correlation between probes in the expression data. Additionally, this methodology is also not confounded by the fact that for certain traits different SNPs in strong LD can have been reported, due to differences in the platforms that were used to identify these loci.
We first determined how many unique eQTL SNPs had been identified in the original eQTL mapping (with an FDR<0.05) and how many of these are trait-associated. Subsequently we permuted the expression phenotypes relative to the genotypes (thus keeping the correlation structure within the genotype data and the correlation structure within the expression data intact, yet assigning the genotypes of a sample to the expression data of a randomly chosen sample) and reran the eQTL mapping, sorting all tested eQTLs on highest significance. We then took an equal number of top associated, but permuted, eQTL SNPs and determined how many of these permuted eQTL SNPs are trait-associated. By performing 100 permutations we obtained an empiric distribution of the number of trait-associated SNPs expected by chance. We subsequently fitted a generalized extreme value distribution (EVD, using the EVD add-on package for R), permitting us to estimate realistic enrichment significance estimates (called EVD p throughout the manuscript).
For the MHC enrichment analysis the followed procedure was identical, with the difference that we looked for enrichment for SNPs within the MHC, defined as SNPs physically mapping between 20 Mb and 40 Mb on chromosome 6 (NCBI 36 assembly).
Trans-eQTL replication datasets
Replication of the detected eQTLs was performed in monocytes from 1,490 different samples 
and in an independent population of 86 morbidly obese individuals that underwent elective bariatric surgery (Department of general surgery, Maastricht University Medical Centre, the Netherlands). Both these datasets also used the same Illumina HumanHT-12 expression platform.
For the 1,490 monocyte samples eQTL P-Values summary statistics were available for all monocyte trans-eQTLs with a nominal p<1.0×10−5. We ascertained how many of the trans-eQTLs we had found in our peripheral blood data had a nominal eQTL p<1.0×10−5 in this monocyte dataset.
We also assessed trans-eQTLs in four different tissues from the 86 morbidly obese individuals that underwent bariatric surgery. DNA was extracted from blood samples using the Chemagic Magnetic Separation Module 1 (Chemagen) integrated with a Multiprobe II Pipeting robot (PerkinElmer). All samples were genotyped using both Illumina HumanCytoSNP-12 BeadChips and Illumina HumanOmni1-Quad BeadChips (QC was identical as was applied to the peripheral blood samples). We imputed HapMap 2 genotypes using Impute version 2.0. In addition expression profiling was performed for four different tissues for each of these individuals using the Illumina HumanHT-12 arrays. Wedge biopsies of liver, visceral adipose tissue (VAT, omentum majus), subcutaneous adipose tissue (SAT, abdominal), and muscle (musculus rectus abdominis) were taken during surgery. RNA was isolated using the Qiagen Lipid Tissue Mini Kit (Qiagen, UK, 74804). Assessment of RNA quality and concentration was done with an Agilent Bioanalyzer (Agilent Technologies USA). Starting with 200 ng of RNA, the Ambion Illumina TotalPrep Amplification Kit was used for anti-sense RNA synthesis, amplification, and purification according to the protocol provided by the manufacturer (Ambion, USA). 750 ng of complementary RNA was hybridized to Illumina HumanHT12 BeadChips and scanned on the Illumina BeadArray Reader. Expression data preprocessing was as mentioned before. We first attempted to replicate the trait-associated trans-eQTLs per tissue, using an FDR of 0.05 and 100 permutations. Subsequently we conducted a meta-analysis, combining the four tissues. Per trans-eQTL we used a weighted Z-method to combine the four individual p-values. However, these four datasets are not independent, as they reflect the same individuals. We resolved this by conducting the permutations in such a way that in every permutation round the samples were permuted in exactly the same way for each of the four tissues. By doing this we retained the correlations that exist between the different tissues per sample, and were able to get a realistic empiric (null-)distribution of expected test-statistics.
Per trait we assessed all the SNPs that have been reported to be associated with that particular trait. We analyzed per trait all possible SNP-pairs. If a pair of SNPs was not in LD (r2<0.001) we assessed whether they affected the same gene in cis or trans. When using the trait-associated cis- and trans-eQTLs that had been identified when controlling the FDR at 0.05, we identified 7 unique pairs of SNPs that caused both the same phenotype and also affected the same gene(s). When using a somewhat more relaxed set of trans-eQTLs, identified when controlling the FDR at 0.5, we identified 18 unique pairs of SNPs that affect the same downstream gene.
We assessed whether these numbers were significantly higher than expected, by using the same strategy that we had used to assess the enrichment of trait-associated SNPs and the HLA; we ran 100 permutations. We kept per permutation the cis-eQTL list as it was, but generated a permuted set of trans-eQTLs, equal in size to the original set of non-permuted trans-eQTLs. This enabled us to determine per permutation round how many unique pairs of SNPs converge on the same gene(s). We subsequently fitted a generalized extreme value distribution, permitting us to estimate realistic enrichment significance estimates.
Co-expression between genes, based on HT12 peripheral blood co-expression
If a particular SNP is cis
- or trans
-acting on multiple genes, it is plausible that those genes are biologically related. Co-expression between these genes provides circumstantial evidence this is the case, strengthening the likelihood such cis
- and trans
-eQTLs are real. We assessed this in the peripheral blood data, by using the expression data of the 1,240 samples, run on the comprehensive HT12 expression platform. As we had removed 25 PCs (to remove physiological, environmental variation, and systematic experimental variation) for the trans
-eQTL analyses, we decided to confine co-expression analyses to this expression dataset. As there are 43,202 HT12 probes that we mapped to a known genomic location, 43,202×43,201/2
933,184,801 probe-pairs exist. Given 1,240 samples, a Pearson correlation coefficient r≥0.19 corresponds to a p<0.05 when applying stringent Bonferroni correction for these number of probe-pairs.
Expression data for both the peripheral blood and the four non-blood datasets have been deposited in GEO with accession numbers GSE20142 (1,240 peripheral blood samples, hybridized to HT12 arrays), GSE20332 (229 peripheral blood samples, hybridized to H8v2 arrays) and GSE22070 (subcutaneous adipose, visceral adipose, muscle and liver samples).