For each cohort, the first HDL-C measure was included in the phenotype file, along with age at exam and smoking status. HDL-C measures were blanked for individuals using cholesterol-lowering drugs. In total there were 6334 individuals with both HDL-C measures and genotype data, with HDL-C measures ranging from 16 to 206 (mean 53.6 ± 0.2). Within our analysis dataset, age at exam ranged from 5 to 72 (mean 38.3 ± 0.1). There were 6301 individuals with data on both HDL-C and age and 6152 individuals with both HDL-C and smoking status.
The genes considered in this investigation were those corresponding to the 22,413 transcripts identified in the expression profiling. Of these, there were 17,350 gene regions with a least one effective SNP located within a 25-kb extension of either side of the physical gene location (NCBI build 36.3). SNP counts ranged from 1 to 597, with an average of 21 ± 1 SNPs per gene region. The 25-kb extension of the boundaries was selected to maximize the number of SNPs that may influence the target gene while minimizing the number of overlapping; this parameter is investigator-driven and can be adjusted as required.
Of the 17,350 gene regions tested, 14 were significantly associated with HDL-C from the measured genotype analysis, following correction of the p-value for the effective number of SNPs within the region, at a 1% FDR. These results are shown in Table .
The 14 measured genotypes results for HDL-C significant at a 1% FDR
In the joint test there were a total of 39 genes significant at a highly conservative 1% FDR, including 9 from the significant measured genotype set and 23 with expression that was significantly correlated with HDL-C. Seven genes identified as significant in the joint test were not identified by either the association or expression tests independently (ABCG1
, and PRPF38A
). The results of the joint test are shown in Additional File 1
The genes shown in Table 2 are prime candidates for resequencing and variant typing, empirically selected based on evidence both from transcriptional profiling and genome-wide association. One of the most significant genes is CETP (cholesteryl ester transfer protein), a well known cholesterol binding gene. In total, there are seven well known lipid metabolism genes prioritized by the joint test (ABCB4, ABCG1, CETP, CYP51A1, IL8, IL1R2, and LPL). Interestingly, the list also prioritizes a number of genes of little-known function, such as NLRC5 (NLR family CARD domain containing 5), TCTN1 (tectonic family member 1), and TPPP3 (tubulin polymerization-promoting protein family member 3), which would not be selected by any form of candidate gene approach.
It can also be seen that there are situations in which genes show a highly significant correlation between their expression and HDL-C, but no evidence of association at the physical location of the gene, such as IL1R2 (Table 2). Similarly, there are cases (SMURF1) where the association information drives the combined tests. We have retained all genes that exhibit combined significance. An individual reader may choose to further focus on only those genes that exhibit at least nominal significance on each dimension.
While this approach shows great potential for speeding gene identification, it also has several limitations. One potential weakness is the focus on regulatory variation. While there is a growing belief that much of quantitative phenotypic variation may stem from regulatory variation, other types of mechanisms (e.g., structural variation that alters protein-protein interactions) can also be involved. Similarly, genes whose expression is not detected in the target tissue may be missed. Thus, as with all discovery-based approaches, only positive findings admit interpretation. A gene cannot be ruled out using these methods.
This paper combines information from two different population studies. Both samples, however, are ascertained without regard to phenotype. It is possible that the relationship between expression levels and disease-related phenotypes may vary across populations. However, we would expect this to diminish signal rather than yield false positives. Optimally, expression and association results would come from the same data set.