It is usually observed that among genes there exist strong statistical interactions associated with diseases of public health importance. Gene interactions can potentially contribute to the improvement of disease classification accuracy. Especially when gene expression differs across different classes are not great enough, it is more important to take use of gene interactions for disease classification analyses. However, most gene selection algorithms in classification analyses merely focus on genes whose expression levels show differences across classes, and ignore the discriminatory information from gene interactions. In this study, we develop a two-stage algorithm that can take gene interaction into account during a gene selection procedure. Its biggest advantage is that it can take advantage of discriminatory information from gene interactions as well as gene expression differences, by using “Bayes error” as a gene selection criterion. Using simulated and real microarray data sets, we demonstrate the ability of gene interactions for classification accuracy improvement, and present that the proposed algorithm can yield small informative sets of genes while leading to highly accurate classification results. Thus our study may give a novel sight for future gene selection algorithms of human diseases discrimination.
Peripheral blood monocytes (PBMs) play multiple and critical roles in the immune response, and abnormalities in PBMs have been linked to a variety of human disorders. However, the DNA methylation landscape in PBMs is largely unknown. In this study, we characterized epigenome-wide DNA methylation profiles in purified PBMs.
Materials & methods
PBMs were isolated from freshly collected peripheral blood from 18 unrelated healthy postmenopausal Caucasian females. Epigenome-wide DNA methylation profiles (the methylome) were characterized by using methylated DNA immunoprecipitation combined with high-throughput sequencing.
Distinct patterns were revealed at different genomic features. For instance, promoters were commonly (~58%) found to be unmethylated; whereas protein coding regions were largely (~84%) methylated. Although CpG-rich and -poor promoters showed distinct methylation patterns, interestingly, a negative correlation between promoter methylation levels and gene transcription levels was consistently observed across promoters with high to low CpG densities. Importantly, we observed substantial interindividual variation in DNA methylation across the individual PBM methylomes and the pattern of this interindividual variation varied between different genomic features, with highly variable regions enriched for repetitive DNA elements. Furthermore, we observed a modest but significant excess (p < 2.2 × 10−16) of genes showing a negative correlation between interindividual promoter methylation and transcription levels. These significant genes were enriched in biological processes that are closely related to PBM functions, suggesting that alteration in DNA methylation is likely to be an important mechanism contributing to the interindividual variation in PBM function, and PBM-related phenotypic and disease-susceptibility variation in humans.
This study represents a comprehensive analysis of the human PBM methylome and its interindividual variation. Our data provide a valuable resource for future epigenomic and multiomic studies, exploring biological and disease-related regulatory mechanisms in PBMs.
DNA methylation; interindividual variation; peripheral blood monocyte
Bone size (BS) contributes significantly to the risk of osteoporotic fracture. Osteoporotic spine fracture is one of the most disabling outcomes of osteoporosis. This study aims to identify genomic loci underlying spine BS variation in humans.
We performed a genome-wide association scan in 2,286 unrelated Caucasians using Affymetrix 6.0 SNP arrays. Areal BS (cm2) at lumbar spine was measured using dual energy X-ray absorptiometry scanners. SNPs of interest were subjected to replication analyses and meta-analyses with additional two independent Caucasian populations (N = 1,000 and 2,503) and one Chinese population (N = 1,627).
In the initial GWAS, 91 SNPs were associated with spine BS (P<1.0E-4). Eight contiguous SNPs were found clustering in a haplotype block within UQCC gene (ubiquinol-cytochrome creductase complex chaperone). Association of the above eight SNPs with spine BS were replicated in one Caucasian and one Chinese populations. Meta-analyses (N = 7,416) generated much stronger association signals for these SNPs (e.g., P = 1.86E-07 for SNP rs6060373), supporting association of UQCC with spine BS across ethnicities.
This study identified a novel locus, i.e., the UQCC gene, for spine BS variation in humans. Future functional studies will contribute to elucidating the mechanisms by which UQCC regulates bone growth and development.
Spine bone size; GWAS; UQCC
Bone and muscle, two major tissue types of musculoskeletal system, have strong genetic determination. Abnormality in bone and/or muscle may cause musculoskeletal diseases such as osteoporosis and sarcopenia. Bone size phenotypes (BSPs), such as hip bone size (HBS), appendicular bone size (ABS), are genetically correlated with body lean mass (mainly muscle mass). However, the specific genes shared by these phenotypes are largely unknown. In this study, we aimed to identify the specific genes with pleiotropic effects on BSPs and appendicular lean mass (ALM).
We performed a bivariate genome-wide association study (GWAS) by analyzing ~690,000 SNPs in 1,627 unrelated Han Chinese adults (802 males and 825 females) followed by a replication study in 2,286 unrelated US Caucasians (558 males and 1728 females).
We identified 14 interesting single nucleotide polymorphisms (SNPs) that may contribute to variation of both BSPs and ALM, with p values <10−6 in discovery stage. Among them, the association of three SNPs (rs2507838, rs7116722, and rs11826261) in/near GLYAT (glycine-N-acyltransferase) gene was replicated in US Caucasians, with p values ranging from 1.89×10−3 to 3.71×10−4 for ALM-ABS, from 5.14×10−3 to 1.11×10−2 for ALM-HBS, respectively. Meta-analyses yielded stronger association signals for rs2507838, rs7116722, and rs11826261, with pooled p values of 1.68×10−8, 7.94×10−8, 6.80×10−8 for ALB-ABS and 1.22×10−4, 9.85×10−5, 3.96×10−4 for ALM-HBS, respectively. Haplotype allele ATA based on these three SNPs were also associated with ALM-HBS and ALM-ABS in both discovery and replication samples. Interestingly, GLYAT was previously found to be essential to glucose metabolism and energy metabolism, suggesting the gene’s dual role in both bone development and muscle growth.
Our findings, together with the prior biological evidence, suggest the importance of GLYAT gene in co-regulation of bone phenotypes and body lean mass.
Bivariate GWAS; Bone size; Lean mass; GLYAT
Coalescent simulation is pivotal for understanding population evolutionary models and demographic histories, as well as for developing novel analytical methods for genetic association studies for DNA sequence data. A plethora of coalescent simulators are developed, but selecting the most appropriate program remains challenging.
We extensively compared performances of five widely used coalescent simulators – Hudson’s ms, msHOT, MaCS, Simcoal2, and fastsimcoal, to provide a practical guide considering three crucial factors, 1) speed, 2) scalability and 3) recombination hotspot position and intensity accuracy. Although ms represents a popular standard coalescent simulator, it lacks the ability to simulate sequences with recombination hotspots. An extended program msHOT has compensated for the deficiency of ms by incorporating recombination hotspots and gene conversion events at arbitrarily chosen locations and intensities, but remains limited in simulating long stretches of DNA sequences. Simcoal2, based on a discrete generation-by-generation approach, could simulate more complex demographic scenarios, but runs comparatively slow. MaCS and fastsimcoal, both built on fast, modified sequential Markov coalescent algorithms to approximate standard coalescent, are much more efficient whilst keeping salient features of msHOT and Simcoal2, respectively. Our simulations demonstrate that they are more advantageous over other programs for a spectrum of evolutionary models. To validate recombination hotspots, LDhat 2.2 rhomap package, sequenceLDhot and Haploview were compared for hotspot detection, and sequenceLDhot exhibited the best performance based on both real and simulated data.
While ms remains an excellent choice for general coalescent simulations of DNA sequences, MaCS and fastsimcoal are much more scalable and flexible in simulating a variety of demographic events under different recombination hotspot models. Furthermore, sequenceLDhot appears to give the most optimal performance in detecting and validating cross-over hotspots.
Coalescent; Population genetics; Linkage disequilibrium; Recombination; Single nucleotide polymorphism
Aims: Some of the well-known functional alcohol dehydrogenase (ADH) gene variants (e.g. ADH1B*2, ADH1B*3 and ADH1C*2) that significantly affect the risk of alcohol dependence are rare variants in most populations. In the present study, we comprehensively examined the associations between rare ADH variants [minor allele frequency (MAF) <0.05] and alcohol dependence, with several other neuropsychiatric and neurological disorders as reference. Methods: A total of 49,358 subjects in 22 independent cohorts with 11 different neuropsychiatric and neurological disorders were analyzed, including 3 cohorts with alcohol dependence. The entire ADH gene cluster (ADH7–ADH1C–ADH1B–ADH1A–ADH6–ADH4–ADH5 at Chr4) was imputed in all samples using the same reference panels that included whole-genome sequencing data. We stringently cleaned the phenotype and genotype data to obtain a total of 870 single nucleotide polymorphisms with 0< MAF <0.05 for association analysis. Results: We found that a rare variant constellation across the entire ADH gene cluster was significantly associated with alcohol dependence in European-Americans (Fp1: simulated global P = 0.045), European-Australians (Fp5: global P = 0.027; collapsing: P = 0.038) and African-Americans (Fp5: global P = 0.050; collapsing: P = 0.038), but not with any other neuropsychiatric disease. Association signals in this region came principally from ADH6, ADH7, ADH1B and ADH1C. In particular, a rare ADH6 variant constellation showed a replicable association with alcohol dependence across these three independent cohorts. No individual rare variants were statistically significantly associated with any disease examined after group- and region-wide correction for multiple comparisons. Conclusion: We conclude that rare ADH variants are specific for alcohol dependence. The ADH gene cluster may harbor a causal variant(s) for alcohol dependence.
Deng and Lynch (1, 2) proposed to
characterize deleterious genomic mutations from changes in the mean and genetic variance of
fitness traits upon selfing in outcrossing populations. Such observations can be readily acquired
in cyclical parthenogens. Selfing and life-table experiments were performed for two such
Daphnia populations. A significant inbreeding depression and an increase of genetic variance for
all traits analyzed were observed. Deng and Lynch's (2) procedures were employed to estimate the genomic mutation rate (U), mean dominance coefficient (
selection coefficient (
), and scaled genomic mutational variance (
(^ indicates an
estimate) are 0.84, 0.30, 0.14 and 4.6E-4 respectively. For the true values, the
are lower bounds, and
The emergence of high-throughput genomic datasets from different sources and platforms (e.g., gene expression, single nucleotide polymorphisms (SNP), and copy number variation (CNV)) has greatly enhanced our understandings of the interplay of these genomic factors as well as their influences on the complex diseases. It is challenging to explore the relationship between these different types of genomic data sets. In this paper, we focus on a multivariate statistical method, canonical correlation analysis (CCA) method for this problem. Conventional CCA method does not work effectively if the number of data samples is significantly less than that of biomarkers, which is a typical case for genomic data (e.g., SNPs). Sparse CCA (sCCA) methods were introduced to overcome such difficulty, mostly using penalizations with l-1 norm (CCA-l1) or the combination of l-1and l-2 norm (CCA-elastic net). However, they overlook the structural or group effect within genomic data in the analysis, which often exist and are important (e.g., SNPs spanning a gene interact and work together as a group).
We propose a new group sparse CCA method (CCA-sparse group) along with an effective numerical algorithm to study the mutual relationship between two different types of genomic data (i.e., SNP and gene expression). We then extend the model to a more general formulation that can include the existing sCCA models. We apply the model to feature/variable selection from two data sets and compare our group sparse CCA method with existing sCCA methods on both simulation and two real datasets (human gliomas data and NCI60 data). We use a graphical representation of the samples with a pair of canonical variates to demonstrate the discriminating characteristic of the selected features. Pathway analysis is further performed for biological interpretation of those features.
The CCA-sparse group method incorporates group effects of features into the correlation analysis while performs individual feature selection simultaneously. It outperforms the two sCCA methods (CCA-l1 and CCA-group) by identifying the correlated features with more true positives while controlling total discordance at a lower level on the simulated data, even if the group effect does not exist or there are irrelevant features grouped with true correlated features. Compared with our proposed CCA-group sparse models, CCA-l1 tends to select less true correlated features while CCA-group inclines to select more redundant features.
Group sparse CCA; Genomic data integration; Feature selection; SNP
The BMP and Wnt/β-catenin signaling pathways cooperatively regulate osteoblast differentiation and bone formation. Although BMP signaling regulates gene expression of the Wnt pathway, much less is known about whether Wnt signaling modulates BMP expression in osteoblasts. Given the presence of putative Tcf/Lef response elements that bind β-catenin/TCF transcription complex in the BMP2 promoter, we hypothesized that the Wnt/β-catenin pathway stimulates BMP2 expression in osteogenic cells. In this study, we showed that Wnt/β-catenin signaling is active in various osteoblast or osteoblast precursor cell lines, including MC3T3-E1, 2T3, C2C12, and C3H10T1/2 cells. Furthermore, crosstalk between the BMP and Wnt pathways affected BMP signaling activity, osteoblast differentiation, and bone formation, suggesting Wnt signaling is an upstream regulator of BMP signaling. Activation of Wnt signaling by Wnt3a or overexpression of β-catenin/TCF4 both stimulated BMP2 transcription at promoter and mRNA levels. In contrast, transcription of BMP2 in osteogenic cells was decreased by either blocking the Wnt pathway with DKK1 and sFRP4, or inhibiting β-catenin/TCF4 activity with FWD1/β-TrCP, ICAT, or ΔTCF4. Using a site-directed mutagenesis approach, we confirmed that Wnt/β-catenin transactivation of BMP2 transcription is directly mediated through the Tcf/Lef response elements in the BMP2 promoter. These results, which demonstrate that the Wnt/β-catenin signaling pathway is an upstream activator of BMP2 expression in osteoblasts, provide novel insights into the nature of functional cross talk integrating the BMP and Wnt/β-catenin pathways in osteoblastic differentiation and maintenance of skeletal homeostasis.
BMP; Wnt/β-catenin; Gene expression; Osteogenesis
There is growing evidence for a link between energy and bone metabolism. The nuclear receptor subfamily 5 member A2 (NR5A2) is involved in lipid metabolism and modulates the expression of estrogen-related genes in some tissues. The objective of this study was to explore the influence of NR5A2 on bone cells and to determine whether its allelic variations are associated with bone mineral density (BMD).
Analyses of gene expression by quantitative PCR and inhibition of NR5A2 expression by siRNAs were used to explore the effects of NR5A2 in osteoblasts. Femoral neck BMD and 30 single nucleotide polymorphisms (SNPs) were first analyzed in 935 postmenopausal women and the association of NR5A2 genetic variants with BMD was explored in other 1284 women in replication cohorts.
NR5A2 was highly expressed in bone. The inhibition of NR5A2 confirmed that it modulates the expression of osteocalcin, osteoprotegerin, and podoplanin in osteoblasts. Two SNPs were associated with BMD in the Spanish discovery cohort (rs6663479, P=0.0014, and rs2816948, P=0.0012). A similar trend was observed in another Spanish cohort, with statistically significant differences across genotypes in the combined analysis (P=0.03). However, the association in a cohort from the United States was rather weak. Electrophoretic mobility assays and studies with luciferase reporter vectors confirmed the existence of differences in the binding of nuclear proteins and the transcriptional activity of rs2816948 alleles.
NR5A2 modulates gene expression in osteoblasts and some allelic variants are associated with bone mass in Spanish postmenopausal women.
The present study searched for replicable risk genomic regions for alcohol and nicotine co-dependence using a genome-wide association strategy. The data contained a total of 3,143 subjects including 818 European-American (EA) cases with alcohol and nicotine co-dependence, 1,396 EA controls, 449 African-American (AA) cases and 480 AA controls. We performed separate genome-wide association analyses in EAs and AAs and a meta-analysis to derive combined p values, and calculated the genome-wide false discovery rate (FDR) for each SNP. Regions with p<5×10-7 together with FDR<0.05 in the meta-analysis were examined to detect all replicable risk SNPs across EAs, AAs and meta-analysis. These SNPs were followed with a series of functional expression quantitative trait locus (eQTL) analyses. We found a unique genome-wide significant gene region – SH3BP5-NR2C2 – that was enriched with 11 replicable risk SNPs for alcohol and nicotine co-dependence. The distributions of -log(p) values for all SNP-disease associations within this region were consistent across EAs, AAs, and meta-analysis (0.315≤r≤0.868; 8.1×10-52≤p≤3.6×10-5). In the meta-analysis, this region was the only association peak throughout chromosome 3 at p<0.0001. All replicable risk markers available for eQTL analysis had nominal cis- and trans-acting regulatory effects on gene expression. The transcript expression of the genes in this region was regulated partly by several nicotine dependence-related genes and significantly correlated with transcript expression of many alcohol and nicotine dependence-related genes. We concluded that the SH3BP5-NR2C2 region on Chromosome 3 might harbor causal loci for alcohol and nicotine co-dependence.
GWAS; alcohol and nicotine co-dependence
Copy number variation (CNV) is an important structural variation (SV) in human genome. Various studies have shown that CNVs are associated with complex diseases. Traditional CNV detection methods such as fluorescence in situ hybridization (FISH) and array comparative genomic hybridization (aCGH) suffer from low resolution. The next generation sequencing (NGS) technique promises a higher resolution detection of CNVs and several methods were recently proposed for realizing such a promise. However, the performances of these methods are not robust under some conditions, e.g., some of them may fail to detect CNVs of short sizes. There has been a strong demand for reliable detection of CNVs from high resolution NGS data.
A novel and robust method to detect CNV from short sequencing reads is proposed in this study. The detection of CNV is modeled as a change-point detection from the read depth (RD) signal derived from the NGS, which is fitted with a total variation (TV) penalized least squares model. The performance (e.g., sensitivity and specificity) of the proposed approach are evaluated by comparison with several recently published methods on both simulated and real data from the 1000 Genomes Project.
The experimental results showed that both the true positive rate and false positive rate of the proposed detection method do not change significantly for CNVs with different copy numbers and lengthes, when compared with several existing methods. Therefore, our proposed approach results in a more reliable detection of CNVs than the existing methods.
Therapeutic interventions in prediabetes are important in the primary prevention of type 2 diabetes (T2D) and its chronic complications. However, little is known about the pharmacogenetic effect of traditional herbs on prediabetes treatment. A total of 194 impaired glucose tolerance (IGT) subjects were treated with traditional hypoglycemic herbs (Tianqi Jiangtang) for 12 months in this study. DNA samples were genotyped for 184 mutations in 34 genes involved in drug metabolism or transportation. Multinomial logistic regression analysis indicated that rs1142345 (A > G) in the thiopurine S-methyltransferase (TPMT) gene was significantly associated with the hypoglycemic effect of the drug (P = 0.001, FDR P = 0.043). The “G” allele frequencies of rs1142345 in the healthy (subjects reverted from IGT to normal glucose tolerance), maintenance (subjects still had IGT), and deterioration (subjects progressed from IGT to T2D) groups were 0.094, 0.214, and 0.542, respectively. Binary logistic regression analysis indicated that rs1142345 was also significantly associated with the hypoglycemic effect of the drug between the healthy and maintenance groups (P = 0.027, OR = 4.828) and between the healthy and deterioration groups (P = 0.001, OR = 7.811). Therefore, rs1142345 was associated with the clinical effect of traditional hypoglycemic herbs. Results also suggested that TPMT was probably involved in the pharmacological mechanisms of T2D.
Copy number variation (CNV) has played an important role in studies of susceptibility or resistance to complex diseases. Traditional methods such as fluorescence in situ hybridization (FISH) and array comparative genomic hybridization (aCGH) suffer from low resolution of genomic regions. Following the emergence of next generation sequencing (NGS) technologies, CNV detection methods based on the short read data have recently been developed. However, due to the relatively young age of the procedures, their performance is not fully understood. To help investigators choose suitable methods to detect CNVs, comparative studies are needed. We compared six publicly available CNV detection methods: CNV-seq, FREEC, readDepth, CNVnator, SegSeq and event-wise testing (EWT). They are evaluated both on simulated and real data with different experiment settings. The receiver operating characteristic (ROC) curve is employed to demonstrate the detection performance in terms of sensitivity and specificity, box plot is employed to compare their performances in terms of breakpoint and copy number estimation, Venn diagram is employed to show the consistency among these methods, and F-score is employed to show the overlapping quality of detected CNVs. The computational demands are also studied. The results of our work provide a comprehensive evaluation on the performances of the selected CNV detection methods, which will help biological investigators choose the best possible method.
Genotype imputation is an important tool in human genetics studies, which uses reference sets with known genotypes and prior knowledge on linkage disequilibrium and recombination rates to infer un-typed alleles for human genetic variations at a low cost. The reference sets used by current imputation approaches are based on HapMap data, and/or based on recently available next-generation sequencing (NGS) data such as data generated by the 1000 Genomes Project. However, with different coverage and call rates for different NGS data sets, how to integrate NGS data sets of different accuracy as well as previously available reference data as references in imputation is not an easy task and has not been systematically investigated. In this study, we performed a comprehensive assessment of three strategies on using NGS data and previously available reference data in genotype imputation for both simulated data and empirical data, in order to obtain guidelines for optimal reference set construction. Briefly, we considered three strategies: strategy 1 uses one NGS data as a reference; strategy 2 imputes samples by using multiple individual data sets of different accuracy as independent references and then combines the imputed samples with samples based on the high accuracy reference selected when overlapping occurs; and strategy 3 combines multiple available data sets as a single reference after imputing each other. We used three software (MACH, IMPUTE2 and BEAGLE) for assessing the performances of these three strategies. Our results show that strategy 2 and strategy 3 have higher imputation accuracy than strategy 1. Particularly, strategy 2 is the best strategy across all the conditions that we have investigated, producing the best accuracy of imputation for rare variant. Our study is helpful in guiding application of imputation methods in next generation association analyses.
Previous studies using SAGE (the Study of Addiction: Genetics and Environment) and COGA (the Collaborative Study on the Genetics of Alcoholism) genome-wide association study (GWAS) data sets reported several risk loci for alcohol dependence (AD), which have not yet been well replicated independently or confirmed by functional studies. We combined these two data sets, now publicly available, to increase the study power, in order to identify replicable, functional, and significant risk regions for AD. A total of 4116 subjects (1409 European-American (EA) cases with AD, 1518 EA controls, 681 African-American (AA) cases, and 508 AA controls) underwent association analysis. An additional 443 subjects underwent expression quantitative trait locus (eQTL) analysis. Genome-wide association analysis was performed in EAs to identify significant risk genes. All available markers in the genome-wide significant risk genes were tested in AAs for associations with AD, and in six HapMap populations and two European samples for associations with gene expression levels. We identified a unique genome-wide significant gene—KIAA0040—that was enriched with many replicable risk SNPs for AD, all of which had significant cis-acting regulatory effects. The distributions of −log(p) values for SNP-disease and SNP-expression associations for all markers in the TNN–KIAA0040 region were consistent across EAs, AAs, and five HapMap populations (0.369⩽r⩽0.824; 2.8 × 10−9⩽p⩽0.032). The most significant SNPs in these populations were in high LD, concentrating in KIAA0040. Finally, expression of KIAA0040 was significantly (1.2 × 10−11⩽p⩽1.5 × 10−6) associated with the expression of numerous genes in the neurotransmitter systems or metabolic pathways previously associated with AD. We concluded that KIAA0040 might harbor a causal variant for AD and thus might directly contribute to risk for this disorder. KIAA0040 might also contribute to the risk of AD via neurotransmitter systems or metabolic pathways that have previously been implicated in the pathophysiology of AD. Alternatively, KIAA0040 might regulate the risk via some interactions with flanking genes TNN and TNR. TNN is involved in neurite outgrowth and cell migration in hippocampal explants, and TNR is an extracellular matrix protein expressed primarily in the central nervous system.
risk region; alcohol dependence; cis-eQTL; GWAS; alcohol & alcoholism; neurogenetics; addiction & substance abuse; biological psychiatry; GWAScis-eQTL; risk region
It has been a research focus to uncover the genetic determination of complex diseases caused by rare variants. As the vast majority of genomic variants represent background variation, highlighting potentially causal mutations through weighting scheme is critical to the success of rare variants aimed association studies. In this study, we propose a novel Bayesian marker selection approach to perform weighting-based association test. In this approach, individual association signal and its direction are used to weight variants. In addition, the predicted biological function of variants is taken as prior information to direct the selection of likely causal variants. Simulation studies show that the proposed method has improved power over several existing methods in certain conditions. Analyses of two empirical datasets demonstrate its applicability.
weighting; Bayesian marker selection; rare variants; association
Osteoporosis (OP) is characterized by low bone mineral density (BMD) and has strong genetic determination. However, specific genetic variants influencing BMD and contributing to pathogenesis of osteoporosis are largely uncharacterized. Current genetic studies in bone filed, which aimed at identification of OP risk genes, are mostly focused on DNA, RNA, or protein level individually, lacking integrative evidences from the three levels of genetic information flow to confidently ascertain the significance of genes for osteoporosis. Our previous proteomics study discovered that superoxide dismutase 2 (SOD2) in circulating monocytes (CMCs, i.e., potential osteoclast precursors) was significantly up-regulated at protein level in vivo in Chinese with low vs. high hip BMD. Herein, at mRNA level, we found that SOD2 gene expression was also up-regulated in CMC (p < 0.05) in Chinese with low vs. high hip BMD. At DNA level, in 1,627 unrelated Chinese subjects, we identified eight SNPs at SOD2 gene locus that were suggestively associated with hip BMD (peak signal at rs11968525, p = 0.048). Among the eight SNPs, three SNPs (rs7754103, rs7754295, and rs2053949) were associated with SOD2 mRNA expression level (p < 0.05), suggesting that they are expression quantitative trait locus (eQTL) regulating SOD2 gene expression. In conclusion, the present integrative evidences from DNA, RNA, and protein levels supported SOD2 as a susceptibility gene for osteoporosis.
Osteoporosis; SOD2; eQTL; BMD
Various types of genomic data (e.g., SNPs and mRNA transcripts) have been employed to identify risk genes for complex diseases. However, the analysis of these data has largely been performed in isolation. Combining these multiple data for integrative analysis can take advantage of complementary information and thus can have higher power to identify genes (and/or their functions) that would otherwise be impossible with individual data analysis. Due to the different nature, structure, and format of diverse sets of genomic data, multiple genomic data integration is challenging. Here we address the problem by developing a sparse representation based clustering (SRC) method for integrative data analysis. As an example, we applied the SRC method to the integrative analysis of 376821 SNPs in 200 subjects (100 cases and 100 controls) and expression data for 22283 genes in 80 subjects (40 cases and 40 controls) to identify significant genes for osteoporosis (OP). Comparing our results with previous studies, we identified some genes known related to OP risk (e.g., ‘THSD4’, ‘CRHR1’, ‘HSD11B1’, ‘THSD7A’, ‘BMPR1B’ ‘ADCY10’, ‘PRL’, ‘CA8’,’ESRRA’, ‘CALM1’, ‘CALM1’, ‘SPARC’, and ‘LRP1’). Moreover, we uncovered novel osteoporosis susceptible genes (‘DICER1’, ‘PTMA’, etc.) that were not found previously but play functionally important roles in osteoporosis etiology from existing studies. In addition, the SRC method identified genes can lead to higher accuracy for the diagnosis/classification of osteoporosis subjects when compared with the traditional T-test and Fisher-exact test, which further validates the proposed SRC approach for integrative analysis.
Many lines of evidence suggest that mitochondrial DNA (mtDNA) variants are involved in the pathogenesis of human complex diseases, especially for age-related disorders. Osteoporosis is a typical age-related complex disease. However, the role of mtDNA variants in the susceptibility of osteoporosis is largely unknown. In this study, we performed a mitochondria-wide association study for osteoporosis in Caucasians. A total of 445 mitochondrial single nucleotide polymorphisms (mtSNPs) were genotyped in a large sample of 2,286 unrelated Caucasian subjects by using the Affymetrix Genome-Wide SNP Array 6.0, and 72 mtSNPs survived the quality control. We first tested for association between single-mtSNP and bone mineral density (BMD), and identified that, a mtSNP within the NADH dehydrogenase 2 gene (ND2), mt4823 C/A polymorphism, was strongly associated with hip BMD (P = 2.05 × 10−4), even after conservative Bonferroni correction‥ The C allele of mt4823 was associated with reduced hip BMD and the effect size (β) was estimated to be ~0.044. Another SNP mt15885 within the Cytochrome b gene (Cytb) was found to be associated both with spine (P = 1.66×10−3) and hip BMD (P = 0.023). The T allele of mt15885 had a protective effect on spine (β = 0.064) and hip BMD (β = 0.038). Next, we classified subjects into the nine common European haplogroups and conducted association analyses. Subjects classified as haplogroup X had significantly lower mean hip BMD values than others (P = 0.040). Our results highlighted the importance of mtDNA variants in influencing BMD variation and risk to osteoporosis.
mtSNP; haplogroup; osteoporosis; BMD; association
Motivation: Several new de novo assembly tools have been developed recently to assemble short sequencing reads generated by next-generation sequencing platforms. However, the performance of these tools under various conditions has not been fully investigated, and sufficient information is not currently available for informed decisions to be made regarding the tool that would be most likely to produce the best performance under a specific set of conditions.
Results: We studied and compared the performance of commonly used de novo assembly tools specifically designed for next-generation sequencing data, including SSAKE, VCAKE, Euler-sr, Edena, Velvet, ABySS and SOAPdenovo. Tools were compared using several performance criteria, including N50 length, sequence coverage and assembly accuracy. Various properties of read data, including single-end/paired-end, sequence GC content, depth of coverage and base calling error rates, were investigated for their effects on the performance of different assembly tools. We also compared the computation time and memory usage of these seven tools. Based on the results of our comparison, the relative performance of individual tools are summarized and tentative guidelines for optimal selection of different assembly tools, under different conditions, are provided.
Supplementary information: Supplementary data are available at Bioinformatics online.
Genotype imputation is often used in the meta-analysis of genome-wide association studies (GWAS), for combining data from different studies and/or genotyping platforms, in order to improve the ability for detecting disease variants with small to moderate effects. However, how genotype imputation affects the performance of the meta-analysis of GWAS is largely unknown. In this study, we investigated the effects of genotype imputation on the performance of meta-analysis through simulations based on empirical data from the Framingham Heart Study. We found that when fix-effects models were used, considerable between-study heterogeneity was detected when causal variants were typed in only some but not all individual studies, resulting in up to ∼25% reduction of detection power. For certain situations, the power of the meta-analysis can be even less than that of individual studies. Additional analyses showed that the detection power was slightly improved when between-study heterogeneity was partially controlled through the random-effects model, relative to that of the fixed-effects model. Our study may aid in the planning, data analysis, and interpretation of GWAS meta-analysis results when genotype imputation is necessary.
MicroRNAs (miRNAs) regulate posttranscriptional gene expression usually by binding to 3'-untranslated regions (3'-UTRs) of target message RNAs (mRNAs). Hence genetic polymorphisms on 3'-UTRs of mRNAs may alter binding affinity between miRNAs target 3'-UTRs, thereby altering translational regulation of target mRNAs and/or degradation of mRNAs, leading to differential protein expression of target genes. Based on a database that catalogues predicted polymorphisms in miRNA target sites (poly-miRTSs), we selected 568 polymorphisms within 3'-UTRs of target mRNAs and performed association analyses between these selected poly-miRTSs and osteoporosis in 997 white subjects who were genotyped by Affymetrix Human Mapping 500K arrays. Initial discovery (in the 997 subjects) and replication (in 1728 white subjects) association analyses identified three poly-miRTSs (rs6854081, rs1048201, and rs7683093) in the fibroblast growth factor 2 (FGF2) gene that were significantly associated with femoral neck bone mineral density (BMD). These three poly-miRTSs serve as potential binding sites for 9 miRNAs (eg, miR-146a and miR-146b). Further gene expression analyses demonstrated that the FGF2 gene was differentially expressed between subjects with high versus low BMD in three independent sample sets. Our initial and replicate association studies and subsequent gene expression analyses support the conclusion that these three polymorphisms of the FGF2 gene may contribute to susceptibility to osteoporosis, most likely through their effects on altered binding affinity for specific miRNAs. © 2011 American Society for Bone and Mineral Research.
MICRORNA; OSTEOPOROSIS; ASSOCIATION; POLYMORPHISM
DNA microarray gene expression and microarray based comparative genomic hybridization (aCGH) have been widely used for biomedical discovery. Because of the large number of genes and the complex nature of biological networks, various analysis methods have been proposed. One such method is "gene shaving," a procedure which identifies subsets of the genes with coherent expression patterns and large variation across samples. Since combining genomic information from multiple sources can improve classification and prediction of diseases, in this paper we proposed a new method, "ICA gene shaving" (ICA, independent component analysis), for jointly analyzing gene expression and copy number data. First we used ICA to analyze joint measurements, gene expression and copy number, of a biological system and project the data onto statistically independent biological processes. Next we used these results to identify patterns of variation in the data and then applied an iterative shaving method. We investigated the properties of our proposed method by analyzing both simulated and real data. We demonstrated that the robustness of our method to noise using simulated data. Using breast cancer data, we showed that our method is superior to the Generalized Singular Value Decomposition (GSVD) gene shaving method for identifying genes associated with breast cancer.
Clustering Technique; Comparative Genomic Hybridization (CGH); Copy Number Variation (CNV); Generalized Singular Value Decomposition (GSVD); Gene Expression; Gene Shaving; Independent Component Analysis (ICA)