The best-documented example for transmission distortion (TD) to normal offspring are the t haplotypes on mouse chromosome 17. In healthy humans, TD has been described for whole chromosomes and for particular loci, but multiple comparisons have presented a statistical obstacle in wide-ranging analyses. Here we provide six high-resolution TD maps of the short arm of human chromosome 6 (Hsa6p), based on single-nucleotide polymorphism (SNP) data from 60 trio families belonging to two ethnicities that are available through the International HapMap Project. We tested all approximately 70 000 previously genotyped SNPs within Hsa6p by the transmission disequilibrium test. TagSNP selection followed by permutation testing was performed to adjust for multiple testing. A statistically significant evidence for TD was observed among male parents of European ancestry, due to strong and wide-ranging skewed segregation in a 730 kb long region containing the transcription factor-encoding genes SUPT3H and RUNX2, as well as the microRNA locus MIRN586. We also observed that this chromosomal segment coincides with pronounced linkage disequilibrium (LD), suggesting a relationship between TD and LD. The fact that TD may be taking place in samples not selected for a genetic disease implies that linkage studies must be assessed with particular caution in chromosomal segments with evidence of TD.
transmission distortion; linkage disequilibrium; human chromosome 6p; SUPT3H; MIRN586; RUNX2
In order to assess whether gene expression variability could be influenced by several SNPs acting in cis, either through additive or more complex haplotype effects, a systematic genome-wide search for cis haplotype expression quantitative trait loci (eQTL) was conducted in a sample of 758 individuals, part of the Cardiogenics Transcriptomic Study, for which genome-wide monocyte expression and GWAS data were available. 19,805 RNA probes were assessed for cis haplotypic regulation through investigation of ∼2,1×109 haplotypic combinations. 2,650 probes demonstrated haplotypic p-values >104-fold smaller than the best single SNP p-value. Replication of significant haplotype effects were tested for 412 probes for which SNPs (or proxies) that defined the detected haplotypes were available in the Gutenberg Health Study composed of 1,374 individuals. At the Bonferroni correction level of 1.2×10−4 (∼0.05/412), 193 haplotypic signals replicated. 1000G imputation was then conducted, and 105 haplotypic signals still remained more informative than imputed SNPs. In-depth analysis of these 105 cis eQTL revealed that at 76 loci genetic associations were compatible with additive effects of several SNPs, while for the 29 remaining regions data could be compatible with a more complex haplotypic pattern. As 24 of the 105 cis eQTL have previously been reported to be disease-associated loci, this work highlights the need for conducting haplotype-based and 1000G imputed cis eQTL analysis before commencing functional studies at disease-associated loci.
In order to assess whether gene expression variability could be influenced by the presence of more than one cis-acting SNP, we have conducted a systematic genome-wide search for haplotypic cis eQTL effects in a sample of 758 individuals and replicated the findings in an independent sample of 1,374 subjects. In both studies, genome-wide monocytes expression and genotype data were available. We identified 105 genes whose monocyte expression was under the influence of multiple cis-acting SNPs. About 75% of the detected genetic effects were related to independent additive SNP effects and the last quarter due to more complex haplotype effects. Of note, 24 of the genes identified to be affected by multiple cis eSNPs have been previously reported to reside at disease-associated loci. This could suggest that such multiple locus-specific genetic effects could contribute to the susceptibility to human diseases.
Like human infants, songbirds learn their species-specific vocalizations through imitation learning. The birdsong system has emerged as a widely used experimental animal model for understanding the underlying neural mechanisms responsible for vocal production learning. However, how neural impulses are translated into the precise motor behavior of the complex vocal organ (syrinx) to create song is poorly understood. First and foremost, we lack a detailed understanding of syringeal morphology.
To fill this gap we combined non-invasive (high-field magnetic resonance imaging and micro-computed tomography) and invasive techniques (histology and micro-dissection) to construct the annotated high-resolution three-dimensional dataset, or morphome, of the zebra finch (Taeniopygia guttata) syrinx. We identified and annotated syringeal cartilage, bone and musculature in situ in unprecedented detail. We provide interactive three-dimensional models that greatly improve the communication of complex morphological data and our understanding of syringeal function in general.
Our results show that the syringeal skeleton is optimized for low weight driven by physiological constraints on song production. The present refinement of muscle organization and identity elucidates how apposed muscles actuate different syringeal elements. Our dataset allows for more precise predictions about muscle co-activation and synergies and has important implications for muscle activity and stimulation experiments. We also demonstrate how the syrinx can be stabilized during song to reduce mechanical noise and, as such, enhance repetitive execution of stereotypic motor patterns. In addition, we identify a cartilaginous structure suited to play a crucial role in the uncoupling of sound frequency and amplitude control, which permits a novel explanation of the evolutionary success of songbirds.
Microarray profiling of gene expression is widely applied in molecular biology and functional genomics. Experimental and technical variations make meta-analysis of different studies challenging. In a total of 3358 samples, all from German population-based cohorts, we investigated the effect of data preprocessing and the variability due to sample processing in whole blood cell and blood monocyte gene expression data, measured on the Illumina HumanHT-12 v3 BeadChip array.
Gene expression signal intensities were similar after applying the log2 or the variance-stabilizing transformation. In all cohorts, the first principal component (PC) explained more than 95% of the total variation. Technical factors substantially influenced signal intensity values, especially the Illumina chip assignment (33–48% of the variance), the RNA amplification batch (12–24%), the RNA isolation batch (16%), and the sample storage time, in particular the time between blood donation and RNA isolation for the whole blood cell samples (2–3%), and the time between RNA isolation and amplification for the monocyte samples (2%). White blood cell composition parameters were the strongest biological factors influencing the expression signal intensities in the whole blood cell samples (3%), followed by sex (1–2%) in both sample types. Known single nucleotide polymorphisms (SNPs) were located in 38% of the analyzed probe sequences and 4% of them included common SNPs (minor allele frequency >5%). Out of the tested SNPs, 1.4% significantly modified the probe-specific expression signals (Bonferroni corrected p-value<0.05), but in almost half of these events the signal intensities were even increased despite the occurrence of the mismatch. Thus, the vast majority of SNPs within probes had no significant effect on hybridization efficiency.
In summary, adjustment for a few selected technical factors greatly improved reliability of gene expression analyses. Such adjustments are particularly required for meta-analyses.
Medicine and biomedical sciences have become data-intensive fields, which, at the same time, enable the application of data-driven approaches and require sophisticated data analysis and data mining methods. Biomedical informatics provides a proper interdisciplinary context to integrate data and knowledge when processing available information, with the aim of giving effective decision-making support in clinics and translational research.
To reflect on different perspectives related to the role of data analysis and data mining in biomedical informatics.
On the occasion of the 50th year of Methods of Information in Medicine a symposium was organized, that reflected on opportunities, challenges and priorities of organizing, representing and analysing data, information and knowledge in biomedicine and health care. The contributions of experts with a variety of backgrounds in the area of biomedical data analysis have been collected as one outcome of this symposium, in order to provide a broad, though coherent, overview of some of the most interesting aspects of the field.
The paper presents sections on data accumulation and data-driven approaches in medical informatics, data and knowledge integration, statistical issues for the evaluation of data mining models, translational bioinformatics and bioinformatics aspects of genetic epidemiology.
Biomedical informatics represents a natural framework to properly and effectively apply data analysis and data mining methods in a decision-making context. In the future, it will be necessary to preserve the inclusive nature of the field and to foster an increasing sharing of data and methods between researchers.
Biomedical informatics; data mining; data analysis; data-driven methods; translational bioinformatics
We aimed to assess whether pri-miRNA SNPs (miSNPs) could influence monocyte gene expression, either through marginal association or by interacting with polymorphisms located in 3'UTR regions (3utrSNPs). We then conducted a genome-wide search for marginal miSNPs effects and pairwise miSNPs × 3utrSNPs interactions in a sample of 1,467 individuals for which genome-wide monocyte expression and genotype data were available. Statistical associations that survived multiple testing correction were tested for replication in an independent sample of 758 individuals with both monocyte gene expression and genotype data. In both studies, the hsa-mir-1279 rs1463335 was found to modulate in cis the expression of LYZ and in trans the expression of CNTN6, CTRC, COPZ2, KRT9, LRRFIP1, NOD1, PCDHA6, ST5 and TRAF3IP2 genes, supporting the role of hsa-mir-1279 as a regulator of several genes in monocytes. In addition, we identified two robust miSNPs × 3utrSNPs interactions, one involving HLA-DPB1 rs1042448 and hsa-mir-219-1 rs107822, the second the H1F0 rs1894644 and hsa-mir-659 rs5750504, modulating the expression of the associated genes.
As some of the aforementioned genes have previously been reported to reside at disease-associated loci, our findings provide novel arguments supporting the hypothesis that the genetic variability of miRNAs could also contribute to the susceptibility to human diseases.
After an association between genetic variants and a phenotype has been established, further study goals comprise the classification of patients according to disease risk or the estimation of disease probability. To accomplish this, different statistical methods are required, and specifically machine-learning approaches may offer advantages over classical techniques. In this paper, we describe methods for the construction and evaluation of classification and probability estimation rules. We review the use of machine-learning approaches in this context and explain some of the machine-learning algorithms in detail. Finally, we illustrate the methodology through application to a genome-wide association analysis on rheumatoid arthritis.
Electronic supplementary material
The online version of this article (doi:10.1007/s00439-012-1194-y) contains supplementary material, which is available to authorized users.
Biomarkers are of increasing importance for personalized medicine, with applications including diagnosis, prognosis, and selection of targeted therapies. Their use is extremely diverse, ranging from pharmacodynamics to treatment monitoring. Following a concise review of terminology, we provide examples and current applications of three broad categories of biomarkers—DNA biomarkers, DNA tumor biomarkers, and other general biomarkers. We outline clinical trial phases for identifying and validating diagnostic and prognostic biomarkers. Predictive biomarkers, more generally termed companion diagnostic tests predict treatment response in terms of efficacy and/or safety. We consider suitability of clinical trial designs for predictive biomarkers, including a detailed discussion of validation study designs, with emphasis on interpretation of study results. We specifically discuss the interpretability of treatment effects if a large set of DNA biomarker profiles is available and the number of therapies is identical to the number of different profiles.
Evidence exist that motor observation activates the same cortical motor areas that are involved in the performance of the observed actions. The so called “mirror neuron system” has been proposed to be responsible for this phenomenon. We employ this neural system and its capability to re-enact stored motor representations as a tool for rehabilitating motor control. In our new neurorehabilitative schema (videotherapy) we combine observation of daily actions with concomitant physical training of the observed actions focusing on the upper limbs. Following a pilot study in chronic patients in an ambulatory setting, we currently designed a new multicenter clinical study dedicated to patients in the sub-acute state after stroke using a home-based self-induced training. Within our protocol we assess 1) the capability of action observation to elicit rehabilitational effects in the motor system, and 2) the capacity of this schema to be performed by patients without assistance from a physiotherapist. The results of this study would be of high health and economical relevance.
A controlled, randomized, multicenter, paralleled, 6 month follow-up study will be conducted on three groups of patients: one group will be given the experimental treatment whereas the other two will participate in control treatments. All patients will undergo their usual rehabilitative treatment beside participation in the study. The experimental condition consists in the observation and immediate imitation of common daily hand and arm actions. The two parallel control groups are a placebo group and a group receiving usual rehabilitation without any trial-related treatment. Trial randomization is provided via external data management. The primary efficacy endpoint is the improvement of the experimental group in a standardized motor function test (Wolf Motor Function Test) relative to control groups. Further assessments refer to subjective and qualitative rehabilitational scores. This study has been reviewed and approved by the ethics committee of Aachen University.
This therapy provides an extension of therapeutic procedures for recovery after stroke and emphasizes the importance of action perception in neurorehabilitation The results of the study could become implemented into the wide physiotherapeutic practice, for example as an ad on and individualized therapy.
To examine the association of polymorphisms in ATM (codon 158), GSTP1 (codon 105), SOD2 (codon 16), TGFB1 (position −509), XPD (codon 751), and XRCC1 (codon 399) with the risk of severe erythema after breast conserving radiotherapy.
Methods and materials
Retrospective analysis of 83 breast cancer patients treated with breast conserving radiotherapy. A total dose of 50.4 Gy was administered, applying 1.8 Gy/fraction within 42 days. Erythema was evaluated according to the Radiation Therapy Oncology Group (RTOG) score. DNA was extracted from blood samples and polymorphisms were determined using either the Polymerase Chain Reaction based Restriction-Fragment-Length-Polymorphism (PCR-RFL) technique or Matrix-Assisted-Laser-Desorption/Ionization –Time-Of-Flight-Mass-Spectrometry (MALDI-TOF). Relative excess heterozygosity (REH) was investigated to check compatibility of genotype frequencies with Hardy-Weinberg equilibrium (HWE). In addition, p-values from the standard exact HWE lack of fit test were calculated using 100,000 permutations. HWE analyses were performed using R.
Fifty-six percent (46/83) of all patients developed erythema of grade 2 or 3, with this risk being higher for patients with large breast volume (odds ratio, OR = 2.55, 95% confidence interval, CI: 1.03–6.31, p = 0.041). No significant association between SNPs and risk of erythema was found when all patients were considered. However, in patients with small breast volume the TGFB1 SNP was associated with erythema (p = 0.028), whereas the SNP in XPD showed an association in patients with large breast volume (p = 0.046). A risk score based on all risk alleles was neither significant in all patients nor in patients with small or large breast volume. Risk alleles of most SNPs were different compared to a previously identified risk profile for fibrosis.
The genetic risk profile for erythema appears to be different for patients with small and larger breast volume. This risk profile seems to be specific for erythema as compared to a risk profile for fibrosis.
Single nucleotide polymorphisms (SNPs); Erythema; Breast cancer; Radiotherapy
Genome-wide association studies have successfully elucidated the genetic background of complex diseases, but X chromosomal data have usually not been analyzed. A reason for this is that there is no consensus approach for the analysis taking into account the specific features of X chromosomal data. This contribution evaluates test statistics proposed for X chromosomal markers regarding type I error frequencies and power.
We performed extensive simulation studies covering a wide range of different settings. Besides characteristics of the general population, we investigated sex-balanced or unbalanced sampling procedures as well as sex-specific effect sizes, allele frequencies and prevalence. Finally, we applied the test statistics to an association data set on Crohn's disease.
Simulation results imply that in addition to standard quality control, sex-specific allele frequencies should be checked to control for type I errors. Furthermore, we observed distinct differences in power between test statistics which are determined by sampling design and sex specificity of effect sizes. Analysis of the Crohn's disease data detects two previously unknown genetic regions on the X chromosome.
Although no test is uniformly most powerful under all settings, recommendations are offered as to which test performs best under certain conditions.
Crohn's disease; Genetic association; Genome-wide association; Sex specific; X chromosome
Detection of epistatic interaction between loci has been postulated to provide a more in-depth understanding of the complex biological and biochemical pathways underlying human diseases. Studying the interaction between two loci is the natural progression following traditional and well-established single locus analysis. However, the added costs and time duration required for the computation involved have thus far deterred researchers from pursuing a genome-wide analysis of epistasis. In this paper, we propose a method allowing such analysis to be conducted very rapidly. The method, dubbed EPIBLASTER, is applicable to case–control studies and consists of a two-step process in which the difference in Pearson's correlation coefficients is computed between controls and cases across all possible SNP pairs as an indication of significant interaction warranting further analysis. For the subset of interactions deemed potentially significant, a second-stage analysis is performed using the likelihood ratio test from the logistic regression to obtain the P-value for the estimated coefficients of the individual effects and the interaction term. The algorithm is implemented using the parallel computational capability of commercially available graphical processing units to greatly reduce the computation time involved. In the current setup and example data sets (211 cases, 222 controls, 299468 SNPs; and 601 cases, 825 controls, 291095 SNPs), this coefficient evaluation stage can be completed in roughly 1 day. Our method allows for exhaustive and rapid detection of significant SNP pair interactions without imposing significant marginal effects of the single loci involved in the pair.
Epistasis; genome-wide interaction analysis; graphical processing unit
Lymphadenectomy is performed to assess patient prognosis and to prevent metastasizing. Recently, it was questioned whether lymph node metastases were capable of metastasizing and therefore, if lymphadenectomy was still adequate. We evaluated whether the nodal status impacts on the occurrence of distant metastases by analyzing a highly selected cohort of colon cancer patients.
1,395 patients underwent surgery exclusively for colon cancer at the University of Lübeck between 01/1993 and 12/2008. The following exclusion criteria were applied: synchronous metastasis, R1-resection, prior/synchronous second carcinoma, age < 50 years, positive family history, inflammatory bowel disease, FAP, HNPCC, and follow-up < 5 years. The remaining 421 patients were divided into groups with (TM+, n = 75) or without (TM-, n = 346) the occurrence of metastasis throughout a 5-year follow-up.
Five-year survival rates for TM + and TM- were 21% and 73%, respectively (p < 0.0001). Survival rates differed significantly for N0 vs. N2, grading 2 vs. 3, UICC-I vs. -II and UICC-I vs. -III (p < 0.05). Regression analysis revealed higher age upon diagnosis, increasing N- and increasing T-category to significantly impact on recurrence free survival while increasing N-and T-category were significant parameters for the risk to develop metastases within 5-years after surgery (HR 1.97 and 1.78; p < 0.0001).
Besides a higher T-category, a positive N-stage independently implies a higher probability to develop distant metastases and correlates with poor survival. Our data thus show a prognostic relevance of lymphadenectomy which should therefore be retained until conclusive studies suggest the unimportance of lmyphadenectomy.
Colon cancer; Lymph nodes; Metastasis; Prognosis; Survival; Recurrence free survival; Regression analysis
The crustacean cuticle consists of a complex organic matrix and a mineral phase. The physical and chemical properties of the cuticle are corellated to the specific functions of cuticular elements, leading to a large variety in its structure and composition. Investigation of the structure-function relationship in crustacean cuticle requires sophisticated methodological tools for the analysis of different aspects of the cuticular architecture. In the present paper we report improved preparation methods that, in combination with various electron microscopic techniques, have led to new insights of cuticle structure and composition in the tergite cuticle of Porcellio scaber. We used thin sections of non-decalcified tergites and decalcified resin embedded material for transmission electron microscopy and scanning transmission electron microscopy. Etched sagittal planes of bulk tergite samples were analysed with field emission scanning electron microscopy. We have found a distinct distal region within the exocuticle that differs from the subjacent proximal exocuticle in the arrangement of fibres. Within this distal exocuticle chitin-protein fibrils assemble to fibres with diameters between 15 and 50 nm that are embedded in a mineral matrix. In the proximal exocuticle and the endocuticle fibrils do not assemble to fibres and are surrounded by mineral individually. Furthermore, we show that the pore canals are filled with mineral, and demonstrate that mild etching of polished sagittal cuticle surfaces reveals regions containing mineral of diverse solubility.
Isopoda; cuticle; ultrastructure; Porcellio scaber
Background: The incidence of therapy-related acute leukaemia (TRAL) in mitoxantrone treatment in multiple sclerosis (MS) is controversially discussed.
Methods and results: In a retrospective meta-analysis from six centres, we observed six cases of acute myeloid leukaemia (AML) (incidence 0.41% for patients with mean follow up after end of treatment of 3.6 years, n = 1.156; incidence 0.25% for all patients, n = 2.261). Potential influencing factors such as myelotoxic or glucocorticosteroid pretreatment/cotreatment were present in all but one case of TRAL. Between 1990 and 2010, 11 cases of TRAL were reported to the Drug Commission of the German Medical Association (estimated risk of 0.09–0.13%).
Conclusions: Regional differences in reported TRAL incidence may point to confounding cofactors such as administration protocols and cotreatments.
escalation therapy; leukaemia; mitoxantrone; multiple sclerosis; risk profile
Next-generation sequencing technology allows investigation of both common and rare variants in humans. Exomes are sequenced on the population level or in families to further study the genetics of human diseases. Genetic Analysis Workshop 17 (GAW17) provided exomic data from the 1000 Genomes Project and simulated phenotypes. These data enabled evaluations of existing and newly developed statistical methods for rare variant sequence analysis for which standard statistical methods fail because of the rareness of the alleles. Various alternative approaches have been proposed that overcome the rareness problem by combining multiple rare variants within a gene. These approaches are termed collapsing methods, and our GAW17 group focused on studying the performance of existing and novel collapsing methods using rare variants. All tested methods performed similarly, as measured by type I error and power. Inflated type I error fractions were consistently observed and might be caused by gametic phase disequilibrium between causal and noncausal rare variants in this relatively small sample as well as by population stratification. Incorporating prior knowledge, such as appropriate covariates and information on functionality of SNPs, increased the power of detecting associated genes. Overall, collapsing rare variants can increase the power of identifying disease-associated genes. However, studying genetic associations of rare variants remains a challenging task that requires further development and improvement in data collection, management, analysis, and computation.
1000 Genomes Project; association; collapsing methods; next-generation sequencing
With the advent of novel sequencing technologies, interest in the identification of rare variants that influence common traits has increased rapidly. Standard statistical methods, such as the Cochrane-Armitage trend test or logistic regression, fail in this setting for the analysis of unrelated subjects because of the rareness of the variants. Recently, various alternative approaches have been proposed that circumvent the rareness problem by collapsing rare variants in a defined genetic region or sets of regions. We provide an overview of these collapsing methods for association analysis and discuss the use of permutation approaches for significance testing of the data-adaptive methods.
association; collapsing methods; collection of rare variants; common disease/rare variant hypothesis; contingency table; generalized linear model; next-generation sequencing; pooling methods
Genetic Analysis Workshop 17 (GAW17) focused on the transition from genome-wide association study designs and methods to the study designs and statistical genetic methods that will be required for the analysis of next-generation sequence data including both common and rare sequence variants. In the 166 contributions to GAW17, a wide variety of statistical methods were applied to simulated traits in population- and family-based samples, and results from these analyses were compared to the known generating model. In general, many of the statistical genetic methods used in the population-based sample identified causal sequence variants (SVs) when the estimated locus-specific heritability, as measured in the population-based sample, was greater than about 0.08. However, SVs with locus-specific heritabilities less than 0.03 were rarely identified consistently. In the family-based samples, many of the methods detected SVs that were rarer than those detected in the population-based sample, but the estimated locus-specific heritabilities for these rare SVs, as measured in the family-based samples, were substantially higher (>0.2) than their corresponding heritabilities in the population-based samples. Substantial inflation of the type I error rate was observed across a wide variety of statistical methods. Although many of the contributions found little inflation in type I error for Q4, a trait with no causal SVs, type I error rates for Q1 and Q2 were well above their nominal levels with the inflation for Q1 being higher than that for Q2. It seems likely that this inflation in type I error is due to correlations among SVs.
linkage; association; next-generation sequencing; computer simulation
One major expectation from the transcriptome in humans is to characterize the biological basis of associations identified by genome-wide association studies. So far, few cis expression quantitative trait loci (eQTLs) have been reliably related to disease susceptibility. Trans-regulating mechanisms may play a more prominent role in disease susceptibility. We analyzed 12,808 genes detected in at least 5% of circulating monocyte samples from a population-based sample of 1,490 European unrelated subjects. We applied a method of extraction of expression patterns—independent component analysis—to identify sets of co-regulated genes. These patterns were then related to 675,350 SNPs to identify major trans-acting regulators. We detected three genomic regions significantly associated with co-regulated gene modules. Association of these loci with multiple expression traits was replicated in Cardiogenics, an independent study in which expression profiles of monocytes were available in 758 subjects. The locus 12q13 (lead SNP rs11171739), previously identified as a type 1 diabetes locus, was associated with a pattern including two cis eQTLs, RPS26 and SUOX, and 5 trans eQTLs, one of which (MADCAM1) is a potential candidate for mediating T1D susceptibility. The locus 12q24 (lead SNP rs653178), which has demonstrated extensive disease pleiotropy, including type 1 diabetes, hypertension, and celiac disease, was associated to a pattern strongly correlating to blood pressure level. The strongest trans eQTL in this pattern was CRIP1, a known marker of cellular proliferation in cancer. The locus 12q15 (lead SNP rs11177644) was associated with a pattern driven by two cis eQTLs, LYZ and YEATS4, and including 34 trans eQTLs, several of them tumor-related genes. This study shows that a method exploiting the structure of co-expressions among genes can help identify genomic regions involved in trans regulation of sets of genes and can provide clues for understanding the mechanisms linking genome-wide association loci to disease.
One major expectation from the transcriptome in humans is to help characterize the biological basis of associations identified by genome-wide association studies. Here, we take advantage of recent technical and methodological advances to examine the influence of natural genetic variability on >12,000 genes expressed in the monocyte, a blood cell playing a key role in immunity-related disorders and atherosclerosis. By examining 1,490 European population-based subjects, we identify three regions of the genome reproducibly associated with specific patterns of gene expression. Two of these regions overlap genetic variants previously known to be involved in the susceptibility to type 1 diabetes, celiac disease, and hypertension. Genes whose expression is modulated by these genetic variants may act as mediators in the causal relationship linking the variability of the genome to complex disease. These findings illustrate how integration of genetic and transcriptomic data at an epidemiological scale can help decipher the genetic basis of complex diseases.
Genetic Analysis Workshop 17 (GAW17) provided a platform for evaluating existing statistical genetic methods and for developing novel methods to analyze rare variants that modulate complex traits. In this article, we present an overview of the 1000 Genomes Project exome data and simulated phenotype data that were distributed to GAW17 participants for analyses, the different issues addressed by the participants, and the process of preparation of manuscripts resulting from the discussions during the workshop.
Novel technologies allow sequencing of whole genomes and are considered as an emerging approach for the identification of rare disease-associated variants. Recent studies have shown that multiple rare variants can explain a particular proportion of the genetic basis for disease. Following this assumption, we compare five collapsing approaches to test for groupwise association with disease status, using simulated data provided by Genetic Analysis Workshop 17 (GAW17). Variants are collapsed in different scenarios per gene according to different minor allele frequency (MAF) thresholds and their functionality. For comparing the different approaches, we consider the family-wise error rate and the power. Most of the methods could maintain the nominal type I error levels well for small MAF thresholds, but the power was generally low. Although the methods considered in this report are common approaches for analyzing rare variants, they performed poorly with respect to the simulated disease phenotype in the GAW17 data set.
Diagnostic accuracy of a genetic test involving multiple disease genes is evaluated using sensitivity and specificity. For estimation, data from both affected and unaffected subjects are required. For early onset diseases such as autism spectrum disorder only data from families with affected offspring is available. To enable estimation of specificity when no data for unaffected offspring are available (single affected offspring, SAO, data), we combine the pseudocontrol method of Cordell and Clayton [2002 Am J Hum Genet 70:124-41] with the approach of DeLong et al. [1985 Biometrics 41:947-58] in a logistic regression model for disease outcome with a risk score (RS) constructed from genotype information as prognostic variable. The area under the receiver operating characteristic curve (AUC) is then computed using the nonparametric Mann-Whitney method. Extensive simulation studies show that, analogously to other approaches utilizing pseudocontrols, resulting estimates of AUC using SAO data are slightly conservative when compared to estimates computed using the full population-based data. The method is illustrated using data from a study of autism spectrum disorder.
Family data; genetic profile; diagnostic test; transmission disequilibrium test
The hypothesis of dosage compensation of genes of the X chromosome, supported by previous microarray studies, was recently challenged by RNA-sequencing data. It was suggested that microarray studies were biased toward an over-estimation of X-linked expression levels as a consequence of the filtering of genes below the detection threshold of microarrays.
To investigate this hypothesis, we used microarray expression data from circulating monocytes in 1,467 individuals. In total, 25,349 and 1,156 probes were unambiguously assigned to autosomes and the X chromosome, respectively. Globally, there was a clear shift of X-linked expressions toward lower levels than autosomes. We compared the ratio of expression levels of X-linked to autosomal transcripts (X∶AA) using two different filtering methods: 1. gene expressions were filtered out using a detection threshold irrespective of gene chromosomal location (the standard method in microarrays); 2. equal proportions of genes were filtered out separately on the X and on autosomes. For a wide range of filtering proportions, the X∶AA ratio estimated with the first method was not significantly different from 1, the value expected if dosage compensation was achieved, whereas it was significantly lower than 1 with the second method, leading to the rejection of the hypothesis of dosage compensation. We further showed in simulated data that the choice of the most appropriate method was dependent on biological assumptions regarding the proportion of actively expressed genes on the X chromosome comparative to the autosomes and the extent of dosage compensation.
This study shows that the method used for filtering out lowly expressed genes in microarrays may have a major impact according to the hypothesis investigated. The hypothesis of dosage compensation of X-linked genes cannot be firmly accepted or rejected using microarray-based data.