A key component of genetic architecture is the allelic spectrum influencing trait variability. For autism spectrum disorder (henceforth autism) the nature of its allelic spectrum is uncertain. Individual risk genes have been identified from rare variation, especially de novo mutations1–8. From this evidence one might conclude that rare variation dominates its allelic spectrum, yet recent studies show that common variation, individually of small effect, has substantial impact en masse9,10. At issue is how much of an impact relative to rare variation. Using a unique epidemiological sample from Sweden, novel methods that distinguish total narrow-sense heritability from that due to common variation, and by synthesizing results from other studies, we reach several conclusions about autism’s genetic architecture: its narrow-sense heritability is ≈54% and most traces to common variation; rare de novo mutations contribute substantially to individuals’ liability; still their contribution to variance in liability, 2.6%, is modest compared to heritable variation.
Predicting the prognosis of prostate cancer disease through gene expression analysis is receiving increasing interest. In many cases, such analyses are based on formalin-fixed, paraffin embedded (FFPE) core needle biopsy material on which Gleason grading for diagnosis has been conducted. Since each patient typically has multiple biopsy samples, and since Gleason grading is an operator dependent procedure known to be difficult, the impact of the operator's choice of biopsy was evaluated.
Multiple biopsy samples from 43 patients were evaluated using a previously reported gene signature of IGFBP3, F3 and VGLL3 with potential prognostic value in estimating overall survival at diagnosis of prostate cancer. A four multiplex one-step qRT-PCR test kit, designed and optimized for measuring the signature in FFPE core needle biopsy samples was used. Concordance of gene expression levels between primary and secondary Gleason tumor patterns, as well as benign tissue specimens, was analyzed.
The gene expression levels of IGFBP3 and F3 in prostate cancer epithelial cell-containing tissue representing the primary and secondary Gleason patterns were high and consistent, while the low expressed VGLL3 showed more variation in its expression levels.
The assessment of IGFBP3 and F3 gene expression levels in prostate cancer tissue is independent of Gleason patterns, meaning that the impact of operator's choice of biopsy is low.
Aspirin (ASA) use has been associated with improved breast cancer survival in several prospective studies.
We conducted a nested case–control study of ASA use after a breast cancer diagnosis among women using Swedish National Registries. We assessed prospectively recorded ASA exposure during several different time windows following cancer diagnosis using conditional logistic regression with breast cancer death as the main outcome. Within each six-month period of follow-up, we categorized dispensed ASA doses into three groups: 0, less than 1, and 1 or more daily doses.
We included 27,426 women diagnosed with breast cancer between 2005 and 2009; 1,661 died of breast cancer when followed until Dec 31, 2010. There was no association between ASA use and breast cancer death when exposure was assessed either shortly after diagnosis, or 3–12 months before the end of follow-up. Only during the period 0–6 months before the end of follow-up was ASA use at least daily compared with non-use associated with a decreased risk of breast cancer death: HR (95% CI) = 0.69 (0.56-0.86). However, in the same time-frame, those using ASA less than daily had an increased risk of breast cancer death: HR (95% CI) = 1.43 (1.09-1.87).
Contrary to other studies, we did not find that ASA use was associated with a lower risk of death from breast cancer, except when assessed short term with no delay to death/end of follow-up, which may reflect discontinuation of ASA during terminal illness.
Aspirin; Breast neoplasms; Survival; Prospective study; Sweden; Registries
Previously, we developed a radiosensitivity molecular signature (RSI) that was clinically-validated in three independent datasets (rectal, esophageal, head and neck) in 118 patients. Here, we test RSI in radiotherapy (RT) treated breast cancer patients.
RSI was tested in two previously published breast cancer datasets. Patients were treated at the Karolinska University Hospital (n=159) and Erasmus Medical Center (n=344). RSI was applied as previously described.
We tested RSI in RT-treated patients (Karolinska). Patients predicted to be radiosensitive (RS) had an improved 5 yr relapse-free survival when compared with radioresistant (RR) patients (95% vs. 75%, p=0.0212) but there was no difference between RS/RR patients treated without RT (71% vs. 77%, p=0.6744), consistent with RSI being RT-specific (interaction term RSIxRT, p=0.05). Similarly, in the Erasmus dataset RT-treated RS patients had an improved 5-year distant-metastasis-free survival over RR patients (77% vs. 64%, p=0.0409) but no difference was observed in patients treated without RT (RS vs. RR, 80% vs. 81%, p=0.9425). Multivariable analysis showed RSI is the strongest variable in RT-treated patients (Karolinska, HR=5.53, p=0.0987, Erasmus, HR=1.64, p=0.0758) and in backward selection (removal alpha of 0.10) RSI was the only variable remaining in the final model. Finally, RSI is an independent predictor of outcome in RT-treated ER+ patients (Erasmus, multivariable analysis, HR=2.64, p=0.0085).
RSI is validated in two independent breast cancer datasets totaling 503 patients. Including prior data, RSI is validated in five independent cohorts (621 patients) and represents, to our knowledge, the most extensively validated molecular signature in radiation oncology.
radiosensitivity; predictive biomarkers; gene expression; molecular signature; breast cancer
Systemic inflammation and sequestration of parasitized erythrocytes are central processes in the pathophysiology of severe Plasmodium falciparum childhood malaria. However, it is still not understood why some children are more at risks to develop malaria complications than others. To identify human proteins in plasma related to childhood malaria syndromes, multiplex antibody suspension bead arrays were employed. Out of the 1,015 proteins analyzed in plasma from more than 700 children, 41 differed between malaria infected children and community controls, whereas 13 discriminated uncomplicated malaria from severe malaria syndromes. Markers of oxidative stress were found related to severe malaria anemia while markers of endothelial activation, platelet adhesion and muscular damage were identified in relation to children with cerebral malaria. These findings suggest the presence of generalized vascular inflammation, vascular wall modulations, activation of endothelium and unbalanced glucose metabolism in severe malaria. The increased levels of specific muscle proteins in plasma implicate potential muscle damage and microvasculature lesions during the course of cerebral malaria.
Why do some malaria-infected children develop severe and lethal forms of the disease, while others only have mild forms? In order to try to find potential answers or clues to this question, we have here analyzed more than 1,000 different human proteins in the blood of more than 500 malaria-infected children from Ibadan in Nigeria, a holoendemic malaria region. We identified several proteins that were present at higher levels in the blood from the children that developed severe malaria in comparison to those that did not. Some of the most interesting identified proteins were muscle specific proteins, which indicate that damaged muscles could be a discriminatory pathologic event in cerebral malaria compared to other malaria cases. These findings will hopefully lead to an increased understanding of the disease and may contribute to the development of clinical algorithms that could predict which children are more at risks to severe malaria. This in turn will be of high value in the management of these children in already overloaded tertiary-care health facilities in urban large densely-populated sub-Saharan cities with holoendemic malaria such as in the case of Ibadan and Lagos.
Approaches exploiting extremes of the trait distribution may reveal novel loci for common traits, but it is unknown whether such loci are generalizable to the general population. In a genome-wide search for loci associated with upper vs. lower 5th percentiles of body mass index, height and waist-hip ratio, as well as clinical classes of obesity including up to 263,407 European individuals, we identified four new loci (IGFBP4, H6PD, RSRC1, PPP2R2A) influencing height detected in the tails and seven new loci (HNF4G, RPTOR, GNAT2, MRPS33P4, ADCY9, HS6ST3, ZZZ3) for clinical classes of obesity. Further, we show that there is large overlap in terms of genetic structure and distribution of variants between traits based on extremes and the general population and little etiologic heterogeneity between obesity subgroups.
A persistent debate in psychiatry concerns whether schizophrenia and bipolar disorder are the clinical realizations of discrete versus shared etiological processes.
We linked the Multi-Generation Register, containing information about children and their parents of all Swedes, and the Hospital Discharge Register, covering all public psychiatric inpatient hospitalizations in Sweden. We identified 9,009,202 unique individuals in more than 2 million nuclear families. Risks for schizophrenia, bipolar disorder and their co-morbidity were calculated for biological and adoptive parents, offspring, full siblings and half-siblings of probands with the diseases. A multivariate generalized linear mixed model was used to estimate genetic and environmental contributions to liability for schizophrenia, bipolar disorder, and their co-morbidity.
There were increased risks of both schizophrenia and bipolar disorder to first degree relatives of probands with either disorder. Half-sibs had a significantly increased risk, albeit substantially lower than the full-siblings. When relatives of probands with bipolar disorder were analysed, increased risks for schizophrenia were present for all relationships, including offspring adopted away. Heritability for schizophrenia was 64% and for bipolar disorder 59%. Shared environmental effects were small but significant for both disorders. The co-morbidity between the disorders was primarily (63%) due to additive genetic effects common to both disorders.
Similar to molecular genetic studies, we found compelling evidence that schizophrenia and bipolar disorder partially share a common genetic etiology. These results challenge the current nosological dichotomy between schizophrenia and bipolar disorder, and are consistent with a reappraisal of these disorders as distinct diagnostic entities.
Non-small cell lung cancer (NSCLC), a leading cause of cancer deaths, represents a heterogeneous group of neoplasms, mostly comprising squamous cell carcinoma (SCC), adenocarcinoma (AC) and large-cell carcinoma (LCC). The objectives of this study were to utilize integrated genomic data including copy-number alteration, mRNA, microRNA expression and candidate-gene full sequencing data to characterize the molecular distinctions between AC and SCC.
Comparative genomic hybridization followed by mutational analysis, gene expression and miRNA microarray profiling were performed on 123 paired tumor and non-tumor tissue samples from patients with NSCLC.
At DNA, mRNA and miRNA levels we could identify molecular markers that discriminated significantly between the various histopathological entities of NSCLC. We identified 34 genomic clusters using aCGH data; several genes exhibited a different profile of aberrations between AC and SCC, including PIK3CA, SOX2, THPO, TP63, PDGFB genes. Gene expression profiling analysis identified SPP1, CTHRC1and GREM1 as potential biomarkers for early diagnosis of the cancer, and SPINK1 and BMP7 to distinguish between AC and SCC in small biopsies or in blood samples. Using integrated genomics approach we found in recurrently altered regions a list of three potential driver genes, MRPS22, NDRG1 and RNF7, which were consistently over-expressed in amplified regions, had wide-spread correlation with an average of ~800 genes throughout the genome and highly associated with histological types. Using a network enrichment analysis, the targets of these potential drivers were seen to be involved in DNA replication, cell cycle, mismatch repair, p53 signalling pathway and other lung cancer related signalling pathways, and many immunological pathways. Furthermore, we also identified one potential driver miRNA hsa-miR-944.
Integrated molecular characterization of AC and SCC helped identify clinically relevant markers and potential drivers, which are recurrent and stable changes at DNA level that have functional implications at RNA level and have strong association with histological subtypes.
NSCLC; AC; SCC; LCC; Systems biology
Children born to older fathers are at higher risk to develop severe psychopathology (e.g., schizophrenia and bipolar disorder), possibly due to increased de novo mutations during spermatogenesis with older paternal age. Since severe psychopathology is correlated with antisocial behavior, we examined possible associations between advancing paternal age and offspring violent offending.
Interlinked Swedish national registers provided information on fathers’ age at childbirth and violent criminal convictions in all offspring born 1958–1979 (n=2,359,921). We used ever committing a violent crime and number of violent crimes as indices of violent offending. The data included information on multiple levels; we compared differentially exposed siblings in within-family analyses to rigorously test causal influences.
In the entire population, advancing paternal age predicted offspring violent crime according to both indices. Congruent with a causal effect, this association remained for rates of violent crime in within-family analyses. However, in within-analyses, we found no association with ever committing a violent crime, suggesting that factors shared by siblings (genes and environment) confounded this association. Life-course-persistent criminality has been proposed to have a partly biological etiology; our results agree with a stronger biological effect (i.e., de novo mutations) on persistent violent offending.
Paternal age; Violent criminality; De novo mutations; Sibling comparison
Methods are needed for determining program endpoints or postprogram surveillance for any elimination program. Cysticercosis has the necessary effective strategies and diagnostic tools for establishing an elimination program; however, tools to verify program endpoints have not been determined. Using a statistical approach, the present study proposed that taeniasis and porcine cysticercosis antibody assays could be used to determine with a high statistical confidence whether an area is free of disease. Confidence would be improved by using secondary tests such as the taeniasis coproantigen assay and necropsy of the sentinel pigs.
Gene-set enrichment analyses (GEA or GSEA) are commonly used for biological characterization of an experimental gene-set. This is done by finding known functional categories, such as pathways or Gene Ontology terms, that are over-represented in the experimental set; the assessment is based on an overlap statistic. Rich biological information in terms of gene interaction network is now widely available, but this topological information is not used by GEA, so there is a need for methods that exploit this type of information in high-throughput data analysis.
We developed a method of network enrichment analysis (NEA) that extends the overlap statistic in GEA to network links between genes in the experimental set and those in the functional categories. For the crucial step in statistical inference, we developed a fast network randomization algorithm in order to obtain the distribution of any network statistic under the null hypothesis of no association between an experimental gene-set and a functional category. We illustrate the NEA method using gene and protein expression data from a lung cancer study.
The results indicate that the NEA method is more powerful than the traditional GEA, primarily because the relationships between gene sets were more strongly captured by network connectivity rather than by simple overlaps.
Research has consistently found lower cognitive ability to be related to increased risk for violent and other antisocial behaviour. Since this association has remained when adjusting for childhood socioeconomic position, ethnicity, and parental characteristics, it is often assumed to be causal, potentially mediated through school adjustment problems and conduct disorder. Socioeconomic differences are notoriously difficult to quantify, however, and it is possible that the association between intelligence and delinquency suffer substantial residual confounding.
We linked longitudinal Swedish total population registers to study the association of general cognitive ability (intelligence) at age 18 (the Conscript Register, 1980–1993) with the incidence proportion of violent criminal convictions (the Crime Register, 1973–2009), among all men born in Sweden 1961–1975 (N = 700,514). Using probit regression, we controlled for measured childhood socioeconomic variables, and further employed sibling comparisons (family pedigree data from the Multi-Generation Register) to adjust for shared familial characteristics.
Cognitive ability in early adulthood was inversely associated to having been convicted of a violent crime (β = −0.19, 95% CI: −0.19; −0.18), the association remained when adjusting for childhood socioeconomic factors (β = −0.18, 95% CI: −0.18; −0.17). The association was somewhat lower within half-brothers raised apart (β = −0.16, 95% CI: −0.18; −0.14), within half-brothers raised together (β = −0.13, 95% CI: (−0.15; −0.11), and lower still in full-brother pairs (β = −0.10, 95% CI: −0.11; −0.09). The attenuation among half-brothers raised together and full brothers was too strong to be attributed solely to attenuation from measurement error.
Our results suggest that the association between general cognitive ability and violent criminality is confounded partly by factors shared by brothers. However, most of the association remains even after adjusting for such factors.
Prostate-specific antigen screening has led to enormous overtreatment of prostate cancer because of the inability to distinguish potentially lethal disease at diagnosis. We reasoned that by identifying an mRNA signature of Gleason grade, the best predictor of prognosis, we could improve prediction of lethal disease among men with moderate Gleason 7 tumors, the most common grade, and the most indeterminate in terms of prognosis.
Patients and Methods
Using the complementary DNA–mediated annealing, selection, extension, and ligation assay, we measured the mRNA expression of 6,100 genes in prostate tumor tissue in the Swedish Watchful Waiting cohort (n = 358) and Physicians' Health Study (PHS; n = 109). We developed an mRNA signature of Gleason grade comparing individuals with Gleason ≤ 6 to those with Gleason ≥ 8 tumors and applied the model among patients with Gleason 7 to discriminate lethal cases.
We built a 157-gene signature using the Swedish data that predicted Gleason with low misclassification (area under the curve [AUC] = 0.91); when this signature was tested in the PHS, the discriminatory ability remained high (AUC = 0.94). In men with Gleason 7 tumors, who were excluded from the model building, the signature significantly improved the prediction of lethal disease beyond knowing whether the Gleason score was 4 + 3 or 3 + 4 (P = .006).
Our expression signature and the genes identified may improve our understanding of the de-differentiation process of prostate tumors. Additionally, the signature may have clinical applications among men with Gleason 7, by further estimating their risk of lethal prostate cancer and thereby guiding therapy decisions to improve outcomes and reduce overtreatment.
Substantial progress has been made in human genetics and genomics research over the past ten years since the publication of the draft sequence of the human genome in 2001. Findings emanating directly from the Human Genome Project, together with those from follow-on studies, have had an enormous impact on our understanding of the architecture and function of the human genome. Major developments have been made in cataloguing genetic variation, the International HapMap Project, and with respect to advances in genotyping technologies. These developments are vital for the emergence of genome-wide association studies in the investigation of complex diseases and traits. In parallel, the advent of high-throughput sequencing technologies has ushered in the 'personal genome sequencing' era for both normal and cancer genomes, and made possible large-scale genome sequencing studies such as the 1000 Genomes Project and the International Cancer Genome Consortium. The high-throughput sequencing and sequence-capture technologies are also providing new opportunities to study Mendelian disorders through exome sequencing and whole-genome sequencing. This paper reviews these major developments in human genetics and genomics over the past decade.
Human Genome Project; International HapMap Project; 1000 Genomes Project; genome-wide association studies; single nucleotide polymorphisms; copy number variations; next-generation sequencing technologies; cancer genome sequencing; exome sequencing; complex disease; Mendelian disorders; personalised genomic medicine
The majority of prostate cancers harbor gene fusions of the 5′-untranslated region of the androgen-regulated transmembrane protease, serine 2 (TMPRSS2) promoter with erythroblast transformation specific (ETS) transcription factor family members. The common v-ets erythroblastosis virus E26 oncogene homolog [avian] (TMPRSS2–ERG) fusion is associated with a more aggressive clinical phenotype, implying the existence of a distinct subclass of prostate cancer defined by this fusion.
We used cDNA-mediated annealing, selection, ligation, and extension to determine the expression profiles of 6144 transcriptionally informative genes in archived biopsy samples from 455 prostate cancer patients in the Swedish Watchful Waiting cohort (1987–1999) and the US-based Physicians Health Study cohort (1983–2003). A gene expression signature for prostate cancers with the TMPRSS2-ERG fusion was determined using partitioning and classification models and used in computational functional analysis. Cell proliferation and TMPRSS2-ERG expression in androgen receptor–negative (NCI-H660) and –positive (VCaP-ERβ) prostate cancer cells after treatment with vehicle or estrogenic compounds were assessed by viability assays and quantitative polymerase chain reaction, respectively. All statistical tests were two-sided.
We identified an 87-gene expression signature that distinguishes TMPRSS2-ERG fusion prostate cancer as a discrete molecular entity (area under the curve = 0.80, 95% confidence interval [CI] = 0.792 to 0.81; P<.001). Computational analysis suggested that this fusion signature was associated with estrogen receptor (ER) signaling. Viability of NCI-H660 cells decreased after treatment with estrogen (viability normalized to day 0, estrogen vs vehicle at day 8, mean = 2.04 vs 3.40, difference = 1.36, 95% CI = 1.12 to 1.62) or ERβ agonist (ERβ agonist vs vehicle at day 8, mean = 1.86 vs 3.40, difference = 1.54, 95% CI = 1.39 to 1.69) but increased after ERα agonist treatment (ERα agonist vs vehicle at day 8, mean = 4.36 vs 3.40, difference = 0.96, 95% CI = 0.68 to 1.23). Similarly, expression of TMPRSS2-ERG decreased after ERβ agonist treatment (fold change over internal control, ERβ agonist vs vehicle at 24 hours, NCI H660, mean = 0.57-fold vs 1.0-fold, difference = 0.43, 95% CI = 0.29-fold to 0.57-fold) and increased after ERα agonist treatment (ERα agonist vs vehicle at 24 hours, mean = 5.63-fold vs 1.0-fold, difference = 4.63-fold, 95% CI = 4.34-fold to 4.92-fold).
TMPRSS2-ERG fusion prostate cancer is a distinct molecular subclass. TMPRSS2-ERG expression is regulated by a novel ER-dependent mechanism.
A fundamental question in human genetics is the degree to which the polygenic character of complex traits derives from polymorphism in genes with similar or with dissimilar functions. The many genome-wide association studies now being performed offer an opportunity to investigate this, and although early attempts are emerging, new tools and modeling strategies still need to be developed and deployed. Towards this goal we implemented a new algorithm to facilitate the transition from genetic marker lists (principally those generated by PLINK) to pathway analyses of representational gene sets in either threshold or threshold-free downstream applications (e.g. DAVID, GSEA-P, and Ingenuity Pathway Analysis). This was applied to several large genome-wide association studies covering diverse human traits that included type 2 diabetes, Crohn’s disease, and plasma lipid levels. Validation of this approach was obtained for plasma HDL levels, where functional categories related to lipid metabolism emerged as the most significant in two independent studies. From analyses of these samples we highlight and address numerous issues related to this strategy, including appropriate gene based correction statistics, the utility of imputed vs. non imputed marker sets, and the apparent enrichment of pathways due solely to the positional clustering of functionally related genes. The latter in particular emphasizes the importance of studies that directly tie genetic variation to functional characteristics of specific genes. The software freely provided that we have called ProxyGeneLD may resolve an important bottleneck in pathway-based analyses of genome-wide association data. This has allowed us to identify at least one replicable case of pathway enrichment but also to highlight functional gene clustering as a potentially serious problem that may lead to spurious pathway findings if not corrected for.
pathway; genome-wide; association; gene; enrichment; ontology
Principal component analysis (PCA) has gained popularity as a method for the analysis of high-dimensional genomic data. However, it is often difficult to interpret the results because the principal components are linear combinations of all variables, and the coefficients (loadings) are typically nonzero. These nonzero values also reflect poor estimation of the true vector loadings; for example, for gene expression data, biologically we expect only a portion of the genes to be expressed in any tissue, and an even smaller fraction to be involved in a particular process. Sparse PCA methods have recently been introduced for reducing the number of nonzero coefficients, but these existing methods are not satisfactory for high-dimensional data applications because they still give too many nonzero coefficients.
Here we propose a new PCA method that uses two innovations to produce an extremely sparse loading vector: (i) a random-effect model on the loadings that leads to an unbounded penalty at the origin and (ii) shrinkage of the singular values obtained from the singular value decomposition of the data matrix. We develop a stable computing algorithm by modifying nonlinear iterative partial least square (NIPALS) algorithm, and illustrate the method with an analysis of the NCI cancer dataset that contains 21,225 genes.
The new method has better performance than several existing methods, particularly in the estimation of the loading vectors.
Algorithms and software for CNV detection have been developed, but they detect the CNV regions sample-by-sample with individual-specific breakpoints, while common CNV regions are likely to occur at the same genomic locations across different individuals in a homogenous population. Current algorithms to detect common CNV regions do not account for the varying reliability of the individual CNVs, typically reported as confidence scores by SNP-based CNV detection algorithms. General methodologies for identifying these recurrent regions, especially those directed at SNP arrays, are still needed.
In this paper, we describe two new approaches for identifying common CNV regions based on (i) the frequency of occurrence of reliable CNVs, where reliability is determined by high confidence scores, and (ii) a weighted frequency of occurrence of CNVs, where the weights are determined by the confidence scores. In addition, motivated by the fact that we often observe partially overlapping CNV regions as a mixture of two or more distinct subregions, regions identified using the two approaches can be fine-tuned to smaller sub-regions using a clustering algorithm. We compared the performance of the methods with sequencing-based results in terms of discordance rates, rates of departure from Hardy-Weinberg equilibrium (HWE) and average frequency and size of the identified regions. The discordance rates as well as the rates of departure from HWE decrease when we select CNVs with higher confidence scores. We also performed comparisons with two previously published methods, STAC and GISTIC, and showed that the methods we consider are better at identifying low-frequency but high-confidence CNV regions.
The proposed methods for identifying common CNV regions in multiple individuals perform well compared to existing methods. The identified common regions can be used for downstream analyses such as group comparisons in association studies.
Current prostate cancer prognostic models are based on pre-treatment prostate specific antigen (PSA) levels, biopsy Gleason score, and clinical staging but in practice are inadequate to accurately predict disease progression. Hence, we sought to develop a molecular panel for prostate cancer progression by reasoning that molecular profiles might further improve current clinical models.
We analyzed a Swedish Watchful Waiting cohort with up to 30 years of clinical follow up using a novel method for gene expression profiling. This cDNA-mediated annealing, selection, ligation, and extension (DASL) method enabled the use of formalin-fixed paraffin-embedded transurethral resection of prostate (TURP) samples taken at the time of the initial diagnosis. We determined the expression profiles of 6100 genes for 281 men divided in two extreme groups: men who died of prostate cancer and men who survived more than 10 years without metastases (lethals and indolents, respectively). Several statistical and machine learning models using clinical and molecular features were evaluated for their ability to distinguish lethal from indolent cases.
Surprisingly, none of the predictive models using molecular profiles significantly improved over models using clinical variables only. Additional computational analysis confirmed that molecular heterogeneity within both the lethal and indolent classes is widespread in prostate cancer as compared to other types of tumors.
The determination of the molecularly dominant tumor nodule may be limited by sampling at time of initial diagnosis, may not be present at time of initial diagnosis, or may occur as the disease progresses making the development of molecular biomarkers for prostate cancer progression challenging.
Family data are used extensively in quantitative genetic studies to disentangle the genetic and environmental contributions to various diseases. Many family studies based their analysis on population-based registers containing a large number of individuals composed of small family units. For binary trait analyses, exact marginal likelihood is a common approach, but, due to the computational demand of the enormous data sets, it allows only a limited number of effects in the model. This makes it particularly difficult to perform joint estimation of variance components for a binary trait and the potential confounders. We have developed a data-reduction method of ascertaining informative families from population-based family registers. We propose a scheme where the ascertained families match the full cohort with respect to some relevant statistics, such as the risk to relatives of an affected individual. The ascertainment-adjusted analysis, which we implement using a pseudo-likelihood approach, is shown to be efficient relative to the analysis of the whole cohort and robust to mis-specification of the random effect distribution.
Segregation analysis; Mixed models; Variance components; Probit models
A great majority of genetic markers discovered in recent genome-wide association studies have small effect sizes, and they explain only a small fraction of the genetic contribution to the diseases. How many more variants can we expect to discover and what study sizes are needed? We derive the connection between the cumulative risk of the SNP variants to the latent genetic risk model and heritability of the disease. We determine the sample size required for case-control studies in order to achieve a certain expected number of discoveries in a collection of most significant SNPs. Assuming similar allele frequencies and effect sizes of the currently validated SNPs, complex phenotypes such as type-2 diabetes would need approximately 800 variants to explain its 40% heritability. Much smaller numbers of variants are needed if we assume rare-variants but higher penetrance models. We estimate that up to 50,000 cases and an equal number of controls are needed to discover 800 common low-penetrant variants among the top 5000 SNPs. Under common and rare low-penetrance models, the very large studies required to discover the numerous variants are probably at the limit of practical feasibility. Under rare-variant with medium- to high-penetrance models (odds-ratios between 1.6 and 4.0), studies comparable in size to many existing studies are adequate provided the genotyping technology can interrogate more and rarer variants.
Joint analysis of transcriptomic and proteomic data taken from the same samples has the potential to elucidate complex biological mechanisms. Most current methods that integrate these datasets allow for the computation of the correlation between a gene and protein but only after a one-to-one matching of genes and proteins is done. However, genes and proteins are connected via biological pathways and their relationship is not necessarily one-to-one. In this paper, we investigate the use of Correlated Factor Analysis (CFA) for modeling the correlation of genome-scale gene and protein data. Unlike existing approaches, CFA considers all possible gene-protein pairs and utilizes all gene and protein information in its modeling framework. The Generalized Singular Value Decomposition (gSVD) is another method which takes into account all available transcriptomic and proteomic data. Comparison is made between CFA and gSVD.
Our simulation study indicates that the CFA estimates can consistently capture the dominant patterns of correlation between two sets of measurements; in contrast, the gSVD estimates cannot do that. Applied to real cancer data, the list of co-regulated genes and proteins identified by CFA has biologically meaningful interpretation, where both the gene and protein expressions are pointing to the same processes. Among the GO terms for which the genes and proteins are most correlated, we observed blood vessel morphogenesis and development.
We demonstrate that CFA is a useful tool for gene-protein data integration and modeling, where the main question is in finding which patterns of gene expression are most correlated with protein expression.
While prostate cancer is a leading cause of cancer death, most men die with and not from their disease, underscoring the urgency to distinguish potentially lethal from indolent prostate cancer. We tested the prognostic value of a previously identified multigene signature of prostate cancer progression to predict cancer-specific death. The Örebro Watchful Waiting Cohort included 172 men with localized prostate cancer of whom 40 died of prostate cancer. We quantified protein expression of the markers in tumor tissue by immunohistochemistry, and stratified the cohort by quintiles according to risk classification. We accounted for clinical parameters (age, Gleason, nuclear grade, tumor volume) using Cox regression, and calculated Receiver Operator Curves to compare discriminatory ability. The hazard ratio of prostate cancer death increased with increasing risk classification by the multigene model, with a 16-fold greater risk comparing highest versus lowest risk strata, and predicted outcome independent of clinical factors (p=0.002). The best discrimination came from combining information from the multigene markers and clinical data, which perfectly classified the lowest risk stratum where no one developed lethal disease; using the two lowest risk groups as referent, the hazard ratio (95% confidence interval) was 11.3 (4.0–32.8) for the highest risk group and difference in mortality at 15 years was 60% (50–70%). The combined model provided greater discriminatory ability (AUC 0.78) than the clinical model alone (AUC 0.71), p=0.04. Molecular tumor markers can add to clinical parameters to help distinguish lethal and indolent prostate cancer, and hold promise to guide treatment decisions.
From 1968 to 2002, Singapore experienced an almost four-fold increase in prostate cancer incidence. This paper examines the incidence, mortality and survival patterns for prostate cancer among all residents in Singapore from 1968 to 2002.
This is a retrospective population-based cohort study including all prostate cancer cases aged over 20 (n = 3613) reported to the Singapore Cancer Registry from 1968 to 2002. Age-standardized incidence, mortality rates and 5-year Relative Survival Ratios (RSRs) were obtained for each 5-year period. Follow-up was ascertained by matching with the National Death Register until 2002. A weighted linear regression was performed on the log-transformed age-standardized incidence and mortality rates over period.
The percentage increase in the age-standardized incidence rate per year was 5.0%, 5.6%, 4.0% and 1.9% for all residents, Chinese, Malays and Indians respectively. The percentage increase in age-standardized mortality rate per year was 5.7%, 6.0%, 6.6% and 2.5% for all residents, Chinese, Malays and Indians respectively. When all Singapore residents were considered, the RSRs for prostate cancer were fairly constant across the study period with slight improvement from 1995 onwards among the Chinese.
Ethnic differences in prostate cancer incidence, mortality and survival patterns were observed. There has been a substantial improvement in RSRs since the 1990s for the Chinese.
It is well known that the normalization step of microarray data makes a difference in the downstream analysis. All normalization methods rely on certain assumptions, so differences in results can be traced to different sensitivities to violation of the assumptions. Illustrating the lack of robustness, in a striking spike-in experiment all existing normalization methods fail because of an imbalance between up- and down-regulated genes. This means it is still important to develop a normalization method that is robust against violation of the standard assumptions
We develop a new algorithm based on identification of the least-variant set (LVS) of genes across the arrays. The array-to-array variation is evaluated in the robust linear model fit of pre-normalized probe-level data. The genes are then used as a reference set for a non-linear normalization. The method is applicable to any existing expression summaries, such as MAS5 or RMA.
We show that LVS normalization outperforms other normalization methods when the standard assumptions are not satisfied. In the complex spike-in study, LVS performs similarly to the ideal (in practice unknown) housekeeping-gene normalization. An R package called lvs is available in .