Methods are needed for determining program endpoints or postprogram surveillance for any elimination program. Cysticercosis has the necessary effective strategies and diagnostic tools for establishing an elimination program; however, tools to verify program endpoints have not been determined. Using a statistical approach, the present study proposed that taeniasis and porcine cysticercosis antibody assays could be used to determine with a high statistical confidence whether an area is free of disease. Confidence would be improved by using secondary tests such as the taeniasis coproantigen assay and necropsy of the sentinel pigs.
Gene-set enrichment analyses (GEA or GSEA) are commonly used for biological characterization of an experimental gene-set. This is done by finding known functional categories, such as pathways or Gene Ontology terms, that are over-represented in the experimental set; the assessment is based on an overlap statistic. Rich biological information in terms of gene interaction network is now widely available, but this topological information is not used by GEA, so there is a need for methods that exploit this type of information in high-throughput data analysis.
We developed a method of network enrichment analysis (NEA) that extends the overlap statistic in GEA to network links between genes in the experimental set and those in the functional categories. For the crucial step in statistical inference, we developed a fast network randomization algorithm in order to obtain the distribution of any network statistic under the null hypothesis of no association between an experimental gene-set and a functional category. We illustrate the NEA method using gene and protein expression data from a lung cancer study.
The results indicate that the NEA method is more powerful than the traditional GEA, primarily because the relationships between gene sets were more strongly captured by network connectivity rather than by simple overlaps.
Research has consistently found lower cognitive ability to be related to increased risk for violent and other antisocial behaviour. Since this association has remained when adjusting for childhood socioeconomic position, ethnicity, and parental characteristics, it is often assumed to be causal, potentially mediated through school adjustment problems and conduct disorder. Socioeconomic differences are notoriously difficult to quantify, however, and it is possible that the association between intelligence and delinquency suffer substantial residual confounding.
We linked longitudinal Swedish total population registers to study the association of general cognitive ability (intelligence) at age 18 (the Conscript Register, 1980–1993) with the incidence proportion of violent criminal convictions (the Crime Register, 1973–2009), among all men born in Sweden 1961–1975 (N = 700,514). Using probit regression, we controlled for measured childhood socioeconomic variables, and further employed sibling comparisons (family pedigree data from the Multi-Generation Register) to adjust for shared familial characteristics.
Cognitive ability in early adulthood was inversely associated to having been convicted of a violent crime (β = −0.19, 95% CI: −0.19; −0.18), the association remained when adjusting for childhood socioeconomic factors (β = −0.18, 95% CI: −0.18; −0.17). The association was somewhat lower within half-brothers raised apart (β = −0.16, 95% CI: −0.18; −0.14), within half-brothers raised together (β = −0.13, 95% CI: (−0.15; −0.11), and lower still in full-brother pairs (β = −0.10, 95% CI: −0.11; −0.09). The attenuation among half-brothers raised together and full brothers was too strong to be attributed solely to attenuation from measurement error.
Our results suggest that the association between general cognitive ability and violent criminality is confounded partly by factors shared by brothers. However, most of the association remains even after adjusting for such factors.
Prostate-specific antigen screening has led to enormous overtreatment of prostate cancer because of the inability to distinguish potentially lethal disease at diagnosis. We reasoned that by identifying an mRNA signature of Gleason grade, the best predictor of prognosis, we could improve prediction of lethal disease among men with moderate Gleason 7 tumors, the most common grade, and the most indeterminate in terms of prognosis.
Patients and Methods
Using the complementary DNA–mediated annealing, selection, extension, and ligation assay, we measured the mRNA expression of 6,100 genes in prostate tumor tissue in the Swedish Watchful Waiting cohort (n = 358) and Physicians' Health Study (PHS; n = 109). We developed an mRNA signature of Gleason grade comparing individuals with Gleason ≤ 6 to those with Gleason ≥ 8 tumors and applied the model among patients with Gleason 7 to discriminate lethal cases.
We built a 157-gene signature using the Swedish data that predicted Gleason with low misclassification (area under the curve [AUC] = 0.91); when this signature was tested in the PHS, the discriminatory ability remained high (AUC = 0.94). In men with Gleason 7 tumors, who were excluded from the model building, the signature significantly improved the prediction of lethal disease beyond knowing whether the Gleason score was 4 + 3 or 3 + 4 (P = .006).
Our expression signature and the genes identified may improve our understanding of the de-differentiation process of prostate tumors. Additionally, the signature may have clinical applications among men with Gleason 7, by further estimating their risk of lethal prostate cancer and thereby guiding therapy decisions to improve outcomes and reduce overtreatment.
Substantial progress has been made in human genetics and genomics research over the past ten years since the publication of the draft sequence of the human genome in 2001. Findings emanating directly from the Human Genome Project, together with those from follow-on studies, have had an enormous impact on our understanding of the architecture and function of the human genome. Major developments have been made in cataloguing genetic variation, the International HapMap Project, and with respect to advances in genotyping technologies. These developments are vital for the emergence of genome-wide association studies in the investigation of complex diseases and traits. In parallel, the advent of high-throughput sequencing technologies has ushered in the 'personal genome sequencing' era for both normal and cancer genomes, and made possible large-scale genome sequencing studies such as the 1000 Genomes Project and the International Cancer Genome Consortium. The high-throughput sequencing and sequence-capture technologies are also providing new opportunities to study Mendelian disorders through exome sequencing and whole-genome sequencing. This paper reviews these major developments in human genetics and genomics over the past decade.
Human Genome Project; International HapMap Project; 1000 Genomes Project; genome-wide association studies; single nucleotide polymorphisms; copy number variations; next-generation sequencing technologies; cancer genome sequencing; exome sequencing; complex disease; Mendelian disorders; personalised genomic medicine
The majority of prostate cancers harbor gene fusions of the 5′-untranslated region of the androgen-regulated transmembrane protease, serine 2 (TMPRSS2) promoter with erythroblast transformation specific (ETS) transcription factor family members. The common v-ets erythroblastosis virus E26 oncogene homolog [avian] (TMPRSS2–ERG) fusion is associated with a more aggressive clinical phenotype, implying the existence of a distinct subclass of prostate cancer defined by this fusion.
We used cDNA-mediated annealing, selection, ligation, and extension to determine the expression profiles of 6144 transcriptionally informative genes in archived biopsy samples from 455 prostate cancer patients in the Swedish Watchful Waiting cohort (1987–1999) and the US-based Physicians Health Study cohort (1983–2003). A gene expression signature for prostate cancers with the TMPRSS2-ERG fusion was determined using partitioning and classification models and used in computational functional analysis. Cell proliferation and TMPRSS2-ERG expression in androgen receptor–negative (NCI-H660) and –positive (VCaP-ERβ) prostate cancer cells after treatment with vehicle or estrogenic compounds were assessed by viability assays and quantitative polymerase chain reaction, respectively. All statistical tests were two-sided.
We identified an 87-gene expression signature that distinguishes TMPRSS2-ERG fusion prostate cancer as a discrete molecular entity (area under the curve = 0.80, 95% confidence interval [CI] = 0.792 to 0.81; P<.001). Computational analysis suggested that this fusion signature was associated with estrogen receptor (ER) signaling. Viability of NCI-H660 cells decreased after treatment with estrogen (viability normalized to day 0, estrogen vs vehicle at day 8, mean = 2.04 vs 3.40, difference = 1.36, 95% CI = 1.12 to 1.62) or ERβ agonist (ERβ agonist vs vehicle at day 8, mean = 1.86 vs 3.40, difference = 1.54, 95% CI = 1.39 to 1.69) but increased after ERα agonist treatment (ERα agonist vs vehicle at day 8, mean = 4.36 vs 3.40, difference = 0.96, 95% CI = 0.68 to 1.23). Similarly, expression of TMPRSS2-ERG decreased after ERβ agonist treatment (fold change over internal control, ERβ agonist vs vehicle at 24 hours, NCI H660, mean = 0.57-fold vs 1.0-fold, difference = 0.43, 95% CI = 0.29-fold to 0.57-fold) and increased after ERα agonist treatment (ERα agonist vs vehicle at 24 hours, mean = 5.63-fold vs 1.0-fold, difference = 4.63-fold, 95% CI = 4.34-fold to 4.92-fold).
TMPRSS2-ERG fusion prostate cancer is a distinct molecular subclass. TMPRSS2-ERG expression is regulated by a novel ER-dependent mechanism.
A fundamental question in human genetics is the degree to which the polygenic character of complex traits derives from polymorphism in genes with similar or with dissimilar functions. The many genome-wide association studies now being performed offer an opportunity to investigate this, and although early attempts are emerging, new tools and modeling strategies still need to be developed and deployed. Towards this goal we implemented a new algorithm to facilitate the transition from genetic marker lists (principally those generated by PLINK) to pathway analyses of representational gene sets in either threshold or threshold-free downstream applications (e.g. DAVID, GSEA-P, and Ingenuity Pathway Analysis). This was applied to several large genome-wide association studies covering diverse human traits that included type 2 diabetes, Crohn’s disease, and plasma lipid levels. Validation of this approach was obtained for plasma HDL levels, where functional categories related to lipid metabolism emerged as the most significant in two independent studies. From analyses of these samples we highlight and address numerous issues related to this strategy, including appropriate gene based correction statistics, the utility of imputed vs. non imputed marker sets, and the apparent enrichment of pathways due solely to the positional clustering of functionally related genes. The latter in particular emphasizes the importance of studies that directly tie genetic variation to functional characteristics of specific genes. The software freely provided that we have called ProxyGeneLD may resolve an important bottleneck in pathway-based analyses of genome-wide association data. This has allowed us to identify at least one replicable case of pathway enrichment but also to highlight functional gene clustering as a potentially serious problem that may lead to spurious pathway findings if not corrected for.
pathway; genome-wide; association; gene; enrichment; ontology
Principal component analysis (PCA) has gained popularity as a method for the analysis of high-dimensional genomic data. However, it is often difficult to interpret the results because the principal components are linear combinations of all variables, and the coefficients (loadings) are typically nonzero. These nonzero values also reflect poor estimation of the true vector loadings; for example, for gene expression data, biologically we expect only a portion of the genes to be expressed in any tissue, and an even smaller fraction to be involved in a particular process. Sparse PCA methods have recently been introduced for reducing the number of nonzero coefficients, but these existing methods are not satisfactory for high-dimensional data applications because they still give too many nonzero coefficients.
Here we propose a new PCA method that uses two innovations to produce an extremely sparse loading vector: (i) a random-effect model on the loadings that leads to an unbounded penalty at the origin and (ii) shrinkage of the singular values obtained from the singular value decomposition of the data matrix. We develop a stable computing algorithm by modifying nonlinear iterative partial least square (NIPALS) algorithm, and illustrate the method with an analysis of the NCI cancer dataset that contains 21,225 genes.
The new method has better performance than several existing methods, particularly in the estimation of the loading vectors.
Algorithms and software for CNV detection have been developed, but they detect the CNV regions sample-by-sample with individual-specific breakpoints, while common CNV regions are likely to occur at the same genomic locations across different individuals in a homogenous population. Current algorithms to detect common CNV regions do not account for the varying reliability of the individual CNVs, typically reported as confidence scores by SNP-based CNV detection algorithms. General methodologies for identifying these recurrent regions, especially those directed at SNP arrays, are still needed.
In this paper, we describe two new approaches for identifying common CNV regions based on (i) the frequency of occurrence of reliable CNVs, where reliability is determined by high confidence scores, and (ii) a weighted frequency of occurrence of CNVs, where the weights are determined by the confidence scores. In addition, motivated by the fact that we often observe partially overlapping CNV regions as a mixture of two or more distinct subregions, regions identified using the two approaches can be fine-tuned to smaller sub-regions using a clustering algorithm. We compared the performance of the methods with sequencing-based results in terms of discordance rates, rates of departure from Hardy-Weinberg equilibrium (HWE) and average frequency and size of the identified regions. The discordance rates as well as the rates of departure from HWE decrease when we select CNVs with higher confidence scores. We also performed comparisons with two previously published methods, STAC and GISTIC, and showed that the methods we consider are better at identifying low-frequency but high-confidence CNV regions.
The proposed methods for identifying common CNV regions in multiple individuals perform well compared to existing methods. The identified common regions can be used for downstream analyses such as group comparisons in association studies.
Current prostate cancer prognostic models are based on pre-treatment prostate specific antigen (PSA) levels, biopsy Gleason score, and clinical staging but in practice are inadequate to accurately predict disease progression. Hence, we sought to develop a molecular panel for prostate cancer progression by reasoning that molecular profiles might further improve current clinical models.
We analyzed a Swedish Watchful Waiting cohort with up to 30 years of clinical follow up using a novel method for gene expression profiling. This cDNA-mediated annealing, selection, ligation, and extension (DASL) method enabled the use of formalin-fixed paraffin-embedded transurethral resection of prostate (TURP) samples taken at the time of the initial diagnosis. We determined the expression profiles of 6100 genes for 281 men divided in two extreme groups: men who died of prostate cancer and men who survived more than 10 years without metastases (lethals and indolents, respectively). Several statistical and machine learning models using clinical and molecular features were evaluated for their ability to distinguish lethal from indolent cases.
Surprisingly, none of the predictive models using molecular profiles significantly improved over models using clinical variables only. Additional computational analysis confirmed that molecular heterogeneity within both the lethal and indolent classes is widespread in prostate cancer as compared to other types of tumors.
The determination of the molecularly dominant tumor nodule may be limited by sampling at time of initial diagnosis, may not be present at time of initial diagnosis, or may occur as the disease progresses making the development of molecular biomarkers for prostate cancer progression challenging.
Family data are used extensively in quantitative genetic studies to disentangle the genetic and environmental contributions to various diseases. Many family studies based their analysis on population-based registers containing a large number of individuals composed of small family units. For binary trait analyses, exact marginal likelihood is a common approach, but, due to the computational demand of the enormous data sets, it allows only a limited number of effects in the model. This makes it particularly difficult to perform joint estimation of variance components for a binary trait and the potential confounders. We have developed a data-reduction method of ascertaining informative families from population-based family registers. We propose a scheme where the ascertained families match the full cohort with respect to some relevant statistics, such as the risk to relatives of an affected individual. The ascertainment-adjusted analysis, which we implement using a pseudo-likelihood approach, is shown to be efficient relative to the analysis of the whole cohort and robust to mis-specification of the random effect distribution.
Segregation analysis; Mixed models; Variance components; Probit models
A great majority of genetic markers discovered in recent genome-wide association studies have small effect sizes, and they explain only a small fraction of the genetic contribution to the diseases. How many more variants can we expect to discover and what study sizes are needed? We derive the connection between the cumulative risk of the SNP variants to the latent genetic risk model and heritability of the disease. We determine the sample size required for case-control studies in order to achieve a certain expected number of discoveries in a collection of most significant SNPs. Assuming similar allele frequencies and effect sizes of the currently validated SNPs, complex phenotypes such as type-2 diabetes would need approximately 800 variants to explain its 40% heritability. Much smaller numbers of variants are needed if we assume rare-variants but higher penetrance models. We estimate that up to 50,000 cases and an equal number of controls are needed to discover 800 common low-penetrant variants among the top 5000 SNPs. Under common and rare low-penetrance models, the very large studies required to discover the numerous variants are probably at the limit of practical feasibility. Under rare-variant with medium- to high-penetrance models (odds-ratios between 1.6 and 4.0), studies comparable in size to many existing studies are adequate provided the genotyping technology can interrogate more and rarer variants.
Joint analysis of transcriptomic and proteomic data taken from the same samples has the potential to elucidate complex biological mechanisms. Most current methods that integrate these datasets allow for the computation of the correlation between a gene and protein but only after a one-to-one matching of genes and proteins is done. However, genes and proteins are connected via biological pathways and their relationship is not necessarily one-to-one. In this paper, we investigate the use of Correlated Factor Analysis (CFA) for modeling the correlation of genome-scale gene and protein data. Unlike existing approaches, CFA considers all possible gene-protein pairs and utilizes all gene and protein information in its modeling framework. The Generalized Singular Value Decomposition (gSVD) is another method which takes into account all available transcriptomic and proteomic data. Comparison is made between CFA and gSVD.
Our simulation study indicates that the CFA estimates can consistently capture the dominant patterns of correlation between two sets of measurements; in contrast, the gSVD estimates cannot do that. Applied to real cancer data, the list of co-regulated genes and proteins identified by CFA has biologically meaningful interpretation, where both the gene and protein expressions are pointing to the same processes. Among the GO terms for which the genes and proteins are most correlated, we observed blood vessel morphogenesis and development.
We demonstrate that CFA is a useful tool for gene-protein data integration and modeling, where the main question is in finding which patterns of gene expression are most correlated with protein expression.
While prostate cancer is a leading cause of cancer death, most men die with and not from their disease, underscoring the urgency to distinguish potentially lethal from indolent prostate cancer. We tested the prognostic value of a previously identified multigene signature of prostate cancer progression to predict cancer-specific death. The Örebro Watchful Waiting Cohort included 172 men with localized prostate cancer of whom 40 died of prostate cancer. We quantified protein expression of the markers in tumor tissue by immunohistochemistry, and stratified the cohort by quintiles according to risk classification. We accounted for clinical parameters (age, Gleason, nuclear grade, tumor volume) using Cox regression, and calculated Receiver Operator Curves to compare discriminatory ability. The hazard ratio of prostate cancer death increased with increasing risk classification by the multigene model, with a 16-fold greater risk comparing highest versus lowest risk strata, and predicted outcome independent of clinical factors (p=0.002). The best discrimination came from combining information from the multigene markers and clinical data, which perfectly classified the lowest risk stratum where no one developed lethal disease; using the two lowest risk groups as referent, the hazard ratio (95% confidence interval) was 11.3 (4.0–32.8) for the highest risk group and difference in mortality at 15 years was 60% (50–70%). The combined model provided greater discriminatory ability (AUC 0.78) than the clinical model alone (AUC 0.71), p=0.04. Molecular tumor markers can add to clinical parameters to help distinguish lethal and indolent prostate cancer, and hold promise to guide treatment decisions.
From 1968 to 2002, Singapore experienced an almost four-fold increase in prostate cancer incidence. This paper examines the incidence, mortality and survival patterns for prostate cancer among all residents in Singapore from 1968 to 2002.
This is a retrospective population-based cohort study including all prostate cancer cases aged over 20 (n = 3613) reported to the Singapore Cancer Registry from 1968 to 2002. Age-standardized incidence, mortality rates and 5-year Relative Survival Ratios (RSRs) were obtained for each 5-year period. Follow-up was ascertained by matching with the National Death Register until 2002. A weighted linear regression was performed on the log-transformed age-standardized incidence and mortality rates over period.
The percentage increase in the age-standardized incidence rate per year was 5.0%, 5.6%, 4.0% and 1.9% for all residents, Chinese, Malays and Indians respectively. The percentage increase in age-standardized mortality rate per year was 5.7%, 6.0%, 6.6% and 2.5% for all residents, Chinese, Malays and Indians respectively. When all Singapore residents were considered, the RSRs for prostate cancer were fairly constant across the study period with slight improvement from 1995 onwards among the Chinese.
Ethnic differences in prostate cancer incidence, mortality and survival patterns were observed. There has been a substantial improvement in RSRs since the 1990s for the Chinese.
It is well known that the normalization step of microarray data makes a difference in the downstream analysis. All normalization methods rely on certain assumptions, so differences in results can be traced to different sensitivities to violation of the assumptions. Illustrating the lack of robustness, in a striking spike-in experiment all existing normalization methods fail because of an imbalance between up- and down-regulated genes. This means it is still important to develop a normalization method that is robust against violation of the standard assumptions
We develop a new algorithm based on identification of the least-variant set (LVS) of genes across the arrays. The array-to-array variation is evaluated in the robust linear model fit of pre-normalized probe-level data. The genes are then used as a reference set for a non-linear normalization. The method is applicable to any existing expression summaries, such as MAS5 or RMA.
We show that LVS normalization outperforms other normalization methods when the standard assumptions are not satisfied. In the complex spike-in study, LVS performs similarly to the ideal (in practice unknown) housekeeping-gene normalization. An R package called lvs is available in .
Many recent microarrays hold an enormous number of probe sets, thus raising many practical and theoretical problems in controlling the false discovery rate (FDR). Biologically, it is likely that most probe sets are associated with un-expressed genes, so the measured values are simply noise due to non-specific binding; also many probe sets are associated with non-differentially-expressed (non-DE) genes. In an analysis to find DE genes, these probe sets contribute to the false discoveries, so it is desirable to filter out these probe sets prior to analysis. In the methodology proposed here, we first fit a robust linear model for probe-level Affymetrix data that accounts for probe and array effects. We then develop a novel procedure called FLUSH (Filtering Likely Uninformative Sets of Hybridizations), which excludes probe sets that have statistically small array-effects or large residual variance. This filtering procedure was evaluated on a publicly available data set from a controlled spiked-in experiment, as well as on a real experimental data set of a mouse model for retinal degeneration. In both cases, FLUSH filtering improves the sensitivity in the detection of DE genes compared to analyses using unfiltered, presence-filtered, intensity-filtered and variance-filtered data. A freely-available package called FLUSH implements the procedures and graphical displays described in the article.
Many procedures for finding differentially expressed genes in microarray data are based on classical or modified t-statistics. Due to multiple testing considerations, the false discovery rate (FDR) is the key tool for assessing the significance of these test statistics. Two recent papers have generalized two aspects: Storey et al. (2005) have introduced a likelihood ratio test statistic for two-sample situations that has desirable theoretical properties (optimal discovery procedure, ODP), but uses standard FDR assessment; Ploner et al. (2006) have introduced a multivariate local FDR that allows incorporation of standard error information, but uses the standard t-statistic (fdr2d). The relationship and relative performance of these methods in two-sample comparisons is currently unknown.
Using simulated and real datasets, we compare the ODP and fdr2d procedures. We also introduce a new procedure called S2d that combines the ODP test statistic with the extended FDR assessment of fdr2d.
For both simulated and real datasets, fdr2d performs better than ODP. As expected, both methods perform better than a standard t-statistic with standard local FDR. The new procedure S2d performs as well as fdr2d on simulated data, but performs better on the real data sets.
The ODP can be improved by including the standard error information as in fdr2d. This means that the optimality enjoyed in theory by ODP does not hold for the estimated version that has to be used in practice. The new procedure S2d has a slight advantage over fdr2d, which has to be balanced against a significantly higher computational effort and a less intuititive test statistic.
Molecular markers and the rich biological information they contain have great potential for cancer diagnosis, prognostication and therapy prediction. So far, however, they have not superseded routine histopathology and staging criteria, partly because the few studies performed on molecular subtyping have had little validation and limited clinical characterization.
We obtained gene expression and clinical data for 412 breast cancers obtained from population-based cohorts of patients from Stockholm and Uppsala, Sweden. Using the intrinsic set of approximately 500 genes derived in the Norway/Stanford breast cancer data, we validated the existence of five molecular subtypes – basal-like, ERBB2, luminal A/B and normal-like – and characterized these subtypes extensively with the use of conventional clinical variables.
We found an overall 77.5% concordance between the centroid prediction of the Swedish cohort by using the Norway/Stanford signature and the k-means clustering performed internally within the Swedish cohort. The highest rate of discordant assignments occurred between the luminal A and luminal B subtypes and between the luminal B and ERBB2 subtypes. The subtypes varied significantly in terms of grade (p < 0.001), p53 mutation (p < 0.001) and genomic instability (p = 0.01), but surprisingly there was little difference in lymph-node metastasis (p = 0.31). Furthermore, current users of hormone-replacement therapy were strikingly over-represented in the normal-like subgroup (p < 0.001). Separate analyses of the patients who received endocrine therapy and those who did not receive any adjuvant therapy supported the previous hypothesis that the basal-like subtype responded to adjuvant treatment, whereas the ERBB2 and luminal B subtypes were poor responders.
We found that the intrinsic molecular subtypes of breast cancer are broadly present in a diverse collection of patients from a population-based cohort in Sweden. The intrinsic gene set, originally selected to reveal stable tumor characteristics, was shown to have a strong correlation with progression-related properties such as grade, p53 mutation and genomic instability.
Postmenopausal hormone-replacement therapy (HRT) increases breast-cancer risk. The influence of HRT on the biology of the primary tumor, however, is not well understood.
We obtained breast-cancer gene expression profiles using Affymetrix human genome U133A arrays. We examined the relationship between HRT-regulated gene profiles, tumor characteristics, and recurrence-free survival in 72 postmenopausal women.
HRT use in patients with estrogen receptor (ER) protein positive tumors (n = 72) was associated with an altered regulation of 276 genes. Expression profiles based on these genes clustered ER-positive tumors into two molecular subclasses, one of which was associated with HRT use and had significantly better recurrence free survival despite lower ER levels. A comparison with external data suggested that gene regulation in tumors associated with HRT was negatively correlated with gene regulation induced by short-term estrogen exposure, but positively correlated with the effect of tamoxifen.
Our findings suggest that post-menopausal HRT use is associated with a distinct gene expression profile related to better recurrence-free survival and lower ER protein levels. Tentatively, HRT-associated gene expression in tumors resembles the effect of tamoxifen exposure on MCF-7 cells.
Adjuvant breast cancer therapy significantly improves survival, but overtreatment and undertreatment are major problems. Breast cancer expression profiling has so far mainly been used to identify women with a poor prognosis as candidates for adjuvant therapy but without demonstrated value for therapy prediction.
We obtained the gene expression profiles of 159 population-derived breast cancer patients, and used hierarchical clustering to identify the signature associated with prognosis and impact of adjuvant therapies, defined as distant metastasis or death within 5 years. Independent datasets of 76 treated population-derived Swedish patients, 135 untreated population-derived Swedish patients and 78 Dutch patients were used for validation. The inclusion and exclusion criteria for the studies of population-derived Swedish patients were defined.
Among the 159 patients, a subset of 64 genes was found to give an optimal separation of patients with good and poor outcomes. Hierarchical clustering revealed three subgroups: patients who did well with therapy, patients who did well without therapy, and patients that failed to benefit from given therapy. The expression profile gave significantly better prognostication (odds ratio, 4.19; P = 0.007) (breast cancer end-points odds ratio, 10.64) compared with the Elston–Ellis histological grading (odds ratio of grade 2 vs 1 and grade 3 vs 1, 2.81 and 3.32 respectively; P = 0.24 and 0.16), tumor stage (odds ratio of stage 2 vs 1 and stage 3 vs 1, 1.11 and 1.28; P = 0.83 and 0.68) and age (odds ratio, 0.11; P = 0.55). The risk groups were consistent and validated in the independent Swedish and Dutch data sets used with 211 and 78 patients, respectively.
We have identified discriminatory gene expression signatures working both on untreated and systematically treated primary breast cancer patients with the potential to spare them from adjuvant therapy.
There are currently a number of competing techniques for low-level processing of oligonucleotide array data. The choice of technique has a profound effect on subsequent statistical analyses, but there is no method to assess whether a particular technique is appropriate for a specific data set, without reference to external data.
We analyzed coregulation between genes in order to detect insufficient normalization between arrays, where coregulation is measured in terms of statistical correlation. In a large collection of genes, a random pair of genes should have on average zero correlation, hence allowing a correlation test. For all data sets that we evaluated, and the three most commonly used low-level processing procedures including MAS5, RMA and MBEI, the housekeeping-gene normalization failed the test. For a real clinical data set, RMA and MBEI showed significant correlation for absent genes. We also found that a second round of normalization on the probe set level improved normalization significantly throughout.
Previous evaluation of low-level processing in the literature has been limited to artificial spike-in and mixture data sets. In the absence of a known gold-standard, the correlation criterion allows us to assess the appropriateness of low-level processing of a specific data set and the success of normalization for subsets of genes.