PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-22 (22)
 

Clipboard (0)
None

Select a Filter Below

Journals
more »
Year of Publication
Document Types
1.  Integrative Annotation of Variants from 1092 Humans: Application to Cancer Genomics 
Science (New York, N.Y.)  2013;342(6154):1235587.
Interpreting variants, especially noncoding ones, in the increasing number of personal genomes is challenging. We used patterns of polymorphisms in functionally annotated regions in 1092 humans to identify deleterious variants; then we experimentally validated candidates. We analyzed both coding and noncoding regions, with the former corroborating the latter. We found regions particularly sensitive to mutations (“ultrasensitive”) and variants that are disruptive because of mechanistic effects on transcription-factor binding (that is, “motif-breakers”). We also found variants in regions with higher network centrality tend to be deleterious. Insertions and deletions followed a similar pattern to single-nucleotide variants, with some notable exceptions (e.g., certain deletions and enhancers). On the basis of these patterns, we developed a computational tool (FunSeq), whose application to ~90 cancer genomes reveals nearly a hundred candidate noncoding drivers.
doi:10.1126/science.1235587
PMCID: PMC3947637  PMID: 24092746
2.  SPOP Mutations in Prostate Cancer across Demographically Diverse Patient Cohorts12 
Neoplasia (New York, N.Y.)  2014;16(1):14-20.
Background
Recurrent mutations in the Speckle-Type POZ Protein (SPOP) gene occur in up to 15% of prostate cancers. However, the frequency and features of cancers with these mutations across different populations is unknown.
Objective
To investigate SPOP mutations across diverse cohorts and validate a series of assays employing high-resolution melting (HRM) analysis and Sanger sequencing for mutational analysis of formalin-fixed paraffin-embedded material.
Design, Setting, and Participants
720 prostate cancer samples from six international cohorts spanning Caucasian, African American, and Asian patients, including both prostate-specific antigen-screened and unscreened populations, were screened for their SPOP mutation status. Status of SPOP was correlated to molecular features (ERG rearrangement, PTEN deletion, and CHD1 deletion) as well as clinical and pathologic features.
Results and Limitations
Overall frequency of SPOP mutations was 8.1% (4.6% to 14.4%), SPOP mutation was inversely associated with ERG rearrangement (P < .01), and SPOP mutant (SPOPmut) cancers had higher rates of CHD1 deletions (P < .01). There were no significant differences in biochemical recurrence in SPOPmut cancers. Limitations of this study include missing mutational data due to sample quality and lack of power to identify a difference in clinical outcomes.
Conclusion
SPOP is mutated in 4.6% to 14.4% of patients with prostate cancer across different ethnic and demographic backgrounds. There was no significant association between SPOP mutations with ethnicity, clinical, or pathologic parameters. Mutual exclusivity of SPOP mutation with ERG rearrangement as well as a high association with CHD1 deletion reinforces SPOP mutation as defining a distinct molecular subclass of prostate cancer.
PMCID: PMC3924544  PMID: 24563616
3.  Identification of Molecular Tumor Markers in Renal Cell Carcinomas with TFE3 Protein Expression by RNA Sequencing12 
Neoplasia (New York, N.Y.)  2013;15(11):1231-1240.
TFE3 translocation renal cell carcinoma (tRCC) is defined by chromosomal translocations involving the TFE3 transcription factor at chromosome Xp11.2. Genetically proven TFE3 tRCCs have a broad histologic spectrum with overlapping features to other renal tumor subtypes. In this study, we aimed for characterizing RCC with TFE3 protein expression. Using next-generation whole transcriptome sequencing (RNA-Seq) as a discovery tool, we analyzed fusion transcripts, gene expression profile, and somatic mutations in frozen tissue of one TFE3 tRCC. By applying a computational analysis developed to call chimeric RNA molecules from paired-end RNA-Seq data, we confirmed the known TFE3 translocation. Its fusion partner SFPQ has already been described as fusion partner in tRCCs. In addition, an RNA read-through chimera between TMED6 and COG8 as well as MET and KDR (VEGFR2) point mutations were identified. An EGFR mutation, but no chromosomal rearrangements, was identified in a control group of five clear cell RCCs (ccRCCs). The TFE3 tRCC could be clearly distinguished from the ccRCCs by RNA-Seq gene expression measurements using a previously reported tRCC gene signature. In validation experiments using reverse transcription-PCR, TMED6-COG8 chimera expression was significantly higher in nine TFE3 translocated and six TFE3-expressing/non-translocated RCCs than in 24 ccRCCs (P < .001) and 22 papillary RCCs (P < .05–.07). Immunohistochemical analysis of selected genes from the tRCC gene signature showed significantly higher eukaryotic translation elongation factor 1 alpha 2 (EEF1A2) and Contactin 3 (CNTN3) expression in 16 TFE3 translocated and six TFE3-expressing/non-translocated RCCs than in over 200 ccRCCs (P < .0001, both).
PMCID: PMC3859447  PMID: 24339735
4.  VAT: a computational framework to functionally annotate variants in personal genomes within a cloud-computing environment 
Bioinformatics  2012;28(17):2267-2269.
Summary: The functional annotation of variants obtained through sequencing projects is generally assumed to be a simple intersection of genomic coordinates with genomic features. However, complexities arise for several reasons, including the differential effects of a variant on alternatively spliced transcripts, as well as the difficulty in assessing the impact of small insertions/deletions and large structural variants. Taking these factors into consideration, we developed the Variant Annotation Tool (VAT) to functionally annotate variants from multiple personal genomes at the transcript level as well as obtain summary statistics across genes and individuals. VAT also allows visualization of the effects of different variants, integrates allele frequencies and genotype data from the underlying individuals and facilitates comparative analysis between different groups of individuals. VAT can either be run through a command-line interface or as a web application. Finally, in order to enable on-demand access and to minimize unnecessary transfers of large data files, VAT can be run as a virtual machine in a cloud-computing environment.
Availability and Implementation: VAT is implemented in C and PHP. The VAT web service, Amazon Machine Image, source code and detailed documentation are available at vat.gersteinlab.org.
Contact: lukas.habegger@yale.edu or mark.gerstein@yale.edu
Supplementary Information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts368
PMCID: PMC3426844  PMID: 22743228
5.  Recurrent NCOA2 gene rearrangements in congenital/infantile spindle cell rhabdomyosarcoma 
Genes, chromosomes & cancer  2013;52(6):538-550.
Spindle cell rhabdomyosarcoma (RMS) is a rare form of RMS with different clinical characteristics and behavior between children and adult patients. Its genetic hallmark remains unknown and it remains debatable if there is pathogenetic relationship between the spindle cell and the so-called sclerosing RMS. We studied two pediatric and one adult spindle cell RMS by next generation RNA sequencing and used FusionSeq for data analysis to detect novel fusions. An SRF-NCOA2 gene fusion was detected in a spindle cell RMS from the posterior neck in a 7 month-old child. The fusion matched the tumor karyotype and was further confirmed by fluorescence in situ hybridization (FISH) and by RT-PCR, which showed fusion of SRF exon 6 to NCOA2 exon 12. Additional 14 spindle cell (from 8 children and 6 adults) and 4 sclerosing (from 2 children and 2 adults) RMS were tested by FISH for the presence of abnormalities in NCOA2, SRF, as well as for PAX3 and NCOA1, identifying NCOA2 rearrangements in two additional spindle cell RMS from a 3 month-old and a 4 week-old child, both arising in the chest wall. In the latter tumor, TEAD1 was identified by rapid amplification of cDNA ends (RACE) to be the NCOA2 gene fusion partner. None of the adult tumors were positive for NCOA2 rearrangement. Despite similar histomorphology in adults and young children, these results suggest that spindle cell RMS is a heterogeneous disease genetically as well as clinically. Our findings also support a relationship between NCOA2-rearranged spindle cell RMS occurring in young childhood and the so-called congenital RMS, which often displays rearrangements at 8q13 locus (NCOA2).
doi:10.1002/gcc.22050
PMCID: PMC3734530  PMID: 23463663
rhabdomyosarcoma; spindle cell; NCOA2; SRF; TEAD1; translocation; infantile
6.  Epigenomic Alterations in Localized and Advanced Prostate Cancer12 
Neoplasia (New York, N.Y.)  2013;15(4):373-383.
Although prostate cancer (PCa) is the second leading cause of cancer death among men worldwide, not all men diagnosed with PCa will die from the disease. A critical challenge, therefore, is to distinguish indolent PCa from more advanced forms to guide appropriate treatment decisions. We used Enhanced Reduced Representation Bisulfite Sequencing, a genome-wide high-coverage single-base resolution DNA methylation method to profile seven localized PCa samples, seven matched benign prostate tissues, and six aggressive castration-resistant prostate cancer (CRPC) samples. We integrated these data with RNA-seq and whole-genome DNA-seq data to comprehensively characterize the PCa methylome, detect changes associated with disease progression, and identify novel candidate prognostic biomarkers. Our analyses revealed the correlation of cytosine guanine dinucleotide island (CGI)-specific hypermethylation with disease severity and association of certain breakpoints (deletion, tandem duplications, and interchromosomal translocations) with DNA methylation. Furthermore, integrative analysis of methylation and single-nucleotide polymorphisms (SNPs) uncovered widespread allele-specific methylation (ASM) for the first time in PCa. We found that most DNA methylation changes occurred in the context of ASM, suggesting that variations in tumor epigenetic landscape of individuals are partly mediated by genetic differences, which may affect PCa disease progression. We further selected a panel of 13 CGIs demonstrating increased DNA methylation with disease progression and validated this panel in an independent cohort of 20 benign prostate tissues, 16 PCa, and 8 aggressive CRPCs. These results warrant clinical evaluation in larger cohorts to help distinguish indolent PCa from advanced disease.
PMCID: PMC3612910  PMID: 23555183
7.  Molecular Characterization of Neuroendocrine Prostate Cancer and Identification of New Drug Targets 
Cancer discovery  2011;1(6):487-495.
Neuroendocrine prostate cancer (NEPC) is an aggressive subtype of prostate cancer that most commonly evolves from preexisting prostate adenocarcinoma (PCA). Using Next Generation RNA-sequencing and oligonucleotide arrays, we profiled 7 NEPC, 30 PCA, and 5 benign prostate tissue (BEN), and validated findings on tumors from a large cohort of patients (37 NEPC, 169 PCA, 22 BEN) using IHC and FISH. We discovered significant overexpression and gene amplification of AURKA and MYCN in 40% of NEPC and 5% of PCA, respectively, and evidence that that they cooperate to induce a neuroendocrine phenotype in prostate cells. There was dramatic and enhanced sensitivity of NEPC (and MYCN overexpressing PCA) to Aurora kinase inhibitor therapy both in vitro and in vivo, with complete suppression of neuroendocrine marker expression following treatment. We propose that alterations in Aurora kinase A and N-myc are involved in the development of NEPC, and future clinical trials will help determine from the efficacy of Aurora kinase inhibitor therapy.
doi:10.1158/2159-8290.CD-11-0130
PMCID: PMC3290518  PMID: 22389870
neuroendocrine prostate cancer; aurora kinase A; n-myc; drug targets
8.  The real cost of sequencing: higher than you think! 
Genome Biology  2011;12(8):125.
Advances in sequencing technology have led to a sharp decrease in the cost of 'data generation'. But is this sufficient to ensure cost-effective and efficient 'knowledge generation'?
doi:10.1186/gb-2011-12-8-125
PMCID: PMC3245608  PMID: 21867570
Bioinformatics; costs of sequencing; data analysis; experimental design; next-generation sequencing; sample collection
9.  Genomic Analysis of the Hydrocarbon-Producing, Cellulolytic, Endophytic Fungus Ascocoryne sarcoides 
PLoS Genetics  2012;8(3):e1002558.
The microbial conversion of solid cellulosic biomass to liquid biofuels may provide a renewable energy source for transportation fuels. Endophytes represent a promising group of organisms, as they are a mostly untapped reservoir of metabolic diversity. They are often able to degrade cellulose, and they can produce an extraordinary diversity of metabolites. The filamentous fungal endophyte Ascocoryne sarcoides was shown to produce potential-biofuel metabolites when grown on a cellulose-based medium; however, the genetic pathways needed for this production are unknown and the lack of genetic tools makes traditional reverse genetics difficult. We present the genomic characterization of A. sarcoides and use transcriptomic and metabolomic data to describe the genes involved in cellulose degradation and to provide hypotheses for the biofuel production pathways. In total, almost 80 biosynthetic clusters were identified, including several previously found only in plants. Additionally, many transcriptionally active regions outside of genes showed condition-specific expression, offering more evidence for the role of long non-coding RNA in gene regulation. This is one of the highest quality fungal genomes and, to our knowledge, the only thoroughly annotated and transcriptionally profiled fungal endophyte genome currently available. The analyses and datasets contribute to the study of cellulose degradation and biofuel production and provide the genomic foundation for the study of a model endophyte system.
Author Summary
A renewable source of energy is a pressing global need. The biological conversion of lignocellulose to biofuels by microorganisms presents a promising avenue, but few organisms have been studied thoroughly enough to develop the genetic tools necessary for rigorous experimentation. The filamentous-fungal endophyte A. sarcoides produces metabolites when grown on a cellulose-based medium that include eight-carbon volatile organic compounds, which are potential biofuel targets. Here we use broadly applicable methods including genomics, transcriptomics, and metabolomics to explore the biofuel production of A. sarcoides. These data were used to assemble the genome into 16 scaffolds, to thoroughly annotate the cellulose-degradation machinery, and to make predictions for the production pathway for the eight-carbon volatiles. Extremely high expression of the gene swollenin when grown on cellulose highlights the importance of accessory proteins in addition to the enzymes that catalyze the breakdown of the polymers. Correlation of the production of the eight-carbon biofuel-like metabolites with the expression of lipoxygenase pathway genes suggests the catabolism of linoleic acid as the mechanism of eight-carbon compound production. This is the first fungal genome to be sequenced in the family Helotiaceae, and A. sarcoides was isolated as an endophyte, making this work also potentially useful in fungal systematics and the study of plant–fungus relationships.
doi:10.1371/journal.pgen.1002558
PMCID: PMC3291568  PMID: 22396667
10.  IQSeq: Integrated Isoform Quantification Analysis Based on Next-Generation Sequencing 
PLoS ONE  2012;7(1):e29175.
With the recent advances in high-throughput RNA sequencing (RNA-Seq), biologists are able to measure transcription with unprecedented precision. One problem that can now be tackled is that of isoform quantification: here one tries to reconstruct the abundances of isoforms of a gene. We have developed a statistical solution for this problem, based on analyzing a set of RNA-Seq reads, and a practical implementation, available from archive.gersteinlab.org/proj/rnaseq/IQSeq, in a tool we call IQSeq (Isoform Quantification in next-generation Sequencing). Here, we present theoretical results which IQSeq is based on, and then use both simulated and real datasets to illustrate various applications of the tool. In order to measure the accuracy of an isoform-quantification result, one would try to estimate the average variance of the estimated isoform abundances for each gene (based on resampling the RNA-seq reads), and IQSeq has a particularly fast algorithm (based on the Fisher Information Matrix) for calculating this, achieving a speedup of times compared to brute-force resampling. IQSeq also calculates an information theoretic measure of overall transcriptome complexity to describe isoform abundance for a whole experiment. IQSeq has many features that are particularly useful in RNA-Seq experimental design, allowing one to optimally model the integration of different sequencing technologies in a cost-effective way. In particular, the IQSeq formalism integrates the analysis of different sample (i.e. read) sets generated from different technologies within the same statistical framework. It also supports a generalized statistical partial-sample-generation function to model the sequencing process. This allows one to have a modular, “plugin-able” read-generation function to support the particularities of the many evolving sequencing technologies.
doi:10.1371/journal.pone.0029175
PMCID: PMC3253133  PMID: 22238592
11.  Genomics and Privacy: Implications of the New Reality of Closed Data for the Field 
PLoS Computational Biology  2011;7(12):e1002278.
Open source and open data have been driving forces in bioinformatics in the past. However, privacy concerns may soon change the landscape, limiting future access to important data sets, including personal genomics data. Here we survey this situation in some detail, describing, in particular, how the large scale of the data from personal genomic sequencing makes it especially hard to share data, exacerbating the privacy problem. We also go over various aspects of genomic privacy: first, there is basic identifiability of subjects having their genome sequenced. However, even for individuals who have consented to be identified, there is the prospect of very detailed future characterization of their genotype, which, unanticipated at the time of their consent, may be more personal and invasive than the release of their medical records. We go over various computational strategies for dealing with the issue of genomic privacy. One can “slice” and reformat datasets to allow them to be partially shared while securing the most private variants. This is particularly applicable to functional genomics information, which can be largely processed without variant information. For handling the most private data there are a number of legal and technological approaches—for example, modifying the informed consent procedure to acknowledge that privacy cannot be guaranteed, and/or employing a secure cloud computing environment. Cloud computing in particular may allow access to the data in a more controlled fashion than the current practice of downloading and computing on large datasets. Furthermore, it may be particularly advantageous for small labs, given that the burden of many privacy issues falls disproportionately on them in comparison to large corporations and genome centers. Finally, we discuss how education of future genetics researchers will be important, with curriculums emphasizing privacy and data security. However, teaching personal genomics with identifiable subjects in the university setting will, in turn, create additional privacy issues and social conundrums.
doi:10.1371/journal.pcbi.1002278
PMCID: PMC3228779  PMID: 22144881
12.  The genomic complexity of primary human prostate cancer 
Nature  2011;470(7333):214-220.
Prostate cancer is the second most common cause of male cancer deaths in the United States. Here we present the complete sequence of seven primary prostate cancers and their paired normal counterparts. Several tumors contained complex chains of balanced rearrangements that occurred within or adjacent to known cancer genes. Rearrangement breakpoints were enriched near open chromatin, androgen receptor and ERG DNA binding sites in the setting of the ETS gene fusion TMPRSS2-ERG, but inversely correlated with these regions in tumors lacking ETS fusions. This observation suggests a link between chromatin or transcriptional regulation and the genesis of genomic aberrations. Three tumors contained rearrangements that disrupted CADM2, and four harbored events disrupting either PTEN (unbalanced events), a prostate tumor suppressor, or MAGI2 (balanced events), a PTEN interacting protein not previously implicated in prostate tumorigenesis. Thus, genomic rearrangements may arise from transcriptional or chromatin aberrancies to engage prostate tumorigenic mechanisms.
doi:10.1038/nature09744
PMCID: PMC3075885  PMID: 21307934
13.  Estrogen-dependent signaling in a molecularly distinct subclass of aggressive prostate cancer 
Background
The majority of prostate cancers harbor gene fusions of the 5′-untranslated region of the androgen-regulated transmembrane protease, serine 2 (TMPRSS2) promoter with erythroblast transformation specific (ETS) transcription factor family members. The common v-ets erythroblastosis virus E26 oncogene homolog [avian] (TMPRSS2–ERG) fusion is associated with a more aggressive clinical phenotype, implying the existence of a distinct subclass of prostate cancer defined by this fusion.
Methods
We used cDNA-mediated annealing, selection, ligation, and extension to determine the expression profiles of 6144 transcriptionally informative genes in archived biopsy samples from 455 prostate cancer patients in the Swedish Watchful Waiting cohort (1987–1999) and the US-based Physicians Health Study cohort (1983–2003). A gene expression signature for prostate cancers with the TMPRSS2-ERG fusion was determined using partitioning and classification models and used in computational functional analysis. Cell proliferation and TMPRSS2-ERG expression in androgen receptor–negative (NCI-H660) and –positive (VCaP-ERβ) prostate cancer cells after treatment with vehicle or estrogenic compounds were assessed by viability assays and quantitative polymerase chain reaction, respectively. All statistical tests were two-sided.
Results
We identified an 87-gene expression signature that distinguishes TMPRSS2-ERG fusion prostate cancer as a discrete molecular entity (area under the curve = 0.80, 95% confidence interval [CI] = 0.792 to 0.81; P<.001). Computational analysis suggested that this fusion signature was associated with estrogen receptor (ER) signaling. Viability of NCI-H660 cells decreased after treatment with estrogen (viability normalized to day 0, estrogen vs vehicle at day 8, mean = 2.04 vs 3.40, difference = 1.36, 95% CI = 1.12 to 1.62) or ERβ agonist (ERβ agonist vs vehicle at day 8, mean = 1.86 vs 3.40, difference = 1.54, 95% CI = 1.39 to 1.69) but increased after ERα agonist treatment (ERα agonist vs vehicle at day 8, mean = 4.36 vs 3.40, difference = 0.96, 95% CI = 0.68 to 1.23). Similarly, expression of TMPRSS2-ERG decreased after ERβ agonist treatment (fold change over internal control, ERβ agonist vs vehicle at 24 hours, NCI H660, mean = 0.57-fold vs 1.0-fold, difference = 0.43, 95% CI = 0.29-fold to 0.57-fold) and increased after ERα agonist treatment (ERα agonist vs vehicle at 24 hours, mean = 5.63-fold vs 1.0-fold, difference = 4.63-fold, 95% CI = 4.34-fold to 4.92-fold).
Conclusions
TMPRSS2-ERG fusion prostate cancer is a distinct molecular subclass. TMPRSS2-ERG expression is regulated by a novel ER-dependent mechanism.
doi:10.1093/jnci/djn150
PMCID: PMC3073404  PMID: 18505969
14.  RSEQtools: a modular framework to analyze RNA-Seq data using compact, anonymized data summaries 
Bioinformatics  2010;27(2):281-283.
Summary: The advent of next-generation sequencing for functional genomics has given rise to quantities of sequence information that are often so large that they are difficult to handle. Moreover, sequence reads from a specific individual can contain sufficient information to potentially identify and genetically characterize that person, raising privacy concerns. In order to address these issues, we have developed the Mapped Read Format (MRF), a compact data summary format for both short and long read alignments that enables the anonymization of confidential sequence information, while allowing one to still carry out many functional genomics studies. We have developed a suite of tools (RSEQtools) that use this format for the analysis of RNA-Seq experiments. These tools consist of a set of modules that perform common tasks such as calculating gene expression values, generating signal tracks of mapped reads and segmenting that signal into actively transcribed regions. Moreover, the tools can readily be used to build customizable RNA-Seq workflows. In addition to the anonymization afforded by MRF, this format also facilitates the decoupling of the alignment of reads from downstream analyses.
Availability and implementation: RSEQtools is implemented in C and the source code is available at http://rseqtools.gersteinlab.org/.
Contact: lukas.habegger@yale.edu; mark.gerstein@yale.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btq643
PMCID: PMC3018817  PMID: 21134889
15.  FusionSeq: a modular framework for finding gene fusions by analyzing paired-end RNA-sequencing data 
Genome Biology  2010;11(10):R104.
We have developed FusionSeq to identify fusion transcripts from paired-end RNA-sequencing. FusionSeq includes filters to remove spurious candidate fusions with artifacts, such as misalignment or random pairing of transcript fragments, and it ranks candidates according to several statistics. It also has a module to identify exact sequences at breakpoint junctions. FusionSeq detected known and novel fusions in a specially sequenced calibration data set, including eight cancers with and without known rearrangements.
doi:10.1186/gb-2010-11-10-r104
PMCID: PMC3218660  PMID: 20964841
16.  Comparison and calibration of transcriptome data from RNA-Seq and tiling arrays 
BMC Genomics  2010;11:383.
Background
Tiling arrays have been the tool of choice for probing an organism's transcriptome without prior assumptions about the transcribed regions, but RNA-Seq is becoming a viable alternative as the costs of sequencing continue to decrease. Understanding the relative merits of these technologies will help researchers select the appropriate technology for their needs.
Results
Here, we compare these two platforms using a matched sample of poly(A)-enriched RNA isolated from the second larval stage of C. elegans. We find that the raw signals from these two technologies are reasonably well correlated but that RNA-Seq outperforms tiling arrays in several respects, notably in exon boundary detection and dynamic range of expression. By exploring the accuracy of sequencing as a function of depth of coverage, we found that about 4 million reads are required to match the sensitivity of two tiling array replicates. The effects of cross-hybridization were analyzed using a "nearest neighbor" classifier applied to array probes; we describe a method for determining potential "black list" regions whose signals are unreliable. Finally, we propose a strategy for using RNA-Seq data as a gold standard set to calibrate tiling array data. All tiling array and RNA-Seq data sets have been submitted to the modENCODE Data Coordinating Center.
Conclusions
Tiling arrays effectively detect transcript expression levels at a low cost for many species while RNA-Seq provides greater accuracy in several regards. Researchers will need to carefully select the technology appropriate to the biological investigations they are undertaking. It will also be important to reconsider a comparison such as ours as sequencing technologies continue to evolve.
doi:10.1186/1471-2164-11-383
PMCID: PMC3091629  PMID: 20565764
17.  Distinct Genomic Aberrations Associated With ERG Rearranged Prostate Cancer 
Genes, chromosomes & cancer  2009;48(4):366-380.
Emerging molecular and clinical data suggest that ETS fusion prostate cancer represents a distinct molecular subclass, driven most commonly by a hormonally regulated promoter and characterized by an aggressive natural history. The study of the genomic landscape of prostate cancer in the light of ETS fusion events is required to understand the foundation of this molecularly and clinically distinct subtype. We performed genome-wide profiling of 49 primary prostate cancers and identified 20 recurrent chromosomal copy number aberrations, mainly occurring as genomic losses. Co-occurring events included losses at 19q13.32 and 1p22.1. We discovered 3 genomic events associated with ERG rearranged prostate cancer, affecting 6q, 7q, and 16q. 6q loss in non- rearranged prostate cancer is accompanied by gene expression deregulation in an independent dataset and by protein deregulation of MYO6. To analyze copy number alterations within the ETS genes, we performed a comprehensive analysis of all 27 ETS genes and of the 3Mbp genomic area between ERG and TMPRSS2 (21q) with an unprecedented resolution (30 bp). We demonstrate that high-resolution tiling arrays can be used to pin-point breakpoints leading to fusion events. This study provides further support to defining a distinct molecular subtype of prostate cancer based on the presence of ETS gene rearrangements.
doi:10.1002/gcc.20647
PMCID: PMC2674964  PMID: 19156837
ETS genes; prostate cancer; gain; loss
18.  Molecular sampling of prostate cancer: a dilemma for predicting disease progression 
Background
Current prostate cancer prognostic models are based on pre-treatment prostate specific antigen (PSA) levels, biopsy Gleason score, and clinical staging but in practice are inadequate to accurately predict disease progression. Hence, we sought to develop a molecular panel for prostate cancer progression by reasoning that molecular profiles might further improve current clinical models.
Methods
We analyzed a Swedish Watchful Waiting cohort with up to 30 years of clinical follow up using a novel method for gene expression profiling. This cDNA-mediated annealing, selection, ligation, and extension (DASL) method enabled the use of formalin-fixed paraffin-embedded transurethral resection of prostate (TURP) samples taken at the time of the initial diagnosis. We determined the expression profiles of 6100 genes for 281 men divided in two extreme groups: men who died of prostate cancer and men who survived more than 10 years without metastases (lethals and indolents, respectively). Several statistical and machine learning models using clinical and molecular features were evaluated for their ability to distinguish lethal from indolent cases.
Results
Surprisingly, none of the predictive models using molecular profiles significantly improved over models using clinical variables only. Additional computational analysis confirmed that molecular heterogeneity within both the lethal and indolent classes is widespread in prostate cancer as compared to other types of tumors.
Conclusions
The determination of the molecularly dominant tumor nodule may be limited by sampling at time of initial diagnosis, may not be present at time of initial diagnosis, or may occur as the disease progresses making the development of molecular biomarkers for prostate cancer progression challenging.
doi:10.1186/1755-8794-3-8
PMCID: PMC2855514  PMID: 20233430
19.  Optimizing copy number variation analysis using genome-wide short sequence oligonucleotide arrays 
Nucleic Acids Research  2010;38(10):3275-3286.
The detection of copy number variants (CNV) by array-based platforms provides valuable insight into understanding human diversity. However, suboptimal study design and data processing negatively affect CNV assessment. We quantitatively evaluate their impact when short-sequence oligonucleotide arrays are applied (Affymetrix Genome-Wide Human SNP Array 6.0) by evaluating 42 HapMap samples for CNV detection. Several processing and segmentation strategies are implemented, and results are compared to CNV assessment obtained using an oligonucleotide array CGH platform designed to query CNVs at high resolution (Agilent). We quantitatively demonstrate that different reference models (e.g. single versus pooled sample reference) used to detect CNVs are a major source of inter-platform discrepancy (up to 30%) and that CNVs residing within segmental duplication regions (higher reference copy number) are significantly harder to detect (P < 0.0001). After adjusting Affymetrix data to mimic the Agilent experimental design (reference sample effect), we applied several common segmentation approaches and evaluated differential sensitivity and specificity for CNV detection, ranging 39–77% and 86–100% for non-segmental duplication regions, respectively, and 18–55% and 39–77% for segmental duplications. Our results are relevant to any array-based CNV study and provide guidelines to optimize performance based on study-specific objectives.
doi:10.1093/nar/gkq073
PMCID: PMC2879534  PMID: 20156996
20.  N-myc Downstream Regulated Gene 1 (NDRG1) Is Fused to ERG in Prostate Cancer12 
Neoplasia (New York, N.Y.)  2009;11(8):804-811.
A step toward the molecular classification of prostate cancer was the discovery of recurrent erythroblast transformation-specific rearrangements, most commonly fusing the androgen-regulated TMPRSS2 promoter to ERG. The TMPRSS2-ERG fusion is observed in around 90% of tumors that overexpress the oncogene ERG. The goal of the current study was to complete the characterization of these ERG-overexpressing prostate cancers. Using fluorescence in situ hybridization and reverse transcription-polymerase chain reaction assays, we screened 101 prostate cancers, identifying 34 cases (34%) with the TMPRSS2-ERG fusion. Seven cases demonstrated ERG rearrangement by fluorescence in situ hybridization without the presence of TMPRSS2-ERG fusion messenger RNA transcripts. Screening for known 5′ partners, we determined that three cases harbored the SLC45A3-ERG fusion. To discover novel 5′ partners in these ERG-overexpressing and ERG-rearranged cases, we used paired-end RNA sequencing. We first confirmed the utility of this approach by identifying the TMPRSS2-ERG fusion in a known positive prostate cancer case and then discovered a novel fusion involving the androgen-inducible tumor suppressor, NDRG1 (N-myc downstream regulated gene 1), and ERG in two cases. Unlike TMPRSS2-ERG and SCL45A3-ERG fusions, the NDRG1-ERG fusion is predicted to encode a chimeric protein. Like TMPRSS2, SCL45A3 and NDRG1 are inducible not only by androgen but also by estrogen. This study demonstrates that most ERG-overexpressing prostate cancers harbor hormonally regulated TMPRSS2-ERG, SLC45A3-ERG, or NDRG1-ERG fusions. Broader implications of this study support the use of RNA sequencing to discover novel cancer translocations.
PMCID: PMC2713587  PMID: 19649210
21.  The role of disorder in interaction networks: a structural analysis 
Recent studies have emphasized the value of including structural information into the topological analysis of protein networks. Here, we utilized structural information to investigate the role of intrinsic disorder in these networks. Hub proteins tend to be more disordered than other proteins (i.e. the proteome average); however, we find this only true for those with one or two binding interfaces (‘single'-interface hubs). In contrast, the distribution of disordered residues in multi-interface hubs is indistinguishable from the overall proteome. Surprisingly, we find that the binding interfaces in single-interface hubs are highly structured, as is the case for multi-interface hubs. However, the binding partners of single-interface hubs tend to have a higher level of disorder than the proteome average, suggesting that their binding promiscuity is related to the disorder of their binding partners. In turn, the higher level of disorder of single-interface hubs can be partly explained by their tendency to bind to each other in a cascade. A good illustration of this trend can be found in signaling pathways and, more specifically, in kinase cascades. Finally, our findings have implications for the current controversy related to party and date-hubs.
doi:10.1038/msb.2008.16
PMCID: PMC2290937  PMID: 18364713
hubs; intrinsic disorder; structural networks
22.  Modeling Clinical Judgment and Implicit Guideline Compliance in the Diagnosis of Melanomas Using Machine Learning 
We explore several machine learning techniques to model clinical decision making of 6 dermatologists in the clinical task of melanoma diagnosis of 177 pigmented skin lesions (76 malignant, 101 benign). In particular we apply Support Vector Machine (SVM) classifiers to model clinician judgments, Markov Blanket and SVM feature selection to eliminate clinical features that are effectively ignored by the dermatologists, and a novel explanation technique whereby regression tree induction is run on the reduced SVM model’s output to explain the physicians’ implicit patterns of decision making. Our main findings include: (a) clinician judgments can be accurately predicted, (b) subtle decision making rules are revealed enabling the explanation of differences of opinion among physicians, and (c) physician judgment is non-compliant with the diagnostic guidelines that physicians self-report as guiding their decision making.
PMCID: PMC1560780  PMID: 16779123

Results 1-22 (22)