Summary: TREAT (Targeted RE-sequencing Annotation Tool) is a tool for facile navigation and mining of the variants from both targeted resequencing and whole exome sequencing. It provides a rich integration of publicly available as well as in-house developed annotations and visualizations for variants, variant-hosting genes and host-gene pathways.
Availability and implementation: TREAT is freely available to non-commercial users as either a stand-alone annotation and visualization tool, or as a comprehensive workflow integrating sequencing alignment and variant calling. The executables, instructions and the Amazon Cloud Images of TREAT can be downloaded at the website: http://ndc.mayo.edu/mayo/research/biostat/stand-alone-packages.cfm
Contact: Hossain.Asif@mayo.edu; Kocher.JeanPierre@mayo.edu
Supplementary information: Supplementary data are provided at Bioinformatics online.
Zebrafish (Danio rerio) is a popular vertebrate model organism largely deployed using outbred laboratory animals. The nonisogenic nature of the zebrafish as a model system offers the opportunity to understand natural variations and their effect in modulating phenotype. In an effort to better characterize the range of natural variation in this model system and to complement the zebrafish reference genome project, the whole genome sequence of a wild zebrafish at 39-fold genome coverage was determined. Comparative analysis with the zebrafish reference genome revealed approximately 5.2 million single nucleotide variations and over 1.6 million insertion–deletion variations. This dataset thus represents a new catalog of genetic variations in the zebrafish genome. Further analysis revealed selective enrichment for variations in genes involved in immune function and response to the environment, suggesting genome-level adaptations to environmental niches. We also show that human disease gene orthologs in the sequenced wild zebrafish genome show a lower ratio of nonsynonymous to synonymous single nucleotide variations.
As the cost of whole genome sequencing (WGS) decreases, clinical laboratories will be looking at broadly adopting this technology to screen for variants of clinical significance. To fully leverage this technology in a clinical setting, results need to be reported quickly, as the turnaround rate could potentially impact patient care. The latest sequencers can sequence a whole human genome in about 24 hours. However, depending on the computing infrastructure available, the processing of data can take several days, with the majority of computing time devoted to aligning reads to genomics regions that are to date not clinically interpretable. In an attempt to accelerate the reporting of clinically actionable variants, we have investigated the utility of a multi-step alignment algorithm focused on aligning reads and calling variants in genomic regions of clinical relevance prior to processing the remaining reads on the whole genome. This iterative workflow significantly accelerates the reporting of clinically actionable variants with no loss of accuracy when compared to genotypes obtained with the OMNI SNP platform or to variants detected with a standard workflow that combines Novoalign and GATK.
Structural variation (SV) represents a significant, yet poorly understood contribution to an individual’s genetic makeup. Advanced next-generation sequencing technologies are widely used to discover such variations, but there is no single detection tool that is considered a community standard. In an attempt to fulfil this need, we developed an algorithm, SoftSearch, for discovering structural variant breakpoints in Illumina paired-end next-generation sequencing data. SoftSearch combines multiple strategies for detecting SV including split-read, discordant read-pair, and unmated pairs. Co-localized split-reads and discordant read pairs are used to refine the breakpoints.
We developed and validated SoftSearch using real and synthetic datasets. SoftSearch’s key features are 1) not requiring secondary (or exhaustive primary) alignment, 2) portability into established sequencing workflows, and 3) is applicable to any DNA-sequencing experiment (e.g. whole genome, exome, custom capture, etc.). SoftSearch identifies breakpoints from a small number of soft-clipped bases from split reads and a few discordant read-pairs which on their own would not be sufficient to make an SV call.
We show that SoftSearch can identify more true SVs by combining multiple sequence features. SoftSearch was able to call clinically relevant SVs in the BRCA2 gene not reported by other tools while offering significantly improved overall performance.
Given the high technical reproducibility and orders of magnitude greater resolution than microarrays, next-generation sequencing of mRNA (RNA-Seq) is quickly becoming the de facto standard for measuring levels of gene expression in biological experiments. Two important questions must be taken into consideration when designing a particular experiment, namely, 1) how deep does one need to sequence? and, 2) how many biological replicates are necessary to observe a significant change in expression?
Based on the gene expression distributions from 127 RNA-Seq experiments, we find evidence that 91% ± 4% of all annotated genes are sequenced at a frequency of 0.1 times per million bases mapped, regardless of sample source. Based on this observation, and combining this information with other parameters such as biological variation and technical variation that we empirically estimate from our large datasets, we developed a model to estimate the statistical power needed to identify differentially expressed genes from RNA-Seq experiments.
Our results provide a needed reference for ensuring RNA-Seq gene expression studies are conducted with the optimally sample size, power, and sequencing depth. We also make available both R code and an Excel worksheet for investigators to calculate for their own experiments.
Insulin regulates many cellular processes, but the full impact of insulin deficiency on cellular functions remains to be defined. Applying a mass spectrometry–based nontargeted metabolomics approach, we report here alterations of 330 plasma metabolites representing 33 metabolic pathways during an 8-h insulin deprivation in type 1 diabetic individuals. These pathways included those known to be affected by insulin such as glucose, amino acid and lipid metabolism, Krebs cycle, and immune responses and those hitherto unknown to be altered including prostaglandin, arachidonic acid, leukotrienes, neurotransmitters, nucleotides, and anti-inflammatory responses. A significant concordance of metabolome and skeletal muscle transcriptome–based pathways supports an assumption that plasma metabolites are chemical fingerprints of cellular events. Although insulin treatment normalized plasma glucose and many other metabolites, there were 71 metabolites and 24 pathways that differed between nondiabetes and insulin-treated type 1 diabetes. Confirmation of many known pathways altered by insulin using a single blood test offers confidence in the current approach. Future research needs to be focused on newly discovered pathways affected by insulin deficiency and systemic insulin treatment to determine whether they contribute to the high morbidity and mortality in T1D despite insulin treatment.
Formalin fixed, paraffin embedded tissues are most commonly used for routine pathology analysis and for long term tissue preservation in the clinical setting. Many institutions have large archives of Formalin fixed, paraffin embedded tissues that provide a unique opportunity for understanding genomic signatures of disease. However, genome-wide expression profiling of Formalin fixed, paraffin embedded samples have been challenging due to RNA degradation. Because of the significant heterogeneity in tissue quality, normalization and analysis of these data presents particular challenges. The distribution of intensity values from archival tissues are inherently noisy and skewed due to differential sample degradation raising two primary concerns; whether a highly skewed array will unduly influence initial normalization of the data and whether outlier arrays can be reliably identified.
Two simple extensions of common regression diagnostic measures are introduced that measure the stress an array undergoes during normalization and how much a given array deviates from the remaining arrays post-normalization. These metrics are applied to a study involving 1618 formalin-fixed, paraffin-embedded HER2-positive breast cancer samples from the N9831 adjuvant trial processed with Illumina’s cDNA-mediated Annealing Selection extension and Ligation assay.
Proper assessment of array quality within a research study is crucial for controlling unwanted variability in the data. The metrics proposed in this paper have direct biological interpretations and can be used to identify arrays that should either be removed from analysis all together or down-weighted to reduce their influence in downstream analyses.
High-dimensional array quality; Formalin-Fixed; Paraffin-embedded tissue; Outlier detection
Thousands of novel transcripts have been identified using deep transcriptome sequencing. This discovery of large and ‘hidden’ transcriptome rejuvenates the demand for methods that can rapidly distinguish between coding and noncoding RNA. Here, we present a novel alignment-free method, Coding Potential Assessment Tool (CPAT), which rapidly recognizes coding and noncoding transcripts from a large pool of candidates. To this end, CPAT uses a logistic regression model built with four sequence features: open reading frame size, open reading frame coverage, Fickett TESTCODE statistic and hexamer usage bias. CPAT software outperformed (sensitivity: 0.96, specificity: 0.97) other state-of-the-art alignment-based software such as Coding-Potential Calculator (sensitivity: 0.99, specificity: 0.74) and Phylo Codon Substitution Frequencies (sensitivity: 0.90, specificity: 0.63). In addition to high accuracy, CPAT is approximately four orders of magnitude faster than Coding-Potential Calculator and Phylo Codon Substitution Frequencies, enabling its users to process thousands of transcripts within seconds. The software accepts input sequences in either FASTA- or BED-formatted data files. We also developed a web interface for CPAT that allows users to submit sequences and receive the prediction results almost instantly.
To extract physician-asserted drug side effects from electronic medical record clinical narratives.
Materials and methods
Pattern matching rules were manually developed through examining keywords and expression patterns of side effects to discover an individual side effect and causative drug relationship. A combination of machine learning (C4.5) using side effect keyword features and pattern matching rules was used to extract sentences that contain side effect and causative drug pairs, enabling the system to discover most side effect occurrences. Our system was implemented as a module within the clinical Text Analysis and Knowledge Extraction System.
The system was tested in the domain of psychiatry and psychology. The rule-based system extracting side effects and causative drugs produced an F score of 0.80 (0.55 excluding allergy section). The hybrid system identifying side effect sentences had an F score of 0.75 (0.56 excluding allergy section) but covered more side effect and causative drug pairs than individual side effect extraction.
The rule-based system was able to identify most side effects expressed by clear indication words. More sophisticated semantic processing is required to handle complex side effect descriptions in the narrative. We demonstrated that our system can be trained to identify sentences with complex side effect descriptions that can be submitted to a human expert for further abstraction.
Our system was able to extract most physician-asserted drug side effects. It can be used in either an automated mode for side effect extraction or semi-automated mode to identify side effect sentences that can significantly simplify abstraction by a human expert.
Natural language processing; machine learning; information extraction; electronic medical record; Information storage and retrieval (text and images); discovery; and text and data mining methods; Other methods of information extraction; Natural-language processing; bioinformatics; Ontologies; Knowledge representations, Controlled terminologies and vocabularies; Information Retrieval; HIT Data Standards; Human-computer interaction and human-centered computing; Providing just-in-time access to the biomedical literature and other health information; Applications that link biomedical knowledge from diverse primary sources (includes automated indexing); Linking the genotype and phenotype
Summary: Reduced representation bisulfite sequencing (RRBS) is a cost-effective approach for genome-wide methylation pattern profiling. Analyzing RRBS sequencing data is challenging and specialized alignment/mapping programs are needed. Although such programs have been developed, a comprehensive solution that provides researchers with good quality and analyzable data is still lacking. To address this need, we have developed a Streamlined Analysis and Annotation Pipeline for RRBS data (SAAP-RRBS) that integrates read quality assessment/clean-up, alignment, methylation data extraction, annotation, reporting and visualization. This package facilitates a rapid transition from sequencing reads to a fully annotated CpG methylation report to biological interpretation.
Availability and implementation: SAAP-RRBS is freely available to non-commercial users at the web site http://ndc.mayo.edu/mayo/research/biostat/stand-alone-packages.cfm.
firstname.lastname@example.org or email@example.com
Supplementary data are available at Bioinformatics online.
We introduce a promising methodology to identify new therapeutic targets in cancer. Proteins bind to nanoparticles to form a protein corona. We modulate this corona by using surface-engineered nanoparticles, and identify protein composition to provide insight into disease development.
Using a family of structurally homologous nanoparticles we have investigated the changes in the protein corona around surface-functionalized gold nanoparticles (AuNPs) from normal and malignant ovarian cell lysates. Proteomics analysis using mass spectrometry identified hepatoma-derived growth factor (HDGF) that is found exclusively on positively charged AuNPs (+AuNPs) after incubation with the lysates. We confirmed expression of HDGF in various ovarian cancer cells and validated binding selectivity to +AuNPs by Western blot analysis. Silencing of HDGF by siRNA resulted s inhibition in proliferation of ovarian cancer cells.
We investigated the modulation of protein corona around surface-functionalized gold nanoparticles as a promising approach to identify new therapeutic targets. The potential of our method for identifying therapeutic targets was demonstrated through silencing of HDGF by siRNA, which inhibited proliferation of ovarian cancer cells. This integrated proteomics, bioinformatics, and nanotechnology strategy demonstrates that protein corona identification can be used to discover novel therapeutic targets in cancer.
Runs of homozygosity (ROH) represents extended length of homozygotes on a long genomic distance. In oncology, it is known as loss of heterozygosity (LOH) if identified exclusively in cancer cell rather than in matched control cell. Studies have identified several genomic regions which show consistent ROH in different kinds of carcinoma. To query whether this consistency can be observed on broader spectrum, both in more cancer types and in wider genomic regions, we investigated ROH patterns in the National Cancer Institute 60 cancer cell line panel (NCI-60) and HapMap Caucasian healthy trio families. Using results from Affymetrix 500 K SNP arrays, we report a genome wide significant association of ROH regions between the NCI-60 and HapMap samples, with much a higher level of ROH (11 fold) in the cancer cell lines. Analysis shows that more severe ROH found in cancer cells appears to be the extension of existing ROH in healthy state. In the HapMap trios, the adult subgroup had a slightly but significantly higher level (1.02 fold) of ROH than did the young subgroup. For several ROH regions we observed the co-occurrence of fragile sites (FRAs). However, FRA on the genome wide level does not show a clear relationship with ROH regions.
OBJECTIVE: To create a cohort for cost-effective genetic research, the Mayo Genome Consortia (MayoGC) has been assembled with participants from research studies across Mayo Clinic with high-throughput genetic data and electronic medical record (EMR) data for phenotype extraction.
PARTICIPANTS AND METHODS: Eligible participants include those who gave general research consent in the contributing studies to share high-throughput genotyping data with other investigators. Herein, we describe the design of the MayoGC, including the current participating cohorts, expansion efforts, data processing, and study management and organization. A genome-wide association study to identify genetic variants associated with total bilirubin levels was conducted to test the genetic research capability of the MayoGC.
RESULTS: Genome-wide significant results were observed on 2q37 (top single nucleotide polymorphism, rs4148325; P=5.0 × 10–62) and 12p12 (top single nucleotide polymorphism, rs4363657; P=5.1 × 10–8) corresponding to a gene cluster of uridine 5′-diphospho-glucuronosyltransferases (the UGT1A cluster) and solute carrier organic anion transporter family, member 1B1 (SLCO1B1), respectively.
CONCLUSION: Genome-wide association studies have identified genetic variants associated with numerous phenotypes but have been historically limited by inadequate sample size due to costly genotyping and phenotyping. Large consortia with harmonized genotype data have been assembled to attain sufficient statistical power, but phenotyping remains a rate-limiting factor in gene discovery research efforts. The EMR consists of an abundance of phenotype data that can be extracted in a relatively quick and systematic manner. The MayoGC provides a model of a unique collaborative effort in the environment of a common EMR for the investigation of genetic determinants of diseases.
SnowShoes-FTD, developed for fusion transcript detection in paired-end mRNA-Seq data, employs multiple steps of false positive filtering to nominate fusion transcripts with near 100% confidence. Unique features include: (i) identification of multiple fusion isoforms from two gene partners; (ii) prediction of genomic rearrangements; (iii) identification of exon fusion boundaries; (iv) generation of a 5′–3′ fusion spanning sequence for PCR validation; and (v) prediction of the protein sequences, including frame shift and amino acid insertions. We applied SnowShoes-FTD to identify 50 fusion candidates in 22 breast cancer and 9 non-transformed cell lines. Five additional fusion candidates with two isoforms were confirmed. In all, 30 of 55 fusion candidates had in-frame protein products. No fusion transcripts were detected in non-transformed cells. Consideration of the possible functions of a subset of predicted fusion proteins suggests several potentially important functions in transformation, including a possible new mechanism for overexpression of ERBB2 in a HER-positive cell line. The source code of SnowShoes-FTD is provided in two formats: one configured to run on the Sun Grid Engine for parallelization, and the other formatted to run on a single LINUX node. Executables in PERL are available for download from our web site: http://mayoresearch.mayo.edu/mayo/research/biostat/stand-alone-packages.cfm.
Linkage Disequilibrium (LD) bin-tagging algorithms identify a reduced set of tag SNPs that can capture the genetic variation in a population without genotyping every single SNP. However, existing tag SNP selection algorithms for designing custom genotyping panels do not take into account all platform dependent factors affecting the likelihood of a tag SNP to be successfully genotyped and many of the constraints that can be imposed by the user.
SNPPicker optimizes the selection of tag SNPs from common bin-tagging programs to design custom genotyping panels. The application uses a multi-step search strategy in combination with a statistical model to maximize the genotyping success of the selected tag SNPs. User preference toward functional SNPs can also be taken into account as secondary criteria. SNPPicker can also optimize tag SNP selection for a panel tagging multiple populations. SNPPicker can optimize custom genotyping panels including all the assay-specific constraints of Illumina's GoldenGate and Infinium assays.
A new application has been developed to maximize the success of custom multi-population genotyping panels. SNPPicker also takes into account user constraints including options for controlling runtime. Perl Scripts, Java source code and executables are available under an open source license for download at http://mayoresearch.mayo.edu/mayo/research/biostat/software.cfm
Virtual (computational) screening is an increasingly important tool for drug discovery. AutoDock is a popular open-source application for performing molecular docking, the prediction of ligand-receptor interactions. AutoDock is a serial application, though several previous efforts have parallelized various aspects of the program. In this paper, we report on a multi-level parallelization of AutoDock 4.2 (mpAD4).
Using MPI and OpenMP, AutoDock 4.2 was parallelized for use on MPI-enabled systems and to multithread the execution of individual docking jobs. In addition, code was implemented to reduce input/output (I/O) traffic by reusing grid maps at each node from docking to docking. Performance of mpAD4 was examined on two multiprocessor computers.
Using MPI with OpenMP multithreading, mpAD4 scales with near linearity on the multiprocessor systems tested. In situations where I/O is limiting, reuse of grid maps reduces both system I/O and overall screening time. Multithreading of AutoDock's Lamarkian Genetic Algorithm with OpenMP increases the speed of execution of individual docking jobs, and when combined with MPI parallelization can significantly reduce the execution time of virtual screens. This work is significant in that mpAD4 speeds the execution of certain molecular docking workloads and allows the user to optimize the degree of system-level (MPI) and node-level (OpenMP) parallelization to best fit both workloads and computational resources.
Microarray measurements are susceptible to a variety of experimental artifacts, some of which give rise to systematic biases that are spatially dependent in a unique way on each chip. It is likely that such artifacts affect many SNP arrays, but the normalization methods used in currently available genotyping algorithms make no attempt at spatial bias correction. Here, we propose an effective single-chip spatial bias removal procedure for Affymetrix 6.0 SNP arrays or platforms with similar design features. This procedure deals with both extreme and subtle biases and is intended to be applied before standard genotype calling algorithms.
Application of the spatial bias adjustments on HapMap samples resulted in higher genotype call rates with equal or even better accuracy for thousands of SNPs. Consequently the normalization procedure is expected to lead to more meaningful biological inferences and could be valuable for genome-wide SNP analysis.
Spatial normalization can potentially rescue thousands of SNPs in a genetic study at the small cost of computational time. The approach is implemented in R and available from the authors upon request.
The human genome displays extensive copy-number variation (CNV). Recent discoveries have shown that large segments of DNA, ranging in size from hundreds to thousands of nucleotides, are either deleted or duplicated. This CNV may encompass genes, leading to a change in phenotype, including drug response phenotypes. Gemcitabine and 1-β-D-arabinofuranosylcytosine (AraC) are cytidine analogues used to treat a variety of cancers. Previous studies have shown that genetic variation may influence response to these drugs. In the present study, we set out to test the hypothesis that variation in copy number might contribute to variation in cytidine analogue response phenotypes.
We used a cell-based model system consisting of 197 ethnically-defined lymphoblastoid cell lines for which genome-wide SNP data were obtained using Illumina 550 and 650 K SNP arrays to study cytidine analogue cytotoxicity. 775 CNVs with allele frequencies > 1% were identified in 102 regions across the genome. 87/102 of these loci overlapped with previously identified regions of CNV. Association of CNVs with gemcitabine and AraC IC50 values identified 11 regions with permutation p-values < 0.05. Multiplex ligation-dependent probe amplification assays were performed to verify the 11 CNV regions that were associated with this phenotype; with false positive and false negative rates for the in-silico findings of 1.3% and 0.04%, respectively. We also had basal mRNA expression array data for these same 197 cell lines, which allowed us to quantify mRNA expression for 41 probesets in or near the CNV regions identified. We found that 7 of those 41 genes were highly expressed in our lymphoblastoid cell lines, and one of the seven genes (SMYD3) that was significant in the CNV association study was selected for further functional experiments. Those studies showed that knockdown of SMYD3, in pancreatic cancer cell lines increased gemcitabine and AraC resistance during cytotoxicity assay, consistent with the results of the association analysis.
These results suggest that CNVs may play a role in variation in cytidine analogue effect. Therefore, association studies of CNVs with drug response phenotypes in cell-based model systems, when paired with functional characterization, might help to identify CNV that contributes to variation in drug response.
The patient’s medication history and status changes play essential roles in medical treatment. A notable amount of medication status information typically resides in unstructured clinical narratives that require a sophisticated approach to automated classification. In this paper, we investigated rule-based and machine learning methods of medication status change classification from clinical free text. We also examined the impact of balancing training data in machine learning classification when using the data from skewed class distribution.
We conducted a genome-wide association study to identify novel genes influencing diastolic blood pressure (BP) response to hydrochlorothiazide, a commonly prescribed thiazide diuretic preferred for the treatment of high BP. Affymetrix GeneChip Human Mapping 100K Arrays were used to measure single nucleotide polymorphisms across the 22 autosomes in 194 non-Hispanic black subjects and 195 non-Hispanic white subjects with essential hypertension selected from opposite tertiles of the race- and sex-specific distributions of age-adjusted diastolic BP response to hydrochlorothiazide (25 mg daily, PO, for 4 weeks). The black sample consisted of 97 “good” responders (diastolic BP response [mean±SD]=-18.3±4.2 mm Hg; age=47.1±6.1 years; 51.5% women) and 97 “poor” responders (diastolic BP response=-0.18±4.3; age=47.4±6.5 years; 51.5% women). Haplotype trend regression identified a region of chromosome 12q15 in which haplotypes constructed from 3 successive single nucleotide polymorphisms (rs317689, rs315135, and rs7297610) in proximity to lysozyme (LYZ), YEATS domain containing 4 (YEATS4), and fibroblast growth receptor substrate 2 (FRS2) were significantly associated with diastolic BP response (nominal P=2.39×10-7; Bonferroni corrected P=0.024; simulated experiment-wise P=0.040). Genotyping of 35 additional single nucleotide polymorphisms selected to “tag” linkage disequilibrium blocks in these genes provided corroboration that variation in LYZ and YEATS4 was associated with diastolic BP response in a statistically independent data set of 291 black subjects and in the sample of 294 white subjects. These results support the use of genome-wide association analyses to identify novel genes influencing antihypertensive drug responses.
hypertension; pharmacogenetics; diuretic; blood pressure; genome
The developments of high-throughput genotyping technologies, which enable the simultaneous genotyping of hundreds of thousands of single nucleotide polymorphisms (SNP) have the potential to increase the benefits of genetic epidemiology studies. Although the enhanced resolution of these platforms increases the chance of interrogating functional SNPs that are themselves causative or in linkage disequilibrium with causal SNPs, commonly used single SNP-association approaches suffer from serious multiple hypothesis testing problems and provide limited insights into combinations of loci that may contribute to complex diseases. Drawing inspiration from Gene Set Enrichment Analysis developed for gene expression data, we have developed a method, named GLOSSI (Gene-loci Set Analysis), that integrates prior biological knowledge into the statistical analysis of genotyping data to test the association of a group of SNPs (loci-set) with complex disease phenotypes. The most significant loci-sets can be used to formulate hypotheses from a functional viewpoint that can be validated experimentally.
In a simulation study, GLOSSI showed sufficient power to detect loci-sets with less than 10% of SNPs having moderate-to-large effect sizes and intermediate minor allele frequency values. When applied to a biological dataset where no single SNP-association was found in a previous study, GLOSSI was able to identify several loci-sets that are significantly related to blood pressure response to an antihypertensive drug.
GLOSSI is valuable for association of SNPs at multiple genetic loci with complex disease phenotypes. In contrast to methods based on the Kolmogorov-Smirnov statistic, the approach is parametric and only utilizes information from within the interrogated loci-set. It properly accounts for dependency among SNPs and allows the testing of loci-sets of any size.
Advanced cholangiocarcinoma continues to harbor a difficult prognosis and therapeutic options have been limited. During the course of a clinical trial of whole genomic sequencing seeking druggable targets, we examined six patients with advanced cholangiocarcinoma. Integrated genome-wide and whole transcriptome sequence analyses were performed on tumors from six patients with advanced, sporadic intrahepatic cholangiocarcinoma (SIC) to identify potential therapeutically actionable events. Among the somatic events captured in our analysis, we uncovered two novel therapeutically relevant genomic contexts that when acted upon, resulted in preliminary evidence of anti-tumor activity. Genome-wide structural analysis of sequence data revealed recurrent translocation events involving the FGFR2 locus in three of six assessed patients. These observations and supporting evidence triggered the use of FGFR inhibitors in these patients. In one example, preliminary anti-tumor activity of pazopanib (in vitro FGFR2 IC50≈350 nM) was noted in a patient with an FGFR2-TACC3 fusion. After progression on pazopanib, the same patient also had stable disease on ponatinib, a pan-FGFR inhibitor (in vitro, FGFR2 IC50≈8 nM). In an independent non-FGFR2 translocation patient, exome and transcriptome analysis revealed an allele specific somatic nonsense mutation (E384X) in ERRFI1, a direct negative regulator of EGFR activation. Rapid and robust disease regression was noted in this ERRFI1 inactivated tumor when treated with erlotinib, an EGFR kinase inhibitor. FGFR2 fusions and ERRFI mutations may represent novel targets in sporadic intrahepatic cholangiocarcinoma and trials should be characterized in larger cohorts of patients with these aberrations.
Cholangiocarcinoma is a cancer that affects the bile ducts. Unfortunately, many patients diagnosed with cholangiocarcinoma have disease that cannot be treated with surgery or has spread to other parts of the body, thus severely limiting treatment options. New advances in drug treatment have enabled treatment of these cancers with “targeted therapy” that exploits an error in the normal functioning of a tumor cell, compared to other cells in the body, thus allowing only tumor cells to be killed by the drug. We sought to identify changes in the genetic material of cholangiocarcinoma patient tumors in order to identify potential errors in cellular functioning by utilizing cutting edge genetic sequencing technology. We identified three patient tumors possessing an FGFR2 gene that was aberrantly fused to another gene. Two of these patients were able to receive targeted therapy for FGFR2 with resulting tumor shrinkage. A fourth tumor contained an error in a gene that controls a very important cellular mechanism in cancer, termed epidermal growth factor pathway (EGFR). This patient received therapy targeting this mechanism and also demonstrated response to treatment. Thus, we have been able to utilize cutting edge technology with targeted drug treatment to personalize medical treatment for cancer in cholangiocarcinoma patients.
The sequencing by the PolyA selection is the most common approach for library preparation. With limited amount or degraded RNA, alternative protocols such as the NuGEN have been developed. However, it is not yet clear how the different library preparations affect the downstream analyses of the broad applications of RNA sequencing.
Methods and Materials
Eight human mammary epithelial cell (HMEC) lines with high quality RNA were sequenced by Illumina’s mRNA-Seq PolyA selection and NuGEN ENCORE library preparation. The following analyses and comparisons were conducted: 1) the numbers of genes captured by each protocol; 2) the impact of protocols on differentially expressed gene detection between biological replicates; 3) expressed single nucleotide variant (SNV) detection; 4) non-coding RNAs, particularly lincRNA detection; and 5) intragenic gene expression.
Sequences from the NuGEN protocol had lower (75%) alignment rate than the PolyA (over 90%). The NuGEN protocol detected fewer genes (12–20% less) with a significant portion of reads mapped to non-coding regions. A large number of genes were differentially detected between the two protocols. About 17–20% of the differentially expressed genes between biological replicates were commonly detected between the two protocols. Significantly higher numbers of SNVs (5–6 times) were detected in the NuGEN samples, which were largely from intragenic and intergenic regions. The NuGEN captured fewer exons (25% less) and had higher base level coverage variance. While 6.3% of reads were mapped to intragenic regions in the PolyA samples, the percentages were much higher (20–25%) for the NuGEN samples. The NuGEN protocol did not detect more known non-coding RNAs such as lincRNAs, but targeted small and “novel” lincRNAs.
Different library preparations can have significant impacts on downstream analysis and interpretation of RNA-seq data. The NuGEN provides an alternative for limited or degraded RNA but it has limitations for some RNA-seq applications.