1.  Use of log-skew-normal distribution in analysis of continuous data with a discrete component at zero 
Statistics in medicine  2008;27(18):3643-3655.
The problem of analyzing a continuous variable with a discrete component is addressed within the frame-work of the mixture model proposed by Moulton and Halsey. The model can be generalized by the introduction of the log-skew-normal distribution for the continuous component, and the fit can be significantly improved by its use, while retaining the interpretation of regression parameter estimates. Simulation studies and application to a real data set are used for demonstration.
PMCID: PMC2758628  PMID: 18186536
censoring; skew-normal distribution; two-part model
2.  Association of MAPT haplotypes with Alzheimer’s disease risk and MAPT brain gene expression levels 
MAPT encodes for tau, the predominant component of neurofibrillary tangles that are neuropathological hallmarks of Alzheimer’s disease (AD). Genetic association of MAPT variants with late-onset AD (LOAD) risk has been inconsistent, although insufficient power and incomplete assessment of MAPT haplotypes may account for this.
We examined the association of MAPT haplotypes with LOAD risk in more than 20,000 subjects (n-cases = 9,814, n-controls = 11,550) from Mayo Clinic (n-cases = 2,052, n-controls = 3,406) and the Alzheimer’s Disease Genetics Consortium (ADGC, n-cases = 7,762, n-controls = 8,144). We also assessed associations with brain MAPT gene expression levels measured in the cerebellum (n = 197) and temporal cortex (n = 202) of LOAD subjects. Six single nucleotide polymorphisms (SNPs) which tag MAPT haplotypes with frequencies greater than 1% were evaluated.
H2-haplotype tagging rs8070723-G allele associated with reduced risk of LOAD (odds ratio, OR = 0.90, 95% confidence interval, CI = 0.85-0.95, p = 5.2E-05) with consistent results in the Mayo (OR = 0.81, p = 7.0E-04) and ADGC (OR = 0.89, p = 1.26E-04) cohorts. rs3785883-A allele was also nominally significantly associated with LOAD risk (OR = 1.06, 95% CI = 1.01-1.13, p = 0.034). Haplotype analysis revealed significant global association with LOAD risk in the combined cohort (p = 0.033), with significant association of the H2 haplotype with reduced risk of LOAD as expected (p = 1.53E-04) and suggestive association with additional haplotypes. MAPT SNPs and haplotypes also associated with brain MAPT levels in the cerebellum and temporal cortex of AD subjects with the strongest associations observed for the H2 haplotype and reduced brain MAPT levels (β = -0.16 to -0.20, p = 1.0E-03 to 3.0E-03).
These results confirm the previously reported MAPT H2 associations with LOAD risk in two large series, that this haplotype has the strongest effect on brain MAPT expression amongst those tested and identify additional haplotypes with suggestive associations, which require replication in independent series. These biologically congruent results provide compelling evidence to screen the MAPT region for regulatory variants which confer LOAD risk by influencing its brain gene expression.
PMCID: PMC4198935  PMID: 25324900
3.  An Integrated Model of the Transcriptome of HER2-Positive Breast Cancer 
PLoS ONE  2013;8(11):e79298.
Our goal in these analyses was to use genomic features from a test set of primary breast tumors to build an integrated transcriptome landscape model that makes relevant hypothetical predictions about the biological and/or clinical behavior of HER2-positive breast cancer. We interrogated RNA-Seq data from benign breast lesions, ER+, triple negative, and HER2-positive tumors to identify 685 differentially expressed genes, 102 alternatively spliced genes, and 303 genes that expressed single nucleotide sequence variants (eSNVs) that were associated with the HER2-positive tumors in our survey panel. These features were integrated into a transcriptome landscape model that identified 12 highly interconnected genomic modules, each of which represents a cellular processes pathway that appears to define the genomic architecture of the HER2-positive tumors in our test set. The generality of the model was confirmed by the observation that several key pathways were enriched in HER2-positive TCGA breast tumors. The ability of this model to make relevant predictions about the biology of breast cancer cells was established by the observation that integrin signaling was linked to lapatinib sensitivity in vitro and strongly associated with risk of relapse in the NCCTG N9831 adjuvant trastuzumab clinical trial dataset. Additional modules from the HER2 transcriptome model, including ubiquitin-mediated proteolysis, TGF-beta signaling, RHO-family GTPase signaling, and M-phase progression, were linked to response to lapatinib and paclitaxel in vitro and/or risk of relapse in the N9831 dataset. These data indicate that an integrated transcriptome landscape model derived from a test set of HER2-positive breast tumors has potential for predicting outcome and for identifying novel potential therapeutic strategies for this breast cancer subtype.
PMCID: PMC3815156  PMID: 24223926
4.  Novel late-onset Alzheimer disease loci variants associate with brain gene expression 
Neurology  2012;79(3):221-228.
Recent genome-wide association studies (GWAS) of late-onset Alzheimer disease (LOAD) identified 9 novel risk loci. Discovery of functional variants within genes at these loci is required to confirm their role in Alzheimer disease (AD). Single nucleotide polymorphisms that influence gene expression (eSNPs) constitute an important class of functional variants. We therefore investigated the influence of the novel LOAD risk loci on human brain gene expression.
We measured gene expression levels in the cerebellum and temporal cortex of autopsied AD subjects and those with other brain pathologies (∼400 total subjects). To determine whether any of the novel LOAD risk variants are eSNPs, we tested their cis-association with expression of 6 nearby LOAD candidate genes detectable in human brain (ABCA7, BIN1, CLU, MS4A4A, MS4A6A, PICALM) and an additional 13 genes ±100 kb of these SNPs. To identify additional eSNPs that influence brain gene expression levels of the novel candidate LOAD genes, we identified SNPs ±100 kb of their location and tested for cis-associations.
CLU rs11136000 (p = 7.81 × 10−4) and MS4A4A rs2304933/rs2304935 (p = 1.48 × 10−4–1.86 × 10−4) significantly influence temporal cortex expression levels of these genes. The LOAD-protective CLU and risky MS4A4A locus alleles associate with higher brain levels of these genes. There are other cis-variants that significantly influence brain expression of CLU and ABCA7 (p = 4.01 × 10−5–9.09 × 10−9), some of which also associate with AD risk (p = 2.64 × 10−2–6.25 × 10−5).
CLU and MS4A4A eSNPs may at least partly explain the LOAD risk association at these loci. CLU and ABCA7 may harbor additional strong eSNPs. These results have implications in the search for functional variants at the novel LOAD risk loci.
PMCID: PMC3398432  PMID: 22722634
Hypertension  2012;59(6):1204-1211.
To identify genes influencing blood pressure response to an angiotensin II receptor blocker, single nucleotide polymorphisms identified by genome-wide association analysis of the response to candesartan were validated by opposite direction associations with the response to a thiazide diuretic, hydrochlorothiazide. 198 White and 193 African Americans with primary hypertension were sampled from opposite tertiles of the race-sex-specific distributions of age-adjusted diastolic blood pressure response to candesartan. 285 polymorphisms associated with the response to candesartan at p<10−4 in Whites. 273 of the 285 polymorphisms, which were available for analysis in a separate sample of 196 Whites, validated for opposite direction associations with the response to hydrochlorothiazide (Fisher’s X2 1-sided p=0.02). Among the 273 polymorphisms, those in the chromosome 11q21 region were the most significantly associated with response to candesartan in Whites (e.g., rs11020821 near FUT4, p=8.98×10−7), had the strongest opposite direction associations with response to hydrochlorothiazide (e.g., rs3758785 in GPR83, p=7.10×10−3), and had same direction associations with response to candesartan in the 193 African Americans (e.g., rs16924603 near FUT4, p=1.52×10−2). Also notable among the 273 polymorphisms was rs11649420 on chromosome 16 in the amiloride-sensitive sodium channel subunit SCNN1G involved in mediating renal sodium reabsorption and maintaining blood pressure when the renin-angiotensin system is inhibited by candesartan. These results support the utility of genomewide association analyses to identify novel genes predictive of opposite direction associations with blood pressure responses to inhibitors of the renin-angiotensin and renal sodium transport systems.
PMCID: PMC3530397  PMID: 22566498
Hypertension; pharmacogenetics; diuretic; blood pressure; genome
6.  Concordance of Changes in Metabolic Pathways Based on Plasma Metabolomics and Skeletal Muscle Transcriptomics in Type 1 Diabetes 
Diabetes  2012;61(5):1004-1016.
Insulin regulates many cellular processes, but the full impact of insulin deficiency on cellular functions remains to be defined. Applying a mass spectrometry–based nontargeted metabolomics approach, we report here alterations of 330 plasma metabolites representing 33 metabolic pathways during an 8-h insulin deprivation in type 1 diabetic individuals. These pathways included those known to be affected by insulin such as glucose, amino acid and lipid metabolism, Krebs cycle, and immune responses and those hitherto unknown to be altered including prostaglandin, arachidonic acid, leukotrienes, neurotransmitters, nucleotides, and anti-inflammatory responses. A significant concordance of metabolome and skeletal muscle transcriptome–based pathways supports an assumption that plasma metabolites are chemical fingerprints of cellular events. Although insulin treatment normalized plasma glucose and many other metabolites, there were 71 metabolites and 24 pathways that differed between nondiabetes and insulin-treated type 1 diabetes. Confirmation of many known pathways altered by insulin using a single blood test offers confidence in the current approach. Future research needs to be focused on newly discovered pathways affected by insulin deficiency and systemic insulin treatment to determine whether they contribute to the high morbidity and mortality in T1D despite insulin treatment.
PMCID: PMC3331761  PMID: 22415876
7.  Impact of data fragmentation across healthcare centers on the accuracy of a high-throughput clinical phenotyping algorithm for specifying subjects with type 2 diabetes mellitus 
To evaluate data fragmentation across healthcare centers with regard to the accuracy of a high-throughput clinical phenotyping (HTCP) algorithm developed to differentiate (1) patients with type 2 diabetes mellitus (T2DM) and (2) patients with no diabetes.
Materials and methods
This population-based study identified all Olmsted County, Minnesota residents in 2007. We used provider-linked electronic medical record data from the two healthcare centers that provide >95% of all care to County residents (ie, Olmsted Medical Center and Mayo Clinic in Rochester, Minnesota, USA). Subjects were limited to residents with one or more encounter January 1, 2006 through December 31, 2007 at both healthcare centers. DM-relevant data on diagnoses, laboratory results, and medication from both centers were obtained during this period. The algorithm was first executed using data from both centers (ie, the gold standard) and then from Mayo Clinic alone. Positive predictive values and false-negative rates were calculated, and the McNemar test was used to compare categorization when data from the Mayo Clinic alone were used with the gold standard. Age and sex were compared between true-positive and false-negative subjects with T2DM. Statistical significance was accepted as p<0.05.
With data from both medical centers, 765 subjects with T2DM (4256 non-DM subjects) were identified. When single-center data were used, 252 T2DM subjects (1573 non-DM subjects) were missed; an additional false-positive 27 T2DM subjects (215 non-DM subjects) were identified. The positive predictive values and false-negative rates were 95.0% (513/540) and 32.9% (252/765), respectively, for T2DM subjects and 92.6% (2683/2898) and 37.0% (1573/4256), respectively, for non-DM subjects. Age and sex distribution differed between true-positive (mean age 62.1; 45% female) and false-negative (mean age 65.0; 56.0% female) T2DM subjects.
The findings show that application of an HTCP algorithm using data from a single medical center contributes to misclassification. These findings should be considered carefully by researchers when developing and executing HTCP algorithms.
PMCID: PMC3277630  PMID: 22249968
Algorithms; electronic medical record; research techniques; type 2 diabetes mellitus; EMR secondary and meaningful use; EHR; information retrieval; modeling; data mining; medical informatics; infection control; phenotyping; biomedical informatics; ontologies; knowledge representations; controlled terminologies and vocabularies; information retrieval; HIT data standards
8.  Homozygosity Mapping and Exome Sequencing Reveal GATAD1 Mutation in Autosomal Recessive Dilated Cardiomyopathy 
Dilated cardiomyopathy (DCM) is a heritable, genetically heterogeneous disorder, typically exhibiting autosomal dominant inheritance. Genomic strategies enable discovery of novel, unsuspected molecular underpinnings of familial DCM. We performed genome-wide mapping and exome sequencing in a unique family wherein DCM segregated as an autosomal recessive (AR) trait.
Methods and Results
Echocardiography in 17 adult descendants of first cousins revealed DCM in two female siblings and idiopathic left ventricular enlargement in their brother. Genotyping and linkage analysis mapped an AR DCM locus to chromosome 7q21, which was validated and refined by high-density homozygosity mapping. Exome sequencing of the affected sisters was then employed as a complementary strategy for mutation discovery. An iterative bioinformatics process was used to filter >40,000 genetic variants, revealing a single shared homozygous missense mutation localized to the 7q21 critical region. The mutation, absent in HapMap, 1000Genomes and 474 ethnically matched controls, altered a conserved residue of GATAD1, encoding GATA zinc finger domain-containing protein 1. Thirteen relatives were heterozygous mutation-carriers with no evidence of myocardial disease, even at advanced ages. Immunohistochemistry demonstrated nuclear localization of GATAD1 in left ventricular myocytes, yet subcellular expression and nuclear morphology were aberrant in the proband.
Linkage analysis and exome sequencing were used as synergistic genomic strategies to identify GATAD1 as a gene for AR DCM. GATAD1 binds to a histone modification site that regulates gene expression. Consistent with murine DCM caused by genetic disruption of histone deacetylases, our data implicate an inherited basis for epigenetic dysregulation in human heart failure.
PMCID: PMC3248690  PMID: 21965549
Cardiomyopathy; Genetics; Genomics; Epigenetics; Next generation sequencing
9.  Brain Expression Genome-Wide Association Study (eGWAS) Identifies Human Disease-Associated Variants 
PLoS Genetics  2012;8(6):e1002707.
Genetic variants that modify brain gene expression may also influence risk for human diseases. We measured expression levels of 24,526 transcripts in brain samples from the cerebellum and temporal cortex of autopsied subjects with Alzheimer's disease (AD, cerebellar n = 197, temporal cortex n = 202) and with other brain pathologies (non–AD, cerebellar n = 177, temporal cortex n = 197). We conducted an expression genome-wide association study (eGWAS) using 213,528 cisSNPs within ±100 kb of the tested transcripts. We identified 2,980 cerebellar cisSNP/transcript level associations (2,596 unique cisSNPs) significant in both ADs and non–ADs (q<0.05, p = 7.70×10−5–1.67×10−82). Of these, 2,089 were also significant in the temporal cortex (p = 1.85×10−5–1.70×10−141). The top cerebellar cisSNPs had 2.4-fold enrichment for human disease-associated variants (p<10−6). We identified novel cisSNP/transcript associations for human disease-associated variants, including progressive supranuclear palsy SLCO1A2/rs11568563, Parkinson's disease (PD) MMRN1/rs6532197, Paget's disease OPTN/rs1561570; and we confirmed others, including PD MAPT/rs242557, systemic lupus erythematosus and ulcerative colitis IRF5/rs4728142, and type 1 diabetes mellitus RPS26/rs1701704. In our eGWAS, there was 2.9–3.3 fold enrichment (p<10−6) of significant cisSNPs with suggestive AD–risk association (p<10−3) in the Alzheimer's Disease Genetics Consortium GWAS. These results demonstrate the significant contributions of genetic factors to human brain gene expression, which are reliably detected across different brain regions and pathologies. The significant enrichment of brain cisSNPs among disease-associated variants advocates gene expression changes as a mechanism for many central nervous system (CNS) and non–CNS diseases. Combined assessment of expression and disease GWAS may provide complementary information in discovery of human disease variants with functional implications. Our findings have implications for the design and interpretation of eGWAS in general and the use of brain expression quantitative trait loci in the study of human disease genetics.
Author Summary
Genetic variants that regulate gene expression levels can also influence human disease risk. Discovery of genomic loci that alter brain gene expression levels (brain expression quantitative trait loci = eQTLs) can be instrumental in the identification of genetic risk underlying both central nervous system (CNS) and non–CNS diseases. To systematically assess the role of brain eQTLs in human disease and to evaluate the influence of brain region and pathology in eQTL mapping, we performed an expression genome-wide association study (eGWAS) in 773 brain samples from the cerebellum and temporal cortex of ∼200 autopsied subjects with Alzheimer's disease (AD) and ∼200 with other brain pathologies (non–AD). We identified ∼3,000 significant associations between cisSNPs near ∼700 genes and their cerebellar transcript levels, which replicate in ADs and non–ADs. More than 2,000 of these associations were reproducible in the temporal cortex. The top cisSNPs are enriched for both CNS and non–CNS disease-associated variants. We identified novel and confirmed previous cisSNP/transcript associations for many disease loci, suggesting gene expression regulation as their mechanism of action. These findings demonstrate the reproducibility of the eQTL approach across different brain regions and pathologies, and advocate the combined use of gene expression and disease GWAS for identification and functional characterization of human disease-associated variants.
PMCID: PMC3369937  PMID: 22685416
10.  Glutathione S-transferase omega genes in Alzheimer and Parkinson disease risk, age-at-diagnosis and brain gene expression: an association study with mechanistic implications 
Glutathione S-transferase omega-1 and 2 genes (GSTO1, GSTO2), residing within an Alzheimer and Parkinson disease (AD and PD) linkage region, have diverse functions including mitigation of oxidative stress and may underlie the pathophysiology of both diseases. GSTO polymorphisms were previously reported to associate with risk and age-at-onset of these diseases, although inconsistent follow-up study designs make interpretation of results difficult. We assessed two previously reported SNPs, GSTO1 rs4925 and GSTO2 rs156697, in AD (3,493 ADs vs. 4,617 controls) and PD (678 PDs vs. 712 controls) for association with disease risk (case-controls), age-at-diagnosis (cases) and brain gene expression levels (autopsied subjects).
We found that rs156697 minor allele associates with significantly increased risk (odds ratio = 1.14, p = 0.038) in the older ADs with age-at-diagnosis > 80 years. The minor allele of GSTO1 rs4925 associates with decreased risk in familial PD (odds ratio = 0.78, p = 0.034). There was no other association with disease risk or age-at-diagnosis. The minor alleles of both GSTO SNPs associate with lower brain levels of GSTO2 (p = 4.7 × 10-11-1.9 × 10-27), but not GSTO1. Pathway analysis of significant genes in our brain expression GWAS, identified significant enrichment for glutathione metabolism genes (p = 0.003).
These results suggest that GSTO locus variants may lower brain GSTO2 levels and consequently confer AD risk in older age. Other glutathione metabolism genes should be assessed for their effects on AD and other chronic, neurologic diseases.
PMCID: PMC3393625  PMID: 22494505
GSTO genes; Disease risk; Gene expression; Association
11.  Mayo Genome Consortia: A Genotype-Phenotype Resource for Genome-Wide Association Studies With an Application to the Analysis of Circulating Bilirubin Levels 
Mayo Clinic Proceedings  2011;86(7):606-614.
OBJECTIVE: To create a cohort for cost-effective genetic research, the Mayo Genome Consortia (MayoGC) has been assembled with participants from research studies across Mayo Clinic with high-throughput genetic data and electronic medical record (EMR) data for phenotype extraction.
PARTICIPANTS AND METHODS: Eligible participants include those who gave general research consent in the contributing studies to share high-throughput genotyping data with other investigators. Herein, we describe the design of the MayoGC, including the current participating cohorts, expansion efforts, data processing, and study management and organization. A genome-wide association study to identify genetic variants associated with total bilirubin levels was conducted to test the genetic research capability of the MayoGC.
RESULTS: Genome-wide significant results were observed on 2q37 (top single nucleotide polymorphism, rs4148325; P=5.0 × 10–62) and 12p12 (top single nucleotide polymorphism, rs4363657; P=5.1 × 10–8) corresponding to a gene cluster of uridine 5′-diphospho-glucuronosyltransferases (the UGT1A cluster) and solute carrier organic anion transporter family, member 1B1 (SLCO1B1), respectively.
CONCLUSION: Genome-wide association studies have identified genetic variants associated with numerous phenotypes but have been historically limited by inadequate sample size due to costly genotyping and phenotyping. Large consortia with harmonized genotype data have been assembled to attain sufficient statistical power, but phenotyping remains a rate-limiting factor in gene discovery research efforts. The EMR consists of an abundance of phenotype data that can be extracted in a relatively quick and systematic manner. The MayoGC provides a model of a unique collaborative effort in the environment of a common EMR for the investigation of genetic determinants of diseases.
PMCID: PMC3127556  PMID: 21646302
12.  Deep Sequence Analysis of Non-Small Cell Lung Cancer: Integrated Analysis of Gene Expression, Alternative Splicing, and Single Nucleotide Variations in Lung Adenocarcinomas with and without Oncogenic KRAS Mutations 
KRAS mutations are highly prevalent in non-small cell lung cancer (NSCLC), and tumors harboring these mutations tend to be aggressive and resistant to chemotherapy. We used next-generation sequencing technology to identify pathways that are specifically altered in lung tumors harboring a KRAS mutation. Paired-end RNA-sequencing of 15 primary lung adenocarcinoma tumors (8 harboring mutant KRAS and 7 with wild-type KRAS) were performed. Sequences were mapped to the human genome, and genomic features, including differentially expressed genes, alternate splicing isoforms and single nucleotide variants, were determined for tumors with and without KRAS mutation using a variety of computational methods. Network analysis was carried out on genes showing differential expression (374 genes), alternate splicing (259 genes), and SNV-related changes (65 genes) in NSCLC tumors harboring a KRAS mutation. Genes exhibiting two or more connections from the lung adenocarcinoma network were used to carry out integrated pathway analysis. The most significant signaling pathways identified through this analysis were the NFκB, ERK1/2, and AKT pathways. A 27 gene mutant KRAS-specific sub network was extracted based on gene–gene connections from the integrated network, and interrogated for druggable targets. Our results confirm previous evidence that mutant KRAS tumors exhibit activated NFκB, ERK1/2, and AKT pathways and may be preferentially sensitive to target therapeutics toward these pathways. In addition, our analysis indicates novel, previously unappreciated links between mutant KRAS and the TNFR and PPARγ signaling pathways, suggesting that targeted PPARγ antagonists and TNFR inhibitors may be useful therapeutic strategies for treatment of mutant KRAS lung tumors. Our study is the first to integrate genomic features from RNA-Seq data from NSCLC and to define a first draft genomic landscape model that is unique to tumors with oncogenic KRAS mutations.
PMCID: PMC3356053  PMID: 22655260
transcriptome sequencing; RNA-Seq; KRAS mutation; NSCLC; bioinformatics; network analysis; data integration and computational methods
13.  Batch effect correction for genome-wide methylation data with Illumina Infinium platform 
BMC Medical Genomics  2011;4:84.
Genome-wide methylation profiling has led to more comprehensive insights into gene regulation mechanisms and potential therapeutic targets. Illumina Human Methylation BeadChip is one of the most commonly used genome-wide methylation platforms. Similar to other microarray experiments, methylation data is susceptible to various technical artifacts, particularly batch effects. To date, little attention has been given to issues related to normalization and batch effect correction for this kind of data.
We evaluated three common normalization approaches and investigated their performance in batch effect removal using three datasets with different degrees of batch effects generated from HumanMethylation27 platform: quantile normalization at average β value (QNβ); two step quantile normalization at probe signals implemented in "lumi" package of R (lumi); and quantile normalization of A and B signal separately (ABnorm). Subsequent Empirical Bayes (EB) batch adjustment was also evaluated.
Each normalization could remove a portion of batch effects and their effectiveness differed depending on the severity of batch effects in a dataset. For the dataset with minor batch effects (Dataset 1), normalization alone appeared adequate and "lumi" showed the best performance. However, all methods left substantial batch effects intact in the datasets with obvious batch effects and further correction was necessary. Without any correction, 50 and 66 percent of CpGs were associated with batch effects in Dataset 2 and 3, respectively. After QNβ, lumi or ABnorm, the number of CpGs associated with batch effects were reduced to 24, 32, and 26 percent for Dataset 2; and 37, 46, and 35 percent for Dataset 3, respectively. Additional EB correction effectively removed such remaining non-biological effects. More importantly, the two-step procedure almost tripled the numbers of CpGs associated with the outcome of interest for the two datasets.
Genome-wide methylation data from Infinium Methylation BeadChip can be susceptible to batch effects with profound impacts on downstream analyses and conclusions. Normalization can reduce part but not all batch effects. EB correction along with normalization is recommended for effective batch effect removal.
PMCID: PMC3265417  PMID: 22171553
14.  TREAT: a bioinformatics tool for variant annotations and visualizations in targeted and exome sequencing data 
Bioinformatics  2011;28(2):277-278.
Summary: TREAT (Targeted RE-sequencing Annotation Tool) is a tool for facile navigation and mining of the variants from both targeted resequencing and whole exome sequencing. It provides a rich integration of publicly available as well as in-house developed annotations and visualizations for variants, variant-hosting genes and host-gene pathways.
Availability and implementation: TREAT is freely available to non-commercial users as either a stand-alone annotation and visualization tool, or as a comprehensive workflow integrating sequencing alignment and variant calling. The executables, instructions and the Amazon Cloud Images of TREAT can be downloaded at the website:
Supplementary information: Supplementary data are provided at Bioinformatics online.
PMCID: PMC3259432  PMID: 22088845
15.  A novel bioinformatics pipeline for identification and characterization of fusion transcripts in breast cancer and normal cell lines 
Nucleic Acids Research  2011;39(15):e100.
SnowShoes-FTD, developed for fusion transcript detection in paired-end mRNA-Seq data, employs multiple steps of false positive filtering to nominate fusion transcripts with near 100% confidence. Unique features include: (i) identification of multiple fusion isoforms from two gene partners; (ii) prediction of genomic rearrangements; (iii) identification of exon fusion boundaries; (iv) generation of a 5′–3′ fusion spanning sequence for PCR validation; and (v) prediction of the protein sequences, including frame shift and amino acid insertions. We applied SnowShoes-FTD to identify 50 fusion candidates in 22 breast cancer and 9 non-transformed cell lines. Five additional fusion candidates with two isoforms were confirmed. In all, 30 of 55 fusion candidates had in-frame protein products. No fusion transcripts were detected in non-transformed cells. Consideration of the possible functions of a subset of predicted fusion proteins suggests several potentially important functions in transformation, including a possible new mechanism for overexpression of ERBB2 in a HER-positive cell line. The source code of SnowShoes-FTD is provided in two formats: one configured to run on the Sun Grid Engine for parallelization, and the other formatted to run on a single LINUX node. Executables in PERL are available for download from our web site:
PMCID: PMC3159479  PMID: 21622959
16.  Spatial normalization improves the quality of genotype calling for Affymetrix SNP 6.0 arrays 
BMC Bioinformatics  2010;11:356.
Microarray measurements are susceptible to a variety of experimental artifacts, some of which give rise to systematic biases that are spatially dependent in a unique way on each chip. It is likely that such artifacts affect many SNP arrays, but the normalization methods used in currently available genotyping algorithms make no attempt at spatial bias correction. Here, we propose an effective single-chip spatial bias removal procedure for Affymetrix 6.0 SNP arrays or platforms with similar design features. This procedure deals with both extreme and subtle biases and is intended to be applied before standard genotype calling algorithms.
Application of the spatial bias adjustments on HapMap samples resulted in higher genotype call rates with equal or even better accuracy for thousands of SNPs. Consequently the normalization procedure is expected to lead to more meaningful biological inferences and could be valuable for genome-wide SNP analysis.
Spatial normalization can potentially rescue thousands of SNPs in a genetic study at the small cost of computational time. The approach is implemented in R and available from the authors upon request.
PMCID: PMC2910027  PMID: 20587065
17.  Copy number variation and cytidine analogue cytotoxicity: A genome-wide association approach 
BMC Genomics  2010;11:357.
The human genome displays extensive copy-number variation (CNV). Recent discoveries have shown that large segments of DNA, ranging in size from hundreds to thousands of nucleotides, are either deleted or duplicated. This CNV may encompass genes, leading to a change in phenotype, including drug response phenotypes. Gemcitabine and 1-β-D-arabinofuranosylcytosine (AraC) are cytidine analogues used to treat a variety of cancers. Previous studies have shown that genetic variation may influence response to these drugs. In the present study, we set out to test the hypothesis that variation in copy number might contribute to variation in cytidine analogue response phenotypes.
We used a cell-based model system consisting of 197 ethnically-defined lymphoblastoid cell lines for which genome-wide SNP data were obtained using Illumina 550 and 650 K SNP arrays to study cytidine analogue cytotoxicity. 775 CNVs with allele frequencies > 1% were identified in 102 regions across the genome. 87/102 of these loci overlapped with previously identified regions of CNV. Association of CNVs with gemcitabine and AraC IC50 values identified 11 regions with permutation p-values < 0.05. Multiplex ligation-dependent probe amplification assays were performed to verify the 11 CNV regions that were associated with this phenotype; with false positive and false negative rates for the in-silico findings of 1.3% and 0.04%, respectively. We also had basal mRNA expression array data for these same 197 cell lines, which allowed us to quantify mRNA expression for 41 probesets in or near the CNV regions identified. We found that 7 of those 41 genes were highly expressed in our lymphoblastoid cell lines, and one of the seven genes (SMYD3) that was significant in the CNV association study was selected for further functional experiments. Those studies showed that knockdown of SMYD3, in pancreatic cancer cell lines increased gemcitabine and AraC resistance during cytotoxicity assay, consistent with the results of the association analysis.
These results suggest that CNVs may play a role in variation in cytidine analogue effect. Therefore, association studies of CNVs with drug response phenotypes in cell-based model systems, when paired with functional characterization, might help to identify CNV that contributes to variation in drug response.
PMCID: PMC2894803  PMID: 20525348
18.  Genomic Association Analysis Suggests Chromosome 12 Locus Influencing Antihypertensive Response to Thiazide Diuretic 
Hypertension  2008;52(2):359-365.
We conducted a genome-wide association study to identify novel genes influencing diastolic blood pressure (BP) response to hydrochlorothiazide, a commonly prescribed thiazide diuretic preferred for the treatment of high BP. Affymetrix GeneChip Human Mapping 100K Arrays were used to measure single nucleotide polymorphisms across the 22 autosomes in 194 non-Hispanic black subjects and 195 non-Hispanic white subjects with essential hypertension selected from opposite tertiles of the race- and sex-specific distributions of age-adjusted diastolic BP response to hydrochlorothiazide (25 mg daily, PO, for 4 weeks). The black sample consisted of 97 “good” responders (diastolic BP response [mean±SD]=-18.3±4.2 mm Hg; age=47.1±6.1 years; 51.5% women) and 97 “poor” responders (diastolic BP response=-0.18±4.3; age=47.4±6.5 years; 51.5% women). Haplotype trend regression identified a region of chromosome 12q15 in which haplotypes constructed from 3 successive single nucleotide polymorphisms (rs317689, rs315135, and rs7297610) in proximity to lysozyme (LYZ), YEATS domain containing 4 (YEATS4), and fibroblast growth receptor substrate 2 (FRS2) were significantly associated with diastolic BP response (nominal P=2.39×10-7; Bonferroni corrected P=0.024; simulated experiment-wise P=0.040). Genotyping of 35 additional single nucleotide polymorphisms selected to “tag” linkage disequilibrium blocks in these genes provided corroboration that variation in LYZ and YEATS4 was associated with diastolic BP response in a statistically independent data set of 291 black subjects and in the sample of 294 white subjects. These results support the use of genome-wide association analyses to identify novel genes influencing antihypertensive drug responses.
PMCID: PMC2692710  PMID: 18591461
hypertension; pharmacogenetics; diuretic; blood pressure; genome
19.  GLOSSI: a method to assess the association of genetic loci-sets with complex diseases 
BMC Bioinformatics  2009;10:102.
The developments of high-throughput genotyping technologies, which enable the simultaneous genotyping of hundreds of thousands of single nucleotide polymorphisms (SNP) have the potential to increase the benefits of genetic epidemiology studies. Although the enhanced resolution of these platforms increases the chance of interrogating functional SNPs that are themselves causative or in linkage disequilibrium with causal SNPs, commonly used single SNP-association approaches suffer from serious multiple hypothesis testing problems and provide limited insights into combinations of loci that may contribute to complex diseases. Drawing inspiration from Gene Set Enrichment Analysis developed for gene expression data, we have developed a method, named GLOSSI (Gene-loci Set Analysis), that integrates prior biological knowledge into the statistical analysis of genotyping data to test the association of a group of SNPs (loci-set) with complex disease phenotypes. The most significant loci-sets can be used to formulate hypotheses from a functional viewpoint that can be validated experimentally.
In a simulation study, GLOSSI showed sufficient power to detect loci-sets with less than 10% of SNPs having moderate-to-large effect sizes and intermediate minor allele frequency values. When applied to a biological dataset where no single SNP-association was found in a previous study, GLOSSI was able to identify several loci-sets that are significantly related to blood pressure response to an antihypertensive drug.
GLOSSI is valuable for association of SNPs at multiple genetic loci with complex disease phenotypes. In contrast to methods based on the Kolmogorov-Smirnov statistic, the approach is parametric and only utilizes information from within the interrogated loci-set. It properly accounts for dependency among SNPs and allows the testing of loci-sets of any size.
PMCID: PMC2678095  PMID: 19344520

