1.  Airway PI3K Pathway Activation Is an Early and Reversible Event in Lung Cancer Development 
Science translational medicine  2010;2(26):26ra25.
Although only a subset of smokers develop lung cancer, we cannot determine which smokers are at highest risk for cancer development, nor do we know the signaling pathways altered early in the process of tumorigenesis in these individuals. On the basis of the concept that cigarette smoke creates a molecular field of injury throughout the respiratory tract, this study explores oncogenic pathway deregulation in cytologically normal proximal airway epithelial cells of smokers at risk for lung cancer. We observed a significant increase in a genomic signature of phosphatidylinositol 3-kinase (PI3K) pathway activation in the cytologically normal bronchial airway of smokers with lung cancer and smokers with dysplastic lesions, suggesting that PI3K is activated in the proximal airway before tumorigenesis. Further, PI3K activity is decreased in the airway of high-risk smokers who had significant regression of dysplasia after treatment with the chemopreventive agent myo-inositol, and myo-inositol inhibits the PI3K pathway in vitro. These results suggest that deregulation of the PI3K pathway in the bronchial airway epithelium of smokers is an early, measurable, and reversible event in the development of lung cancer and that genomic profiling of these relatively accessible airway cells may enable personalized approaches to chemoprevention and therapy. Our work further suggests that additional lung cancer chemoprevention trials either targeting the PI3K pathway or measuring airway PI3K activation as an intermediate endpoint are warranted.
PMCID: PMC3694402  PMID: 20375364
3.  A pharmacogenomic method for individualized prediction of drug sensitivity 
Using valproic acid as an example, the authors demonstrate that drug response signatures derived from genome-wide expression data can identify individuals likely to respond to a drug, and propose that this method could select optimal populations for clinical trials of new therapies.
Drug response signatures that accurately reflect the cellular response to a drug can be generated from Connectivity Map and publically available gene expression data.Predictions from the drug response signature for valproic acid correlate with sensitivity to valproic acid in breast cancer cell lines and patient tumors grown in three-dimensional culture and mouse xenografts.The MATCH algorithm provides an efficient approach for using genome-wide gene expression data to identify a target population for a drug prior to clinical trials.MATCH can predict drug sensitivity in tumors without knowledge of mechanism of action.
Unlike traditional chemotherapy, targeted cancer therapies are expected to work in only a subset of people with a particular cancer. However, biomarkers of response are not always known before clinical trial initiation. We present MATCH (Merging genomic and pharmacologic Analyses for Therapy CHoice), an algorithm for using genome-wide gene expression data to identify and validate a genomic biomarker of sensitivity (see Figure 1). Our proof-of-principle example is valproic acid (VPA), but we also show that an estrogen blocking drug currently used for breast cancer and a B-RAF inhibitor in trials for melanoma give predictions that correspond to their clinical uses.
We use genome-wide gene expression data from treated and untreated samples from the Connectivity Map to generate a VPA response signature. We validate that the VPA signature can identify treated and untreated cells in an independent data set of normal cells and in independent samples from the Connectivity Map. The AUC for the ROC curve is 0.86. We then apply the VPA signature to publically available data sets from a panel of cancer cell lines and from primary tumor and normal tissue samples. These data suggest that there is a subset of women with breast cancer who will be sensitive to VPA. Finally, we validate that our predictions correlate with sensitivity to VPA in breast cancer cell lines grown in two-dimensional culture, primary breast tumor samples grown in three-dimensional culture, and in vivo mouse breast cancer xenografts. Together, these studies show that MATCH can identify cancer patients most likely to respond to a specific drug treatment.
Identifying the best drug for each cancer patient requires an efficient individualized strategy. We present MATCH (Merging genomic and pharmacologic Analyses for Therapy CHoice), an approach using public genomic resources and drug testing of fresh tumor samples to link drugs to patients. Valproic acid (VPA) is highlighted as a proof-of-principle. In order to predict specific tumor types with high probability of drug sensitivity, we create drug response signatures using publically available gene expression data and assess sensitivity in a data set of >40 cancer types. Next, we evaluate drug sensitivity in matched tumor and normal tissue and exclude cancer types that are no more sensitive than normal tissue. From these analyses, breast tumors are predicted to be sensitive to VPA. A meta-analysis across breast cancer data sets shows that aggressive subtypes are most likely to be sensitive to VPA, but all subtypes have sensitive tumors. MATCH predictions correlate significantly with growth inhibition in cancer cell lines and three-dimensional cultures of fresh tumor samples. MATCH accurately predicts reduction in tumor growth rate following VPA treatment in patient tumor xenografts. MATCH uses genomic analysis with in vitro testing of patient tumors to select optimal drug regimens before clinical trial initiation.
PMCID: PMC3159972  PMID: 21772261
biomarkers; cancer; pharmacogenomics
4.  An integration of complementary strategies for gene-expression analysis to reveal novel therapeutic opportunities for breast cancer 
Perhaps the major challenge in developing more effective therapeutic strategies for the treatment of breast cancer patients is confronting the heterogeneity of the disease, recognizing that breast cancer is not one disease but multiple disorders with distinct underlying mechanisms. Gene-expression profiling studies have been used to dissect this complexity, and our previous studies identified a series of intrinsic subtypes of breast cancer that define distinct populations of patients with respect to survival. Additional work has also used signatures of oncogenic pathway deregulation to dissect breast cancer heterogeneity as well as to suggest therapeutic opportunities linked to pathway activation.
We used genomic analyses to identify relations between breast cancer subtypes, pathway deregulation, and drug sensitivity. For these studies, we use three independent breast cancer gene-expression data sets to measure an individual tumor phenotype. Correlation between pathway status and subtype are examined and linked to predictions for response to conventional chemotherapies.
We reveal patterns of pathway activation characteristic of each molecular breast cancer subtype, including within the more aggressive subtypes in which novel therapeutic opportunities are critically needed. Whereas some oncogenic pathways have high correlations to breast cancer subtype (RAS, CTNNB1, p53, HER1), others have high variability of activity within a specific subtype (MYC, E2F3, SRC), reflecting biology independent of common clinical factors. Additionally, we combined these analyses with predictions of sensitivity to commonly used cytotoxic chemotherapies to provide additional opportunities for therapeutics specific to the intrinsic subtype that might be better aligned with the characteristics of the individual patient.
Genomic analyses can be used to dissect the heterogeneity of breast cancer. We use an integrated analysis of breast cancer that combines independent methods of genomic analyses to highlight the complexity of signaling pathways underlying different breast cancer phenotypes and to identify optimal therapeutic opportunities.
PMCID: PMC2750116  PMID: 19638211
5.  Smoking-induced gene expression changes in the bronchial airway are reflected in nasal and buccal epithelium 
BMC Genomics  2008;9:259.
Cigarette smoking is a leading cause of preventable death and a significant cause of lung cancer and chronic obstructive pulmonary disease. Prior studies have demonstrated that smoking creates a field of molecular injury throughout the airway epithelium exposed to cigarette smoke. We have previously characterized gene expression in the bronchial epithelium of never smokers and identified the gene expression changes that occur in the mainstem bronchus in response to smoking. In this study, we explored relationships in whole-genome gene expression between extrathorcic (buccal and nasal) and intrathoracic (bronchial) epithelium in healthy current and never smokers.
Using genes that have been previously defined as being expressed in the bronchial airway of never smokers (the "normal airway transcriptome"), we found that bronchial and nasal epithelium from non-smokers were most similar in gene expression when compared to other epithelial and nonepithelial tissues, with several antioxidant, detoxification, and structural genes being highly expressed in both the bronchus and nose. Principle component analysis of previously defined smoking-induced genes from the bronchus suggested that smoking had a similar effect on gene expression in nasal epithelium. Gene set enrichment analysis demonstrated that this set of genes was also highly enriched among the genes most altered by smoking in both nasal and buccal epithelial samples. The expression of several detoxification genes was commonly altered by smoking in all three respiratory epithelial tissues, suggesting a common airway-wide response to tobacco exposure.
Our findings support a relationship between gene expression in extra- and intrathoracic airway epithelial cells and extend the concept of a smoking-induced field of injury to epithelial cells that line the mouth and nose. This relationship could potentially be utilized to develop a non-invasive biomarker for tobacco exposure as well as a non-invasive screening or diagnostic tool providing information about individual susceptibility to smoking-induced lung diseases.
PMCID: PMC2435556  PMID: 18513428
6.  High-precision high-coverage functional inference from integrated data sources 
BMC Bioinformatics  2008;9:119.
Information obtained from diverse data sources can be combined in a principled manner using various machine learning methods to increase the reliability and range of knowledge about protein function. The result is a weighted functional linkage network (FLN) in which linked neighbors share at least one function with high probability. Precision is, however, low. Aiming to provide precise functional annotation for as many proteins as possible, we explore and propose a two-step framework for functional annotation (1) construction of a high-coverage and reliable FLN via machine learning techniques (2) development of a decision rule for the constructed FLN to optimize functional annotation.
We first apply this framework to Saccharomyces cerevisiae. In the first step, we demonstrate that four commonly used machine learning methods, Linear SVM, Linear Discriminant Analysis, Naïve Bayes, and Neural Network, all combine heterogeneous data to produce reliable and high-coverage FLNs, in which the linkage weight more accurately estimates functional coupling of linked proteins than use individual data sources alone. In the second step, empirical tuning of an adjustable decision rule on the constructed FLN reveals that basing annotation on maximum edge weight results in the most precise annotation at high coverages. In particular at low coverage all rules evaluated perform comparably. At coverage above approximately 50%, however, they diverge rapidly. At full coverage, the maximum weight decision rule still has a precision of approximately 70%, whereas for other methods, precision ranges from a high of slightly more than 30%, down to 3%. In addition, a scoring scheme to estimate the precisions of individual predictions is also provided. Finally, tests of the robustness of the framework indicate that our framework can be successfully applied to less studied organisms.
We provide a general two-step function-annotation framework, and show that high coverage, high precision annotations can be achieved by constructing a high-coverage and reliable FLN via data integration followed by applying a maximum weight decision rule.
PMCID: PMC2292694  PMID: 18298847
7.  Towards the identification of essential genes using targeted genome sequencing and comparative analysis 
BMC Genomics  2006;7:265.
The identification of genes essential for survival is of theoretical importance in the understanding of the minimal requirements for cellular life, and of practical importance in the identification of potential drug targets in novel pathogens. With the great time and expense required for experimental studies aimed at constructing a catalog of essential genes in a given organism, a computational approach which could identify essential genes with high accuracy would be of great value.
We gathered numerous features which could be generated automatically from genome sequence data and assessed their relationship to essentiality, and subsequently utilized machine learning to construct an integrated classifier of essential genes in both S. cerevisiae and E. coli. When looking at single features, phyletic retention, a measure of the number of organisms an ortholog is present in, was the most predictive of essentiality. Furthermore, during construction of our phyletic retention feature we for the first time explored the evolutionary relationship among the set of organisms in which the presence of a gene is most predictive of essentiality. We found that in both E. coli and S. cerevisiae the optimal sets always contain host-associated organisms with small genomes which are closely related to the reference. Using five optimally selected organisms, we were able to improve predictive accuracy as compared to using all available sequenced organisms. We hypothesize the predictive power of these genomes is a consequence of the process of reductive evolution, by which many parasites and symbionts evolved their gene content. In addition, essentiality is measured in rich media, a condition which resembles the environments of these organisms in their hosts where many nutrients are provided. Finally, we demonstrate that integration of our most highly predictive features using a probabilistic classifier resulted in accuracies surpassing any individual feature.
Using features obtainable directly from sequence data, we were able to construct a classifier which can predict essential genes with high accuracy. Furthermore, our analysis of the set of genomes in which the presence of a gene is most predictive of essentiality may suggest ways in which targeted sequencing can be used in the identification of essential genes. In summary, the methods presented here can aid in the reduction of time and money invested in essential gene identification by targeting those genes for experimentation which are predicted as being essential with a high probability.
PMCID: PMC1624830  PMID: 17052348
8.  Comparative assessment of performance and genome dependence among phylogenetic profiling methods 
BMC Bioinformatics  2006;7:420.
The rapidly increasing speed with which genome sequence data can be generated will be accompanied by an exponential increase in the number of sequenced eukaryotes. With the increasing number of sequenced eukaryotic genomes comes a need for bioinformatic techniques to aid in functional annotation. Ideally, genome context based techniques such as proximity, fusion, and phylogenetic profiling, which have been so successful in prokaryotes, could be utilized in eukaryotes. Here we explore the application of phylogenetic profiling, a method that exploits the evolutionary co-occurrence of genes in the assignment of functional linkages, to eukaryotic genomes.
In order to evaluate the performance of phylogenetic profiling in eukaryotes, we assessed the relative performance of commonly used profile construction techniques and genome compositions in predicting functional linkages in both prokaryotic and eukaryotic organisms. When predicting linkages in E. coli with a prokaryotic profile, the use of continuous values constructed from transformed BLAST bit-scores performed better than profiles composed of discretized E-values; the use of discretized E-values resulted in more accurate linkages when using S. cerevisiae as the query organism. Extending this analysis by incorporating several eukaryotic genomes in profiles containing a majority of prokaryotes resulted in similar overall accuracy, but with a surprising reduction in pathway diversity among the most significant linkages. Furthermore, the application of phylogenetic profiling using profiles composed of only eukaryotes resulted in the loss of the strong correlation between common KEGG pathway membership and profile similarity score. Profile construction methods, orthology definitions, ontology and domain complexity were explored as possible sources of the poor performance of eukaryotic profiles, but with no improvement in results.
Given the current set of completely sequenced eukaryotic organisms, phylogenetic profiling using profiles generated from any of the commonly used techniques was found to yield extremely poor results. These findings imply genome-specific requirements for constructing functionally relevant phylogenetic profiles, and suggest that differences in the evolutionary history between different kingdoms might generally limit the usefulness of phylogenetic profiling in eukaryotes.
PMCID: PMC1592128  PMID: 17005048
9.  Genetic and Functional Diversification of Small RNA Pathways in Plants 
PLoS Biology  2004;2(5):e104.
Multicellular eukaryotes produce small RNA molecules (approximately 21–24 nucleotides) of two general types, microRNA (miRNA) and short interfering RNA (siRNA). They collectively function as sequence-specific guides to silence or regulate genes, transposons, and viruses and to modify chromatin and genome structure. Formation or activity of small RNAs requires factors belonging to gene families that encode DICER (or DICER-LIKE [DCL]) and ARGONAUTE proteins and, in the case of some siRNAs, RNA-dependent RNA polymerase (RDR) proteins. Unlike many animals, plants encode multiple DCL and RDR proteins. Using a series of insertion mutants of Arabidopsis thaliana, unique functions for three DCL proteins in miRNA (DCL1), endogenous siRNA (DCL3), and viral siRNA (DCL2) biogenesis were identified. One RDR protein (RDR2) was required for all endogenous siRNAs analyzed. The loss of endogenous siRNA in dcl3 and rdr2 mutants was associated with loss of heterochromatic marks and increased transcript accumulation at some loci. Defects in siRNA-generation activity in response to turnip crinkle virus in dcl2 mutant plants correlated with increased virus susceptibility. We conclude that proliferation and diversification of DCL and RDR genes during evolution of plants contributed to specialization of small RNA-directed pathways for development, chromatin structure, and defense.
In plants, RNA-mediated silencing pathways have diversified in unique ways. This study elucidates the specific functions of some of the key regulators in development, chromatin structure, and pathogen defense
PMCID: PMC350667  PMID: 15024409

