Next-generation sequencing has enabled examination of variation at the DNA sequence level and can be further enhanced by evaluation of the variants at the protein level. One powerful method is to visualize these data often revealing patterns not immediately apparent in a text version of the same data. Many investigators are interested in knowing where their amino acid changes reside within a protein. Clustering of variation within a protein versus non-clustering can show interesting aspects of the biological changes happening in disease.
We describe a freely available tool, Plot Protein, executable from the command line or utilized as a graphical interface through a web browser, to enable visualization of amino acid changes at the protein level. This allows researchers to plot variation from their sequencing studies in a quick and uniform way. The features available include plotting amino acid changes, domains, post-translational modifications, reference sequence, conservation, conservation score, and also zoom capabilities. Herein we provide a case example using this tool to examine the RET protein and we demonstrate how clustering of mutations within the protein in Multiple Endocrine Neoplasia 2A (MEN2A) reveals important information about disease mechanism.
Plot Protein is a useful tool for investigating amino acid changes and their localization within proteins. Command line and web server versions of this software are described that enable users to derive visual knowledge about their mutations.
Protein; Mutation; Disease; Cluster; Visualization; Plot
Quantification and normalization of RT-qPCR data critically depends on the expression of so called reference genes. Our goal was to develop a strategy for the selection of reference genes that utilizes microarray data analysis and combines known approaches for gene stability evaluation and to select a set of appropriate reference genes for research and clinical analysis of breast samples with different receptor and cancer status using this strategy.
A preliminary search of reference genes was based on high-throughput analysis of microarray datasets. The final selection and validation of the candidate genes were based on the RT-qPCR data analysis using several known methods for expression stability evaluation: comparative ∆Ct method, geNorm, NormFinder and Haller equivalence test.
A set of five reference genes was identified: ACTB, RPS23, HUWE1, EEF1A1 and SF3A1. The initial selection was based on the analysis of publically available well-annotated microarray datasets containing different breast cancers and normal breast epithelium from breast cancer patients and epithelium from cancer-free patients. The final selection and validation were performed using RT-qPCR data from 39 breast cancer biopsy samples. Three genes from the final set were identified by the means of microarray analysis and were novel in the context of breast cancer assay. We showed that the selected set of reference genes is more stable in comparison not only with individual genes, but also with a system of reference genes used in commercial OncotypeDX test.
A selection of reference genes for RT-qPCR can be efficiently performed by combining a preliminary search based on the high-throughput analysis of microarray datasets and final selection and validation based on the analysis of RT-qPCR data with a simultaneous examination of different expression stability measures. The identified set of reference genes proved to be less variable and thus potentially more efficient for research and clinical analysis of breast samples comparing to individual genes and the set of reference genes used in OncotypeDX assay.
Reference genes; Microarrays; Reverse transcription quantitative real-time polymerase chain reaction (RT-qPCR); Gene expression; Breast cancer
DNA copy number variations (CNV) constitute an important source of genetic variability. The standard method used for CNV detection is array comparative genomic hybridization (aCGH).
We propose a novel multiple sample aCGH analysis methodology aiming in rare CNVs detection. In contrast to the majority of previous approaches, which deal with cancer datasets, we focus on constitutional genomic abnormalities identified in a diverse spectrum of diseases in human. Our method is tested on exon targeted aCGH array of 366 patients affected with developmental delay/intellectual disability, epilepsy, or autism. The proposed algorithms can be applied as a post–processing filtering to any given segmentation method.
Thanks to the additional information obtained from multiple samples, we could efficiently detect significant segments corresponding to rare CNVs responsible for pathogenic changes. The robust statistical framework applied in our method enables to eliminate the influence of widespread technical artifact termed ‘waves’.
Biological networks are important for elucidating disease etiology due to their ability to model complex high dimensional data and biological systems. Proteomics provides a critical data source for such models, but currently lacks robust de novo methods for network construction, which could bring important insights in systems biology.
We have evaluated the construction of network models using methods derived from weighted gene co-expression network analysis (WGCNA). We show that approximately scale-free peptide networks, composed of statistically significant modules, are feasible and biologically meaningful using two mouse lung experiments and one human plasma experiment. Within each network, peptides derived from the same protein are shown to have a statistically higher topological overlap and concordance in abundance, which is potentially important for inferring protein abundance. The module representatives, called eigenpeptides, correlate significantly with biological phenotypes. Furthermore, within modules, we find significant enrichment for biological function and known interactions (gene ontology and protein-protein interactions).
Biological networks are important tools in the analysis of complex systems. In this paper we evaluate the application of weighted co-expression network analysis to quantitative proteomics data. Protein co-expression networks allow novel approaches for biological interpretation, quality control, inference of protein abundance, a framework for potentially resolving degenerate peptide-protein mappings, and a biomarker signature discovery.
Biomarkers; Biological networks; Networks; Systems biology; Virology; Sarcopenia; LC-MS; Proteomics
Whole genome microarray gene expression profiling is the ‘gold standard’ for the discovery of prognostic and predictive genetic markers for human cancers. However, suitable research material is lacking as most diagnostic samples are preserved as formalin-fixed, paraffin-embedded tissue (FFPET). We tested a new workflow and data analysis method optimized for use with FFPET samples.
Sixteen breast tumor samples were split into matched pairs and preserved as FFPET or fresh-frozen (FF). Total RNA was extracted and tested for yield and purity. RNA from FFPET samples was amplified using three different commercially available kits in parallel, and hybridized to Affymetrix GeneChip® Human Genome U133 Plus 2.0 Arrays. The array probe set was optimized in silico to exclude misdesigned and misannotated probes.
FFPET samples processed using the WT-Ovation™ FFPE System V2 (NuGEN) provided 80% specificity and 97% sensitivity compared with FF samples (assuming values of 100%). In addition, in silico probe set redesign improved sequence detection sensitivity and, thus, may rescue potentially significant small-magnitude gene expression changes that could otherwise be diluted by the overall probe set background.
In conclusion, our FFPET-optimized workflow enables the detection of more genes than previous, nonoptimized approaches, opening new possibilities for the discovery, validation, and clinical application of mRNA biomarkers in human diseases.
Biomarker; Breast cancer; Gene; HER2; Microarray
Like all other neurodegenerative diseases, Alzheimer’s disease (AD) remains a very challenging and difficult problem for diagnosis and therapy. For many years, only historical, behavioral and psychiatric measures have been available to AD cases. Recently, a definitive diagnostic framework, using biomarkers and imaging, has been proposed. In this paper, we propose a promising diagnostic methodology for the framework.
In a previous paper, we developed an efficient SVM (Support Vector Machine) based method, which we have now applied to discover important biomarkers and target networks which provide strategies for AD therapy.
The methodology selects a number of blood-based biomarkers (fewer than 10% of initial numbers on three AD datasets from NCBI), and the results are statistically verified by cross-validation. The resulting SVM is a classifier of AD vs. normal subjects. We construct target networks of AD based on MI (mutual information). In addition, a hierarchical clustering is applied on the initial data and clustered genes are visualized in a heatmap. The proposed method also performs gender analysis by classifying subjects based on gender.
Unlike other traditional statistical analyses, our method uses a machine learning-based algorithm. Our method selects a small set of important biomarkers for AD, differentiates noisy (irrelevant) from relevant biomarkers and also provides the target networks of the selected biomarkers, which will be useful for diagnosis and therapeutic design. Finally, based on the gender analysis, we observe that gender could play a role in AD diagnosis.
Feature selection; Biomarkers; Target networks; Alzheimer’s disease; Support vector machine
Exploring stromal changes associated with tumor growth and development is a growing area of oncologic research. In order to study molecular changes in the stroma it is recommended to separate tumor tissue from stromal tissue. This is relevant to xenograft models where tumors can be small and difficult to separate from host tissue. We introduce a novel definition of cross-alignment/cross-hybridization to compare qualitatively the ability of high-throughput mRNA sequencing, RNA-Seq, and microarrays to detect tumor and stromal expression from mixed ‘pseudo-xenograft’ samples vis-à-vis genes and pathways in cross-alignment (RNA-Seq) and cross-hybridization (microarrays). Samples consisted of normal mouse lung and human breast cancer cells; these were combined in fixed proportions to create a titration series of 25% steps. Our definition identifies genes in a given species (human or mouse) with undetectable expression in same-species RNA but detectable expression in cross-species RNA. We demonstrate the comparative value of this method and discuss its potential contribution in cancer research.
Our method can identify genes from either species that demonstrate cross-hybridization and/or cross-alignment properties. Surprisingly, the set of genes identified using a simpler and more common approach (using a ‘pure’ cross-species sample and calling all detected genes as ‘crossers’) is not a superset of the genes identified using our technique. The observed levels of cross-hybridization are relatively low: 5.3% of human genes detected in mouse, and 3.5% of mouse genes detected in human. Observed levels of cross-alignment are practically comparable to the levels of cross-hybridization: 6.5% of human genes detected in mouse, and 2.3% of mouse genes detected in human. We also observed a relatively high percentage of orthologs: 40.3% of cross-hybridizing genes, and 32.2% of cross-aligning genes.
Normalizing the gene catalog to use Consensus Coding Sequence (CCDS) IDs (Genome Res 19:1316–1323, 2009), our results show that the observed levels of cross-hybridization are low: 2.7% of human CCDS IDs are detected in mouse, and 2.4% of mouse CCDS IDs are detected in human. Levels of cross-alignment using the RNA-Seq data are comparable for the mouse, 2.2% of mouse CCDS IDs detected in human, and 9.9% of human CCDS IDs detected in mouse. However, the lists of cross-aligning/cross-hybridizing genes contain many that are of specific interest to oncologic researchers.
The conservative definition that we propose identifies genes in mouse whose expression can be attributed to human RNA, and vice versa, as well as revealing genes with cross-alignment/cross-hybridization properties which could not be identified using a simpler but more established approach. The overall percentage of genes affected by cross-hybridization/cross-alignment is small, but includes genes that are of interest to oncologic researchers. Which platform to use with mixed xenograft samples, microarrays or RNA-Seq, appears to be primarily a question of cost and whether the detection and measurement of expression of specific genes of interest are likely to be affected by cross-hybridization or cross-alignment.
Microarray; RNA-Seq; Cross-hybridization; Cross-alignment; Tumor microenvironment; Xenograft; Pathway analysis
The editors of Journal of Clinical Bioinformatics would like to thank all our reviewers who have contributed to the journal in Volume 2 (2012).
MicroRNAs (miRNAs) are remarkable molecules that appear to have a fundamental role in the biology of the cell. They constitute a class of non-protein encoding RNA molecules which have now emerged as key players in regulating the activity of mRNA. miRNAs are small RNAmolecules around 22 nucleotides in length, which affect the activity of specific mRNA, directly degrading it and/or preventing its translation into protein. The science of miRNAs holds them as candidate biomarkers for the early detection and management of cancer. There is also considerable excitement for the use of miRNAs as a novel class of therapeutic targets and as a new class of therapeutic agents for the treatment of cancers. From a clinical perspective, miRNAs can induce a number of effects and may have a diverse application in biomedical research. This review highlights the general mode of action of miRNAs, their biogenesis, the effect of diet on miRNA expression and the impact of miRNAs on cancer epigenetics and drug resistance in various cancers. Further we also provide emphasis on bioinformatics software which can be used to determine potential targets of miRNAs.
miRNA; Biogenesis; Diet; Cancer epigenetics; Bioinformatics software
Single nucleotide polymorphisms (SNPs) in genes derived from distinct pathways are associated with a breast cancer risk. Identifying possible SNP-SNP interactions in genome-wide case–control studies is an important task when investigating genetic factors that influence common complex traits; the effects of SNP-SNP interaction need to be characterized. Furthermore, observations of the complex interplay (interactions) between SNPs for high-dimensional combinations are still computationally and methodologically challenging. An improved branch and bound algorithm with feature selection (IBBFS) is introduced to identify SNP combinations with a maximal difference of allele frequencies between the case and control groups in breast cancer, i.e., the high/low risk combinations of SNPs.
A total of 220 real case and 334 real control breast cancer data are used to test IBBFS and identify significant SNP combinations. We used the odds ratio (OR) as a quantitative measure to estimate the associated cancer risk of multiple SNP combinations to identify the complex biological relationships underlying the progression of breast cancer, i.e., the most likely SNP combinations. Experimental results show the estimated odds ratio of the best SNP combination with genotypes is significantly smaller than 1 (between 0.165 and 0.657) for specific SNP combinations of the tested SNPs in the low risk groups. In the high risk groups, predicted SNP combinations with genotypes are significantly greater than 1 (between 2.384 and 6.167) for specific SNP combinations of the tested SNPs.
This study proposes an effective high-speed method to analyze SNP-SNP interactions in breast cancer association studies. A number of important SNPs are found to be significant for the high/low risk group. They can thus be considered a potential predictor for breast cancer association.
The recent introduction of high throughput sequencing technologies into clinical genetics has made it practical to simultaneously sequence many genes. In contrast, previous technologies limited sequencing based tests to only a handful of genes. While the ability to more accurately diagnose inherited diseases is a great benefit it introduces specific challenges. Interpretation of missense mutations continues to be challenging and the number of variants of uncertain significance continues to grow.
We leveraged the data available at ARUP Laboratories, a major reference laboratory, for the CFTR gene to explore specific challenges related to variant interpretation, including a focus on understanding ethnic-specific variants and an evaluation of existing databases for clinical interpretation of variants. In this study we analyzed 555 patients representing eight different ethnic groups. We observed 184 different variants, most of which were ethnic group specific. Eighty-five percent of these variants were present in the Cystic Fibrosis Mutation Database, whereas the Human Mutation Database and dbSNP/1000 Genomes had far fewer of the observed variants. Finally, 21 of the variants were novel and we report these variants and their clinical classifications.
Based on our analyses of data from six years of CFTR testing at ARUP Laboratories a more comprehensive, clinical grade database is needed for the accurate interpretation of observed variants. Furthermore, there is a particular need for more and better information regarding variants from individuals of non-Caucasian ethnicity.
Cystic fibrosis; CFTR; Novel variants; Next-generation sequencing; Interpretation of variants
Mass spectrometry is an important analytical tool for clinical proteomics. Primarily employed for biomarker discovery, it is increasingly used for developing methods which may help to provide unambiguous diagnosis of biological samples. In this context, we investigated the classification of phenotypes by applying support vector machine (SVM) on experimental data obtained by MudPIT approach. In particular, we compared the performance capabilities of SVM by using two independent collection of complex samples and different data-types, such as mass spectra (m/z), peptides and proteins.
Globally, protein and peptide data allowed a better discriminant informative content than experimental mass spectra (overall accuracy higher than 87% in both collection 1 and 2). These results indicate that sequencing of peptides and proteins reduces the experimental noise affecting the raw mass spectra, and allows the extraction of more informative features available for the effective classification of samples. In addition, proteins and peptides features selected by SVM matched for 80% with the differentially expressed proteins identified by the MAProMa software.
These findings confirm the availability of the most label-free quantitative methods based on processing of spectral count and SEQUEST-based SCORE values. On the other hand, it stresses the usefulness of MudPIT data for a correct grouping of sample phenotypes, by applying both supervised and unsupervised learning algorithms. This capacity permit the evaluation of actual samples and it is a good starting point to translate proteomic methodology to clinical application.
Sample classification; MudPIT; SVM; Clinical proteomics; Label-free quantification
Hutchinson-Gilford progeria syndrome is a rare dominant human disease of genetic origin. The average life expectancy is about 20 years, patients’ life quality is still very poor and no efficient therapy has yet been developed. It is caused by mutation of the LMNA gene, which results in accumulation in the nuclear membrane of a particular splicing form of Lamin-A called progerin. The mechanism by which progerin perturbs cellular homeostasis and leads to the symptoms is still under debate.
Micro-RNAs are able to negatively regulate transcription by coupling with the 3’ UnTranslated Region of messenger RNAs. Several Micro-RNAs recognize the same 3’ UnTranslated Region and each Micro-RNA can recognize multiple 3’ UnTranslated Regions of different messenger RNAs. When different messenger RNAs are co-regulated via a similar panel of micro-RNAs, these messengers are called Competing Endogenous RNAs, or ceRNAs.
The 3’ UnTranslated Region of the longest LMNA transcript was analysed looking for its ceRNAs. The aim of this study was to search for candidate genes and gene ontology functions possibly influenced by LMNA mutations that may exert a role in progeria development.
11 miRNAs were isolated as potential LMNA regulators. By computational analysis, the miRNAs pointed to 17 putative LMNA ceRNAs. Gene ontology analysis of isolated ceRNAs showed an enrichment in RNA interference and control of cell cycle functions.
This study isolated novel genes and functions potentially involved in LMNA network of regulation that could be involved in laminopathies such as the Hutchinson-Gilford progeria syndrome.
CeRNA; Hutchinson-Gilford; Progeria; LMNA; Lamin-A; 3’ UTR; MiRNA
Identification of prognostic biomarkers is hallmark of cancer genomics. Since miRNAs regulate expression of multiple genes, they act as potent biomarkers in several cancers. Identification of miRNAs that are prognostically important has been done sporadically, but no resource is available till date that allows users to study prognostics of miRNAs of interest, utilizing the wealth of available data, in major cancer types.
In this paper, we present a web based tool that allows users to study prognostic properties of miRNAs in several cancer types, using publicly available data. We have compiled data from Gene Expression Omnibus (GEO), and recently developed “The Cancer Genome Atlas (TCGA)”, to create this tool. The tool is called “PROGmiR” and it is available at http://www.compbio.iupui.edu/progmir. Currently, our tool can be used to study overall survival implications for approximately 1050 human miRNAs in 16 major cancer types.
We believe this resource, as a hypothesis generation tool, will be helpful for researchers to link miRNA expression with cancer outcome and to design mechanistic studies. We studied performance of our tool using identified miRNA biomarkers from published studies. The prognostic plots created using our tool for specific miRNAs in specific cancer types corroborated with the findings in the studies.
miRNA; Prognostics; Cancer; Pan-cancer; Database; Signature; Biomaker
Cancer outlier profile analysis (COPA) has proven to be an effective approach to analyzing cancer expression data, leading to the discovery of the TMPRSS2 and ETS family gene fusion events in prostate cancer. However, the original COPA algorithm did not identify down-regulated outliers, and the currently available R package implementing the method is similarly restricted to the analysis of over-expressed outliers. Here we present a modified outlier detection method, mCOPA, which contains refinements to the outlier-detection algorithm, identifies both over- and under-expressed outliers, is freely available, and can be applied to any expression dataset.
We compare our method to other feature-selection approaches, and demonstrate that mCOPA frequently selects more-informative features than do differential expression or variance-based feature selection approaches, and is able to recover observed clinical subtypes more consistently. We demonstrate the application of mCOPA to prostate cancer expression data, and explore the use of outliers in clustering, pathway analysis, and the identification of tumour suppressors. We analyse the under-expressed outliers to identify known and novel prostate cancer tumour suppressor genes, validating these against data in Oncomine and the Cancer Gene Index. We also demonstrate how a combination of outlier analysis and pathway analysis can identify molecular mechanisms disrupted in individual tumours.
We demonstrate that mCOPA offers advantages, compared to differential expression or variance, in selecting outlier features, and that the features so selected are better able to assign samples to clinically annotated subtypes. Further, we show that the biology explored by outlier analysis differs from that uncovered in differential expression or variance analysis. mCOPA is an important new tool for the exploration of cancer datasets and the discovery of new cancer subtypes, and can be combined with pathway and functional analysis approaches to discover mechanisms underpinning heterogeneity in cancers.
Cancer; Outliers; Expression data; Expression profile; Cluster; Subtype; Heterogeneous; Bioinformatics; Percentile; Feature selection
Second generation RNA sequencing technology (RNA-seq) offers the potential to interrogate genome-wide differential RNA splicing in cancer. However, since short RNA reads spanning spliced junctions cannot be mapped contiguously onto to the chromosomes, there is a need for methods to profile splicing from RNA-seq data. Before the invent of RNA-seq technologies, microarrays containing probe sequences representing exon-exon junctions of known genes have been used to hybridize cellular RNAs for measuring context-specific differential splicing. Here, we extend this approach to detect tumor-specific splicing in prostate cancer from a RNA-seq dataset.
A database, SPEventH, representing probe sequences of under a million non-redundant splice events in human is created with exon-exon junctions of optimized length for use as virtual microarray. SPEventH is used to map tens of millions of reads from matched tumor-normal samples from ten individuals with prostate cancer. Differential counts of reads mapped to each event from tumor and matched normal is used to identify statistically significant tumor-specific splice events in prostate.
We find sixty-one (61) splice events that are differentially expressed with a p-value of less than 0.0001 and a fold change of greater than 1.5 in prostate tumor compared to the respective matched normal samples. Interestingly, the only evidence, EST (BF372485), in the public database for one of the tumor-specific splice event joining one of the intron in KLK3 gene to an intron in KLK2, is also derived from prostate tumor-tissue. Also, the 765 events with a p-value of less than 0.001 is shown to cluster all twenty samples in a context-specific fashion with few exceptions stemming from low coverage of samples.
We demonstrate that virtual microarray experiments using a non-redundant database of splice events in human is both efficient and sensitive way to profile genome-wide splicing in biological samples and to detect tumor-specific splicing signatures in datasets from RNA-seq technologies. The signature from the large number of splice events that could cluster tumor and matched-normal samples into two tight separate clusters, suggests that differential splicing is yet another RNA phenotype, alongside gene expression and SNPs, that can be exploited for tumor stratification.
Inflammatory bowel diseases, ulcerative colitis and Crohn’s disease are considered to be of autoimmune origin, but the etiology of irritable bowel syndrome remains elusive. Furthermore, classifying patients into irritable bowel syndrome and inflammatory bowel diseases can be difficult without invasive testing and holds important treatment implications. Our aim was to assess the ability of gene expression profiling in blood to differentiate among these subject groups.
Transcript levels of a total of 45 genes in blood were determined by quantitative real-time polymerase chain reaction (RT-PCR). We applied three separate analytic approaches; one utilized a scoring system derived from combinations of ratios of expression levels of two genes and two different support vector machines.
All methods discriminated different subject cohorts, irritable bowel syndrome from control, inflammatory bowel disease from control, irritable bowel syndrome from inflammatory bowel disease, and ulcerative colitis from Crohn’s disease, with high degrees of sensitivity and specificity.
These results suggest these approaches may provide clinically useful prediction of the presence of these gastro-intestinal diseases and syndromes.
Next generation sequencing provides clinical research scientists with direct read out of innumerable variants, including personal, pathological and common benign variants. The aim of resequencing studies is to determine the candidate pathogenic variants from individual genomes, or from family-based or tumor/normal genome comparisons. Whilst the use of appropriate controls within the experimental design will minimize the number of false positive variations selected, this number can be reduced further with the use of high quality whole genome reference data to minimize false positives variants prior to candidate gene selection. In addition the use of platform related sequencing error models can help in the recovery of ambiguous genotypes from lower coverage data.
We have developed a whole genome database of human genetic variations, Huvariome, determined by whole genome deep sequencing data with high coverage and low error rates. The database was designed to be sequencing technology independent but is currently populated with 165 individual whole genomes consisting of small pedigrees and matched tumor/normal samples sequenced with the Complete Genomics sequencing platform. Common variants have been determined for a Benelux population cohort and represented as genotypes alongside the results of two sets of control data (73 of the 165 genomes), Huvariome Core which comprises 31 healthy individuals from the Benelux region, and Diversity Panel consisting of 46 healthy individuals representing 10 different populations and 21 samples in three Pedigrees. Users can query the database by gene or position via a web interface and the results are displayed as the frequency of the variations as detected in the datasets. We demonstrate that Huvariome can provide accurate reference allele frequencies to disambiguate sequencing inconsistencies produced in resequencing experiments. Huvariome has been used to support the selection of candidate cardiomyopathy related genes which have a homozygous genotype in the reference cohorts. This database allows the users to see which selected variants are common variants (> 5% minor allele frequency) in the Huvariome core samples, thus aiding in the selection of potentially pathogenic variants by filtering out common variants that are not listed in one of the other public genomic variation databases. The no-call rate and the accuracy of allele calling in Huvariome provides the user with the possibility of identifying platform dependent errors associated with specific regions of the human genome.
Huvariome is a simple to use resource for validation of resequencing results obtained by NGS experiments. The high sequence coverage and low error rates provide scientists with the ability to remove false positive results from pedigree studies. Results are returned via a web interface that displays location-based genetic variation frequency, impact on protein function, association with known genetic variations and a quality score of the variation base derived from Huvariome Core and the Diversity Panel data. These results may be used to identify and prioritize rare variants that, for example, might be disease relevant. In testing the accuracy of the Huvariome database, alleles of a selection of ambiguously called coding single nucleotide variants were successfully predicted in all cases. Data protection of individuals is ensured by restricted access to patient derived genomes from the host institution which is relevant for future molecular diagnostics.
Medical genetics; Medical genomics; Whole genome sequencing; Allele frequency; Cardiomyopathy
Drug discovery typically starts with the identification of a potential target that is then tested and validated either through high-throughput screening against a library of drug compounds or by rational drug design. When the putative target is a protein, the latter approach requires the knowledge of its structure. Finding the structure of a protein is however a difficult task. Significant progress has come from high-resolution techniques such as X-ray crystallography and NMR; there are many proteins however whose structure have not yet been solved. Computational techniques for structure prediction are viable alternatives to experimental techniques for these cases. However, the proper validation of the structural models they generate remains an issue.
In this report, we focus on homology modeling techniques and introduce the H-factor, a new indicator for assessing the quality of protein structure models generated with these techniques. The H-factor is meant to mimic the R-factor used in X-ray crystallography. The method for computing the H-factor is fully described with a demonstration of its effectiveness on a test set of target proteins.
We have developed a web service for computing the H-factor for models of a protein structure. This service is freely accessible at http://koehllab.genomecenter.ucdavis.edu/toolkit/h-factor.
Autism is the fastest growing developmental disorder in the world today. The prevalence of autism in the US has risen from 1 in 2500 in 1970 to 1 in 88 children today. People with autism present with repetitive movements and with social and communication impairments. These impairments can range from mild to profound. The estimated total lifetime societal cost of caring for one individual with autism is $3.2 million US dollars. With the rapid growth in this disorder and the great expense of caring for those with autism, it is imperative for both individuals and society that techniques be developed to model and understand autism. There is increasing evidence that those individuals diagnosed with autism present with highly diverse set of abnormalities affecting multiple systems of the body. To this date, little to no work has been done using a whole body systems biology approach to model the characteristics of this disorder. Identification and modelling of these systems might lead to new and improved treatment protocols, better diagnosis and treatment of the affected systems, which might lead to improved quality of life by themselves, and, in addition, might also help the core symptoms of autism due to the potential interconnections between the brain and nervous system with all these other systems being modeled. This paper first reviews research which shows that autism impacts many systems in the body, including the metabolic, mitochondrial, immunological, gastrointestinal and the neurological. These systems interact in complex and highly interdependent ways. Many of these disturbances have effects in most of the systems of the body. In particular, clinical evidence exists for increased oxidative stress, inflammation, and immune and mitochondrial dysfunction which can affect almost every cell in the body. Three promising research areas are discussed, hierarchical, subgroup analysis and modeling over time. This paper reviews some of the systems disturbed in autism and suggests several systems biology research areas. Autism poses a rich test bed for systems biology modeling techniques.
Autism; Mitochondrial dysfunction; Oxidative stress; Immune dysfunction; Gastrointestinal disease
Cancer therapy is a challenging research area because side effects often occur in chemo and radiation therapy. We intend to study a multi-targets and multi-components design that will provide synergistic results to improve efficiency of cancer therapy.
We have developed a general methodology, AMFES (Adaptive Multiple FEature Selection), for ranking and selecting important cancer biomarkers based on SVM (Support Vector Machine) classification. In particular, we exemplify this method by three datasets: a prostate cancer (three stages), a breast cancer (four subtypes), and another prostate cancer (normal vs. cancerous). Moreover, we have computed the target networks of these biomarkers as the signatures of the cancers with additional information (mutual information between biomarkers of the network). Then, we proposed a robust framework for synergistic therapy design approach which includes varies existing mechanisms.
These methodologies were applied to three GEO datasets: GSE18655 (three prostate stages), GSE19536 (4 subtypes breast cancers) and GSE21036 (prostate cancer cells and normal cells) shown in. We selected 96 biomarkers for first prostate cancer dataset (three prostate stages), 72 for breast cancer (luminal A vs. luminal B), 68 for breast cancer (basal-like vs. normal-like), and 22 for another prostate cancer (cancerous vs. normal. In addition, we obtained statistically significant results of mutual information, which demonstrate that the dependencies among these biomarkers can be positive or negative.
We proposed an efficient feature ranking and selection scheme, AMFES, to select an important subset from a large number of features for any cancer dataset. Thus, we obtained the signatures of these cancers by building their target networks. Finally, we proposed a robust framework of synergistic therapy for cancer patients. Our framework is not only supported by real GEO datasets but also aim to a multi-targets/multi-components drug design tool, which improves the traditional single target/single component analysis methods. This framework builds a computational foundation which can provide a clear classification of cancers and lead to an efficient cancer therapy.
Feature selection; Biomarkers; Microarray; Therapy design; Target network
Ovarian cancer is the most deadly gynecological cancer because of late diagnosis, frequently with diffuse peritoneal metastases. Recent findings have shown that serous epithelial ovarian cancer has a narrow mutational spectrum with TP53 being the most frequently targeted when single genes are considered. It is, however, important to understand which pathways as a whole may be targeted for mutation.
Previously published mutational data provided by the cancer genome atlas networks findings on ovarian cancer was searched for statistically significant enrichment of genes in pathways. These pathways were then searched in all patients to identify the spectrum of mutations. Statistical significance was further shown through in-silico permutations of exome sequences using empirically observed mutation rates. We detected mutations in the cell adhesion pathway genes in more than 89% of serous epithelial ovarian cancer patients. This level of near universal mutational targeting of the cell adhesion pathway, including the extracellular matrix pathway, is previously unreported in epithelial ovarian cancer.
Taken together with previous studies on the role of cell adhesion and extracellular matrix gene expression in ovarian cancer and metastasis, our results identify pathways for which the mutational prevalence has previously been overlooked using single gene approaches. Analysis of mutations at the pathway level will be critical in studying heterogeneous diseases such as ovarian cancer.
In early-stage of cancer, primary treatment can be considered as effective at eliminating the tumor for a non-negligible proportion of patients whereas for the others it leads to a lower tumor burden and thereby potentially prolonged survival. In this mixed population of patients, it is of great interest to detect complex differences in survival distributions associated with molecular markers that potentially activate latent downstream pathways implicated in tumor progression.
We propose a novel model-based score test designed for identifying molecular markers with complex effects on survival in early-stage cancer. From a biological point of view, the proposed score test allows to detect complex changes in the survival distributions linked to either the tumor burden or its dynamic growth.
Simulation results show that the proposed statistic is powerful at identifying departure from the null hypothesis of no survival difference. The practical use of the proposed statistic is exemplified by analyzing the prognostic impact of Kras mutation in early-stage of lung adenocarcinomas. This analysis leads to the conclusion that Kras mutation has a significant negative prognostic impact on survival. Moreover, it emphasizes that the complex role of Kras mutation on survival would have been overlooked by considering results from the classical logrank test.
With the growing number of biological markers to be tested in early-stage cancer, the proposed score test statistic is a powerful tool for detecting molecular markers associated with complex survival patterns.
Clinical genomic; Survival analysis; Early-stage cancer; Cure rate model; Long-term survivors; Score test
A simple and fast computational model to describe the dynamics of tumour growth and metastasis formation is presented. The model is based on the calculation of successive generations of tumour cells and enables one to describe biologically important entities like tumour volume, time point of 1st metastatic growth or number of metastatic colonies at a given time. The model entirely relies on the chronology of these successive events of the metastatic cascade. The simulation calculations were performed for two embedded growth models to describe the Gompertzian like growth behaviour of tumours. The initial training of the models was carried out using an analytical solution for the size distribution of metastases of a hepatocellular carcinoma. We then show the applicability of our models to clinical data from the Munich Cancer Registry. Growth and dissemination characteristics of metastatic cells originating from cells in the primary breast cancer can be modelled thus showing its ability to perform systematic analyses relevant for clinical breast cancer research and treatment. In particular, our calculations show that generally metastases formation has already been initiated before the primary can be detected clinically.
Breast cancer; Computational calculations; Gompertzian growth function; Tumour growth models; Metastasis formation
The identification of very small subsets of predictive variables is an important toπc that has not often been considered in the literature. In order to discover highly predictive yet compact gene set classifiers from whole genome expression data, a non-parametric, iterative algorithm, Splitting Random Forest (SRF), was developed to robustly identify genes that distinguish between molecular subtypes. The goal is to improve the prediction accuracy while considering sparsity.
The optimal SRF 50 run (SRF50) gene classifiers for glioblastoma (GB), breast (BC) and ovarian cancer (OC) subtypes had overall prediction rates comparable to those from published datasets upon validation (80.1%-91.7%). The SRF50 sets outperformed other methods by identifying compact gene sets needed for distinguishing between tested cancer subtypes (10–200 fold fewer genes than ANOVA or published gene sets). The SRF50 sets achieved superior and robust overall and subtype prediction accuracies when compared with single random forest (RF) and the Top 50 ANOVA results (80.1% vs 77.8% for GB; 84.0% vs 74.1% for BC; 89.8% vs 88.9% for OC in SRF50 vs single RF comparison; 80.1% vs 77.2% for GB; 84.0% vs 82.7% for BC; 89.8% vs 87.0% for OC in SRF50 vs Top 50 ANOVA comparison). There was significant overlap between SRF50 and published gene sets, showing that SRF identifies the relevant sub-sets of important gene lists. Through Ingenuity Pathway Analysis (IPA), the overlap in “hub” genes between the SRF50 and published genes sets were RB1, πK3R1, PDGFBB and ERK1/2 for GB; ESR1, MYC, NFkB and ERK1/2 for BC; and Akt, FN1, NFkB, PDGFBB and ERK1/2 for OC.
The SRF approach is an effective driver of biomarker discovery research that reduces the number of genes needed for robust classification, dissects complex, high dimensional “omic” data and provides novel insights into the cellular mechanisms that define cancer subtypes.
Tree based models; High dimensional data; Cancer subtypes