The recent introduction of high throughput sequencing technologies into clinical genetics has made it practical to simultaneously sequence many genes. In contrast, previous technologies limited sequencing based tests to only a handful of genes. While the ability to more accurately diagnose inherited diseases is a great benefit it introduces specific challenges. Interpretation of missense mutations continues to be challenging and the number of variants of uncertain significance continues to grow.
We leveraged the data available at ARUP Laboratories, a major reference laboratory, for the CFTR gene to explore specific challenges related to variant interpretation, including a focus on understanding ethnic-specific variants and an evaluation of existing databases for clinical interpretation of variants. In this study we analyzed 555 patients representing eight different ethnic groups. We observed 184 different variants, most of which were ethnic group specific. Eighty-five percent of these variants were present in the Cystic Fibrosis Mutation Database, whereas the Human Mutation Database and dbSNP/1000 Genomes had far fewer of the observed variants. Finally, 21 of the variants were novel and we report these variants and their clinical classifications.
Based on our analyses of data from six years of CFTR testing at ARUP Laboratories a more comprehensive, clinical grade database is needed for the accurate interpretation of observed variants. Furthermore, there is a particular need for more and better information regarding variants from individuals of non-Caucasian ethnicity.
Cystic fibrosis; CFTR; Novel variants; Next-generation sequencing; Interpretation of variants
Mass spectrometry is an important analytical tool for clinical proteomics. Primarily employed for biomarker discovery, it is increasingly used for developing methods which may help to provide unambiguous diagnosis of biological samples. In this context, we investigated the classification of phenotypes by applying support vector machine (SVM) on experimental data obtained by MudPIT approach. In particular, we compared the performance capabilities of SVM by using two independent collection of complex samples and different data-types, such as mass spectra (m/z), peptides and proteins.
Globally, protein and peptide data allowed a better discriminant informative content than experimental mass spectra (overall accuracy higher than 87% in both collection 1 and 2). These results indicate that sequencing of peptides and proteins reduces the experimental noise affecting the raw mass spectra, and allows the extraction of more informative features available for the effective classification of samples. In addition, proteins and peptides features selected by SVM matched for 80% with the differentially expressed proteins identified by the MAProMa software.
These findings confirm the availability of the most label-free quantitative methods based on processing of spectral count and SEQUEST-based SCORE values. On the other hand, it stresses the usefulness of MudPIT data for a correct grouping of sample phenotypes, by applying both supervised and unsupervised learning algorithms. This capacity permit the evaluation of actual samples and it is a good starting point to translate proteomic methodology to clinical application.
Sample classification; MudPIT; SVM; Clinical proteomics; Label-free quantification
Hutchinson-Gilford progeria syndrome is a rare dominant human disease of genetic origin. The average life expectancy is about 20 years, patients’ life quality is still very poor and no efficient therapy has yet been developed. It is caused by mutation of the LMNA gene, which results in accumulation in the nuclear membrane of a particular splicing form of Lamin-A called progerin. The mechanism by which progerin perturbs cellular homeostasis and leads to the symptoms is still under debate.
Micro-RNAs are able to negatively regulate transcription by coupling with the 3’ UnTranslated Region of messenger RNAs. Several Micro-RNAs recognize the same 3’ UnTranslated Region and each Micro-RNA can recognize multiple 3’ UnTranslated Regions of different messenger RNAs. When different messenger RNAs are co-regulated via a similar panel of micro-RNAs, these messengers are called Competing Endogenous RNAs, or ceRNAs.
The 3’ UnTranslated Region of the longest LMNA transcript was analysed looking for its ceRNAs. The aim of this study was to search for candidate genes and gene ontology functions possibly influenced by LMNA mutations that may exert a role in progeria development.
11 miRNAs were isolated as potential LMNA regulators. By computational analysis, the miRNAs pointed to 17 putative LMNA ceRNAs. Gene ontology analysis of isolated ceRNAs showed an enrichment in RNA interference and control of cell cycle functions.
This study isolated novel genes and functions potentially involved in LMNA network of regulation that could be involved in laminopathies such as the Hutchinson-Gilford progeria syndrome.
CeRNA; Hutchinson-Gilford; Progeria; LMNA; Lamin-A; 3’ UTR; MiRNA
Identification of prognostic biomarkers is hallmark of cancer genomics. Since miRNAs regulate expression of multiple genes, they act as potent biomarkers in several cancers. Identification of miRNAs that are prognostically important has been done sporadically, but no resource is available till date that allows users to study prognostics of miRNAs of interest, utilizing the wealth of available data, in major cancer types.
In this paper, we present a web based tool that allows users to study prognostic properties of miRNAs in several cancer types, using publicly available data. We have compiled data from Gene Expression Omnibus (GEO), and recently developed “The Cancer Genome Atlas (TCGA)”, to create this tool. The tool is called “PROGmiR” and it is available at http://www.compbio.iupui.edu/progmir. Currently, our tool can be used to study overall survival implications for approximately 1050 human miRNAs in 16 major cancer types.
We believe this resource, as a hypothesis generation tool, will be helpful for researchers to link miRNA expression with cancer outcome and to design mechanistic studies. We studied performance of our tool using identified miRNA biomarkers from published studies. The prognostic plots created using our tool for specific miRNAs in specific cancer types corroborated with the findings in the studies.
miRNA; Prognostics; Cancer; Pan-cancer; Database; Signature; Biomaker
Cancer outlier profile analysis (COPA) has proven to be an effective approach to analyzing cancer expression data, leading to the discovery of the TMPRSS2 and ETS family gene fusion events in prostate cancer. However, the original COPA algorithm did not identify down-regulated outliers, and the currently available R package implementing the method is similarly restricted to the analysis of over-expressed outliers. Here we present a modified outlier detection method, mCOPA, which contains refinements to the outlier-detection algorithm, identifies both over- and under-expressed outliers, is freely available, and can be applied to any expression dataset.
We compare our method to other feature-selection approaches, and demonstrate that mCOPA frequently selects more-informative features than do differential expression or variance-based feature selection approaches, and is able to recover observed clinical subtypes more consistently. We demonstrate the application of mCOPA to prostate cancer expression data, and explore the use of outliers in clustering, pathway analysis, and the identification of tumour suppressors. We analyse the under-expressed outliers to identify known and novel prostate cancer tumour suppressor genes, validating these against data in Oncomine and the Cancer Gene Index. We also demonstrate how a combination of outlier analysis and pathway analysis can identify molecular mechanisms disrupted in individual tumours.
We demonstrate that mCOPA offers advantages, compared to differential expression or variance, in selecting outlier features, and that the features so selected are better able to assign samples to clinically annotated subtypes. Further, we show that the biology explored by outlier analysis differs from that uncovered in differential expression or variance analysis. mCOPA is an important new tool for the exploration of cancer datasets and the discovery of new cancer subtypes, and can be combined with pathway and functional analysis approaches to discover mechanisms underpinning heterogeneity in cancers.
Cancer; Outliers; Expression data; Expression profile; Cluster; Subtype; Heterogeneous; Bioinformatics; Percentile; Feature selection
Second generation RNA sequencing technology (RNA-seq) offers the potential to interrogate genome-wide differential RNA splicing in cancer. However, since short RNA reads spanning spliced junctions cannot be mapped contiguously onto to the chromosomes, there is a need for methods to profile splicing from RNA-seq data. Before the invent of RNA-seq technologies, microarrays containing probe sequences representing exon-exon junctions of known genes have been used to hybridize cellular RNAs for measuring context-specific differential splicing. Here, we extend this approach to detect tumor-specific splicing in prostate cancer from a RNA-seq dataset.
A database, SPEventH, representing probe sequences of under a million non-redundant splice events in human is created with exon-exon junctions of optimized length for use as virtual microarray. SPEventH is used to map tens of millions of reads from matched tumor-normal samples from ten individuals with prostate cancer. Differential counts of reads mapped to each event from tumor and matched normal is used to identify statistically significant tumor-specific splice events in prostate.
We find sixty-one (61) splice events that are differentially expressed with a p-value of less than 0.0001 and a fold change of greater than 1.5 in prostate tumor compared to the respective matched normal samples. Interestingly, the only evidence, EST (BF372485), in the public database for one of the tumor-specific splice event joining one of the intron in KLK3 gene to an intron in KLK2, is also derived from prostate tumor-tissue. Also, the 765 events with a p-value of less than 0.001 is shown to cluster all twenty samples in a context-specific fashion with few exceptions stemming from low coverage of samples.
We demonstrate that virtual microarray experiments using a non-redundant database of splice events in human is both efficient and sensitive way to profile genome-wide splicing in biological samples and to detect tumor-specific splicing signatures in datasets from RNA-seq technologies. The signature from the large number of splice events that could cluster tumor and matched-normal samples into two tight separate clusters, suggests that differential splicing is yet another RNA phenotype, alongside gene expression and SNPs, that can be exploited for tumor stratification.
Inflammatory bowel diseases, ulcerative colitis and Crohn’s disease are considered to be of autoimmune origin, but the etiology of irritable bowel syndrome remains elusive. Furthermore, classifying patients into irritable bowel syndrome and inflammatory bowel diseases can be difficult without invasive testing and holds important treatment implications. Our aim was to assess the ability of gene expression profiling in blood to differentiate among these subject groups.
Transcript levels of a total of 45 genes in blood were determined by quantitative real-time polymerase chain reaction (RT-PCR). We applied three separate analytic approaches; one utilized a scoring system derived from combinations of ratios of expression levels of two genes and two different support vector machines.
All methods discriminated different subject cohorts, irritable bowel syndrome from control, inflammatory bowel disease from control, irritable bowel syndrome from inflammatory bowel disease, and ulcerative colitis from Crohn’s disease, with high degrees of sensitivity and specificity.
These results suggest these approaches may provide clinically useful prediction of the presence of these gastro-intestinal diseases and syndromes.
Next generation sequencing provides clinical research scientists with direct read out of innumerable variants, including personal, pathological and common benign variants. The aim of resequencing studies is to determine the candidate pathogenic variants from individual genomes, or from family-based or tumor/normal genome comparisons. Whilst the use of appropriate controls within the experimental design will minimize the number of false positive variations selected, this number can be reduced further with the use of high quality whole genome reference data to minimize false positives variants prior to candidate gene selection. In addition the use of platform related sequencing error models can help in the recovery of ambiguous genotypes from lower coverage data.
We have developed a whole genome database of human genetic variations, Huvariome, determined by whole genome deep sequencing data with high coverage and low error rates. The database was designed to be sequencing technology independent but is currently populated with 165 individual whole genomes consisting of small pedigrees and matched tumor/normal samples sequenced with the Complete Genomics sequencing platform. Common variants have been determined for a Benelux population cohort and represented as genotypes alongside the results of two sets of control data (73 of the 165 genomes), Huvariome Core which comprises 31 healthy individuals from the Benelux region, and Diversity Panel consisting of 46 healthy individuals representing 10 different populations and 21 samples in three Pedigrees. Users can query the database by gene or position via a web interface and the results are displayed as the frequency of the variations as detected in the datasets. We demonstrate that Huvariome can provide accurate reference allele frequencies to disambiguate sequencing inconsistencies produced in resequencing experiments. Huvariome has been used to support the selection of candidate cardiomyopathy related genes which have a homozygous genotype in the reference cohorts. This database allows the users to see which selected variants are common variants (> 5% minor allele frequency) in the Huvariome core samples, thus aiding in the selection of potentially pathogenic variants by filtering out common variants that are not listed in one of the other public genomic variation databases. The no-call rate and the accuracy of allele calling in Huvariome provides the user with the possibility of identifying platform dependent errors associated with specific regions of the human genome.
Huvariome is a simple to use resource for validation of resequencing results obtained by NGS experiments. The high sequence coverage and low error rates provide scientists with the ability to remove false positive results from pedigree studies. Results are returned via a web interface that displays location-based genetic variation frequency, impact on protein function, association with known genetic variations and a quality score of the variation base derived from Huvariome Core and the Diversity Panel data. These results may be used to identify and prioritize rare variants that, for example, might be disease relevant. In testing the accuracy of the Huvariome database, alleles of a selection of ambiguously called coding single nucleotide variants were successfully predicted in all cases. Data protection of individuals is ensured by restricted access to patient derived genomes from the host institution which is relevant for future molecular diagnostics.
Medical genetics; Medical genomics; Whole genome sequencing; Allele frequency; Cardiomyopathy
Drug discovery typically starts with the identification of a potential target that is then tested and validated either through high-throughput screening against a library of drug compounds or by rational drug design. When the putative target is a protein, the latter approach requires the knowledge of its structure. Finding the structure of a protein is however a difficult task. Significant progress has come from high-resolution techniques such as X-ray crystallography and NMR; there are many proteins however whose structure have not yet been solved. Computational techniques for structure prediction are viable alternatives to experimental techniques for these cases. However, the proper validation of the structural models they generate remains an issue.
In this report, we focus on homology modeling techniques and introduce the H-factor, a new indicator for assessing the quality of protein structure models generated with these techniques. The H-factor is meant to mimic the R-factor used in X-ray crystallography. The method for computing the H-factor is fully described with a demonstration of its effectiveness on a test set of target proteins.
We have developed a web service for computing the H-factor for models of a protein structure. This service is freely accessible at http://koehllab.genomecenter.ucdavis.edu/toolkit/h-factor.
Autism is the fastest growing developmental disorder in the world today. The prevalence of autism in the US has risen from 1 in 2500 in 1970 to 1 in 88 children today. People with autism present with repetitive movements and with social and communication impairments. These impairments can range from mild to profound. The estimated total lifetime societal cost of caring for one individual with autism is $3.2 million US dollars. With the rapid growth in this disorder and the great expense of caring for those with autism, it is imperative for both individuals and society that techniques be developed to model and understand autism. There is increasing evidence that those individuals diagnosed with autism present with highly diverse set of abnormalities affecting multiple systems of the body. To this date, little to no work has been done using a whole body systems biology approach to model the characteristics of this disorder. Identification and modelling of these systems might lead to new and improved treatment protocols, better diagnosis and treatment of the affected systems, which might lead to improved quality of life by themselves, and, in addition, might also help the core symptoms of autism due to the potential interconnections between the brain and nervous system with all these other systems being modeled. This paper first reviews research which shows that autism impacts many systems in the body, including the metabolic, mitochondrial, immunological, gastrointestinal and the neurological. These systems interact in complex and highly interdependent ways. Many of these disturbances have effects in most of the systems of the body. In particular, clinical evidence exists for increased oxidative stress, inflammation, and immune and mitochondrial dysfunction which can affect almost every cell in the body. Three promising research areas are discussed, hierarchical, subgroup analysis and modeling over time. This paper reviews some of the systems disturbed in autism and suggests several systems biology research areas. Autism poses a rich test bed for systems biology modeling techniques.
Autism; Mitochondrial dysfunction; Oxidative stress; Immune dysfunction; Gastrointestinal disease
Cancer therapy is a challenging research area because side effects often occur in chemo and radiation therapy. We intend to study a multi-targets and multi-components design that will provide synergistic results to improve efficiency of cancer therapy.
We have developed a general methodology, AMFES (Adaptive Multiple FEature Selection), for ranking and selecting important cancer biomarkers based on SVM (Support Vector Machine) classification. In particular, we exemplify this method by three datasets: a prostate cancer (three stages), a breast cancer (four subtypes), and another prostate cancer (normal vs. cancerous). Moreover, we have computed the target networks of these biomarkers as the signatures of the cancers with additional information (mutual information between biomarkers of the network). Then, we proposed a robust framework for synergistic therapy design approach which includes varies existing mechanisms.
These methodologies were applied to three GEO datasets: GSE18655 (three prostate stages), GSE19536 (4 subtypes breast cancers) and GSE21036 (prostate cancer cells and normal cells) shown in. We selected 96 biomarkers for first prostate cancer dataset (three prostate stages), 72 for breast cancer (luminal A vs. luminal B), 68 for breast cancer (basal-like vs. normal-like), and 22 for another prostate cancer (cancerous vs. normal. In addition, we obtained statistically significant results of mutual information, which demonstrate that the dependencies among these biomarkers can be positive or negative.
We proposed an efficient feature ranking and selection scheme, AMFES, to select an important subset from a large number of features for any cancer dataset. Thus, we obtained the signatures of these cancers by building their target networks. Finally, we proposed a robust framework of synergistic therapy for cancer patients. Our framework is not only supported by real GEO datasets but also aim to a multi-targets/multi-components drug design tool, which improves the traditional single target/single component analysis methods. This framework builds a computational foundation which can provide a clear classification of cancers and lead to an efficient cancer therapy.
Feature selection; Biomarkers; Microarray; Therapy design; Target network
Ovarian cancer is the most deadly gynecological cancer because of late diagnosis, frequently with diffuse peritoneal metastases. Recent findings have shown that serous epithelial ovarian cancer has a narrow mutational spectrum with TP53 being the most frequently targeted when single genes are considered. It is, however, important to understand which pathways as a whole may be targeted for mutation.
Previously published mutational data provided by the cancer genome atlas networks findings on ovarian cancer was searched for statistically significant enrichment of genes in pathways. These pathways were then searched in all patients to identify the spectrum of mutations. Statistical significance was further shown through in-silico permutations of exome sequences using empirically observed mutation rates. We detected mutations in the cell adhesion pathway genes in more than 89% of serous epithelial ovarian cancer patients. This level of near universal mutational targeting of the cell adhesion pathway, including the extracellular matrix pathway, is previously unreported in epithelial ovarian cancer.
Taken together with previous studies on the role of cell adhesion and extracellular matrix gene expression in ovarian cancer and metastasis, our results identify pathways for which the mutational prevalence has previously been overlooked using single gene approaches. Analysis of mutations at the pathway level will be critical in studying heterogeneous diseases such as ovarian cancer.
In early-stage of cancer, primary treatment can be considered as effective at eliminating the tumor for a non-negligible proportion of patients whereas for the others it leads to a lower tumor burden and thereby potentially prolonged survival. In this mixed population of patients, it is of great interest to detect complex differences in survival distributions associated with molecular markers that potentially activate latent downstream pathways implicated in tumor progression.
We propose a novel model-based score test designed for identifying molecular markers with complex effects on survival in early-stage cancer. From a biological point of view, the proposed score test allows to detect complex changes in the survival distributions linked to either the tumor burden or its dynamic growth.
Simulation results show that the proposed statistic is powerful at identifying departure from the null hypothesis of no survival difference. The practical use of the proposed statistic is exemplified by analyzing the prognostic impact of Kras mutation in early-stage of lung adenocarcinomas. This analysis leads to the conclusion that Kras mutation has a significant negative prognostic impact on survival. Moreover, it emphasizes that the complex role of Kras mutation on survival would have been overlooked by considering results from the classical logrank test.
With the growing number of biological markers to be tested in early-stage cancer, the proposed score test statistic is a powerful tool for detecting molecular markers associated with complex survival patterns.
Clinical genomic; Survival analysis; Early-stage cancer; Cure rate model; Long-term survivors; Score test
A simple and fast computational model to describe the dynamics of tumour growth and metastasis formation is presented. The model is based on the calculation of successive generations of tumour cells and enables one to describe biologically important entities like tumour volume, time point of 1st metastatic growth or number of metastatic colonies at a given time. The model entirely relies on the chronology of these successive events of the metastatic cascade. The simulation calculations were performed for two embedded growth models to describe the Gompertzian like growth behaviour of tumours. The initial training of the models was carried out using an analytical solution for the size distribution of metastases of a hepatocellular carcinoma. We then show the applicability of our models to clinical data from the Munich Cancer Registry. Growth and dissemination characteristics of metastatic cells originating from cells in the primary breast cancer can be modelled thus showing its ability to perform systematic analyses relevant for clinical breast cancer research and treatment. In particular, our calculations show that generally metastases formation has already been initiated before the primary can be detected clinically.
Breast cancer; Computational calculations; Gompertzian growth function; Tumour growth models; Metastasis formation
The identification of very small subsets of predictive variables is an important toπc that has not often been considered in the literature. In order to discover highly predictive yet compact gene set classifiers from whole genome expression data, a non-parametric, iterative algorithm, Splitting Random Forest (SRF), was developed to robustly identify genes that distinguish between molecular subtypes. The goal is to improve the prediction accuracy while considering sparsity.
The optimal SRF 50 run (SRF50) gene classifiers for glioblastoma (GB), breast (BC) and ovarian cancer (OC) subtypes had overall prediction rates comparable to those from published datasets upon validation (80.1%-91.7%). The SRF50 sets outperformed other methods by identifying compact gene sets needed for distinguishing between tested cancer subtypes (10–200 fold fewer genes than ANOVA or published gene sets). The SRF50 sets achieved superior and robust overall and subtype prediction accuracies when compared with single random forest (RF) and the Top 50 ANOVA results (80.1% vs 77.8% for GB; 84.0% vs 74.1% for BC; 89.8% vs 88.9% for OC in SRF50 vs single RF comparison; 80.1% vs 77.2% for GB; 84.0% vs 82.7% for BC; 89.8% vs 87.0% for OC in SRF50 vs Top 50 ANOVA comparison). There was significant overlap between SRF50 and published gene sets, showing that SRF identifies the relevant sub-sets of important gene lists. Through Ingenuity Pathway Analysis (IPA), the overlap in “hub” genes between the SRF50 and published genes sets were RB1, πK3R1, PDGFBB and ERK1/2 for GB; ESR1, MYC, NFkB and ERK1/2 for BC; and Akt, FN1, NFkB, PDGFBB and ERK1/2 for OC.
The SRF approach is an effective driver of biomarker discovery research that reduces the number of genes needed for robust classification, dissects complex, high dimensional “omic” data and provides novel insights into the cellular mechanisms that define cancer subtypes.
Tree based models; High dimensional data; Cancer subtypes
Hazelnut is reported as a causative agent of allergic reactions. However it is also an edible nut with health benefits. The allergenic characteristics of hazelnut-samples after autoclaving (AC) and high-pressure (HHP) processing have been studied and are also presented here. Previous studies demonstrated that AC treatments were responsible for structural transformation of protein structure motifs. Thus, structural analyses of allergen proteins from hazelnut were carried out to observe what is occurring in relation to the specific-IgE recognition of the related allergenic proteins. The aims of this work are to evaluate the effect of AC and HHP processing on hazelnut in vitro allergenicity using human-sera and to analyse the complexity of hazelnut allergen-protein structures.
Hazelnut-samples were subjected to AC and HHP processing. The specific IgE- reactivity was studied in 15 allergic clinic-patients via western blotting analyses. A series of homology-based-bioinformatics 3D-models (Cora 1, Cora 8, Cora 9 and Cora 11) were generated for the antigens included in the study to analyse the co mplexity of their protein structure. This study is supported by the Declaration of Helsinki and subsequent ethical guidelines.
A severe reduction in vitro in allergenicity to hazelnut after AC processing was observed in the allergic clinic-patients studied. The specific-IgE binding of some of the described immunoreactive hazelnut protein-bands: Cora 1 ~18KDa, Cora 8 ~9KDa, Cora 9 ~35-40KDa and Cora 11 ~47-48 KDa decreases. Furthermore a relevant glycosylation was assigned and visualized via structural analysis of proteins (3D-modelling) for the first time in the protein-allergen Cora 11 showing a new role which could open a new door for allergenicity-unravellings.
Hazelnut allergenicity-studies in vivo via Prick-Prick and other means using AC processing are crucial to verify the data we observed via in vitro analyses. Glycosylation studies provided us with clues to elucidate, in the near future, mechanisms of the structures that contribute to hazelnut allergenicity, which thus, in turn, help alleviate food allergens.
Structural analysis of allergen-proteins and Glycosylation
Mouse is widely used in animal testing of cardiovascular disease. However, a large number of cardiovascular drugs that have been experimentally proved to work well on mouse were withdrawn because they caused adverse side effects in human.
In this study, we investigate whether binding patterns of withdrawn cardiovascular drugs are conserved between mouse and human through computational dockings and molecular dynamic simulations. In addition, we also measured the level of conservation of gene expression patterns of the drug targets and their interacting partners by analyzing the microarray data.
The results show that target proteins of withdrawn cardiovascular drugs are functionally conserved between human and mouse. However, all the binding patterns of withdrawn drugs we retrieved show striking difference due to sequence divergence in drug-binding pocket, mainly through loss or gain of hydrogen bond donors and distinct drug-binding pockets. The binding affinities of withdrawn drugs to their receptors tend to be reduced from mouse to human. In contrast, the FDA-approved and best-selling drugs are little affected.
Our analysis suggests that sequence divergence in drug-binding pocket may be a reasonable explanation for the discrepancy of drug effects between animal models and human.
Withdrawn cardiovascular drugs; Animal modeling; Sequence divergence; Side effects; Drug-binding pocket
Numerous biomedical software applications access databases maintained by the US National Center for Biotechnology Information (NCBI). To ease software automation, NCBI provides a powerful but complex Web-service-based programming interface, eUtils. This paper describes a toolset that simplifies eUtils use through a graphical front-end that can be used by non-programmers to construct data-extraction pipelines. The front-end relies on a code library that provides high-level wrappers around eUtils functions, and which is distributed as open-source, allowing customization and enhancement by individuals with programming skills.
We initially created an application that queried eUtils to retrieve nephrology-specific biomedical literature citations for a user-definable set of genes. We later augmented the application code to create a general-purpose library that accesses eUtils capability as individual functions that could be combined into user-defined pipelines.
The toolset’s use is illustrated with an application that serves as a front-end to the library and can be used by non-programmers to construct user-defined pipelines. The operation of the library is illustrated for the literature-surveillance application, which serves as a case-study. An overview of the library is also provided.
The library simplifies use of the eUtils service by operating at a higher level, and also transparently addresses robustness issues that would need to be individually implemented otherwise, such as error recovery and prevention of overloading of the eUtils service.
Entrez Programming Utilities; Proteomics Analysis; Pubmed filters
3D domain swapping is a novel structural phenomenon observed in diverse set of protein structures in oligomeric conformations. A distinct structural feature, where structural segments in a protein dimer or higher oligomer were shared between two or more chains of a protein structure, characterizes 3D domain swapping. 3D domain swapping was observed as a key mediator of numerous functional mechanisms and play pathogenic role in various diseases including conformational diseases like amyloidosis, Alzheimer's disease, Parkinson's disease and prion diseases. We report the first study with a focus on identifying functional classes, pathways and diseases mediated by 3D domain swapping in the human proteome.
We used a panel of four enrichment tools with two different ontologies and two annotations database to derive biological and clinical relevant information associated with 3D domain swapping. Protein domain enrichment analysis followed by Gene Ontology (GO) term enrichment analysis revealed the functional repertoire of proteins involved in swapping. Pathway analysis using KEGG annotations revealed diverse pathway associations of human proteins involved in 3D domain swapping. Disease Ontology was used to find statistically significant associations with proteins in swapped conformation and various disease categories (P-value < 0.05).
We report meta-analysis results of a literature-curated dataset of human gene products involved in 3D domain swapping and discuss new insights about the functional repertoire, pathway associations and disease implications of proteins involved in 3D domain swapping.
Our integrated bioinformatics pipeline comprising of four different enrichment tools, two ontologies and two annotations revealed new insights into the functional and disease correlations with 3D domain swapping. GO term enrichment were used to infer terms associated with three different GO categories. Protein domain enrichment was used to identify conserved domains enriched in swapped proteins. Pathway enrichment analysis using KEGG annotations revealed that proteins with swapped conformations are present in all six classes of KEGG BRITE hierarchy and significantly enriched KEGG pathways were observed in five classes. Five major classes of disease were found to be associated with 3D domain swapping using functional disease ontology based enrichment analysis. Five classes of human diseases: cancer, diseases of the respiratory or pulmonary system, degenerative diseases of the central nervous system, vascular disease and encephalitis were found to be significant. In conclusion, our study shows that bioinformatics based analytical approaches using curated data can enhance the understanding of functional and disease implications of 3D domain swapping.
Protein aggregation; Human disease; Deposition disease; Human proteome; Data integration; Biological data mining
The 6th Benelux Bioinformatics Conference (BBC11) held in Luxembourg on 12 and 13 December 2011 attracted around 200 participants, including internationally-renowned guest speakers and more than 100 peer-reviewed submissions from 3 continents. Researchers from the public and private sectors convened at BBC11 to discuss advances and challenges in a wide spectrum of application areas. A key theme of the conference was the contribution of bioinformatics to enable and accelerate translational and clinical research. The BBC11 stressed the need for stronger collaborating efforts across disciplines and institutions. The demonstration of the clinical relevance of systems approaches and of next-generation sequencing-based measurement technologies are among the existing opportunities for increasing impact in translational research. Translational bioinformatics will benefit from research models that strike a balance between the importance of protecting intellectual property and the need to openly access scientific and technological advances. The full conference proceedings are freely available at http://www.bbc11.lu.
Translational bioinformatics; Clinical bioinformatics; Translational research; Systems biology; Next-generation sequencing; Bioinformatic infrastructure
Peripheral arterial disease (PAD) is a relatively common manifestation of systemic atherosclerosis that leads to progressive narrowing of the lumen of leg arteries. Circulating monocytes are in contact with the arterial wall and can serve as reporters of vascular pathology in the setting of PAD. We performed gene expression analysis of peripheral blood mononuclear cells (PBMC) in patients with PAD and controls without PAD to identify differentially regulated genes.
PAD was defined as an ankle brachial index (ABI) ≤0.9 (n = 19) while age and gender matched controls had an ABI > 1.0 (n = 18). Microarray analysis was performed using Affymetrix HG-U133 plus 2.0 gene chips and analyzed using GeneSpring GX 11.0. Gene expression data was normalized using Robust Multichip Analysis (RMA) normalization method, differential expression was defined as a fold change ≥1.5, followed by unpaired Mann-Whitney test (P < 0.05) and correction for multiple testing by Benjamini and Hochberg False Discovery Rate. Meta-analysis of differentially expressed genes was performed using an integrated bioinformatics pipeline with tools for enrichment analysis using Gene Ontology (GO) terms, pathway analysis using Kyoto Encyclopedia of Genes and Genomes (KEGG), molecular event enrichment using Reactome annotations and network analysis using Ingenuity Pathway Analysis suite. Extensive biocuration was also performed to understand the functional context of genes.
We identified 87 genes differentially expressed in the setting of PAD; 40 genes were upregulated and 47 genes were downregulated. We employed an integrated bioinformatics pipeline coupled with literature curation to characterize the functional coherence of differentially regulated genes.
Notably, upregulated genes mediate immune response, inflammation, apoptosis, stress response, phosphorylation, hemostasis, platelet activation and platelet aggregation. Downregulated genes included several genes from the zinc finger family that are involved in transcriptional regulation. These results provide insights into molecular mechanisms relevant to the pathophysiology of PAD.
Peripheral arterial disease; Gene expression; Microarray analysis; Vascular disease; Biomarkers
In the biological sciences the TCID50 (median tissue culture infective dose) assay is often used to determine the strength of a virus.
When the so-called Spearman-Kaerber calculation is used, the ratio between the pfu (the number of plaque forming units, the effective number of virus particles) and the TCID50, theoretically approaches a simple function of Eulers constant. Further, the standard deviation of the logarithm of the TCID50 approaches a simple function of the dilution factor and the number of wells used for determining the ratios in the assay. However, these theoretical calculations assume that the dilutions of the assay are independent, and in practice this is not completely correct. The assay was simulated using Monte Carlo techniques.
Our simulation studies show that the theoretical results actually hold true for practical implementations of the assay. Furthermore, the simulation studies show that the distribution of the (the log of) TCID50, although discrete in nature, has a close relationship to the normal distribution.
The pfu is proportional to the TCID50 titre with a factor of about 0.56 when using the Spearman-Kaerber calculation method. The normal distribution can be used for statistical inferences and ANOVA on the (the log of) TCID50 values is meaningful with group sizes of 5 and above.
TCID50; Spearman-Kaerber; pfu; Euler's constant; ANOVA; Monte Carlo simulation
The salivary microbiota is a potential diagnostic indicator of several diseases. Culture-independent techniques are required to study the salivary microbial community since many of its members have not been cultivated.
We explored the bacterial community composition in the saliva sample using metagenomic whole genome shotgun (WGS) sequencing, the extraction of 16S rRNA gene fragments from metagenomic sequences (16S-WGS) and high-throughput sequencing of PCR-amplified bacterial 16S rDNA gene (16S-HTS) regions V1 and V3.
The hierarchical clustering of data based on the relative abundance of bacterial genera revealed that distances between 16S-HTS datasets for V1 and V3 regions were greater than those obtained for the same V region with different numbers of PCR cycles. Datasets generated by 16S-HTS and 16S-WGS were even more distant. Finally, comparison of WGS and 16S-based datasets revealed the highest dissimilarity.
The analysis of the 16S-HTS, WGS and 16S-WGS datasets revealed 206, 56 and 39 bacterial genera, respectively, 124 of which have not been previously identified in salivary microbiomes. A large fraction of DNA extracted from saliva corresponded to human DNA. Based on sequence similarity search against completely sequenced genomes, bacterial and viral sequences represented 0.73% and 0.0036% of the salivary metagenome, respectively. Several sequence reads were identified as parts of the human herpesvirus 7.
Analysis of the salivary metagenome may have implications in diagnostics e.g. in detection of microorganisms and viruses without designing specific tests for each pathogen.
High blood glucose and diabetes are amongst the conditions causing the greatest losses in years of healthy life worldwide. Therefore, numerous studies aim to identify reliable risk markers for development of impaired glucose metabolism and type 2 diabetes. However, the molecular basis of impaired glucose metabolism is so far insufficiently understood. The development of so called 'omics' approaches in the recent years promises to identify molecular markers and to further understand the molecular basis of impaired glucose metabolism and type 2 diabetes. Although univariate statistical approaches are often applied, we demonstrate here that the application of multivariate statistical approaches is highly recommended to fully capture the complexity of data gained using high-throughput methods.
We took blood plasma samples from 172 subjects who participated in the prospective Metabolic Syndrome Berlin Potsdam follow-up study (MESY-BEPO Follow-up). We analysed these samples using Gas Chromatography coupled with Mass Spectrometry (GC-MS), and measured 286 metabolites. Furthermore, fasting glucose levels were measured using standard methods at baseline, and after an average of six years. We did correlation analysis and built linear regression models as well as Random Forest regression models to identify metabolites that predict the development of fasting glucose in our cohort.
We found a metabolic pattern consisting of nine metabolites that predicted fasting glucose development with an accuracy of 0.47 in tenfold cross-validation using Random Forest regression. We also showed that adding established risk markers did not improve the model accuracy. However, external validation is eventually desirable. Although not all metabolites belonging to the final pattern are identified yet, the pattern directs attention to amino acid metabolism, energy metabolism and redox homeostasis.
We demonstrate that metabolites identified using a high-throughput method (GC-MS) perform well in predicting the development of fasting plasma glucose over several years. Notably, not single, but a complex pattern of metabolites propels the prediction and therefore reflects the complexity of the underlying molecular mechanisms. This result could only be captured by application of multivariate statistical approaches. Therefore, we highly recommend the usage of statistical methods that seize the complexity of the information given by high-throughput methods.
prediction; fasting glucose; type 2 diabetes; metabolomics; plasma; random forest; metabolite; regression; biomarker
Translational and evidence based medicine can take advantage of biotechnology advances that offer a fast growing variety of high-throughput data for screening molecular activities of genomic, transcriptional, post-transcriptional and translational observations. The clinical information hidden in these data can be clarified with clinical bioinformatics approaches. We have recently proposed a method to analyze different layers of high-throughput (omic) data to preserve the emergent properties that appear in the cellular system when all molecular levels are interacting. We show here that this method applied to brain cancer data can uncover properties (i.e. molecules related to protective versus risky features in different types of brain cancers) that have been independently validated as survival markers, with potential important application in clinical practice.
glioblastoma; survival; system; emergent property; high-throughput biology