Understanding the perception of patients on research ethics issues related to biobanking is important to enrich ethical discourse and help inform policy.
We examined the views of leukemia patients undergoing treatment in clinics located in the Princess Margaret Hospital in Toronto, Ontario, Canada. An initial written survey was provided to 100 patients (64.1% response rate) followed by a follow-up survey (62.5% response rate) covering the topics of informed consent, withdrawal, anonymity, incidental findings and the return of results, ownership, and trust.
The majority (59.6%) preferred one-time consent, 30.3% desired a tiered consent approach that provides multiple options, and 10.1% preferred re-consent for future research. When asked different questions on re-consent, most (58%) reported that re-consent was a waste of time and money, but 51.7% indicated they would feel respected and involved if asked to re-consent. The majority of patients (62.2%) stated they had a right to withdraw their consent, but many changed their mind in the follow-up survey explaining that they should not have the right to withdraw consent. Nearly all of the patients (98%) desired being informed of incidental health findings and explained that the information was useful. Of these, 67.3% of patients preferred that researchers inform them and their doctors of the results. The majority of patients (62.2%) stated that the research institution owns the samples whereas 19.4% stated that the participants owned their samples. Patients had a great deal of trust in doctors, hospitals and government-funded university researchers, moderate levels of trust for provincial governments and industry-funded university researchers, and low levels of trust towards industry and insurance companies.
Many cancer patients surveyed preferred a one-time consent although others desired some form of control. The majority of participants wanted a continuing right to withdraw consent and nearly all wanted to be informed of incidental findings related to their health. Patients had a great deal of trust in their medical professionals and publically-funded researchers as opposed to profit-based industries and insurance companies.
Biobank; Tissue repository; Cancer patient perspectives; Consent; Withdrawal; Anonymity; Incidental findings; Return of results; Ownership; Trust
The editors of BMC Medical Genomics would like to thank all our reviewers who have contributed to the journal in Volume 5 (2012).
Consumption of high-fat diets has negative impacts on health and well-being, some of which may be epigenetically regulated. Selenium and folate are two compounds which influence epigenetic mechanisms. We investigated the hypothesis that post-weaning supplementation with adequate levels of selenium and folate in offspring of female mice fed a high-fat, low selenium and folate diet during gestation and lactation will lead to epigenetic changes of potential importance for long-term health.
Female offspring of mothers fed the experimental diet were either maintained on this diet (HF-low-low), or weaned onto a high-fat diet with sufficient levels of selenium and folate (HF-low-suf), for 8 weeks. Gene and protein expression, DNA methylation, and histone modifications were measured in colon and liver of female offspring.
Adequate levels of selenium and folate post-weaning affected gene expression in colon and liver of offspring, including decreasing Slc2a4 gene expression. Protein expression was only altered in the liver. There was no effect of adequate levels of selenium and folate on global histone modifications in the liver. Global liver DNA methylation was decreased in mice switched to adequate levels of selenium and folate, but there was no effect on methylation of specific CpG sites within the Slc2a4 gene in liver.
Post-weaning supplementation with adequate levels of selenium and folate in female offspring of mice fed high-fat diets inadequate in selenium and folate during gestation and lactation can alter global DNA methylation in liver. This may be one factor through which the negative effects of a poor diet during early life can be ameliorated. Further research is required to establish what role epigenetic changes play in mediating observed changes in gene and protein expression, and the relevance of these changes to health.
Epigenetic; Microarray analysis; 2D-DIGE; Proteomics; Folate; Selenium; High fat
Ring chromosome 6 is a rare constitutional abnormality that generally occurs de novo. The related phenotype may be highly variable ranging from an almost normal phenotype to severe malformations and mental retardation. These features are mainly present when genetic material at the end of the chromosome is lost. The severity of the phenotype seems to be related to the size of the deletion. About 25 cases have been described to date, but the vast majority reports only conventional cytogenetic investigations.
Here we present an accurate cyto-molecular characterization of a ring chromosome 6 in a 16-months-old Caucasian girl with mild motor developmental delay, cardiac defect, and facial anomalies. The cytogenetic investigations showed a karyotype 46,XX,r(6)(p25q27) and FISH analysis revealed the absence of the signals on both arms of the chromosome 6. These results were confirmed by means of array-CGH showing terminal deletions on 6p25.3 (1.3 Mb) and 6q26.27 (6.7 Mb). Our data were compared to current literature.
Our report describes the case of a patient with a ring chromosome 6 abnormality completely characterized by array CGH which provided additional information for genotype-phenotype studies.
Array-CGH; Heart defects; Ring chromosome 6
Availability of chemical response-specific lists of genes (gene sets) for pharmacological and/or toxic effect prediction for compounds is limited. We hypothesize that more gene sets can be created by next-generation text mining (next-gen TM), and that these can be used with gene set analysis (GSA) methods for chemical treatment identification, for pharmacological mechanism elucidation, and for comparing compound toxicity profiles.
We created 30,211 chemical response-specific gene sets for human and mouse by next-gen TM, and derived 1,189 (human) and 588 (mouse) gene sets from the Comparative Toxicogenomics Database (CTD). We tested for significant differential expression (SDE) (false discovery rate -corrected p-values < 0.05) of the next-gen TM-derived gene sets and the CTD-derived gene sets in gene expression (GE) data sets of five chemicals (from experimental models). We tested for SDE of gene sets for six fibrates in a peroxisome proliferator-activated receptor alpha (PPARA) knock-out GE dataset and compared to results from the Connectivity Map. We tested for SDE of 319 next-gen TM-derived gene sets for environmental toxicants in three GE data sets of triazoles, and tested for SDE of 442 gene sets associated with embryonic structures. We compared the gene sets to triazole effects seen in the Whole Embryo Culture (WEC), and used principal component analysis (PCA) to discriminate triazoles from other chemicals.
Next-gen TM-derived gene sets matching the chemical treatment were significantly altered in three GE data sets, and the corresponding CTD-derived gene sets were significantly altered in five GE data sets. Six next-gen TM-derived and four CTD-derived fibrate gene sets were significantly altered in the PPARA knock-out GE dataset. None of the fibrate signatures in cMap scored significant against the PPARA GE signature. 33 environmental toxicant gene sets were significantly altered in the triazole GE data sets. 21 of these toxicants had a similar toxicity pattern as the triazoles. We confirmed embryotoxic effects, and discriminated triazoles from other chemicals.
Gene set analysis with next-gen TM-derived chemical response-specific gene sets is a scalable method for identifying similarities in gene responses to other chemicals, from which one may infer potential mode of action and/or toxic effect.
Text mining; Toxicogenomics; Gene set analysis
A number of neurodevelopmental syndromes are caused by mutations in genes encoding proteins that normally function in epigenetic regulation. Identification of epigenetic alterations occurring in these disorders could shed light on molecular pathways relevant to neurodevelopment.
Using a genome-wide approach, we identified genes with significant loss of DNA methylation in blood of males with intellectual disability and mutations in the X-linked KDM5C gene, encoding a histone H3 lysine 4 demethylase, in comparison to age/sex matched controls. Loss of DNA methylation in such individuals is consistent with known interactions between DNA methylation and H3 lysine 4 methylation. Further, loss of DNA methylation at the promoters of the three top candidate genes FBXL5, SCMH1, CACYBP was not observed in more than 900 population controls. We also found that DNA methylation at these three genes in blood correlated with dosage of KDM5C and its Y-linked homologue KDM5D. In addition, parallel sex-specific DNA methylation profiles in brain samples from control males and females were observed at FBXL5 and CACYBP.
We have, for the first time, identified epigenetic alterations in patient samples carrying a mutation in a gene involved in the regulation of histone modifications. These data support the concept that DNA methylation and H3 lysine 4 methylation are functionally interdependent. The data provide new insights into the molecular pathogenesis of intellectual disability. Further, our data suggest that some DNA methylation marks identified in blood can serve as biomarkers of epigenetic status in the brain.
KDM5C; DNA methylation; H3K4 methylation; Intellectual disability
DNA methylation is an inheritable chemical modification of cytosine, and represents one of the most important epigenetic events. Computational prediction of the DNA methylation status can be employed to speed up the genome-wide methylation profiling, and to identify the key features that are correlated with various methylation patterns. Here, we develop CpGIMethPred, the support vector machine-based models to predict the methylation status of the CpG islands in the human genome under normal conditions. The features for prediction include those that have been previously demonstrated effective (CpG island specific attributes, DNA sequence composition patterns, DNA structure patterns, distribution patterns of conserved transcription factor binding sites and conserved elements, and histone methylation status) as well as those that have not been extensively explored but are likely to contribute additional information from a biological point of view (nucleosome positioning propensities, gene functions, and histone acetylation status). Statistical tests are performed to identify the features that are significantly correlated with the methylation status of the CpG islands, and principal component analysis is then performed to decorrelate the selected features. Data from the Human Epigenome Project (HEP) are used to train, validate and test the predictive models. Specifically, the models are trained and validated by using the DNA methylation data obtained in the CD4 lymphocytes, and are then tested for generalizability using the DNA methylation data obtained in the other 11 normal tissues and cell types. Our experiments have shown that (1) an eight-dimensional feature space that is selected via the principal component analysis and that combines all categories of information is effective for predicting the CpG island methylation status, (2) by incorporating the information regarding the nucleosome positioning, gene functions, and histone acetylation, the models can achieve higher specificity and accuracy than the existing models while maintaining a comparable sensitivity measure, (3) the histone modification (methylation and acetylation) information contributes significantly to the prediction, without which the performance of the models deteriorate, and, (4) the predictive models generalize well to different tissues and cell types. The developed program CpGIMethPred is freely available at http://users.ece.gatech.edu/~hzheng7/CGIMetPred.zip.
This is an introduction to the supplement to BMC Medical Genomics that includes16 papers selected from the 2011 World Congress in Computer Science, Computer Engineering, Applied Computing as well as other sources with a focus on genomics studies with a focus on human diseases.
Bidirectional promoters are shared promoter sequences between divergent gene pair (genes proximal to each other on opposite strands), and can regulate the genes in both directions. In the human genome, > 10% of protein-coding genes are arranged head-to-head on opposite strands, with transcription start sites that are separated by < 1,000 base pairs. Many transcription factor binding sites occur in the bidirectional promoters that influence the expression of 2 opposite genes. Recently, RNA polymerase II (RPol II) ChIP-seq data are used to identify the promoters of coding genes and non-coding RNAs. However, a bidirectional promoter with RPol II ChIP-Seq data has not been found.
In some bidirectional promoter regions, the RPol II forms a bi-peak shape, which indicates that 2 promoters are located in the bidirectional region. We have developed a computational approach to identify the regulatory regions of all divergent gene pairs using genome-wide RPol II binding patterns derived from ChIP-seq data, based upon the assumption that the distribution of RPol II binding patterns around the bidirectional promoters are accumulated by RPol II binding of 2 promoters. In HeLa S3 cells, 249 promoter pairs and 1094 single promoters were identified, of which 76 promoters cover only positive genes, 86 promoters cover only negative genes, and 932 promoters cover 2 genes. Gene expression levels and STAT1 binding sites for different promoter categories were therefore examined.
The regulatory region of bidirectional promoter identification based upon RPol II binding patterns provides important temporal and spatial measurements regarding the initiation of transcription. From gene expression and transcription factor binding site analysis, the promoters in bidirectional regions may regulate the closest gene, and STAT1 is involved in primary promoter.
Cadmium (Cd2+) is a known nephrotoxin causing tubular necrosis during acute exposure and potentially contributing to renal failure in chronic long-term exposure. To investigate changes in global gene expression elicited by cadmium, an in-vitro exposure system was developed from cultures of human renal epithelial cells derived from cortical tissue obtained from nephrectomies. These cultures exhibit many of the qualities of proximal tubule cells. Using these cells, a study was performed to determine the cadmium-induced global gene expression changes after short-term (1 day, 9, 27, and 45 μM) and long-term cadmium exposure (13 days, 4.5, 9, and 27 μM). These studies revealed fundamental differences in the types of genes expressed during each of these time points. The obtained data was further analyzed using regression to identify cadmium toxicity responsive genes. Regression analysis showed 403 genes were induced and 522 genes were repressed by Cd2+ within 1 day, and 366 and 517 genes were induced and repressed, respectively, after 13 days. We developed a gene set enrichment analysis method to identify the cadmium induced pathways that are unique in comparison to traditional approaches. The perturbation of global gene expression by various Cd2+ concentrations and multiple time points enabled us to study the transcriptional dynamics and gene interaction using a mutual information-based network model. The most prominent network module consisted of INHBA, KIF20A, DNAJA4, AKAP12, ZFAND2A, AKR1B10, SCL7A11, and AKR1C1.
Computational genomics of Alzheimer disease (AD), the most common form of senile dementia, is a nascent field in AD research. The field includes AD gene clustering by computing gene order which generates higher quality gene clustering patterns than most other clustering methods. However, there are few available gene order computing methods such as Genetic Algorithm (GA) and Ant Colony Optimization (ACO). Further, their performance in gene order computation using AD microarray data is not known. We thus set forth to evaluate the performances of current gene order computing methods with different distance formulas, and to identify additional features associated with gene order computation.
Using different distance formulas- Pearson distance and Euclidean distance, the squared Euclidean distance, and other conditions, gene orders were calculated by ACO and GA (including standard GA and improved GA) methods, respectively. The qualities of the gene orders were compared, and new features from the calculated gene orders were identified.
Compared to the GA methods tested in this study, ACO fits the AD microarray data the best when calculating gene order. In addition, the following features were revealed: different distance formulas generated a different quality of gene order, and the commonly used Pearson distance was not the best distance formula when used with both GA and ACO methods for AD microarray data.
Compared with Pearson distance and Euclidean distance, the squared Euclidean distance generated the best quality gene order computed by GA and ACO methods.
Schizophrenia (SCZ) and type 2 diabetes mellitus (T2D) are both complex diseases. Accumulated studies indicate that schizophrenia patients are prone to present the type 2 diabetes symptoms, but the potential mechanisms behind their association remain unknown. Here we explored the pathogenetic association between SCZ and T2D based on pathway analysis and protein-protein interaction.
With sets of prioritized susceptibility genes for SCZ and T2D, we identified significant pathways (with adjusted p-value < 0.05) specific for SCZ or T2D and for both diseases based on pathway enrichment analysis. We also constructed a network to explore the crosstalk among those significant pathways. Our results revealed that some pathways are shared by both SCZ and T2D diseases through a number of susceptibility genes. With 382 unique susceptibility proteins for SCZ and T2D, we further built a protein-protein interaction network by extracting their nearest interacting neighbours. Among 2,104 retrieved proteins, 364 of them were found simultaneously interacted with susceptibility proteins of both SCZ and T2D, and proposed as new candidate risk factors for both diseases. Literature mining supported the potential association of partial new candidate proteins with both SCZ and T2D. Moreover, some proteins were hub proteins with high connectivity and interacted with multiple proteins involved in both diseases, implying their pleiotropic effects for the pathogenic association. Some of these hub proteins are the components of our identified enriched pathways, including calcium signaling, g-secretase mediated ErbB4 signaling, adipocytokine signaling, insulin signaling, AKT signaling and type II diabetes mellitus pathways. Through the integration of multiple lines of information, we proposed that those signaling pathways, which contain susceptibility genes for both diseases, could be the key pathways to bridge SCZ and T2D. AKT could be one of the important shared components and may play a pivotal role to link both of the pathogenetic processes.
Our study is the first network and pathway-based systematic analysis for SCZ and T2D, and provides the general pathway-based view of pathogenetic association between two diseases. Moreover, we identified a set of candidate genes potentially contributing to the linkage between these two diseases. This research offers new insights into the potential mechanisms underlying the co-occurrence of SCZ and T2D, and thus, could facilitate the inference of novel hypotheses for the co-morbidity of the two diseases. Some etiological factors that exert pleiotropic effects shared by the significant pathways of two diseases may have important implications for the diseases and could be therapeutic intervention targets.
Insulin resistance is a key element in the pathogenesis of type 2 diabetes mellitus. Plasma free fatty acids were assumed to mediate the insulin resistance, while the relationship between lipid and glucose disposal remains to be demonstrated across liver, skeletal muscle and blood.
We profiled both lipidomics and gene expression of 144 total peripheral blood samples, 84 from patients with T2D and 60 from healthy controls. Then, factor and partial least squares models were used to perform a combined analysis of lipidomics and gene expression profiles to uncover the bioprocesses that are associated with lipidomic profiles in type 2 diabetes.
According to factor analysis of the lipidomic profile, several species of lipids were found to be correlated with different phenotypes, including diabetes-related C23:2CE, C23:3CE, C23:4CE, ePE36:4, ePE36:5, ePE36:6; race-related (African-American) PI36:1; and sex-related PE34:1 and LPC18:2. The major variance of gene expression profile was not caused by known factors and no significant difference can be directly derived from differential gene expression profile. However, the combination of lipidomic and gene expression analyses allows us to reveal the correlation between the altered lipid profile with significantly enriched pathways, such as one carbon pool by folate, arachidonic acid metabolism, insulin signaling pathway, amino sugar and nucleotide sugar metabolism, propanoate metabolism, and starch and sucrose metabolism. The genes in these pathways showed a good capability to classify diabetes samples.
Combined analysis of gene expression and lipidomic profiling reveals type 2 diabetes-associated lipid species and enriched biological pathways in peripheral blood, while gene expression profile does not show direct correlation. Our findings provide a new clue to better understand the mechanism of disordered lipid metabolism in association with type 2 diabetes.
It is well known that DNA methylation, as an epigenetic factor, has an important effect on gene expression and disease development. Detecting differentially methylated loci under different conditions, such as cancer types or treatments, is of great interest in current research as it is important in cancer diagnosis and classification. However, inappropriate testing approaches can result in large false positives and/or false negatives. Appropriate and powerful statistical methods are desirable but very limited in the literature.
In this paper, we propose a nonparametric method to detect differentially methylated loci under multiple conditions for Illumina Array Methylation data. We compare the new method with other methods using simulated and real data. Our study shows that the proposed one outperforms other methods considered in this paper.
Due to the unique feature of the Illumina Array Methylation data, commonly used statistical tests will lose power or give misleading results. Therefore, appropriate statistical methods are crucial for this type of data. Powerful statistical approaches remain to be developed.
R codes are available upon request.
In HBV-infected patients, different genotypes of the hepatitis B virus influence liver disease progression and response to antiviral therapy. Moreover, long-term antiviral therapy will eventually select for drug-resistant mutants. Detection of mutations associated to antiviral therapy and HBV genotyping are essential for monitoring treatment of chronic hepatitis B patients.
In this study, a simple method of partial-S gene sequencing using a common PCR amplification was established for genotyping clinical HBV isolates sensitively, which could detect the drug-resistant mutations successfully at the same time.
The partial S gene sequencing assay developed in this study has potential for application in HBV genotyping and drug resistant mutation detection. It is simpler and more convenient than traditional S gene sequencing, but has nearly the same sensitivity and specificity when compared to S gene sequencing.
Breast cancer is worldwide the second most common type of cancer after lung cancer. Traditional mammography and Tissue Microarray has been studied for early cancer detection and cancer prediction. However, there is a need for more reliable diagnostic tools for early detection of breast cancer. This can be a challenge due to a number of factors and logistics. First, obtaining tissue biopsies can be difficult. Second, mammography may not detect small tumors, and is often unsatisfactory for younger women who typically have dense breast tissue. Lastly, breast cancer is not a single homogeneous disease but consists of multiple disease states, each arising from a distinct molecular mechanism and having a distinct clinical progression path which makes the disease difficult to detect and predict in early stages.
In the paper, we present a Support Vector Machine based on Recursive Feature Elimination and Cross Validation (SVM-RFE-CV) algorithm for early detection of breast cancer in peripheral blood and show how to use SVM-RFE-CV to model the classification and prediction problem of early detection of breast cancer in peripheral blood.
The training set which consists of 32 health and 33 cancer samples and the testing set consisting of 31 health and 34 cancer samples were randomly separated from a dataset of peripheral blood of breast cancer that is downloaded from Gene Express Omnibus. First, we identified the 42 differentially expressed biomarkers between "normal" and "cancer". Then, with the SVM-RFE-CV we extracted 15 biomarkers that yield zero cross validation score. Lastly, we compared the classification and prediction performance of SVM-RFE-CV with that of SVM and SVM Recursive Feature Elimination (SVM-RFE).
We found that 1) the SVM-RFE-CV is suitable for analyzing noisy high-throughput microarray data, 2) it outperforms SVM-RFE in the robustness to noise and in the ability to recover informative features, and 3) it can improve the prediction performance (Area Under Curve) in the testing data set from 0.5826 to 0.7879. Further pathway analysis showed that the biomarkers are associated with Signaling, Hemostasis, Hormones, and Immune System, which are consistent with previous findings. Our prediction model can serve as a general model for biomarker discovery in early detection of other cancers. In the future, Polymerase Chain Reaction (PCR) is planned for validation of the ability of these potential biomarkers for early detection of breast cancer.
Over 10,000 long intergenic non-coding RNAs (lincRNAs) have been identified in the human genome. Some have been well characterized and known to participate in various stages of gene regulation. In the post-transcriptional process, another class of well-known small non-coding RNA, or microRNA (miRNA), is very active in inhibiting mRNA. Though similar features between mRNA and lincRNA have been revealed in several recent studies, and a few isolated miRNA-lincRNA relationships have been observed. Despite these advances, the comprehensive miRNA regulation pattern of lincRNA has not been clarified.
In this study, we investigated the possible interaction between the two classes of non-coding RNAs. Instead of using the existing long non-coding database, we employed an ab initio method to annotate lincRNAs expressed in a group of normal breast tissues and breast tumors.
Approximately 90 lincRNAs show strong reverse expression correlation with miRNAs, which have at least one predicted target site presented. These target sites are statistically more conserved than their neighboring genetic regions and other predicted target sites. Several miRNAs that target to these lincRNAs are known to play an essential role in breast cancer.
Similar to inhibiting mRNAs, miRNAs show potential in promoting the degeneration of lincRNAs. Breast-cancer-related miRNAs may influence their target lincRNAs resulting in differential expression in normal and malignant breast tissues. This implies the miRNA regulation of lincRNAs may be involved in the regulatory process in tumor cells.
MicroRNAs (miRNAs) are short non-coding RNA molecules that regulate mRNA transcript levels and translation. Deregulation of microRNAs is indicated in a number of diseases and microRNAs are seen as a promising target for biomarker identification and drug development. miRNA expression is commonly measured by microarray or real-time polymerase chain reaction (RT-PCR). The findings of RT-PCR data are highly dependent on the normalization techniques used during preprocessing of the Cycle Threshold readings from RT-PCR. Some of the commonly used endogenous controls themselves have been discovered to be differentially expressed in various conditions such as cancer, making them inappropriate internal controls.
We demonstrate that RT-PCR data contains a systematic bias resulting in large variations in the Cycle Threshold (CT) values of the low-abundant miRNA samples. We propose a new data normalization method that considers all available microRNAs as endogenous controls. A weighted normalization approach is utilized to allow contribution from all microRNAs, weighted by their empirical stability.
The systematic bias in RT-PCR data is illustrated on a microRNA dataset obtained from primary cutaneous melanocytic neoplasms. We show that through a single control parameter, this method is able to emulate other commonly used normalization methods and thus provides a more general approach. We explore the consistency of RT-PCR expression data with microarray expression by utilizing a dataset where both RT-PCR and microarray profiling data is available for the same miRNA samples.
A weighted normalization method allows the contribution of all of the miRNAs, whether they are highly abundant or have low expression levels. Our findings further suggest that the normalization of a particular miRNA should rely on only miRNAs that have comparable expression levels.
microRNA; RT-PCR; Normalization; Microarray
Early detection of breast cancer in blood is both appealing clinically and challenging technically due to the disease's illusive nature and heterogeneity. Today, even though major breast cancer subtypes have been characterized, i.e., luminal A, luminal B, HER2+, and basal-like, little is known about the heterogeneity of breast cancer in blood, which could help to discover minimally invasive protein biomarkers with which clinical researchers can detect, classify, and monitor different breast cancer subtypes.
In this study, we performed an integrative pathway-assisted clustering analysis of breast cancer subtypes from plasma proteome samples collected from 80 patients diagnosed with breast cancer and 80 healthy women. First, four breast cancer subtypes and additionally unknown subtype (according to existing annotation) were determined based on pathology lab test results in primary tumors of enrolled patients. Next, we developed and applied four distance metrics, i.e., Protein Intensity, Q-Value, Pathway Profile, and Distance Score Function, to measure and characterize these cancer subtypes. Then, we developed a permutation test to evaluate the significant protein level changes in each biological pathway for each breast cancer subtype, using q-value. Lastly, we developed a pathway-protein matrix for each of the four distance methods to estimate the distance between breast cancer subtypes, for which further Pathway Association Network analysis were performed.
We found that 1) the luminal group (luminal A and luminal B) are clustered together, as well as the basal group (basal-like and HER2+) and 2) luminal A and luminal B are more close to each other than basal-like and HER2+ to each other. Our results were consistent with a recent independent breast cancer research from the Cancer Genome Atlas Network using genomic DNA copy number arrays, DNA methylation, exome sequencing, messenger RNA arrays, microRNA sequencing and reverse-phase protein arrays. Our results showed that changes of different breast cancer subtypes at the pathway level are more profound and less variable than those at the molecular level. Similar subtypes share distinct yet similar pathway activation networks, while dissimilar subtypes are different also at the level of pathway activation networks. The results also showed that distance or similarity of cancer subtypes based on pathway analysis might be able to provide further insight into the intrinsic relationship of breast cancer subtypes. We believe integrative pathway-assisted proteomics analysis described here can become a model for reliable clustering or classification of other cancer subtypes.
Next generation sequencing (NGS) technologies have greatly facilitated the rapid and economical detection of pathogenic mutations in human disorders. However, mutation descriptions are hard to be compared and integrated due to various reference sequences and annotation tools adopted in different articles as well as the nomenclature of diseases/traits.
The Human Disease Associated Mutation (HDAM) database is dedicated to collect, standardize and re-annotate mutations for human diseases discovered by NGS studies. In the current release, HDAM contains 1,114 mutations, located in 669 genes and associated with 125 human diseases through literature mining. All mutation records have uniform and unequivocal descriptions of sequence changes according to the Human Genome Sequence Variation Society (HGVS) nomenclature recommendations. Each entry displays comprehensive information, including mutation location in genome (hg18/hg19), gene functional annotation, protein domain annotation, susceptible diseases, the first literature report of the mutation and etc. Moreover, new mutation-disease relationships predicted by Bayesian network are also presented under each mutation.
HDAM contains hundreds rigorously curated human mutations from NGS studies and was created to provide a comprehensive view of these mutations that confer susceptibility to the common disorders. HDAM can be freely accessed at http://www.megabionet.org/HDAM.
One of the challenges in classification of cancer tissue samples based on gene expression data is to establish an effective method that can select a parsimonious set of informative genes. The Top Scoring Pair (TSP), k-Top Scoring Pairs (k-TSP), Support Vector Machines (SVM), and prediction analysis of microarrays (PAM) are four popular classifiers that have comparable performance on multiple cancer datasets. SVM and PAM tend to use a large number of genes and TSP, k-TSP always use even number of genes. In addition, the selection of distinct gene pairs in k-TSP simply combined the pairs of top ranking genes without considering the fact that the gene set with best discrimination power may not be the combined pairs. The k-TSP algorithm also needs the user to specify an upper bound for the number of gene pairs. Here we introduce a computational algorithm to address the problems. The algorithm is named Chisquare-statistic-based Top Scoring Genes (Chi-TSG) classifier simplified as TSG.
The TSG classifier starts with the top two genes and sequentially adds additional gene into the candidate gene set to perform informative gene selection. The algorithm automatically reports the total number of informative genes selected with cross validation. We provide the algorithm for both binary and multi-class cancer classification. The algorithm was applied to 9 binary and 10 multi-class gene expression datasets involving human cancers. The TSG classifier outperforms TSP family classifiers by a big margin in most of the 19 datasets. In addition to improved accuracy, our classifier shares all the advantages of the TSP family classifiers including easy interpretation, invariant to monotone transformation, often selects a small number of informative genes allowing follow-up studies, resistant to sampling variations due to within sample operations.
Redefining the scores for gene set and the classification rules in TSP family classifiers by incorporating the sample size information can lead to better selection of informative genes and classification accuracy. The resulting TSG classifier offers a useful tool for cancer classification based on numerical molecular data.
Understanding how genes are expressed specifically in particular tissues is a fundamental question in developmental biology. Many tissue-specific genes are involved in the pathogenesis of complex human diseases. However, experimental identification of tissue-specific genes is time consuming and difficult. The accurate predictions of tissue-specific gene targets could provide useful information for biomarker development and drug target identification.
In this study, we have developed a machine learning approach for predicting the human tissue-specific genes using microarray expression data. The lists of known tissue-specific genes for different tissues were collected from UniProt database, and the expression data retrieved from the previously compiled dataset according to the lists were used for input vector encoding. Random Forests (RFs) and Support Vector Machines (SVMs) were used to construct accurate classifiers. The RF classifiers were found to outperform SVM models for tissue-specific gene prediction. The results suggest that the candidate genes for brain or liver specific expression can provide valuable information for further experimental studies. Our approach was also applied for identifying tissue-selective gene targets for different types of tissues.
A machine learning approach has been developed for accurately identifying the candidate genes for tissue specific/selective expression. The approach provides an efficient way to select some interesting genes for developing new biomedical markers and improve our knowledge of tissue-specific expression.
One of the most common causes of worldwide cancer premature death is non-small cell lung carcinoma (NSCLC) with a very low survival rate of 8%-15%. Since patients with an early stage diagnosis can have up to four times the survival rate, discovering cost-effective biological markers that can be used to improve the diagnosis and prognosis of the disease is an important clinical challenge.
In the last few years, significant progress has been made to address this challenge with identified biomarkers ranging from 5-gene signatures to 133-gene signatures. However, A typical molecular sub-classification method for lung carcinomas would have a low predictive accuracy of 68%-71% because datasets of gene-expression profiles typically have tens of thousands of genes for just few hundreds of patients. This type of datasets create many technical challenges impacting the accuracy of the diagnostic prediction.
We discovered that a small set of nine gene-signatures (JAG1, MET, CDH5, ABCC3, DSP, ABCD3, PECAM1, MAPRE2 and PDF5) from the dataset of 12,600 gene-expression profiles of NSCLC acts like an inference basis for NSCLC lung carcinoma and hence can be used as genetic markers. This very small and previously unknown set of biological markers gives an almost perfect predictive accuracy (99.75%) for the diagnosis of the disease the sub-type of cancer. Furthermore, we present a novel method that finds genetic markers for sub-classification of NSCLC. We use generalized Lorenz curves and Gini ratios to overcome many challenges arose from datasets of gene-expression profiles. Our method discovers novel genetic changes that occur in lung tumors using gene-expression profiles.
While proteins encoded by some of these gene-signatures (e.g., JAG1 and MAPRE2) have been showed to involve in the signal transduction of cells and proliferation control of normal cells, specific functions of proteins encoded by other gene-signatures have not yet been determined. Hence, this work opens new questions for structural and molecular biologists about the role of these gene-signatures for the disease.
Colorectal cancer (CRC) is a heterogeneous and biologically poorly understood disease. To tailor CRC treatment, it is essential to first model this heterogeneity by defining subtypes of patients with homogeneous biological and clinical characteristics and second match these subtypes to cell lines for which extensive pharmacological data is available, thus linking targeted therapies to patients most likely to respond to treatment.
We applied a new unsupervised, iterative approach to stratify CRC tumor samples into subtypes based on genome-wide mRNA expression data. By applying this stratification to several CRC cell line panels and integrating pharmacological response data, we generated hypotheses regarding the targeted treatment of different subtypes.
In agreement with earlier studies, the two dominant CRC subtypes are highly correlated with a gene expression signature of epithelial-mesenchymal-transition (EMT). Notably, further dividing these two subtypes using iNMF (iterative Non-negative Matrix Factorization) revealed five subtypes that exhibit activation of specific signaling pathways, and show significant differences in clinical and molecular characteristics. Importantly, we were able to validate the stratification on independent, published datasets comprising over 1600 samples. Application of this stratification to four CRC cell line panels comprising 74 different cell lines, showed that the tumor subtypes are well represented in available CRC cell line panels. Pharmacological response data for targeted inhibitors of SRC, WNT, GSK3b, aurora kinase, PI3 kinase, and mTOR, showed significant differences in sensitivity across cell lines assigned to different subtypes. Importantly, some of these differences in sensitivity were in concordance with high expression of the targets or activation of the corresponding pathways in primary tumor samples of the same subtype.
The stratification presented here is robust, captures important features of CRC, and offers valuable insight into functional differences between CRC subtypes. By matching the identified subtypes to cell line panels that have been pharmacologically characterized, it opens up new possibilities for the development and application of targeted therapies for defined CRC patient sub-populations.
Colorectal cancer; Tumor subtyping; Cell lines; Targeted therapy
The SH2B1 gene (Src-homology 2B adaptor protein 1 gene) is a solid candidate gene for obesity. Large scale GWAS studies depicted markers in the vicinity of the gene; animal models suggest a potential relevance for human body weight regulation.
We performed a mutation screen for variants in the SH2B1 coding sequence in 95 extremely obese children and adolescents. Detected variants were genotyped in independent childhood and adult study groups (up to 11,406 obese or overweight individuals and 4,568 controls). Functional implications on STAT3 mediated leptin signalling of the detected variants were analyzed in vitro.
We identified two new rare mutations and five known SNPs (rs147094247, rs7498665, rs60604881, rs62037368 and rs62037369) in SH2B1. Mutation g.9483C/T leads to a non-synonymous, non-conservative exchange in the beta (βThr656Ile) and gamma (γPro674Ser) splice variants of SH2B1. It was additionally detected in two of 11,206 (extremely) obese or overweight children, adolescents and adults, but not in 4,506 population-based normal-weight or lean controls. The non-coding mutation g.10182C/A at the 3’ end of SH2B1 was only detected in three obese individuals. For the non-synonymous SNP rs7498665 (Thr484Ala) we observed nominal over-transmission of the previously described risk allele in 705 obesity trios (nominal p = 0.009, OR = 1.23) and an increased frequency of the same allele in 359 cases compared to 429 controls (nominal p = 0.042, OR = 1.23). The obesity risk-alleles at Thr484Ala and βThr656Ile/γPro674Ser had no effect on STAT3 mediated leptin receptor signalling in splice variants β and γ.
The rare coding mutation βThr656Ile/γPro674Ser (g.9483C/T) in SH2B1 was exclusively detected in overweight or obese individuals. Functional analyzes did not reveal impairments in leptin signalling for the mutated SH2B1.
SH2B1; Obesity; BMI; rs7498665; Mutation screen