Preterm birth (PTB), or birth before 37 weeks of gestation, is the leading cause of newborn death worldwide. PTB is a critical area of scientific study not only due to its worldwide toll on human lives and economies, but also due to our limited understanding of its pathogenesis and, therefore, its prevention. This systematic review and meta-analysis synthesizes the landscape of PTB transcriptomics research to further our understanding of the genes and pathways involved in PTB subtypes.
We evaluated published genome-wide pregnancy studies across gestational tissues and pathologies, including those that focus on PTB, by performing a targeted PubMed MeSH search and systematically reviewing all relevant studies.
Our search yielded 2,361 studies on gestational tissues including placenta, decidua, myometrium, maternal blood, cervix, fetal membranes (chorion and amnion), umbilical cord, fetal blood, and basal plate. Selecting only those original research studies that measured transcription on a genome-wide scale and reported lists of expressed genetic elements identified 93 gene expression, 21 microRNA, and 20 methylation studies. Although 30 % of all PTB cases are due to medical indications, 76 % of the preterm studies focused on them. In contrast, only 18 % of the preterm studies focused on spontaneous onset of labor, which is responsible for 45 % of all PTB cases. Furthermore, only 23 of the 10,993 unique genetic elements reported to be transcriptionally active were recovered 10 or more times in these 134 studies. Meta-analysis of the 93 gene expression studies across 9 distinct gestational tissues and 29 clinical phenotypes showed limited overlap of genes identified as differentially expressed across studies.
Overall, profiles of differentially expressed genes were highly heterogeneous both between as well as within clinical subtypes and tissues as well as between studies of the same clinical subtype and tissue. These results suggest that large gaps still exist in the transcriptomic study of specific clinical subtypes as well in the generation of the transcriptional profile of well-studied clinical subtypes; understanding the complex landscape of prematurity will require large-scale, systematic genome-wide analyses of human gestational tissues on both understudied and well-studied subtypes alike.
Electronic supplementary material
The online version of this article (doi:10.1186/s12920-015-0099-8) contains supplementary material, which is available to authorized users.
Preterm birth; Gestational tissues; Transcriptomics; Gene expression; microRNA; Methylation; Preeclampsia; Idiopathic preterm birth; Meta-analysis
With the surge of translational medicine and computational omics research, complex disease diagnosis is more and more relying on massive omics data-driven molecular signature detection. However, how to detect and prevent possible diagnostic biases in translational bioinformatics remains an unsolved problem despite its importance in the coming era of personalized medicine.
In this study, we comprehensively investigate the diagnostic bias problem by analyzing benchmark gene array, protein array, RNA-Seq and miRNA-Seq data under the framework of support vector machines for different model selection methods. We further categorize the diagnostic biases into different types by conducting rigorous kernel matrix analysis and provide effective machine learning methods to conquer the diagnostic biases.
In this study, we comprehensively investigate the diagnostic bias problem by analyzing benchmark gene array, protein array, RNA-Seq and miRNA-Seq data under the framework of support vector machines. We have found that the diagnostic biases happen for data with different distributions and SVM with different kernels. Moreover, we identify total three types of diagnostic biases: overfitting bias, label skewness bias, and underfitting bias in SVM diagnostics, and present corresponding reasons through rigorous analysis. Compared with the overfitting and underfitting biases, the label skewness bias is more challenging to detect and conquer because it can be easily confused as a normal diagnostic case from its deceptive accuracy. To tackle this problem, we propose a derivative component analysis based support vector machines to conquer the label skewness bias by achieving the rivaling clinical diagnostic results.
Our studies demonstrate that the diagnostic biases are mainly caused by the three major factors, i.e. kernel selection, signal amplification mechanism in high-throughput profiling, and training data label distribution. Moreover, the proposed DCA-SVM diagnosis provides a generic solution for the label skewness bias overcome due to the powerful feature extraction capability from derivative component analysis. Our work identifies and solves an important but less addressed problem in translational research. It also has a positive impact on machine learning for adding new results to kernel-based learning for omics data.
Translational bioinformatics; Omics; Diagnostic biases; Machine learning
Loss of heterozygosity (LOH) is a common genetic event in cancer development, and is known to be involved in the somatic loss of wild-type alleles in many inherited cancer syndromes. The wider involvement of LOH in cancer is assumed to relate to unmasking a somatically mutated tumour suppressor gene through loss of the wild type allele.
We analysed 86 ovarian carcinomas for mutations in 980 genes selected on the basis of their location in common regions of LOH.
We identified 36 significantly mutated genes, but these could only partly account for the quanta of LOH in the samples. Using our own and TCGA data we then evaluated five possible models to explain the selection for non-random accumulation of LOH in ovarian cancer genomes: 1. Classic two-hit hypothesis: high frequency biallelic genetic inactivation of tumour suppressor genes. 2. Epigenetic two-hit hypothesis: biallelic inactivation through methylation and LOH. 3. Multiple alternate-gene biallelic inactivation: low frequency gene disruption. 4. Haplo-insufficiency: Single copy gene disruption. 5. Modified two-hit hypothesis: reduction to homozygosity of low penetrance germline predisposition alleles. We determined that while high-frequency biallelic gene inactivation under model 1 is rare, regions of LOH (particularly copy-number neutral LOH) are enriched for deleterious mutations and increased promoter methylation, while copy-number loss LOH regions are likely to contain under-expressed genes suggestive of haploinsufficiency. Reduction to homozygosity of cancer predisposition SNPs may also play a minor role.
It is likely that selection for regions of LOH depends on its effect on multiple genes. Selection for copy number neutral LOH may better fit the classic two-hit model whereas selection for copy number loss may be attributed to its effect on multi-gene haploinsufficiency. LOH mapping alone is unlikely to be successful in identifying novel tumour suppressor genes; a combined approach may be more effective.
Electronic supplementary material
The online version of this article (doi:10.1186/s12920-015-0123-z) contains supplementary material, which is available to authorized users.
Tumour suppressor gene; SNP; Ovarian cancer; Mutation; Haploinsufficiency
Sickle cell disease and β thalassemia are common severe diseases with little effective pathophysiologically-based treatment. Their phenotypic heterogeneity prompted genomic approaches to identify modifiers that ultimately might be exploited therapeutically. Fetal hemoglobin (HbF) is the major modulator of the phenotype of the β hemoglobinopathies. HbF inhibits deoxyHbS polymerization and in β thalassemia compensates for the reduction of HbA. The major success of genomics has been a better understanding the genetic regulation of HbF by identifying the major quantitative trait loci for this trait. If the targets identified can lead to means of increasing HbF to therapeutic levels in sufficient numbers of sickle or β-thalassemia erythrocytes, the pathophysiology of these diseases would be reversed. The availability of new target loci, high-throughput drug screening, and recent advances in genome editing provide the opportunity for new approaches to therapeutically increasing HbF production.
Genetic variation can alter transcriptional regulatory activity contributing to variation in complex traits and risk of disease, but identifying individual variants that affect regulatory activity has been challenging. Quantitative sequence-based experiments such as ChIP-seq and DNase-seq can detect sites of allelic imbalance where alleles contribute disproportionately to the overall signal suggesting allelic differences in regulatory activity.
We created an allelic imbalance detection pipeline, AA-ALIGNER, to remove reference mapping biases influencing allelic imbalance detection and evaluate accuracy of allelic imbalance predictions in the absence of complete genotype data. Using the sequence aligner, GSNAP, and varying amounts of genotype information to remove mapping biases we investigated the accuracy of allelic imbalance detection (binomial test) in CREB1 ChIP-seq reads from the GM12878 cell line. Additionally we thoroughly evaluated the influence of experimental and analytical parameters on imbalance detection.
Compared to imbalances identified using complete genotypes, using imputed partial sample genotypes, AA-ALIGNER detected >95 % of imbalances with >90 % accuracy. AA-ALIGNER performed nearly as well using common variants when genotypes were unknown. In contrast, predicting additional heterozygous sites and imbalances using the sequence data led to >50 % false positive rates. We evaluated effects of experimental data characteristics and key analytical parameter settings on imbalance detection. Overall, total base coverage and signal dispersion across the genome most affected our ability to detect imbalances, while parameters such as imbalance significance, imputation quality thresholds, and alignment mismatches had little effect. To assess the biological relevance of imbalance predictions, we used electrophoretic mobility shift assays to functionally test for predicted allelic differences in CREB1 binding in the GM12878 lymphoblast cell line. Six of nine tested variants exhibited allelic differences in binding. Two of these variants, rs2382818 and rs713875, are located within inflammatory bowel disease-associated loci.
AA-ALIGNER accurately detects allelic imbalance in quantitative sequence data using partial genotypes or common variants filling a critical methodological gap in these analyses, as full genotypes are rarely available. Importantly, we demonstrate how experimental and analytical features impact imbalance detection providing guidance for similar future studies.
Electronic supplementary material
The online version of this article (doi:10.1186/s12920-015-0117-x) contains supplementary material, which is available to authorized users.
Allelic imbalance; Genome mapping bias; Transcription factor binding; CREB1; Inflammatory bowel disease; Alleles; GWAS; ChIP-seq
The presence of loss-of-heterozygosity (LOH) mutations in cancer cell genomes is commonly encountered. Moreover, the occurrences of LOHs in tumor suppressor genes play important roles in oncogenesis. However, because the causative mechanisms underlying LOH mutations in cancer cells yet remain to be elucidated, enquiry into the nature of these mechanisms based on a comprehensive examination of the characteristics of LOHs in multiple types of cancers has become a necessity.
We performed next-generation sequencing on inter-Alu sequences of five different types of solid tumors and acute myeloid leukemias, employing the AluScan platform which entailed amplification of such sequences using multiple PCR primers based on the consensus sequences of Alu elements; as well as the whole genome sequences of a lung-to-liver metastatic cancer and a primary liver cancer. Paired-end sequencing reads were aligned to the reference human genome to identify major and minor alleles so that the partition of LOH products between homozygous-major vs. homozygous-minor alleles could be determined at single-base resolution. Strict filtering conditions were employed to avoid false positives. Measurements of LOH occurrences in copy number variation (CNV)-neutral regions were obtained through removal of CNV-associated LOHs.
We found: (a) average occurrence of copy-neutral LOHs amounting to 6.9 % of heterologous loci in the various cancers; (b) the mainly interstitial nature of the LOHs; and (c) preference for formation of homozygous-major over homozygous-minor, and transitional over transversional, LOHs.
The characteristics of the cancer LOHs, observed in both AluScan and whole genome sequencings, point to the formation of LOHs through repair of double-strand breaks by interhomolog recombination, or gene conversion, as the consequence of a defective DNA-damage response, leading to a unified mechanism for generating the mutations required for oncogenesis as well as the progression of cancer cells.
Electronic supplementary material
The online version of this article (doi:10.1186/s12920-015-0104-2) contains supplementary material, which is available to authorized users.
Copy number variation; Double strand break repair; Gain-of-heterozygosity; Gene conversion; Inter-homologous recombination; Loss-of-heterozygosity
Adipose tissue-derived stromal stem cells (ASCs) represent a promising regenerative resource for soft tissue reconstruction. Although autologous grafting of whole fat has long been practiced, a major clinical limitation of this technique is inconsistent long-term graft retention. To understand the changes in cell function during the transition of ASCs into fully mature fat cells, we compared the transcriptome profiles of cultured undifferentiated human primary ASCs under conditions leading to acquisition of a mature adipocyte phenotype.
Microarray analysis was performed on total RNA extracted from separate ACS isolates of six human adult females before and after 7 days (7 days: early stage) and 21 days (21 days: late stage) of adipocyte differentiation in vitro. Differential gene expression profiles were determined using Partek Genomics Suite Version 6.4 for analysis of variance (ANOVA) based on time in culture. We also performed unsupervised hierarchical clustering to test for gene expression patterns among the three cell populations. Ingenuity Pathway Analysis was used to determine biologically significant networks and canonical pathways relevant to adipogenesis.
Cells at each stage showed remarkable intra-group consistency of expression profiles while abundant differences were detected across stages and groups. More than 14,000 transcripts were significantly altered during differentiation while ~6000 transcripts were affected between 7 days and 21 days cultures. Setting a cutoff of +/-two-fold change, 1350 transcripts were elevated while 2929 genes were significantly decreased by 7 days. Comparison of early and late stage cultures revealed increased expression of 1107 transcripts while 606 genes showed significantly reduced expression. In addition to confirming differential expression of known markers of adipogenesis (e.g., FABP4, ADIPOQ, PLIN4), multiple genes and signaling pathways not previously known to be involved in regulating adipogenesis were identified (e.g. POSTN, PPP1R1A, FGF11) as potential novel mediators of adipogenesis. Quantitative RT-PCR validated the microarray results.
ASC maturation into an adipocyte phenotype proceeds from a gene expression program that involves thousands of genes. This is the first study to compare mRNA expression profiles during early and late stage adipogenesis using cultured human primary ASCs from multiple patients.
Electronic supplementary material
The online version of this article (doi:10.1186/s12920-015-0119-8) contains supplementary material, which is available to authorized users.
Microarray; Adipose-derived stem cells; Transcriptome; Gene expression; Stromal vascular fraction; Adipogenesis
Genomic rearrangements or structural variants (SVs) are one of the most common classes of mutations in cancer.
An integrated DNA sequencing and transcriptional profiling (RNA sequence and microarray gene expression data) analysis was performed on six ovarian cancer patient samples. Matched sets of control (whole blood) samples from these same patients were used to distinguish cancer SVs of germline origin from those arising somatically in the cancer cell lineage.
We detected 10,034 ovarian cancer SVs (5518 germline derived; 4516 somatically derived) at base-pair level resolution. Only 11 % of these variants were shown to have the potential to form gene fusions and, of these, less than 20 % were detected at the transcriptional level.
Collectively our results are consistent with the view that gene fusions and other SVs can be significant factors in the onset and progression of ovarian cancer. The results further indicate that it may not only be the occurrence of these variants in cancer but their regulation that contributes to their biological and clinical significance.
Electronic supplementary material
The online version of this article (doi:10.1186/s12920-015-0118-9) contains supplementary material, which is available to authorized users.
Oral squamous cell carcinoma (OSCC) is associated with substantial mortality and morbidity but, OSCC can be difficult to detect at its earliest stage due to its molecular complexity and clinical behavior. Therefore, identification of key gene signatures at an early stage will be highly helpful.
The aim of this study was to identify key genes associated with progression of OSCC stages. Gene expression profiles were classified into cancer stage-related modules, i.e., groups of genes that are significantly related to a clinical stage. For prioritizing the candidate genes, analysis was further restricted to genes with high connectivity and a significant association with a stage. To assess predictive power of these genes, a classification model was also developed and tested by 5-fold cross validation and on an independent dataset.
The identified genes were enriched for significant processes and functional pathways, and various genes were found to be directly implicated in OSCC. Forward and stepwise, multivariate logistic regression analyses identified 13 key genes whose expression discriminated early- and late-stage OSCC with predictive accuracy (area under curve; AUC) of ~0.81 in a 5-fold cross-validation strategy.
The proposed network-driven integrative analytical approach can identify multiple genes significantly related to an OSCC stage; the classification model that is developed with these genes may help to distinguish cancer stages. The proposed genes and model hold promise for monitoring of OSCC stage progression, and our findings may facilitate cancer detection at an earlier stage, resulting in improved treatment outcomes.
Electronic supplementary material
The online version of this article (doi:10.1186/s12920-015-0114-0) contains supplementary material, which is available to authorized users.
Coexpression network analysis; Gene module; Hub gene; Microarray; Oral squamous cell carcinoma; Logistic regression modeling
The growing advances in DNA sequencing tools have made analyzing the human genome cheaper and faster. While such analyses are intended to identify complex variants, related to disease susceptibility and efficacy of drug responses, they have blurred the definitions of mutation and polymorphism.
In the era of personal genomics, it is critical to establish clear guidelines regarding the use of a reference genome. Nowadays DNA variants are called as differences in comparison to a reference. In a sequencing project Single Nucleotide Polymorphisms (SNPs) and DNA mutations are defined as DNA variants detectable in >1 % or <1 % of the population, respectively. The alternative use of the two terms mutation or polymorphism for the same event (a difference as compared with a reference) can lead to problems of classification. These problems can impact the accuracy of the interpretation and the functional relationship between a disease state and a genomic sequence.
We propose to solve this nomenclature dilemma by defining mutations as DNA variants obtained in a paired sequencing project including the germline DNA of the same individual as a reference. Moreover, the term mutation should be accompanied by a qualifying prefix indicating whether the mutation occurs only in somatic cells (somatic mutation) or also in the germline (germline mutation). We believe this distinction in definition will help avoid confusion among researchers and support the practice of sequencing the germline and somatic tissues in parallel to classify the DNA variants thus defined as mutations.
Personal genomics; Precision medicine; DNA sequencing; DNA variants; Human genome
Chromosome 6pter-p24 deletion syndrome (OMIM #612582) is a recognized chromosomal disorder. Most of the individuals with this syndrome carry a terminal deletion of the short arm of chromosome 6 (6p) with a breakpoint within the 6p25.3p23 region. An approximately 2.1 Mb terminal region has been reported to be responsible for some major features of the syndrome. The phenotypic contributions of other deleted regions are unknown. Interstitial deletions of the region are uncommon, and reciprocal interstitial duplication in this region is extremely rare.
We present a family carrying an interstitial deletion and its reciprocal duplication within the 6p25.1p24.3 region. The deletion is 5.6 Mb in size and was detected by array comparative genomic hybridization (aCGH) in a 26-month-old female proband who presented speech delay and mild growth delay, bilateral conductive hearing loss and dysmorphic features. Array CGH studies of her family members detected an apparently mosaic deletion of the same region in the proband’s mildly affected mother, but a reciprocal interstitial duplication in her phenotypically normal brother. Further chromosomal and fluorescence in situ hybridization (FISH) analyses revealed that instead of a simple mosaic deletion of 6p25.1p24.3, the mother actually carries three cell populations in her peripheral blood, including a deletion (~70 %), a duplication (~8 %) and a normal (~22 %) populations. Therefore, both the deletion and duplication seen in the siblings were apparently inherited from the mother.
Interstitial deletion within the 6p25.1p24.3 region and its reciprocal duplication may co-exist in the same individual and/or family due to mitotic unequal sister chromatid exchange. While the deletion causes phenotypes reportedly associated with the chromosome 6pter-p24 deletion syndrome, the reciprocal duplication may have no or minimal phenotypic effect, suggesting possible triploinsensitivity of the same region. In addition, the cells with the duplication may compensate the phenotypic effect of the cells with the deletion in the same individual as implied by the maternal karyotype and her mild phenotype. Chromosomal and FISH analyses are essential to verify abnormal cytogenomic array findings.
Deletion of 6p25.1p24.3; Haploinsufficiency; Duplication of 6p25.1p24.3; Triploinsensitivity; Unequal sister chromatid exchange; Mosaicism
Gulf War Illness (GWI) is a complex multi-symptom disorder that affects up to one in three veterans of this 1991 conflict and for which no effective treatment has been found. Discovering novel treatment strategies for such a complex chronic illness is extremely expensive, carries a high probability of failure and a lengthy cycle time. Repurposing Food and Drug Administration approved drugs offers a cost-effective solution with a significantly abbreviated timeline.
Here, we explore drug re-purposing opportunities in GWI by combining systems biology and bioinformatics techniques with pharmacogenomic information to find overlapping elements in gene expression linking GWI to successfully treated diseases. Gene modules were defined based on cellular function and their activation estimated from the differential expression of each module’s constituent genes. These gene modules were then cross-referenced with drug atlas and pharmacogenomic databases to identify agents currently used successfully for treatment in other diseases. To explore the clinical use of these drugs in illnesses similar to GWI we compared gene expression patterns in modules that were significantly expressed in GWI with expression patterns in those same modules in other illnesses.
We found 19 functional modules with significantly altered gene expression patterns in GWI. Within these modules, 45 genes were documented drug targets. Illnesses with highly correlated gene expression patterns overlapping considerably with GWI were found in 18 of the disease conditions studied. Brain, muscular and autoimmune disorders composed the bulk of these.
Of the associated drugs, immunosuppressants currently used in treating rheumatoid arthritis, and hormone based therapies were identified as the best available candidates for treating GWI symptoms.
Electronic supplementary material
The online version of this article (doi:10.1186/s12920-015-0111-3) contains supplementary material, which is available to authorized users.
Gulf war illness; Systems biology; Bioinformatics; Drug repurposing; Pharmacogenomics; Complex chronic illness
Small ncRNAs (sncRNAs) offer great hope as biomarkers of disease and response to treatment. This has been highlighted in the context of several medical conditions such as cancer, liver disease, cardiovascular disease, and central nervous system disorders, among many others. Here we assessed several steps involved in the development of an ncRNA biomarker discovery pipeline, ranging from sample preparation to bioinformatic processing of small RNA sequencing data.
A total of 45 biological samples were included in the present study. All libraries were prepared using the Illumina TruSeq Small RNA protocol and sequenced using the HiSeq2500 or MiSeq Illumina sequencers. Small RNA sequencing data was validated using qRT-PCR. At each stage, we evaluated the pros and cons of different techniques that may be suitable for different experimental designs. Evaluation methods included quality of data output in relation to hands-on laboratory time, cost, and efficiency of processing.
Our results show that good quality sequencing libraries can be prepared from small amounts of total RNA and that varying degradation levels in the samples do not have a significant effect on the overall quantification of sncRNAs via NGS. In addition, we describe the strengths and limitations of three commercially available library preparation methods: (1) Novex TBE PAGE gel; (2) Pippin Prep automated gel system; and (3) AMPure XP beads. We describe our bioinformatics pipeline, provide recommendations for sequencing coverage, and describe in detail the expression and distribution of all sncRNAs in four human tissues: whole-blood, brain, heart and liver.
Ultimately this study provides tools and outcome metrics that will aid researchers and clinicians in choosing an appropriate and effective high-throughput sequencing quantification method for various study designs, and overall generating valuable information that can contribute to our understanding of small ncRNAs as potential biomarkers and mediators of biological functions and disease.
Electronic supplementary material
The online version of this article (doi:10.1186/s12920-015-0109-x) contains supplementary material, which is available to authorized users.
Biomarker; microRNA; Small non-coding RNA; Next-generation sequencing; Small RNA sequencing; Whole-blood; Brain; Heart; Liver; Clinical samples
microRNAs (miRs) are small non-coding RNAs involved in the fine regulation of several cellular processes by inhibiting their target genes at post-transcriptional level. Osteosarcoma (OS) is a tumor thought to be related to a molecular blockade of the normal process of osteoblast differentiation. The current paper explores temporal transcriptional modifications comparing an osteosarcoma cell line, Saos-2, and clones stably transfected with CD99, a molecule which was found to drive OS cells to terminally differentiate.
Parental cell line and CD99 transfectants were cultured up to 14 days in differentiating medium. In this setting, OS cells were profiled by gene and miRNA expression arrays. Integration of gene and miRNA profiling was performed by both sequence complementarity and expression correlation. Further enrichment and network analyses were carried out to focus on the modulated pathways and on the interactions between transcriptome and miRNome. To track the temporal transcriptional modification, a PCA analysis with differentiated human MSC was performed.
We identified a strong (about 80 %) gene down-modulation where reversion towards the osteoblast-like phenotype matches significant enrichment in TGFbeta signaling players like AKT1 and SMADs. In parallel, we observed the modulation of several cancer-related microRNAs like miR-34a, miR-26b or miR-378. To decipher their impact on the modified transcriptional program in CD99 cells, we correlated gene and microRNA time-series data miR-34a, in particular, was found to regulate a distinct subnetwork of genes with respect to the rest of the other differentially expressed miRs and it appeared to be the main mediator of several TGFbeta signaling genes at initial and middle phases of differentiation. Integration studies further highlighted the involvement of TGFbeta pathway in the differentiation of OS cells towards osteoblasts and its regulation by microRNAs.
These data underline that the expression of miR-34a and down-modulation of TGFbeta signaling emerge as pivotal events to drive CD99-mediated reversal of malignancy and activation of differentiation in OS cells. Our results describe crucial and specific interacting actors providing and supporting their relevance as potential targets for therapeutic differentiative strategies.
Electronic supplementary material
The online version of this article (doi:10.1186/s12920-015-0106-0) contains supplementary material, which is available to authorized users.
microRNA; microRNA target; Array integrations; Osteosarcoma; miR-34a; TGFbeta signaling
Recent advances in high-throughput technologies have led to the emergence of systems biology as a holistic science to achieve more precise modeling of complex diseases. Many predict the emergence of personalized medicine in the near future. We are, however, moving from two-tiered health systems to a two-tiered personalized medicine. Omics facilities are restricted to affluent regions, and personalized medicine is likely to widen the growing gap in health systems between high and low-income countries. This is mirrored by an increasing lag between our ability to generate and analyze big data. Several bottlenecks slow-down the transition from conventional to personalized medicine: generation of cost-effective high-throughput data; hybrid education and multidisciplinary teams; data storage and processing; data integration and interpretation; and individual and global economic relevance. This review provides an update of important developments in the analysis of big data and forward strategies to accelerate the global transition to personalized medicine.
Big data; Omics; Personalized medicine; High-throughput technologies; Cloud computing; Integrative methods; High-dimensionality
Pediatric embryonal brain tumors (PEBTs), which encompass medulloblastoma (MB), primitive neuroectodermal tumor (PNET) and atypical teratoid/rhabdoid tumor (AT/RT), are the second most prevalent pediatric brain tumor type. AT/RT is highly malignant and is often misdiagnosed as MB or PNET. The distinction of AT/RT from PNET/MB is of clinical significance because the survival rate of patients with AT/RT is substantially lower. The diagnosis of AT/RT relies primarily on morphologic assessment and immunohistochemical (IHC) staining for a few known markers such as the lack of INI1 protein expression. However, in our clinical practice we have observed several AT/RT-like tumors, that fulfilled histopathological and all other biomarker criteria for a diagnosis of AT/RT, yet retained INI1 immunoreactivity. Recent studies have also reported preserved INI1 immunoreactivity among certain diagnosed AT/RTs. It is therefore necessary to re-evaluate INI1(+), AT/RT-like cases.
Sanger sequencing, array CGH and mRNA microarray analyses were performed on PEBT samples to investigate their genomic landscapes.
Patients with AT/RT and those with INI(+) AT/RT-like tumors showed a similar survival rate, and global array CGH analysis and INI1 gene sequencing showed no differential chromosomal aberration markers between INI1(−) AT/RT and INI(+) AT/RT-like cases. We did not misdiagnose MBs or PNETs as AT/RT-like tumors because transcriptome profiling revealed that not only did AT/RT and INI(+) AT/RT-like cases express distinct mRNA and microRNA profiles, their gene expression patterns were different from those of MBs and PNETs. The most similar transcriptome profile to that of AT/RTs was the profile of embryonic stem cells. However; the transcriptome profile of INI1(+) AT/RT-like tumors was more similar to that of somatic neural stem cells, while the profile of MBs was closer to that of fetal brain tissue. Novel biomarkers were identified that can be used to distinguish INI1(−) AT/RTs, INI1(+) AT/RT-like cases and MBs.
Our studies revealed a novel INI1(+) ATRT-like subtype among Taiwanese pediatric patients. New diagnostic biomarkers, as well as new therapeutic tactics, can be developed according to the transcriptome data that were unveiled in this work.
Electronic supplementary material
The online version of this article (doi:10.1186/s12920-015-0103-3) contains supplementary material, which is available to authorized users.
Atypical teratoid/rhabdoid tumor; INI1; Pediatric embryonal brain tumor; Transcriptome; Stem cell
Small non-coding regulatory RNAs control cellular functions at the transcriptional and post-transcriptional levels. Oral squamous cell carcinoma is among the leading cancers in the world and the presence of cervical lymph node metastases is currently its strongest prognostic factor. In this work we aimed at finding small RNAs expressed in oral squamous cell carcinoma that could be associated with the presence of lymph node metastasis.
Small RNA libraries from metastatic and non-metastatic oral squamous cell carcinomas were sequenced for the identification and quantification of known small RNAs. Selected markers were validated in plasma samples. Additionally, we used in silico analysis to investigate possible new molecules, not previously described, involved in the metastatic process.
Global expression patterns were not associated with cervical metastases. MiR-21, miR-203 and miR-205 were highly expressed throughout samples, in agreement with their role in epithelial cell biology, but disagreeing with studies correlating these molecules with cancer invasion. Eighteen microRNAs, but no other small RNA class, varied consistently between metastatic and non-metastatic samples. Nine of these microRNAs had been previously detected in human plasma, eight of which presented consistent results between tissue and plasma samples. MiR-31 and miR-130b, known to inhibit several steps in the metastatic process, were over-expressed in non-metastatic samples and the expression of miR-130b was confirmed in plasma of patients showing no metastasis. MiR-181 and miR-296 were detected in metastatic tumors and the expression of miR-296 was confirmed in plasma of patients presenting metastasis. A novel microRNA-like molecule was also associated with non-metastatic samples, potentially targeting cell-signaling mechanisms.
We corroborate literature data on the role of small RNAs in cancer metastasis and suggest the detection of microRNAs as a tool that may assist in the evaluation of oral squamous cell carcinoma metastatic potential.
Electronic supplementary material
The online version of this article (doi:10.1186/s12920-015-0102-4) contains supplementary material, which is available to authorized users.
Recent advances in next-generation sequencing (NGS) have provided new methods for preimplantation genetic screening (PGS) of human embryos from in vitro fertilization (IVF) cycles. However, there is still limited information about clinical applications of NGS in IVF and PGS (IVF-PGS) treatments. The present study aimed to investigate the effects of NGS screening on clinical pregnancy and implantation outcomes for PGS patients in comparison to array comparative genomic hybridization (aCGH) screening.
This study was performed in two phases. Phase I study evaluated the accuracy of NGS for aneuploidy screening in comparison to aCGH. Whole-genome amplification (WGA) products (n = 164) derived from previous IVF-PGS cycles (n = 38) were retrospectively analyzed with NGS. The NGS results were then compared with those of aCGH. Phase II study further compared clinical pregnancy and implantation outcomes between NGS and aCGH for IVF-PGS patients. A total of 172 patients at mean age 35.2 ± 3.5 years were randomized into two groups: 1) NGS (Group A): patients (n = 86) had embryos screened with NGS and 2) aCGH (Group B): patients (n = 86) had embryos screened with aCGH. For both groups, blastocysts were vitrified after trophectoderm biopsy. One to two euploid blastocysts were thawed and transferred to individual patients primarily based on the PGS results. Ongoing pregnancy and implantation rates were compared between the two study groups.
NGS detected all types of aneuploidies of human blastocysts accurately and provided a 100 % 24-chromosome diagnosis consistency with the highly validated aCGH method. Moreover, NGS screening identified euploid blastocysts for transfer and resulted in similarly high ongoing pregnancy rates for PGS patients compared to aCGH screening (74.7 % vs. 69.2 %, respectively, p >0.05). The observed implantation rates were also comparable between the NGS and aCGH groups (70.5 % vs. 66.2 %, respectively, p >0.05).
While NGS screening has been recently introduced to assist IVF patients, this is the first randomized clinical study on the efficiency of NGS for preimplantation genetic screening in comparison to aCGH. With the observed high accuracy of 24-chromosome diagnosis and the resulting high ongoing pregnancy and implantation rates, NGS has demonstrated an efficient, robust high-throughput technology for PGS.
NGS; aCGH; PGS; Aneuploidy screening; Ongoing pregnancy; Implantation
High-throughput sequencing of cell-free DNA fragments found in human plasma has been used to non-invasively detect fetal aneuploidy, monitor organ transplants and investigate tumor DNA. However, many biological properties of this extracellular genetic material remain unknown. Research that further characterizes circulating DNA could substantially increase its diagnostic value by allowing the application of more sophisticated bioinformatics tools that lead to an improved signal to noise ratio in the sequencing data.
In this study, we investigate various features of cell-free DNA in plasma using deep-sequencing data from two pregnant women (>70X, >50X) and compare them with matched cellular DNA. We utilize a descriptive approach to examine how the biological cleavage of cell-free DNA affects different sequence signatures such as fragment lengths, sequence motifs at fragment ends and the distribution of cleavage sites along the genome.
We show that the size distributions of these cell-free DNA molecules are dependent on their autosomal and mitochondrial origin as well as the genomic location within chromosomes. DNA mapping to particular microsatellites and alpha repeat elements display unique size signatures. We show how cell-free fragments occur in clusters along the genome, localizing to nucleosomal arrays and are preferentially cleaved at linker regions by correlating the mapping locations of these fragments with ENCODE annotation of chromatin organization. Our work further demonstrates that cell-free autosomal DNA cleavage is sequence dependent. The region spanning up to 10 positions on either side of the DNA cleavage site show a consistent pattern of preference for specific nucleotides. This sequence motif is present in cleavage sites localized to nucleosomal cores and linker regions but is absent in nucleosome-free mitochondrial DNA.
These background signals in cell-free DNA sequencing data stem from the non-random biological cleavage of these fragments. This sequence structure can be harnessed to improve bioinformatics algorithms, in particular for CNV and structural variant detection. Descriptive measures for cell-free DNA features developed here could also be used in biomarker analysis to monitor the changes that occur during different pathological conditions.
Electronic supplementary material
The online version of this article (doi:10.1186/s12920-015-0107-z) contains supplementary material, which is available to authorized users.
Cell-free DNA; extracellular DNA; biomarkers; fragment lengths; fragmentation motifs; nucleosomes; higher-order chromatin packaging; apoptosis; necrosis
Epigenome-wide studies in hepatocellular carcinoma (HCC) have identified numerous genes with aberrant DNA methylation. However, methods for triaging functional candidate genes as useful biomarkers for epidemiological study have not yet been developed.
We conducted targeted next-generation bisulfite sequencing (bis-seq) to investigate associations of DNA methylation and mRNA expression in HCC. Integrative analyses of epigenetic profiles with DNA copy number analysis were used to pinpoint functional genes regulated mainly by altered DNA methylation.
Significant differences between HCC tumor and adjacent non-tumor tissue were observed for 28 bis-seq amplicons, with methylation differences varying from 12% to 43%. Available mRNA expression data in Oncomine were evaluated. Two candidate genes (GRASP and TSPYL5) were significantly under-expressed in HCC tumors in comparison with precursor and normal liver tissues. The expression levels in tumor tissues were, respectively, 1.828 and − 0.148, significantly lower than those in both precursor and normal liver tissue. Validations in an additional 42 paired tissues showed consistent under-expression in tumor tissue for GRASP (−7.49) and TSPYL5 (−9.71). A highly consistent DNA hypermethylation and mRNA repression pattern was obtained for both GRASP (69%) and TSPYL5 (73%), suggesting that their biological function is regulated by DNA methylation. Another two genes (RGS17 and NR2E1) at Chr6q showed significantly decreased DNA methylation in tumors with loss of DNA copy number compared to those without, suggesting alternative roles of DNA copy number losses and hypermethylation in the regulation of RGS17 and NR2E1.
These results suggest that integrative analyses of epigenomic and genomic data provide an efficient way to filter functional biomarkers for future epidemiological studies in human cancers.
Electronic supplementary material
The online version of this article (doi:10.1186/s12920-015-0105-1) contains supplementary material, which is available to authorized users.
Faced with an increasing number of choices for biologic therapies, rheumatologists have a critical need for better tools to inform rheumatoid arthritis (RA) disease management. The ability to identify patients who are unlikely to respond to first-line biologic anti-TNF therapies prior to their treatment would allow these patients to seek alternative therapies, providing faster relief and avoiding complications of disease.
We identified a gene expression classifier to predict, pre-treatment, which RA patients are unlikely to respond to the anti-TNF infliximab. The classifier was trained and independently evaluated using four published whole blood gene expression data sets, in which RA patients (n = 116 = 44 + 15 + 30 + 27) were treated with infliximab, and their response assessed 14–16 months post treatment according to the European League Against Rheumatism (EULAR) response criteria. For each patient, prior knowledge was used to group gene expression measurements into disease-relevant biological signaling mechanisms that were used as the input features for regularized logistic regression.
The classifier produced a substantial enrichment of non-responders (59 %, given by the cross validated test precision) compared to the full population (27 % non-responders), while identifying nearly a third of non-responders. Given this classifier performance, treatment of predicted non-responders with alternative biologics would decrease their chance of non-response by between a third and a half, substantially improving their odds of effective treatment and stemming further disease progression. The classifier consisted of 18 signaling mechanisms, which together indicated that higher inflammatory signaling mediated by TNF and other cytokines was present pre-treatment in the blood of patients who responded to infliximab treatment. In contrast, non-responders were classified by relatively higher levels of specific metabolic activities in the blood prior to treatment.
We were able to successfully produce a classifier to identify a population of RA patients significantly enriched in anti-TNF non-responders across four different patient cohorts. Additional prospective studies are needed to validate and refine the classifier for clinical use.
Electronic supplementary material
The online version of this article (doi:10.1186/s12920-015-0100-6) contains supplementary material, which is available to authorized users.
Rheumatoid arthritis; Infliximab; Anti-TNF therapy; Classifier
The rapid advances in genome sequencing technologies have resulted in an unprecedented number of genome variations being discovered in humans. However, there has been very limited coverage of interpretation of the personal genome sequencing data in terms of diseases.
In this paper we present the first computational analysis scheme for interpreting personal genome data by simultaneously considering the functional impact of damaging variants and curated disease-gene association data. This method is based on mutual information as a measure of the relative closeness between the personal genome and diseases. We hypothesize that a higher mutual information score implies that the personal genome is more susceptible to a particular disease than other diseases.
The method was applied to the sequencing data of 50 acute myeloid leukemia (AML) patients in The Cancer Genome Atlas. The utility of associations between a disease and the personal genome was explored using data of healthy (control) people obtained from the 1000 Genomes Project. The ranks of the disease terms in the AML patient group were compared with those in the healthy control group using "Leukemia, Myeloid, Acute" (C04.557.337.539.550) as the corresponding MeSH disease term.
The mutual information rank of the disease term was substantially higher in the AML patient group than in the healthy control group, which demonstrates that the proposed methodology can be successfully applied to infer associations between the personal genome and diseases.
Overall, the area under the receiver operating characteristics curve was significantly larger for the AML patient data than for the healthy controls. This methodology could contribute to consequential discoveries and explanations for mining personal genome sequencing data in terms of diseases, and have versatility with respect to genomic-based knowledge such as drug-gene and environmental-factor-gene interactions.
Next-generation sequencing; MeSH tree structure; disease risk; personal genome