Here we argue that microarray gene expression data is a valuable source of information to discover outlier genes with potential functional gene rearrangements that have effect on the expression level of downstream genes. Since gene rearrangements are rare genetic translocation that affects a small sample of cancer patients and not all of them, it is feasible to discover genes that are overexpressed (amplified or fused) or underexpressed (deleted) in subset of cancer samples. Genes that are overexpressed in subset of samples are anticipated to be amplified or fused, and genes that are underexpressed in subset of samples are anticipated to be deleted. Unfortunately methods like SAM, t-test, and so forth that are developed to extract differentially expressed genes are not suitable to detect outlier genes. Previous works that aimed to identify gene rearrangements using bioinformatics approaches were limited to the identification of potential fused genes overexpressed in subset of samples and assessing the performance using synthetic data with embedded test genes. Herein, we followed the same approach by testing our EigFusion method on synthetic data with embedded tests. One might argue that real expression data does not follow certain distributions as in synthetic data. To address this point, we used real prostate cancer data with synthetic tests to test and compare methods. Unfortunately, there is no benchmark data that could be used in this study for performance evaluation purposes.
We compared the performance of EigFusion with all the methods in the literature that we are aware of that deal with outlier gene detection. One key factor that we considered and was not considered before is the size of cancer samples with respect to the size of normal samples. In this work, we showed that the ratio of cancer samples to normal samples significantly impacts the FDR. Existing methods such as COPA suffers from several drawbacks; first, the user defined rth percentile. Second, COPA is individual gene based method, and, most importantly, it fails to distinguish between biomarkers and genes with potential rearrangement especially when S2 is greater than or equal to S1. This is because the median will be biased toward normal samples. ORT, OS, and GTI also suffered from the same drawbacks. ORT showed to prefer high cancer proportion, unlike COPA that showed a decreasing performance as the cancer samples proportion increases (). Based on Figure (), ORT has zero FPR, but high FNR. This is because it is able to give a low rank to all genes that have high expression in all cancer samples, and has high FNR because it was unable to detect fusion genes when the cancer sample size increases. GTI and OS performed equally and they are the closest to EigFusion; however, GTI and OS are unable to discriminate between rearranged genes and biomarker genes when the cancer samples are less than normal samples (), and they are unable to detect rearranged genes when cancer samples size increases (). Both OS and GTI showed to have high FPR when cancer samples are less than 100, and high FNR when cancer samples are more than 100, they perform best when the samples are equally grouped into normal and cancer samples. They both showed not to be affected by the variation in the size of cancer samples. They ranked the same test genes in the same order regardless of the cancer samples size variation. EigFusion is a new method to detect rearranged genes that we proposed in this work which showed to have better performance compared with other existing methods. EigFusion is able to overcome one of the drawbacks of the other methods, which is distinguishing between rearranged genes and biomarkers genes. EigFusion identifies both overexpressed and underexpressed gene in the same run. Thus, we think EigFusion is more generic to be used to identify genetic rearrangements in general that result in gene expression change. We also stress on the impact of cancer samples size with respect to normal sample size, that should be considered in any gene rearrangement prediction problem.
In our study, we aimed to characterize outlier genes and their potential functional gene rearrangements in several tumor types: prostate, leukemia, and ovarian. We first focused on functional gene rearrangements in prostate cancer patients (primary and metastatic) compared with normal samples (). We found that large portion of these gene rearrangements occur in metastatic samples; only CCDC141
showed to be overexpressed in primary cancer. FABP5
gene is overexpressed in both primary and metastatic cancer. FABP5
is associated with psoriasis; it is a chronic immune-mediated disease that appears on the skin, breast cancer, and metastasis. Examination of the clinical implications of FABP5
rearrangements revealed that samples with FABP5
rearrangements are at higher risk of death (P
value = 0.0000001) compared with ERG rearrangements (P
value = 0.18). Furthermore, FABP5
is overexpressed in samples that have TARP
underexpressed, which indicates that FABP5
might be fused to TARP
gene is embedded within an intron of the T-cell receptor-gamma (TCRG) locus, which encodes an alternative T-cell receptor that is always coexpressed with T-cell receptor delta [25
was identified to be expressed in a prostate-specific form of TCRG mRNA in human prostate and demonstrated that it originated from epithelial cells [25
]. This clearly shows that there is specific rearrangement or alternative splicing mechanism that leads toward aggressive cancer. Further characterizing FABP5
, they are rearranged in ERG0 samples, which means that these two genes could be used to define distinct group of prostate cancer. Several studies showed that C-FABP or E-FABP
is a metastasis inducing gene overexpressed in human prostate carcinomas [26
, another significant gene identified in this work harbours a binding site for ELK-1 transcription factor, which is one of the ets- transcrption factors family to which ERG
belongs, in its promoter. This might explain the association between KCNH8 and ERG.
One of the problems bioinformaticians face is validating the proposed computational algorithm. In this work, we validated the identified potential rearranged genes using CNA datasets for the same samples from which microarray gene expression data was conducted. Large portion of the genes were copy number altered, either amplified or deleted in both prostate and ovarian cancer. Validating prostate genes on CNA of ovarian data showed interesting result; altered genes in prostate are also altered in ovarian but not the opposite. We also found that ovarian samples have higher alteration rate than prostate samples. Most of the ovarian genes are altered in more than 8% of the ovarian cancer samples; however, prostate genes are only altered in around 2–4% of prostate samples. This reveals that ovarian cancer is more heterogenous than prostate cancer.
Several other findings have emerged from our analysis, largely based on the opportunity provided by integrated analysis of functional protein networks. Putative rearranged genes are functionally related and form modules that are enriched in biological pathways, mainly RAS/RAF and cadherin signaling pathways. A second finding is that integrating functional protein networks with CNA data provides insights on to the dysregulated pathways. EigFusion was able to identify elements (rearranged genes) in dysregulated pathway, but integrating CNA and functional networks gave more insights into the dysregulated pathways as other altered genes, that EigFusion was not able to retrieve, were identified. Thus, we believe that integrating EigFusion with functional protein networks and CNA data would reveal and give detailed insights into the dysregulated pathways. One of the findings we were able to retrieve using the integrative approach is the nuclear receptor coactivator NCOA2
that was previously shown to alter AR pathway in primary prostate tumors providing mechanism for its potential role as a prostate cancer oncogene [19
Survival analysis revealed that patients with rearrangements in the identified set of genes are at higher risk of cancer specific death. Using rearranged genes in prostate cancer helped to identify three subgroups with distinct outcome and different rearrangement profile. Using ovarian rearranged gene expression did not show significant prognostic value. Overall, these discoveries set the stage for approaches to the treatment of prostate, ovarian, and leukemia in which rearranged genes or network are detected and targeted with therapies selected to be effective against these specific aberrations.