PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of plosonePLoS OneView this ArticleSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)
 
PLoS One. 2017; 12(8): e0183933.
Published online 2017 August 25. doi:  10.1371/journal.pone.0183933
PMCID: PMC5571984

Tensor decomposition-based unsupervised feature extraction applied to matrix products for multi-view data processing

Y-h. Taguchi, Conceptualization, Investigation, Methodology, Project administration, Supervision, Writing – original draft, Writing – review & editing*
Yudong Zhang, Editor

Abstract

In the current era of big data, the amount of data available is continuously increasing. Both the number and types of samples, or features, are on the rise. The mixing of distinct features often makes interpretation more difficult. However, separate analysis of individual types requires subsequent integration. A tensor is a useful framework to deal with distinct types of features in an integrated manner without mixing them. On the other hand, tensor data is not easy to obtain since it requires the measurements of huge numbers of combinations of distinct features; if there are m kinds of features, each of which has N dimensions, the number of measurements needed are as many as Nm, which is often too large to measure. In this paper, I propose a new method where a tensor is generated from individual features without combinatorial measurements, and the generated tensor was decomposed back to matrices, by which unsupervised feature extraction was performed. In order to demonstrate the usefulness of the proposed strategy, it was applied to synthetic data, as well as three omics datasets. It outperformed other matrix-based methodologies.

Introduction

In the current era of big data, it is often that massive datasets are obtained, including samples with many features. For example, a video dataset can be regarded as time points (samples) vs pixels (features). Audio files consist of time points (samples) vs amplitude (features), and sets of DNA sequences consist of individuals (samples) vs nuclear acid sequences (features). All of these are provided in the form of a matrix, whose rows and columns are features and samples, respectively (of course, rows and columns are exchangeable). Although processing massive datasets is itself problematic, integrating distinct types of datasets is even more difficult. This problem is often annotated as multi-view data processing. For example, an audio visual file can be regarded as time points (samples) vs pixels and amplitudes(both features). In this paper, I consider two types of specific multi-view data processing: one is sharing samples (hereinafter called Case I) and another is sharing features (hereinafter, called Case II). It is formally possible to deal with these two cases as unified; multiple (m > 0) views of data, i.e., X(k), k = 1, … m, each of which is Nk features times M samples shared with multiple views (Case I), can be regarded as a (∑k Nk) × M matrix

(X(1)T,X(2)T,,X(m)T)T,
(1)

where XT is the transposed matrix of X, while, if X(k), k = 1, … m, are Mk samples times N features shared with multiple views (Case II), can be regarded as a N × (∑k Mk) matrix

(X(1),X(2),,X(m)).
(2)

For both cases, however, we are not sure what will happen by simply merging distinct features as one matrix.

In order to address this problem, a variety of multi-view data processes have been proposed [1, 2]. Independent of the strategy to integrate multi-view datasets, there must be some weights attributed to each view. Since there are no a priori criteria to optimize these weights, some kind of artificial criteria are required. For example, if samples are classified, weights can be optimized so as to discriminate samples coincidentally from classes. Alternatively, if feature extraction is a task, weights can be optimized so as to generate the “best” features regardless of which features are considered good.

The reason weights are required for individual views is that we do not know whether the same weights are acceptable when simply creating new variables by merging or linearly combining them. Suppose xij(k) are the observed values attributed to the ith feature of the jth sample in the kth view. Generating a merged matrix is to have a matrix where xij(k) is placed at the (i + ∑k′ < k−1 Nk)th row and the jth column, as introduced in Eq (1) (Case I). Alternatively, a merged matrix can be generated where xij(k) is placed at the ith row and the (j + ∑k′ < k−1 Mk)th column, as introduced in Eq (2) (Case II).

This is not necessarily as simple as it may appear. For example, in Case I, if Nk varies drastically from view to view, the results may be dominated by the views with the maximum number of features. However, it is not clear if views with more features are more important. Alternatively, if the new feature xij=i,(k)Ci(k)xij(k) is generated with the linear combination where Ci(k)s are coefficient, there are similar problems. If Ci(k)s do not vary dependent upon (k), views with more features may dominate the outcome. Reverting row and column, I will discuss Case II as well. In order to compensate for this discrepancy, each view must be weighted based on criteria which are not naturally unique. No previously proposed strategies were free from this problem.

In this paper, I propose a brand new strategy that is free from weighting views; generating tensors whose number of modes is the same as, or one greater than the number of views, and applying tensor decomposition (TD) to them. Using this implementation, I performed feature extraction (FE), which I name TD based unsupervised FE, which is extended from the recently proposed principal component analysis (PCA) based unsupervised FE [322].

Materials and methods

Converting multi-view matrices into a tensor with multiplication

If we generate a new feature with neither summation nor merging, the product for Case I—xi1,i2,...,im,j=k=1mxik,j(k)—can be regarded as an (m + 1) mode tensor. As each newly generated feature is composed of one feature from individual views, no weight is needed. Similarly, in Case II, xj1,j2,...,jm,i=k=1mxi,jk(k), can be regarded as an (m + 1) mode tensor. These tensors are hereinafter called Type 1.

Alternatively, instead of simply multiplying matrix components with the shared columns or rows, they can be summed up as follows: x˜i1,i2,...,im=jk=1mxik,j(k) (Case I) and x˜j1,j2,...,jm=ik=1mxi,jk(k) (Case II). These can be regarded as m-mode tensors and are hereinafter called Type II. All variables associated with Type II tensors are written with tildes.

These newly generated m-mode (type II) or (m + 1)-mode (type I) tensors can be processed using any kind of tensor manipulation. For example, for a reduced number of features whose combination can express tensors, TD can be used to gain such features.

In the following subsection, I consider four combinations of types and cases, i.e., type I or II tensors for Case I or II multi-view data, case by case.

Definition and terminology of TD

Since TD is not a popular methodology and the usage of TD for FE is rare, I will briefly introduce TD in this subsection.

TD is the expansion of tensor xn1,n2,…,nm, nk = 1, …, Nk, 1 ≤ km in the form

xn1,n2,,nm=1=1N1m=1NmG(n1,n2,,nm)k=1mxnk,k,

where xnk,[ell]k, 1 ≤ km, are orthogonal matrices. Since xn1,n2,…,nm is as large as G(n1, n2, …, nm), this formula is clearly overcomplete and does not give unique expansion. In this study, in order to decide G(n1, n2, …, nm), xnk,[ell]k, 1 ≤ km uniquely, I employ the higher order singular value decomposition (HOSVD) algorithm [23], which has successfully used to analyse microarrays [24] previously. G(n1, n2, …, nm) is a core matrix. xnk,[ell]k, 1 ≤ km, are singular value matrices and their column vectors are singular value vectors. G(n1, n2, …, nm), having larger absolute values, has more contribution to xn1,n2,…,nm. Since the combination of xnk,[ell]k, 1 ≤ km, associated with G(n1, n2, …, nm) to which larger absolute values were attributed contributes more collectively to xn1,n2,…,nm, they are more likely to be associated with one another.

For type I tensors this expression is straightforward. (m + 1) modes correspond to m + 1 components, i1, i2, …, im, j (Case I) or j1, j2, …, jm, i (Case II), respectively. On the other hand, for type II tensors, m modes correspond to m components, i1, i2, …, im (Case I) or j1, j2, …, jm (Case II), respectively. Thus, singular value vectors for jth sample (Case I) or ith feature (Case II) are missing. These can be computed via x˜m+1=k,j(k)=ikx˜k,ikxik,j (Case I) or x˜m+1=k,i(k)=jkx˜k,jkxi,jk (Case II), k = 1, …, m. Thus, for type II tensor, there are m kinds of sample (Case I) or feature (Case II) singular value vectors in contrast to the type I tensors that have unique sample (Case I) or feature (Case II) singular vale vectors.

The relation to HO GSVD

Higher order generalized singular value decomposition [25] (HO GSVD) is the method that corresponds to singular value vectors when TD is applied to type I tensors. As HO GSVD converts X(k) = U(k)ΛVT where U(k), Λ and V are the Nk × M left singular value matrix, M × M eigenvalue matrix, and the M × M right singular value matrix, U(k) are regarded as feature singular value matrices and V is regarded as a unique (common) sample singular value matrix in the present implementation.

Synthetic dataset

The synthetic dataset used for demonstrating the usefulness of TD based unsupervised FE is defined as:

xij(1)=c2(jM+sinπjM)+(1-c)εij(1)xij(2)=c2(M-jM+sinπjM)+(1-c)εij(2)

for 1 ≤ iN0. xij(k)=εij(k) for N0 < iN. 1 ≤ jM. εij(k) obeys uniform distribution [set membership] [0, 1]. Specifically, c = 0.8, N = 1000, N0 = M = 50.

mRNA and miRNA expression profile

mRNA and microRNA(miRNA) expression profiles of multi-omics data were downloaded from gene expression omnibus (GEO) using GEO ID GSE28884. At first, GSE28884_RAW.tar was downloaded and expanded. For mRNA, 161 files whose names ended by the string “c.txt.gz” were used. Each file was loaded into R by read.csv command and the second column named “M” was employed as mRNA expression values. Probes not associated with Human Genome Organisation (HUGO) gene names were discarded and 13393 probes were remained. For miRNA, 161 files whose names ended by the string “geo.txt.gz” and the corresponding samples of mRNA expression which were measured were used. Each file was loaded into R by read.csv command and the second column (“Count”) was summed using the same third column (“Annotation”) values. Sum totals of less than 10 were discarded. As a result, 755 features remained. Finally, the miRNA expression profile matrix is xi2,jmiRNA,1i2755,1j161, and the mRNA expression profile matrix is xi1,jmRNA,1i113393,1j161.

mRNA expressions of epidermal growth factor (EGF) treated breast cancers were downloaded from GEO using GEO ID GSE84096. The file named GSE84096_series_matrix.txt.gz included in “Series Matrix File(s)” was downloaded. Gene expression was divided into 14 control samples and 14 EGF treated samples named xi,t1control and xi,t1EGF, respectively.

mRNA expressions of vaccination experiments were downloaded from GEO using GEO ID GSE18323. Files named GSE18323-GPL570_series_matrix.txt.gz and GSE18323-GPL571_series_matrix.txt.gz included in “Series Matrix File(s)” were downloaded. As these included two distinct platforms, 22277 commonly included probes were used. Fifty eight samples annotated as “Protected group” (P) at time points T1 to T5, 52 samples annotated as “Delay group” (D) at time points T1 to T5, and 72 samples annotated as “Non-protected group” (NP) at time points T1 to T5, were used. They were named as xi,t1Pl, xi,t2D and xi,t3NP respectively.

All expression profiles were standardized as ∑i xij = 0, and ixij2=N.

PCA-based unsupervised FE

PCA

In contrast to the usual use of PCA, where samples are embedded, the genes were embedded in this implementation.

Suppose xijs satisfy ixij=0,ixij2/N=1 and X is a matrix whose elements are xij. The gram matrix G is defined as G [equivalent] XXT. Eigenvectors uk = (uk1and…,ukN)Ts (1 ≤ k ≤ min(M, N)) are then obtained as Gukλkuk, where uki is the kth PC score attributed to gene i and λks are the eigenvalues ordered as λkλk+1. The kth PC loadings attributed to the jth sample vkj are defined as vkXTuk, where vk = (vk1,…,vkM)T because vk is the eigenvector of the covariance matrix XT X, XTGukXTXXTukXTXvkλkXTukλkvk.

PCA-based unsupervised FE applied to mRNA/miRNA expression

First, the five initial PC loadings v[ell]1, js (for mRNA) and v2,js (for miRNA), 1 ≤ [ell]1, [ell]2 ≤ 5, were confirmed to have significant sample dependence (P-values < 0.05) with categorical regression,

v1,j=C10+SC1,S1δSj,1j161,v2,j=C20+SC2,S1δSj,1j161,

where Ck0 and Ck,S1 (k = 1, 2) are regression coefficients. δSj takes 1 when jth sample belongs to category S, and isotherwise 0.

Then, assuming that PC scores u[ell]1,i1s (for mRNA) and u2,i2s (for miRNA) are normally distributed, the P-values were attributed to the i1th mRNA and i2th miRNA using a χ squared distribution as

Pi1=Pχ2[>15(u1,i1σu1)2]Pi2=Pχ2[>25(u2,i2σu2)2],

where σu[ell]1 and σu2 is the standard deviation of {u[ell]1,i1|1 ≤ i1 ≤ 13393} and {u2,i2|1i2755}, respectively. Pχ2[>x] is the probability that the argument is larger than x under the assumption that the arguments obey a χ squared distribution. The P-values were further adjusted by the Benjamini-Hochberg (BH) criterion [26], and those genes associated with adjusted P-values less than 0.01 were selected as the genes associated with the difference between multiple classes.

PCA-based unsupervised FE applied to vaccination experiments

Although similar to the previous section, some differences are: (i) Only the second PC scores attributed to mRNAs were used for FE. (ii) Instead of mRNA and miRNA, patient category groups, P, D, and NP were used. (iii) mRNAs commonly included in three independent FEs performed considering P, D, and NP, were considered.

TD-based unsupervised FE

For type I tensor generated from multi-omics dataset, in order to identify miRNAs and mRNAs associated with identified sample singular value vectors, it was assumed that x1,i1mRNA and x2,i2miRNA follow multiple normal distributions, and P-values were attributed to the i1th mRNA and the i2th miRNA using χ2 distribution.

Pi1=Pχ2[>15(x1,i1mRNAσ1)2],Pi2=Pχ2[>k2(x2,i2miRNAσ2)2]

where σ[ell]1(σ[ell]2) are standard deviations of x1,i1mRNA(x2,i2miRNA). Piks were adjusted by using the BH criterion, and mRNAs and miRNAs associated with the adjusted P-value lower than 0.01 for mRNA and 0.05 for miRNA were selected as those associated with identified sample singular value vectors when TD was applied to type I tensor.

When TD was applied to type II tensor, the computations were similar, excluding that adjusted P-value lower than 0.01 and 1 ≤ [ell]2 ≤ 5 were used for miRNA, where x1,i1mRNA and x2,i2miRNA were replaced with x˜1,i1mRNA and x˜2,i2miRNA, respectively.

For EGF treated cell lines and vaccination experiments, similar procedures were repeated by replacing gene singular value vectors with x[ell]3 = 2,i (EGF, type I) or x˜3=2,icontrol and x˜3=2,iEGF (EGF, type II) or x˜3=2,iD, x˜3=2,iP and x˜3=2,iNP, (vaccination, type II), respectively.

Where HO GSVD was applied to multi-omics datasets, the feature singular value matrices were replaced with the first five column vectors of U(k) (k = 1 for mRNA and k = 2 for miRNA).

Adjusted P values used as thresholds are always 0.01 for EGF treated cell lines, vaccination, and HO GSVD.

Conversion prob ID to HUGO gene name/ensembl gene ID/GENBANK accession ID

Coincidences between prob ID (mRNA), HUGO gene names, Ensembl gene IDs were downloaded from GEO using GEO ID GPL3676 for multi-omics datasets, and GPL571/570 for vaccination experiments. GENBANK accession ID attributed to probes identified in the EGF treatment were extracted from GPL16686.

Enrichment analysis of g:profiler

Ensembl gene IDs were uploaded to g:profiler [27]. 13393 Ensembl gene IDs were used as background.

Enrichment analysis of genes identified as outliers using each of the first five mRNA singular value vectors obtained by applying TD based unsupervised FE to type I tensor generated from multi-omics datasets

Five distinct P-values were attributed to the i1th mRNA using χ2 distribution:

Pi11=Pχ2[>(x1,i1mRNAσ1)2],115

P-values were adjusted using the BH criterion and mRNAs associated with adjusted P-values less than 0.01 were identified as outliers (see S2 Table). S2 Table was uploaded to g:Cocoa in g:profiler with “Gene Ontology/ Biological Process” specified as the targeted ontology.

Enrichment analysis of MSigDB

HUGO gene IDs or GENBANK accession IDs associated with identified mRNAs were uploaded to http://software.broadinstitute.org/gsea/msigdb/annotate.jsp (registration and login are needed). “CGP: chemical and genetic perturbations” was selected for multi-omics data and EGF treated cell lines, while “C7: immunologic signatures” was selected for vaccination experiments.

Categorical regression towards data shown in Fig 1

Fig 1
Boxplots of sample singular value vectors x[ell]3,j (a) when TD was applied to the type I tensor and x˜3,jmRNA (b), x˜3,jmiRNA (c), 1 ≤ [ell]3 ≤ 5, when TD was applied to the type II tensor, generated ...

For type I tensor (Fig 1(a)),

x3,j=C30+SC3,S1δSj,1j161

where C30 and C3,S1 are regression coefficients. δSj takes 1 when jth sample belongs to category S, otherwise 0. The summation is taken over all categories. For type II tensor (Fig 1(b) and 1(c)), x[ell]3,j is replaced with x˜3,jmRNA or x˜3,jmiRNA. For HO GSVD (Fig 1(d)), x[ell]3,j is replaced with column vectors of V.

Correlations in Fig 2(c) and 2(g)

Fig 2
The results of TD applied to type I tensor generated from EGF treatment experiments.

The correlations were computed between two vectors of the length Tcontrol + TEGF, (x1=2,t1=1control,,x1=2,t1=Tcontrolcontrol,x2=2t2=1EGF,,x2=2,t2=TEGFEGF) and (xi,t1=1control,,xi,t1=Tcontrolcontrol,xi,t2=1EGF,,xi,t2=TEGFEGF) (Fig 2(c)) or between (x˜1=2,t1=1control,,x˜1=2,t1=Tcontrolcontrol,x˜2=2,t2=1EGF,,x˜2=2,t2=TEGFEGF) and (xi,t1=1control,,xi,t1=Tcontrolcontrol,xi,t2=1EGF,,xi,t2=TEGFEGF) (Fig 2(g)), where Tcontrol and TEGF are total number of samples in each treatment, respectively. Adjusted P-values attributed to the correlation coefficients were computed via the fdrtool [28] function in the fdrtool package in R [29].

Scaling and shifting prior to plotting Fig 2(d) and 2(h)

As each individual gene expression has its own base line and amplitude, they must be scaled and shifted before being overdrawn. To this end, the linear regression analysis

x1=2,t1control=aixi,t1control+bi,t1=1,,Tcontrolx2=2,t2EGF=aixi,t2EGF+bi,t2=1,,TEGF

was employed (Fig 2(c)) where ai and bi are regression coefficients commonly used for control and EGF treated samples. For Fig 2(g), x1=2,t1control and x2=2,t2EGF are replaced with x˜1=2,t1control and x˜2=2,t2EGF, respectively. Then, fitted values are used for plots. P-values that exhibit distinction between control and EGF treated sample at time point t were computed by two-sided t test between {aixi,t1=tcontrol+bi1iN} and {aixi,t2=tEGF+bi1iN} within N = 558 (Fig 2(d)) or N = 398 (Fig 2(h)) selected mRNA probes.

Correlations in Fig 3(c)

Fig 3
The results of TD applied to type II tensor generated from vaccination.

Similar to Fig 2, the correlations were computed between two vectors of length TP + TD + TNP, (x˜4=2,t1=1P,,x˜4=2,t1=TPP,x˜4=2,t2=1D,,x˜4=2,t2=TDD,x4=2,t3=1NP,,x4=2,t3=TNPNP) and (xi,t1=1P,,xi,t1=TPP,xi,t2=1D,,xi,t2=TDD,xi,t3=1NP,,x˜i2,t3=TNPNP), where TP = 58, TD = 52, and TNP = 72 are the total number of samples in each patient category, respectively. Adjusted P-values were computed via the fdrtool function in the fdrtool package in R [29].

Scaling and shifting prior to plotting Fig 3(d)

The linear regression analysis

x˜1=2,t1P=aixi,t1P+bi,t1=1,,TPx˜2=2,t2D=aixi,t2D+bi,t2=1,,TDx˜3=2,t3NP=aixi,t3NP+bi,t3=1,,TNP

was employed, where ai and bi are regression coefficients commonly used for three patient categories (these values differ from those used in Fig 2). Fitted values are used for plots. P-values that exhibit distinction among three patient groups at time point t = 1, 2, 3, 4, 5 days were computed by categorical regression

aixi,tS+bi=Ct0+S(P,D,NP)CtS1δSS

for 104 commonly selected mRNA probes in the three categories. Ct0 and CtS1 are regression coefficients fitted for {xi,tS1i104,S(P,D,NP)} with fixed t.

Statistical analysis

All statistical analyses were performed in R [29]. HOSVD was performed using the HOSVD function in the rTensor package. PCA was performed using the prcomp function in R. SAM was performed using SAM function in siggenes package. limma was performed in limma function in the limma package. Adjusted P-values were computed by p.adjust function with “BH” options. P-values by χ2 distribution was computed by pchisq function in R. Categorical regression was performed using the lm function in R. RF was performed using randomForest function in randomForest package. KCCA was performed by KCCA function in the kernlab package.

Results

A Work flow chart and list of the variables introduced are in S1 File.

Synthetic dataset

In order to demonstrate the efficacy of our strategy, I applied TD based unsupervised FE to synthetic data. In the following interpretation, I assumed two views of Case I for synthetic dataset, however interpreting it as Case II is straightforward, thus, I do not consider Case II specifically. The method applied to the synthetic dataset, TD based unsupervised FE, is the extension from the recently proposed PCA based unsupervised FE, which has been successfully applied to various bioinformatics problems [522].

First, two matrices are generated, each of which is composed of N features times M samples. They are notated as X(k), and k = 1, 2, respectively, whose components are denoted as xik,j(k), and k = 1, 2, respectively. The first N0 row vectors (features), xik,j(k), and 1 ≤ ikN0, are the noise added linear combination (Fig 4(d) and 4(e)) of constant, linear, and half period sinusoidal function (Fig 4(a) and 4(c)). However, since the coefficients of linear combinations were selected such that the correlation between X(1) and X(2) is negligible, identifying a correlation between two matrix row vectors is usually impossible (Fig 4(f)). Remaining row vectors xik,j(k) and N0 < ikN are simply random number [set membership] [0, 1]. The tasks are (i) identify N0 ordered features, and (ii) identify latent correspondence between two views.

Fig 4
The results of TD applied to the type I tensor generated from a synthetic dataset (M = 50).

The three mode tensor (type I) xi1,i2,j=xi1,j(1)xi2,j(2),1i1,i2N,1jM was derived from xik,j(k),k=1,2. Fig 4(g)–4(i) shows the first three sample mode singular value vectors, x[ell]3,j, [ell]3 = 1, 2, 3, obtained with HOSVD,

xi1,i2,j=xi1,j(1)xi2,j(2)=1=1N2=1N3=1MG(1,2,3)x1,i1(1)x2,i2(2)x3,j
(3)

It is obvious that Fig 4(g)–4(i) corresponds to Fig 4(a)–4(c), respectively. To my knowledge, it is the only method to decompose linear combinations back into parts in a fully unsupervised manner.

Moreover, as can be seen in Fig 5(a) and 5(b), N0 features are placed as outliers. Thus, TD based unsupervised FE applied to type I tensors generated from matrices’ products can not only decompose linear combinations, but can also identify the limited number of features, 1 ≤ i1, i2, ≤ N0, that contribute to correlations between two matrices, X(k), k = 1, 2.

Fig 5
Feature singular value vectors when TD was applied to type I tensor generated from synthetic data.

Although I could successfully demonstrate that my strategy works well, there is one drawback; it is the computationally extensive method, since its memory as well as computational time are proportional to MN2. In order to reduce computational resources as much as possible, I summed xi1,j(1)xi2,j(2) and generated the m(= 2) mode tensor (type II), x˜i1,i2=jxi1,j(1)xi2,j(2). Then, x˜i1,i2 is decomposed as

x˜i1,i2=jxi1,j(1)xi2,j(2)=1N2NG˜(1,2)x˜1,i1(1)x˜2,i2(2)
(4)

Two sample singular value vectors can be computed as

x˜3=k,j(k)=ikx˜k,ik(k)xik,j(k),k=1,2.

Since I employed the two views problem, although I occasionally got the two mode tensor, i.e., the matrix, if I consider m(>2) views, I generally get a m mode tensor and the above procedures can be easily extended to m mode tensors.

Fig 6 shows the first three sample singular value vectors, x˜3,j(k),3=1,2,3, and (k = 1, 2), obtained with the application of HOSVD to type II tensors. Among these, x˜3,j(k) and k = 1, 2, shown in Fig 6(b) and 6(e) ([ell]3 = 1) correctly reproduce Fig 4(e) while those shown in Fig 6(c) and 6(f) ([ell]3 = 2) does Fig 4(d), respectively. Thus, TD applied to type II tensors also successfully identified latent correlations between X(k), k = 1, 2 (Fig 6(h) and 6(i)). However, it could not depict orthogonal base functions (Fig 4(a)–4(c)) that can be detected by TD applied to type I tensors (Fig 4(g)–4(i)). Additionally, N0 features were successfully identified as outliers (Fig 5(c) and 5(d)). Thus, at the expense of recognition of orthogonal base functions, TD applied to type II tensors successfully reduced the computational resources needed by 1/M and fulfilled tasks (i) and (ii) as defined above.

Fig 6
The results of TD applied to type II tensor generated from synthetic dataset (M = 50).

In order to see if these strategies are also useful in practice, integrated analyses of multi-omics datasets were performed with this strategy and are described in the next subsection.

Multi-omics dataset

In the previous subsection, I demonstrated that applying TD based unsupervised FE to tensors generated from matrices’ products could determine latent structures behind pairs of matrices. However, this may only be feasible using synthetic data, as described in the previous subsection. Thus, in order to see if it also works in the situation not prepared specifically fitted to it, I need to show that it works in real situation.

The analysed dataset is composed of two omics profiles. These are mRNA and miRNA profiles which were measured for multi-class breast cancer samples including normal breast tissues [30]. As the samples are shared, the multi-omics data corresponds to Case I data. TD based unsupervised FE was applied to the dataset in order to identify disease critical genes and latent relations between miRNA and mRNA.

At first, TD was applied to the type I tensor generated from the mRNA and miRNA profiles as follows:

xi1,i2,j=xi1,jmRNAxi2,jmiRNA=123G(1,2,3)x1,i1mRNAx2,i2miRNAx3,j

where xi1,jmRNA and xi2,jmiRNA are expressions of the i1th mRNA and the i2th miRNA from the jth sample. In order to determine whether TD can identify disease critical features, categorical regression analysis was applied to sample singular value vector x[ell]3,j, in order to identify coincidences with defined sample classes. If the obtained sample singular value vectors are coincident with sample class labels, it is evidence that TD can process omics profiles properly, as this approach does not employ class labels explicitly. Fig 1 shows the first five sample singular value vectors, x[ell]3,j, 1 ≤ [ell]3 ≤ 5, that show significant sample class dependence. Thus, TD could successfully generate disease associated features. This is not a trivial outcome, as sample classification was not used.

Next, I attempted to extract features using the mRNA and miRNA singular value vectors x1,i1mRNA and x2,i2miRNA. To accomplish this, it was necessary to first identify mRNA and miRNA singular value vectors x1,i1mRNA and x2,i2miRNA, associated with sample singular value vectors, x[ell]3,j, 1 ≤ [ell]3 ≤ 5 identified above. This can be done by investigating G([ell]1, [ell]2, [ell]3) since the combinations ([ell]1, [ell]2, [ell]3) associated with larger absolute G values are regarded as more coincident with one another. Table 1 shows the ranking of G for 1 ≤ [ell]1, [ell]2, [ell]3 ≤ 5 (ranked from 1 to 10) based upon their absolute values. The first five sample singular value vectors x[ell]3,j, 1 ≤ [ell]3 ≤ 5, are associated with the first two miRNA singular value vectors x2,i2miRNA,2=1,2 as well as the first five mRNA singular value vectors, x1,i1mRNA,115, as only these combinations appear within top ten ranked G([ell]1, [ell]2, [ell]3)s that represent the amount of coincidence among mRNA and miRNA and samples.

Table 1
Top ranked 10 G([ell]1, [ell]2, [ell]3)s with larger absolute values among 1 ≤ [ell]1, [ell]2, [ell]3 ≤ 10 when TD was applied to type I tensor generated from xi1,jmRNA and xi2,jmiRNA (left) and xi,t1control and ...

Four hundred twenty-seven mRNA probes and 7 miRNAs, identified as outliers, using x1,i1mRNA,115 and x2,i2miRNA,122 were selected. The seven selected miRNAs were: hsa-let-7b, hsa-miR-125b, hsa-miR-143, hsa-miR-145, hsa-miR-21, hsa-miR-22, hsa-miR-99a. Remarkably, 143, 145, 21, 22, 99a as well as let-7a, and 125a, which belongs to the same families as 125b and let-7b, were also reported by the original study ([30], Table 1). The mRNAs associated with selected 427 mRNA probes are in S1 Table because of too many numbers.

In order to evaluate obtained mRNAs associated with the 427 selected mRNA probes and 7 miRNAs biologically, the mRNAs were uploaded for enrichment analysis using MSigDB [31] and the miRNAs to DIANA-mirpath [32]. The top 10 enriched gene sets in MSigDB are chiefly related to breast cancer (eight out of ten, see Table 2), while the top ranked pathways identified DIANA-mirpath are “MicroRNAs in cancer” (Table 3). Thus, I concluded that our strategy is successful despite fully unsupervised nature.

Table 2
Overlap between mRNAs identified (S1 Table) and MSigDB.
Table 3
Results of DIANA-mirath using seven miRNAs identified.

Next, type II tensor generated from mRNA and miRNA profiles was considered. Applying TD to type II tensor,

x˜i1,i2=jxi1,i2,j=12G˜(1,2)x˜1,i1mRNAx˜2,i2miRNA,

gives us two sample singular value vectors

x˜3=1,jmRNA=i1x˜1,i1mRNAxi1,j,x˜3=2,jmiRNA=i2x˜2,i2miRNAxi2,j,

The first five are shown separately in (Fig 1(b) and 1(c)). It is obvious that all of the ten sample singular value vectors are significantly coincident with sample classifications.

To determine whether TD applied to type II tensors could depict latent correlation between miRNA and mRNA, hierarchical clustering was performed between x˜3,jmRNA and x˜3,jmiRNA (1 ≤ [ell]3 ≤ 10, Fig 7(a)). Here, x˜3,jmRNA,3310 are always paired with one of x˜3,jmiRNA,3310 (Fig 7(a)). Thus, TD applied to type II tensor could successfully identify latent correlation between two views, i.e., mRNA and miRNA.

Fig 7
Hierarchical clustering of x˜3,jmRNA(x_mRNA) and x˜3,jmiRNA (x_miRNA).

Additionally, I attempted to extract mRNAs and miRNAs using the first five mRNA and miRNA singular value vectors, x˜1,i1mRNA,x˜2,i2miRNA, and 1 ≤ [ell]1, [ell]2 ≤ 5, respectively. In this example two views were emplyed, as the type II tensor is occasionally the matrix x˜i1,i2. Thus, the core tensor, G˜(1,2), is the diagonal matrix. Therefore, the first five feature singular value vectors are automatically associated with the first five corresponding sample singular value vectors. 21 miRNAs (let-7a/b/f, miR-103/125b/141/142-3p/143/145/148a/199a/b-3p/19b/205/21/22/23a/24/26a/30a/451/99a), identified as outliers using the first five miRNA singular value vectors, x˜2,i2,125, were selected. Eight miRNAs (let-7a, miR-21/22/451/142-3p/143/145/99a) were also reported in the original study ([30], Table 1). Three hundred seventy-four mRNA probes, identified as outliers using the first five mRNA singular value vectors x˜1,i1mRNA,115, were selected (associated mRNAs are in S1 Table) and were substantially overlapped with those selected when type I tensors were considered (Table 4). In order to evaluate obtained mRNAs and miRNAs biologically, mRNAs associated with 374 probes were uploaded to MSigDB, and miRNAs to DIANA-mirpath. DIANA-mirpath identified “MiRNAs in cancer” as the most significant KEGG pathway (Table 3). Table 2 also shows the overlap with MSigDB. Eight out of ten were breast cancer related, and one of the remaining two is related to metastasis. Thus, these identified mRNAs and miRNAs are biologically reasonable. In conclusion, TD applied to type II tensor works well.

Table 4
Comparison between 426 mRNA probes identified by TD based unsupervised FE applied to type I tensor and 374 mRNA probes identified by TD based unsupervised FE applied to type II tensor, or 427 probes identified by PCA based unsupervised FE separately applied ...

Temporally differentially expressed genes

Although the application of TD based unsupervised FE to multi-omics data was successful, one may wonder if the application to muti-omics data is reasonable, as synthetic datasets were more difficult to deal with due to a lack of class labelling. Have I intentionally tried the easiest case? In order to address this possible query, I considered two examples of a more difficult problem: identification of temporally differentially expressed genes. The task is as follows: Given more than one temporal gene expression, identify genes expressed differently among multiple expressions at specific time points. For example, expression of a particular gene obeys f(t) as a function of t under one set of conditions, while it obeys f(t) + C with a constant C under another. Conventionally, this is differentially expressed between the two conditions; however this kind of differential expression is often not of interest when temporal gene expression is considered. One distinction of note would be, for example, differences between expressions at certain time points and not at others (a distinct time dependency between two conditions). However, there are no de facto standard methods that automatically achieve this. In order to address it, TD based unsupervised FE was applied to this problem.

The first example of this application is the comparison of non-small lung cancer cell (NSCLC) line H1975, with and without EGF treatment [33]. Gene expression matrices were divided into two groups (with and without EGF treatment). The type I tensor xt1,t2,i is then generated as

xt1,t2,i=xi,t1controlxi,t2EGF

where xi,t1control and xi,t2EGF are ith gene expressions of cell lines with and without EGF treatment, at time points t1 and t2 after the EGF or control treatments. As they share features (though not samples) in contrast to the previous application which was Case I data, this example uses Case II data. As the samples in this example are divided into two groups based on EGF treatment, it is not fully unsupervised. It is, however, unsupervised in the sense that the type of temporal difference sought is not defined. The tensor was expanded by HOSVD as

xt1,t2,i=123G(1,2,3)x1,t1controlx2,t2EGFx3,i

Fig 2(a) and 2(b) shows the x1,t1control and x2,t2EGF for [ell]1, [ell]2 = 1, 2, respectively. Obviously, those for [ell]1 = [ell]2 = 1 do not have any time dependence while those for [ell]1 = [ell]2 = 2 do. Some temporal difference was observed in the latter, however its significance is unclear. In order to determine said significance, genes identified as outliers had to be selected. This selection process began with identifying the gene singular value vectors associated with x1=2,t1control and x2=2,t2EGF. Table 1 shows the top ranked G([ell]1, [ell]2, [ell]3)s with larger absolute values. It is obvious that x[ell]3 = 2,i is associated with x1=2,t1control and x2=2,t2EGF, as the absolute values of G(2, 2, 1) and G(2, 1, 2) are the second and the third largest in the table. Next, 558 mRNA probes (associated mRNAs are in S1 Table) identified as outliers based on x[ell]3 = 2,i were selected. Fig 2(c) shows the histogram of correlation coefficients between the vectors generated by connecting x1,t1control and x2,t2EGF vs the 558 selected mRNA probes. These are highly correlated (adjusted P-values are less than 0.01). It is remarkable since G(2, 2, 1) and G(2, 1, 2) are smaller than one tenth of G(1, 1, 1) whose absolute value is the largest. This suggests that the amount of contributions of G(2, 2, 1) and G(2, 1, 2) is too little to govern individual gene expression. The high correlation despite this fact speaks to the soundness of our methodology.

The next step was to determine whether the 558 mRNA probes selected exhibit temporal differences. The 558 mRNA probes selected are scaled, shifted, and over drawn as boxplot (Fig 2(d)). Though it is difficult to observe, P-values were computed by two-sided t tests between expressions, with and without EGF treatments at time points, 0.5, 1, 2, 4, 6 and 48 hours (Fig 2). These values are significant only at limited time points with and without EFG treatment. These were merely temporally differently expressed genes. Thus, TD based unsupervised FE applied to type I tensors is effective.

To determine the biological reliability of the selected genes, genes associated with selected 558 mRNA probes were uploaded to MSigDB. The second significant gene set was found to be KOBAYASHI_EGFR_SIGNALING_24HR_DN (the adjusted P value is 1.37 × 10−96), which is reasonable as the genes sought were expressed differently with and without EGF treatments.

The next task was to determine whether type II tensors produce similar results. Type II tensor x˜t1,t2, which in this example is the matrix since two views are considered, is defined as a summation of a type I tensor over gene index,

x˜t1,t2=ixt1,t2,i

that is expanded by HOSVD, which is a simple SVD in the present case

x˜t1,t2=12G˜(1,2)x˜1,t1controlx˜2,t2EGF

Fig 2(e) and 2(f) shows the x˜1,t1control and x˜2,t2EGF for [ell]1, [ell]2 = 1, 2. Obviously, x˜1=1,t1control and x˜2=1,t2EGF do not have any time dependence while x˜1=2,t1control and x˜2=2,t2EGF do, as in the case where TD was applied to a type I tensor. Some temporal difference was observed between x˜1=2,t1control and x˜2=2,t2EGF. Again, the significance of this temporal difference was unclear. Genes identified as outliers had to be selected to determine the significance of this temporal difference. While there are two gene singular value vectors,

x˜3=1,icontrol=t1x˜1,t1controlxi,t1,x˜3=2,iEGF=t2x˜2,t2EGFxi,t2,

485 and 471 mRNA probes identified as outliers were selected using x˜3=1,icontrol and x˜3=2,iEGF, respectively, among which 398 mRNA probes (associated mRNAs are shown in S1 Table) were commonly selected. Fig 2(g) shows the histogram of correlation coefficients between the vector generated by connecting x˜1,t1control and x˜2,t2EGF vs 398 commonly selected mRNA probes. These were highly correlated (adjusted P-values are less than 0.01). This is noteworhty as the smallest contribution from the second singular value vector was 1 × 10−5. This suggests that the amount of contributions of the second singular value vector were too small to govern individual gene expressions. As before, the high correlation despite this fact speaks to the soundness of our methodology.

The next task was to determine whether the genes selected exhibit temporal difference. The genes selected are scaled, shifted, and over drawn as boxplot (Fig 2(h)). Though it is difficult to observe, P-values computed by a two-sided t test between expression at time points, 0.5, 1, 2, 4, 6 and 48 hours (Fig 2) are significant only at limited time points with and without EFG treatment. Although this is of less significance than TD applied to a type I tensor, these are merely temporally differently expressed genes. Thus, TD based unsupervised FE applied to type II tensor is effective.

In order to see the biological reliability of selected genes, the mRNAs associated with commonly selected 398 mRNA probes were uploaded to MSigDB. The second most significant gene set was determined to be KOBAYASHI_EGFR_SIGNALING_24HR_DN (the adjusted P value is 1.37 × 10−128), which, again, was reasonable as the genes sought expressed differently with and without EGF treatments. Although the number of genes selected was less than that by TD applied to type a I tensor, since the P-value is smaller, the significance was greater than that of TD applied to a type I tensor.

As a whole, both TD applied to type I type II tensors is effective.

The next temporally differentially expressed gene detection example is vaccine infection experiment [34]. Patients were divided into three groups, P, D and NP. As sample classification was used, it is also not fully unsupervised. It is, however, unsupervised in the sense that no temporal functional forms are assumed. xi,t1P, xi,t2D, and xi,t3NP are the ith gene expressions of protected, delayed, and non-protected, patients at time points t1, t2 and t3 after vaccine treatments, respectively. Type I tensor xt1,t2,t3,i, was defined as

xt1,t2,t3,i=xi,t1Pxi,t2Dxi,t3NP

xt1,t2,t3,i would be expanded by HOSVD as

xt1,t2,t3,i=1,2,3,4G(1,2,3,4)x1,t1Px2,t2Dx3,t3NPx4,i,

however, the total memory required to store all of this expansion is too large to be prepared. Fortunately, as can be seen in the application to the first temporally differentially expressed gene identification, TD applied to a type II tensor that requires much smaller (1/N) memory can be as effective as TD applied to type I tensor for temporally differentially expressed gene identification. Thus, for this example, I employ only type II tensor x˜t1,t2,t3, defined as

x˜t1,t2,t3=ixt1,t2,t3,i

that can be expanded as

x˜t1,t2,t3=123G˜(1,2,3)x˜1,t1Px˜2,t2Dx˜3,t3NP

by HOSVD.

Fig 3(a) and 3(b) shows the x˜1,t1P, x˜2,t2D and x˜3,t3NP for [ell]1,[ell]2 = 1, 2. Obviously, x˜1=1,t1P, x˜2=1,t2D and x˜3=1,t3NP do not have any time dependence while those for [ell]1 = [ell]2 = [ell]3 = 2 do, as in the EGF treated cell line cases. Though x˜1=2,t1P, x˜2=2,t2D and x˜3=2,t3NP also seem to have some temporal difference, its significance is again unclear. To determine the significance of this difference, genes identified as outliers had to be selected. There are three gene singular value vectors:

x˜4=1,iP=t1x˜1,t1Pxi,t1P,x˜4=2,iD=t2x˜2,t2Dxi,t2Dx˜4=3,iNP=t3x˜3,t3NPxi,t3NP

Using these three gene singular value vectors with [ell]1 = [ell]2 = [ell]3 = 2, 104 mRNA probes identified commonly as outliers were selected (S1 Table).

Fig 3(c) shows the histogram of correlation coefficients between the vector generated by connecting x˜1,t1P, x˜2,t2D and x˜3,t3D vs selected 104 mRNA probes. These are highly correlated (adjusted P-values are less than 0.01). This is noteworthy as G˜(1,2,2),G˜(2,1,2),G˜(2,2,1), being the three largest core tensors associated with these three gene singular value vectors, had the contributions as small as 1 × 10−3 of G˜(1,1,1), the largest one. This suggests that the amount of contributions of the second gene singular value vector is too small to govern individual gene expression. The high correlation despite this fact suggests the soundness of our methodology.

The next task was to determine whether the mRNA probes selected exhibit temporal difference. The mRNA probes selected are scaled, shifted, and overdrawn as boxplot (Fig 3(d)). Though it is difficult to observe, P-values computed by categorical regression assuming three classes (P, D, NP) at time points, 1, 2, 3, 4 and 5 days (Fig 3) are significant at all time points between the three classes. This was clearly not due to simple baseline shifts, as can be seen in Fig 3(d). This is due to as temporally differently expressed genes. Thus, TD based unsupervised FE applied to type II tensor is also effective.

In order to determine the biological reliability of selected genes, the mRNAs associated with the selected 104 mRNA probes were uploaded to MSigDB. The top six significant gene sets were determined to be: GSE13485_X_VS_Y_YF17D_VACCINE_PBMC_DN, GSE10325_MYELOID_VS_LUPUS_MYELOID_DN, GSE13485_X_VS_Y_YF17D_VACCINE_PBMC_DN, GSE13485_X_VS_Y_YF17D_VACCINE_PBMC_DN, GSE13485_X_VS_Y_YF17D_VACCINE_PBMC_DN, and GSE13485_X_VS_Y_YF17D_VACCINATION_PBMC_DN, where (X, Y) = (CTRL, DAY7), (DAY1, DAY7), (CTRL, DAY3), (DAY3, YF17D), and (PRE, POST) in this order, which are associated with adjusted P-values, 2.97 × 10−66, 2.98 × 10−57, 3.69 × 10−55, 4.86 × 10−53, 5.43 × 10−51, and 6.36 × 10−49, respectively. As five out of six are related to vaccination, TD based unsupervised FE selected biologically feasible sets of genes.

In conclusion, TD based unsupervised FE is also effective also for the identification of temporally differentially expressed genes.

Discussion

In the following section I discuss the strategy for applying TD based unsupervised FE to tensors built from matrix products, methodological points of view, and outcomes obtained by applying this strategy to multi-omics datasets from the biological point of views.

Comparisons with various methods applicable to synthetic data

The synthetic datasets that presented in the above sections are very difficult to analyse using standard supervised statistical analysis methods. In the supervised methodology, all background knowledge of given datasets is required, e.g., classification labels or assumed functional forms (for example, monotonic increase/decrease or periodicity). Alternatively, TD applied to type I tensors is Eq (3) and applied to type II tensor is Eq (4), which can be performed in the synthetic dataset prepared without any information in advance. To my knowledge, there are no applicable supervised methodologies for the synthetic datasets presented above. Thus, in the following I discuss only unsupervised methods.

As the first N0 features are derived from the common bases shown in Fig 4(a)–4(c), it is possible to detect them by computing correlations between them. However, as can be seen in Fig 4(d)–4(f), the correlations are highly non-linear, and it is therefore impossible to detect them (in actuality, the Pearson correlation between two variables shown is Fig 4(f) is as small as -0.01).

One may wonder if correlation analysis considering linear combinations, e.g., canonical correlation analysis (CCA), can depict latent correlation. However, in CCA, as M dimensional vectors as numerous as N must be compared, it is an overcomplete problem when M < N (and this is the present case). Thus, canonical correlation coefficients generated from linear combinations are always 1.0. This means that there is no way to detect latent correlation between N0(<N) features.

Similar problems stand for nonlinear correlation analysis like kernel CCA (kCCA). KCCA was applied to matrices X(k), k = 1, 2 in the above synthetic examples. The ten components (this is the kCCA default) generated are classified into two types, each of which are distinct from the first N0 features and remaining (Fig 8(a)–8(d)). Thus, the correlation is apparently successful, however, both results in the correlation coefficients were as large as 1.0 meaning that kCCA evaluated the correlation of two views between the first N0 features and that between remaining features the latter composed of simple random numbers as demonstrated in the above. Thus, kCCA cannot distinguish between latent correlation between the first N0 features and random numbers, and cannot be successfully applied.

Fig 8
Two alternative methods applied to synthetic data.

Finally, PCA based unsupervised FE, which was recently proposed and successfully applied to various integrated analyses of multi-omics datasets, was again applied here. As PCA is equivalent to singular value decomposition (SVD),

xi1,j(1)=1u1,i1λ11/2v1,j,xi2,j(2)=2u2,i2λ21/2v2,j

Fig 8(e) and 8(f) are v[ell]1 = 1, j and v2=1,j, respectively. u[ell]1,i1,[ell]1 = 1, 2 is Fig 8(g) and u[ell]2,i2, [ell]2 = 1, 2 is Fig 8(h). λ[ell]2 and λ[ell]3 are the eigen values computed with PCA. Thus

xi1,j(1)xi2,j(2)=12v1,jλ11/2λ21/2v2,ju1,i1u2,i2

Compared with Eq (3), if

3G(1,2,3)x3,j=v1,jλ11/2λ21/2v2,jx1,i1=u1,i1,x2,i2=u2,i2

and PCA is equivalent to TD applied to a a type I tensor. Alternatively, compared with Eq (4), if

G˜(1,2)=λ11/2λ21/2jv1,jv2,j
(5)

x˜1,i1(1)=u1,i1x˜2,i2(2)=u2,i2

then PCA is equivalent to TD applied to a type II tensor. However, HOSVD does not always produce the solution satisfying the above. For example, since Eq (4) computed with HOSVD a standard SVD, G˜(1,2) is diagonal, while Eq (5) cannot be diagonal since v[ell]1,j is not orthogonal to v2,j.

Although PCA based unsupervised FE successfully identified the first N0 features associated with latent correlations as outliers (Fig 8(g) and 8(h)), PC loadings attributed to samples (Fig 8(e) and 8(f)) are almost identical to Fig 4(d) and 4(e), which are not correlated (Fig 4(f)). Therefore PCA based unsupervised FE cannot identify latent correlation. In the previous applications of PCA based unsupervised FE aimed at integrated analysis of multi-omics data [14, 22], it was critical to identify pairs of highly correlated PC loadings, otherwise it was not possible to identify which PCs should be used for FE. In this context, PCA based unsupervised FE failed to detect latent correlations among multi-views.

While PCA was unsuccessful, HOSVD was applied to type I tensors in an attempt to decompose v[ell]1,j and v2,j (Fig 8(e) and 8(f)), which are not orthogonal, into orthogonal bases x[ell]3,j as shown in Fig 4(g)–4(i). For this reason, TD applied to type I tensors is superior to PCA when applied to the matrix products of synthetic datasets and can depict latent correlation that PCA failed to identify. As for HOSVD applied to type II tensors, Fig 6(b) and 6(e) correspond to Fig 4(e) while Fig 6(c) and 6(f) correspond to Fig 4(d) shows HOSVD applied to type II tensors can depict the latent correlation that PCA based unsupervised FE failed to detect.

Methodological discussion of TD based unsupervised FE applied to multi-omics data

In contrast to synthetic data (to which no supervised methods apply), the supervised method can be applied to the multi-omics data used in this study can be treated with supervised method as the data uses class labels. Strictly speaking, there are no unsupervised methods applicable to multi-view data processing other than TD based unsupervised FE. We can, therefore, even conclude that is the above strategy is sound. That said, it would be beneficial to demonstrate that the unsupervised methodology is superior to the supervised methodology.

As mentioned in above, most multi-view data processing methodologies require optimization of weights. I do not consider such methodologies, since optimizing weights is a complicated unnecessary process. Therefore, they are not comparable with the unsupervised strategy, which involves no parameter optimization processes.

Thus, the multi-view data processing methodology is considered applicable to merged matrices shown in Eqs (1) or (2), which does not include any weight optimization. Here, I consider three alternative methodologies, Significance Analysis of Microarrays (SAM), [35] Limma [36] and randomforest [37] (RF). As SAM and Limma evaluate features independently, weights attributed to views are not required. Although RF evaluates features in more collective ways, the evaluations are tree based, therefore the absolute values of each feature are not important. Table 5 shows the pertinent results. All three methods are inferior to the methodology presented above, as they failed to identify a significantly small number of features. Limma selected all of 755 miRNAs as significant. Roughly speaking, at least half of mRNAs were identified by these three methods.

Table 5
The numbers of identified mRNAs and miRNAs (multi-omics) and mRNAs (vaccination) using various methodologies.

Although these might be enough to demonstrate the superiority of the methodology presented above, measures were also undertaken to reduce the number of mRNAs identified by reducing threshold P-values ensuring that these three methods identify 426 mRNA probes, which is the same number identified by TD applied to type I tensors. As threshold P-values which are too small without any statistical justifications may not be acceptable, in order to evaluate these three methods, unnatural threshold P-values were intentionally used. Top ranked mRNAs selected by using intentionally reduced threshold P-values were uploaded to MSigDB server. However, breast cancer was rarely identified (Table 6). Breast cancer was identified only once by SAM, and was not detected in either limma or RF. These outcomes are in contrast to Table 2, where eight out of ten significant gene sets are breast cancer-related when TD was applied to either type I or type II tensor. Thus, it is obvious that these methodologies are inferior to the methodology presented above in the present study, i.e., TD based unsupervised FE.

Table 6
Top 10 significant overlap gene set in MSigDB with top ranked 400 (approx) mRNA probes identified by alternative methods SAM, limma, and RF, as well as 374 mRNA probes identified by HO GSVD.

Finally, PCA based unsupervised FE was applied to mRNAs and miRNAs separately (Table 5). PCA based unsupervised FE identified smaller numbers of mRNAs and miRNAs than the above three methodologies. Especially, since mRNAs selected are almost identical to those identified by TD based unsupervised FE applied to type I tensor (Table 4), PCA based unsupervised FE is as effective for the identification of biologically reliable mRNAs. However, it cannot identify latent correlation between miRNAs and mRNAs, as hierarchical clustering of PC loading attributed to samples identified no pairs of miRNAss and mRNA (Fig 7(b)), which were identified when TD applied to type II tensors was considered and without which no integrated analysis of multi-omics datasets by PCA based unsupervised FE were successful. This indicates that the present datasets were more complex and cannot be dealt with using PCA based unsupervised FE in an integrated manner. Thus, I conclude that TD based unsupervised FE applied to type I or type II tensors is the only method for achieving two tasks: (i) identifying sufficiently small numbers of biologically important features, and (ii) identify latent correspondence between multi-omics profiles.

Here, PCA based unsupervised FE is shown to be the only strategy that can compute with the TD based unsupervised FE applied to matrix products. A more detailed comparison of these two strategies may enable us to understand the functionality of TD based unsupervised FE. As can be seen in the application to synthetic data, TD applied to type I tensors attempted to decompose v[ell]1,j and v[ell]2,j into orthogonal bases x[ell]3,j. This also occurred in the application to multi-omics datasets. First, I identified that the first miRNA PC loading, v2=1,j, dominates subsequent PC loadings (more than 80%). Next, I also identified that the first mRNA PC loadings attributed to samples v[ell]1,j and 1 ≤ [ell]1 ≤ 5, are identical to the first five sample singular value vectors x[ell]3,j, 1 ≤ [ell]3 ≤ 5. The Pearson correlations between these five loadings and singular value vectors are −0.94, −0.91, 0.88, −0.97 and −0.97 (Here the signs do not mean anything), respectively. It is also shown the regression analysis of the first miRNA PC loading v2=1,j with the top five mRNA PC loadings, v[ell]1,j, 1 ≤ [ell]1 ≤ 5,

v2=1,j=C0+1=15C1v1,j,j=1,,M

covers more than 40% of the first miRNA PC loading v2=1,j. Since the dimension of vector is M = 161, only five components that can cover this amount is highly significant. This suggests that TD can easily decompose single miRNA PC loadings v2=1,j into five mRNA PC loading v[ell]1,j, 1 ≤ [ell]1 ≤ 5. The result is that the first five sample singular value vectors (x[ell]3,j, 1 ≤ [ell]3 ≤ 5), are almost identical to the first five mRNA PC loadings (v[ell]1,j, 1 ≤ [ell]3 ≤ 5). Thus, TD based unsupervised FE can decompose the first dominant miRNA PC loading into five basic (orthogonal) mRNA PC loading. This result is analogous to that seen in the application to synthetic dataset (Fig 4).

It is also important to show that the five mRNA PC loadings v[ell]1,j, 1 ≤ [ell]1 ≤ 5 (or the first five sample singular value vectors x3,jmRNA,135) are distinct also from biological point of view. To demonstrate this, five sets of outliner mRNA probes (associated mRNAs are in S2 Table) were selected using each of the first five mRNA singular value vectors (x1,i1mRNA,115) each of which is coincident with the first five sample singular value vectors, (x3,jmRNA,135)as G([ell]1, [ell]2, [ell]3)s with 1 ≤ [ell]1 = [ell]3 ≤ 5 have larger absolute values (Table 1). GO (biological process) BP term enrichments were tested by uploading S2 Table to g:Cocoa in g:profiler [27] (S2 File). It is obvious that five sets of mRNAs identified as outliers using each of the first five mRNA singular value vectors are biologically distinct from one another.

Methodological discussion of TD based unsupervised FE applied to identification of temporally differentially expressed genes

At certain time points in EGF treatment experiments, there is only one measurement, which prevents the application of some statistical methods. Therefore, only vaccination samples were considered. Again, SAM, limma, RF and PCA based unsupervised FE were considered. Here, the samples are assumed to be classified into five time points times three treatments (P, D, NP) equalling 15 classes. Results are shown in Table 5. As with multi-omics data, SAM, limma, and RF failed to identify sufficiently small numbers. This is possibly due to the fact that there are 15 classes. Since even the detection of differences between pairs of any two of the 15 classes can effect results, there are many genes identified as having significant differences. On the other hand, 18 mRNA probes were identified by PCA based unsupervised FE with considering common set when separately applied to three gene expression profiles, xi,t1P xi,t2D and xi,t3NP. Thus, relatively successful.

In order to determine biological significance, a reduced number of gene sets was uploaded to MSigDB. Similar to the multi-omics case, SAM indicated too many mRNAs with adjusted P-value = 0, therefore no reduced sets could be generated. Two 300 top ranked mRNA probes sets were generated from limma and RF and associated mRNAs were uploaded to MSigDB together with the genes associated with 18 mRNA probes obtained by PCA based unsupervised FE. Within top 10 ranked significant genes set, no vaccination related genes sets were identified for limma or RF. PCA based supervised FE has only two vaccination related genes sets. GSE13485_X_VS_Y_YF17D_VACCINE_PBMC_DN, (X, Y) = (DAY3, DAY7) and (DAY1, DAY7), were identified as the second (adjusted P = 1.86 × 10−6) and the fourth (adjusted P = 4.30 × 10−5) significant genes sets, which were smaller than those identified when TD was applied to type II tensors, which identified five gene sets associated with vaccination out of six top ranked gene sets.

Thus, one of four tested methods, PCA based unsupervised FE, could produce some significant results which were inferior to those produced by TD based unsupervised FE applied to type II tensors. As a result, TD based unsupervised FE proved more effective than the other methodologies analysed above.

Comparison with HO GSVD

To my knowledge, although there are no methods that comprise a tensor from multiple matrices and applies TD to it, similar trials aiming integration of multiple matrices exist. For example, higher order generalized singular value decomposition (HO GSVD) [25] is one such method. Although HO GSVD does not generate tensor, the outcome is quite similar; a set of feature singular value vectors and a unique (common) sample singular value vector which is equivalent to what TD applied to type I tensors generated from Case I data produces. Although Ponnapalli et al [25] employed the distinct terminology from the present study, I continue to use my own terminology in this subsection to avoid confusion.

First, HO GSVD was applied to synthetic data. Fig 9(a)–9(d) shows the results. Its outcome is close to that when PCA based unsupervised FE was applied to dataset (Fig 8(e)–8(h)). It is in some sense reasonable, since HO GSVD is essentially PCA excluding the fact multiple views share the unique sample singular value matrix, V, where the first the second column vectors correspond to the first PC loading of the first and the second views, respectively.

Fig 9
The results of HO GSVD applied to the synthetic data.

Next, HO GSVD was applied to multi-omics data. Coincidence between sample singular value vectors and class labeling is shown in Fig 1(d). Although four out of five vectors are significantly coincident with class labeling, significance was substantially less than when TD was applied to type I and II tensors, as the P-values were larger. Thus, HO GSVD can perform well but is less effective than TD applied to type I or II tensors.

Next, 374 mRNA probes identified as outliers using the first five mRNA feature singular value vectors were selected (associated mRNAs are shown in S1 Table). Uploading the mRNAs associated with 374 mRNA probes to MSigDB, I found that only five out of ten top ranked significant genes sets were related to breast cancer (Table 6), while eight out of ten were related to breast cancer when TD was applied to type I or II tensors (Table 2).

Fifteen miRNAs(miR-127-5p/128/181a/190a/301a/30e*/339-5p/340/361-5p/365 /452/454/455-5p/874/135a) identified as outliers using the first and the second miRNA feature singular value vectors were selected. None were reported in the original study ([30], Table 1). Uploading 15 miRNAs to DIANA-mirpath, I found that “MiRNAs in cancer”, which was the top ranked when TD was applied to type I or type II tensors (Table 3), was not included in the top ten ranked KEGG pathways.

HO GSVD cannot be applied to identification of temporally differentially expressed genes, since HO GSVD can be applied only to Case I data where samples are shared between multiple views.

These slightly poorer outcomes of HO GSVD than TD applied to type I or II tensors suggest the usefulness of tensors when analysing multi-view datasets.

Biological validations of mRNAs identified in multi-omics data analysis

In the previous subsection, TD based unsupervised FE applied to product of multi-omica profile matrices was validated chiefly from the methodological perspective, and validated partially from the biological perspective.

In this subsection, I try to validate outcomes biologically in more detail, mainly based upon the consideration from oncology.

The samples analysed are essentially proposing the comparison between tumors with and without metastasis. Thus, it is expected that selected genes are mainly related to cancer oncogenesis related to metastasis.

Since Farazi et al [30] who produced the original study, mainly discuss the aberrant expression of miRNAs among samples, there is no in depth discussion about the role of miRNA/mRNA in metastasis. However, as can be seen in the below, much can be discussed from their dataset.

In order to biologically investigate a set of mRNAs identified when type I tensors were considered, to the mRNAs were uploaded to g:profiler (see S3 Table). A large number of enrichments of biological terms were identified.

For example, in GO BP terms, “leukocyte activation” (GO:0045321) was enriched. It was reported to be related to metastasis. Ihnen et al [38] reported a tumor biological context of activated leukocyte cell adhesion molecules (ALCAM) for the development of metastases in breast cancer. Strell et al [39] concluded that the first two steps of the extravasation of tumor cells and leukocytes, rolling and adhesion, seem to have similarities with regards to the mechanisms and receptors involved. King et al [40] identified ALCAM in metastasis of breast cancer cells to the lung. These suggested that metastasis was mediated by the extravasation similar to that of leukocytes. In relaton to this, “positive regulation of leukocyte chemotaxis” (GO:0002690) was also enriched. Wu [41] reported the role of chemotaxis in cell migration. Gradient of chemotaxis mediates cell migration, and possibly metastasis, too.

In GO (cellular component) CC terms, “extracellular region” (GO:0005576) was enriched. Cho et al [42] reported that Herceptin binds to the juxtamembrane region of HER2, identifying this site as a target for anticancer therapies, while overexpression of HER2 is found in 20-30% of human breast cancers, and correlates with more aggressive tumours and a poorer prognosis. It is also primary biomarker of breast cancer in the original study [30]. More generally, Versteeg et al [43] suggested the importance of extracellular signaling pathway in cancer metastasis. It mediates blood vessel wall damage, which may allow tumours to migrate through blood vessels.

In GO (molecular function) MF terms, “CXCR3 chemokine receptor binding” (GO:0048248) was enriched. CXCR3 was reported as a molecular target in breast cancer metastasis [44]; it inhibits tumor cell migration and promotes of host anti-tumour immunity. As suggested in the above, chemotaxis mediates cell migration and chemokine receptor CXCR3 agonist prevents human T-cell migration [45]. Other than in relation to metastasis, inhibition of CXCR3 was also known to mediate tumor growth [46]. “RAGE receptor binding” (GO:0050786) was also enriched. RAGE was reported to mediate tumor progression and metastasis through binding to S100A7 by modulating the tumor microenvironment [47]. It recruits MMP9-positive tumor-associated macrophages and mediates cell migrations.

Other than in GO terms enrichment, transcription factor (TF) SOX9 target genes were enriched. The relation between SOX9 and metastasis was pointed out by many papers. Got et al [48] reported that co-expression of Slug and Sox9 promotes the tumorigenic and metastasis-seeding abilities of human breast cancer cells. SOX9 protein, which is normally nuclear, was instead localized in the cytoplasm of 25-30% invasive ductal carcinomas (IDCs) and lymph node metastases [49]. Lei et al [50] also suggested that Sox9 expression is related to breast cancer metastasis. Although the above are all related to breast cancer, Sox9 was frequently reported to be related to metastasis in various other cancers.

KEGG pathway “Primary immunodeficiency” (KEGG:05340) was also enriched. Development of cancer in patients with primary immunodeficiencies was reported [51]. Monozygotic twin brothers with primary immunodeficiency presented with metastatic adenocarcinoma of unknown primary [52].

In conclusion, our method and strategy correctly identified many cancer related biological terms/concept enrichments, especially metastasis in breast cancer, which is coincident with the purpose of the original study that did not produce results produced here.

Biological validations of miRNAs identified in multi-omics data analysis

As the relation between mRNAs identified and breast cancer metastasis can be shown, it is necessary to demonstrate the relationship between the miRNAs identified and breast cancer metastasis. Research of the litertature showsthat all seven miRNAs identified when type I tensors were considered(let-7b [53], miR-125b [5456], miR-143 [5764], miR-145 [61, 62, 6568], miR-21 [6973], miR-22 [7478] and miR-99a [32, 79]), were reported to be related to metastasis.

Although not all are strictly related to breast cancer, all seven miRNAs identified are frequently reported to be related to metastasis.

Conclusion

In this paper, a new strategy aiming at multi-view data processing that makes use of tensors generated from multi-view matrices products was proposed. As tensors can be generated from individual measurements, observation under combined conditions, which is generally required to produce tensors from datasets, is not necessary. FEs were performed using singular value vectors generated from TD and biological feasibility was confirmed via comparisons with previously generated annotated gene expression profiles. As this strategy is not restricted to gene expression, its application to other datasets is feasible.

Supporting information

S1 File

Flow chart and variables.

A Work flow chart and list of the variables introduced are in S1 File.

(PDF)

S2 File

BP term enrichment.

BP term enrichments by uploading S2 Table to g:Cocoa in g:profiler.

(PDF)

S1 Table

List of genes.

Full lists of genes selected by various methods.

(XLSX)

S2 Table

List of genes.

Lists of genes associated with five gene singular value vectors.

(CSV)

S3 Table

Output from g:profiler.

Enriched terms when list of genes were uploaded to g profiler.

(XLSX)

Funding Statement

This work was supported by Japan Society for the Promotion of Science-17K00417-Prof. Y-h. Taguchi.

Data Availability

Data Availability

All data sets from GEO GEO ID: GSE28884, GSE84096, GSE18323.

References

1. Li Y, Wu FX, Ngom A. A review on machine learning principles for multi-view biological data integration. Brief Bioinformatics. 2016;. [PubMed]
2. Xu C, Tao D, Xu C. A Survey on Multi-view Learning. CoRR. 2013;abs/1304.5634.
3. Taguchi YH, Iwadate M, Umeyama H, Murakami Y. Principal component analysis based unsupervised feature extraction applied to bioinformatics analysis. In: Tsai JJP, Ng K-l, editors. Computational Methods with Applications in Bioinformatics Analysis. World Scientific; 2017. p. 153–182. Available from: http://www.worldscientific.com/doi/abs/10.1142/9789813207981_0008
4. Taguchi YH. Principal Components Analysis Based Unsupervised Feature Extraction Applied to Gene Expression Analysis of Blood from Dengue Haemorrhagic Fever Patients. Sci Rep. 2017;7:44016 doi: 10.1038/srep44016 [PMC free article] [PubMed]
5. Taguchi YH. Principal component analysis based unsupervised feature extraction applied to publicly available gene expression profiles provides new insights into the mechanisms of action of histone deacetylase inhibitors. Neuroepigenetics. 2016;8:1–18. doi: 10.1016/j.nepig.2016.10.001
6. Taguchi YH, Iwadate M, Umeyama H. Principal component analysis-based unsupervised feature extraction applied to in silico drug discovery for posttraumatic stress disorder-mediated heart disease. BMC Bioinformatics. 2015;16:139 doi: 10.1186/s12859-015-0574-4 [PMC free article] [PubMed]
7. Taguchi YH, Okamoto A. Principal Component Analysis for Bacterial Proteomic Analysis In: Shibuya T, Kashima H, Sese J, Ahmad S, editors. Pattern Recognition in Bioinformatics. vol. 7632 of LNCS. Heidelberg: Springer International Publishing; 2012. p. 141–152.
8. Ishida S, Umeyama H, Iwadate M, Taguchi YH. Bioinformatic Screening of Autoimmune Disease Genes and Protein Structure Prediction with FAMS for Drug Discovery. Protein Pept Lett. 2014;21(8):828–39. doi: 10.2174/09298665113209990052 [PMC free article] [PubMed]
9. Kinoshita R, Iwadate M, Umeyama H, Taguchi YH. Genes associated with genotype-specific DNA methylation in squamous cell carcinoma as candidate drug targets. BMC Syst Biol. 2014;8 Suppl 1:S4 doi: 10.1186/1752-0509-8-S1-S4 [PMC free article] [PubMed]
10. Taguchi YH, Murakami Y. Principal component analysis based feature extraction approach to identify circulating microRNA biomarkers. PLoS ONE. 2013;8(6):e66714 doi: 10.1371/journal.pone.0066714 [PMC free article] [PubMed]
11. Taguchi YH, Murakami Y. Universal disease biomarker: can a fixed set of blood microRNAs diagnose multiple diseases? BMC Res Notes. 2014;7:581 doi: 10.1186/1756-0500-7-581 [PMC free article] [PubMed]
12. Murakami Y, Toyoda H, Tanahashi T, Tanaka J, Kumada T, Yoshioka Y, et al. Comprehensive miRNA expression analysis in peripheral blood can diagnose liver disease. PLoS ONE. 2012;7(10):e48366 doi: 10.1371/journal.pone.0048366 [PMC free article] [PubMed]
13. Murakami Y, Tanahashi T, Okada R, Toyoda H, Kumada T, Enomoto M, et al. Comparison of Hepatocellular Carcinoma miRNA Expression Profiling as Evaluated by Next Generation Sequencing and Microarray. PLoS ONE. 2014;9(9):e106314 doi: 10.1371/journal.pone.0106314 [PMC free article] [PubMed]
14. Murakami Y, Kubo S, Tamori A, Itami S, Kawamura E, Iwaisako K, et al. Comprehensive analysis of transcriptome and metabolome analysis in Intrahepatic Cholangiocarcinoma and Hepatocellular Carcinoma. Sci Rep. 2015;5:16294 doi: 10.1038/srep16294 [PMC free article] [PubMed]
15. Umeyama H, Iwadate M, Taguchi YH. TINAGL1 and B3GALNT1 are potential therapy target genes to suppress metastasis in non-small cell lung cancer. BMC Genomics. 2014;15 Suppl 9:S2 doi: 10.1186/1471-2164-15-S9-S2 [PMC free article] [PubMed]
16. Taguchi YH, Iwadate M, Umeyama H. Heuristic principal component analysis-based unsupervised feature extraction and its application to gene expression analysis of amyotrophic lateral sclerosis data sets. In: Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), 2015 IEEE Conference on; 2015. p. 1–10.
17. Taguchi YH, Iwadate M, Umeyama H, Murakami Y, Okamoto A. Heuristic principal component analysis-aased unsupervised feature extraction and its application to bioinformatics In: Wang B, Li R, Perrizo W, editors. Big Data Analytics in Bioinformatics and Healthcare; 2015. p. 138–162.
18. Taguchi YH. Integrative Analysis of Gene Expression and Promoter Methylation during Reprogramming of a Non-Small-Cell Lung Cancer Cell Line Using Principal Component Analysis-Based Unsupervised Feature Extraction In: Huang DS, Han K, Gromiha M, editors. Intelligent Computing in Bioinformatics. vol. 8590 of LNCS. Heidelberg: Springer International Publishing; 2014. p. 445–455.
19. Taguchi YH. Identification of aberrant gene expression associated with aberrant promoter methylation in primordial germ cells between E13 and E16 rat F3 generation vinclozolin lineage. BMC Bioinformatics. 2015;16 Suppl 18:S16 doi: 10.1186/1471-2105-16-S18-S16 [PMC free article] [PubMed]
20. Taguchi YH. Identification of More Feasible MicroRNA-mRNA Interactions within Multiple Cancers Using Principal Component Analysis Based Unsupervised Feature Extraction. Int J Mol Sci. 2016;17(5):E696 doi: 10.3390/ijms17050696 [PMC free article] [PubMed]
21. Taguchi YH. Principal component analysis based unsupervised feature extraction applied to budding yeast temporally periodic gene expression. BioData Min. 2016;9:22 doi: 10.1186/s13040-016-0101-9 [PMC free article] [PubMed]
22. Taguchi YH, Iwadate M, Umeyama H. SFRP1 is a possible candidate for epigenetic therapy in non-small cell lung cancer. BMC Med Genomics. 2016;9 Suppl 1:28 doi: 10.1186/s12920-016-0196-3 [PMC free article] [PubMed]
23. Lathauwer LD, Moor BD, Vandewalle J. A Multilinear Singular Value Decomposition. SIAM Journal on Matrix Analysis and Applications. 2000;21(4):1253–1278. doi: 10.1137/S0895479896305696
24. Omberg L, Golub GH, Alter O. A tensor higher-order singular value decomposition for integrative analysis of DNA microarray data from different studies. Proc Natl Acad Sci USA. 2007;104(47):18371–18376. doi: 10.1073/pnas.0709146104 [PubMed]
25. Ponnapalli SP, Saunders MA, Van Loan CF, Alter O. A higher-order generalized singular value decomposition for comparison of global mRNA expression from multiple organisms. PLoS ONE. 2011;6(12):e28072 doi: 10.1371/journal.pone.0028072 [PMC free article] [PubMed]
26. Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B (Methodological). 1995;57(1):289–300.
27. Reimand J, Arak T, Adler P, Kolberg L, Reisberg S, Peterson H, et al. g:Profiler-a web server for functional interpretation of gene lists (2016 update). Nucleic Acids Res. 2016;44(W1):W83–89. doi: 10.1093/nar/gkw199 [PMC free article] [PubMed]
28. Strimmer K. fdrtool: a versatile R package for estimating local and tail area-based false discovery rates. Bioinformatics. 2008;24(12):1461–1462. doi: 10.1093/bioinformatics/btn209 [PubMed]
29. R Core Team. R: A Language and Environment for Statistical Computing; 2015. Available from: https://www.R-project.org/
30. Farazi TA, Horlings HM, Ten Hoeve JJ, Mihailovic A, Halfwerk H, Morozov P, et al. MicroRNA sequence and expression analysis in breast tumors by deep sequencing. Cancer Res. 2011;71(13):4443–4453. doi: 10.1158/0008-5472.CAN-11-0608 [PMC free article] [PubMed]
31. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences. 2005;102(43):15545–15550. doi: 10.1073/pnas.0506580102 [PubMed]
32. Yu SH, Zhang CL, Dong FS, Zhang YM. miR-99a suppresses the metastasis of human non-small cell lung cancer cells by targeting AKT1 signaling pathway. J Cell Biochem. 2015;116(2):268–276. doi: 10.1002/jcb.24965 [PubMed]
33. Albrecht M, Stichel D, Muller B, Merkle R, Sticht C, Gretz N, et al. TTCA: an R package for the identification of differentially expressed genes in time course microarray data. BMC Bioinformatics. 2017;18(1):33 doi: 10.1186/s12859-016-1440-8 [PMC free article] [PubMed]
34. Vahey MT, Wang Z, Kester KE, Cummings J, Heppner DG, Nau ME, et al. Expression of genes associated with immunoproteasome processing of major histocompatibility complex peptides is indicative of protection with adjuvanted RTS,S malaria vaccine. J Infect Dis. 2010;201(4):580–589. doi: 10.1086/650310 [PubMed]
35. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001;98(9):5116–5121. doi: 10.1073/pnas.091062498 [PubMed]
36. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47 doi: 10.1093/nar/gkv007 [PMC free article] [PubMed]
37. Fortino V, Kinaret P, Fyhrquist N, Alenius H, Greco D. A robust and accurate method for feature selection and prioritization from multi-class OMICs data. PLoS ONE. 2014;9(9):e107801 doi: 10.1371/journal.pone.0107801 [PMC free article] [PubMed]
38. Ihnen M, Kohler N, Kersten JF, Milde-Langosch K, Beck K, Holler S, et al. Expression levels of Activated Leukocyte Cell Adhesion Molecule (ALCAM/CD166) in primary breast carcinoma and distant breast cancer metastases. Dis Markers. 2010;28(2):71–78. doi: 10.3233/DMA-2010-0685 [PMC free article] [PubMed]
39. Strell C, Entschladen F. Extravasation of leukocytes in comparison to tumor cells. Cell Commun Signal. 2008;6:10 doi: 10.1186/1478-811X-6-10 [PMC free article] [PubMed]
40. King JA, Chambers Z, Kleinfeld S, Chen H, Stevens T, Shevde LA, et al. Potential role of activated leukocyte cell adhesion molecule (ALCAM/CD166) in metastasis of breast cancer cells to the lung. Cancer Research. 2014;66(8 Supplement):654–654.
41. Wu D. Signaling mechanisms for regulation of chemotaxis. Cell Res. 2005;15(1):52–56. doi: 10.1038/sj.cr.7290265 [PubMed]
42. Cho HS, Mason K, Ramyar KX, Stanley AM, Gabelli SB, Denney DW, et al. Structure of the extracellular region of HER2 alone and in complex with the Herceptin Fab. Nature. 2003;421(6924):756–760. doi: 10.1038/nature01392 [PubMed]
43. Versteeg HH, Spek CA, Peppelenbosch MP, Richel DJ. Tissue factor and cancer metastasis: the role of intracellular and extracellular signaling pathways. Mol Med. 2004;10(1-6):6–11. [PMC free article] [PubMed]
44. Zhu G, Yan HH, Pang Y, Jian J, Achyut BR, Liang X, et al. CXCR3 as a molecular target in breast cancer metastasis: inhibition of tumor cell migration and promotion of host anti-tumor immunity. Oncotarget. 2015;6(41):43408–43419. doi: 10.18632/oncotarget.6125 [PMC free article] [PubMed]
45. O’Boyle G, Fox CRJ, Walden HR, Willet JDP, Mavin ER, Hine DW, et al. Chemokine receptor CXCR3 agonist prevents human T-cell migration in a humanized model of arthritic inflammation. Proceedings of the National Academy of Sciences. 2012;109(12):4598–4603. doi: 10.1073/pnas.1118104109 [PubMed]
46. Oghumu S, Varikuti S, Terrazas C, Kotov D, Nasser MW, Powell CA, et al. CXCR3 deficiency enhances tumor progression by promoting macrophage M2 polarization in a murine breast cancer model. Immunology. 2014;143(1):109–119. doi: 10.1111/imm.12293 [PubMed]
47. Nasser MW, Ahirwar DK, Ganju RK. RAGE: A novel target for breast cancer growth and metastasis. Oncoscience. 2016;3(2):52–53. doi: 10.18632/oncoscience.294 [PMC free article] [PubMed]
48. Guo W, Keckesova Z, Donaher JL, Shibue T, Tischler V, Reinhardt F, et al. Slug and Sox9 cooperatively determine the mammary stem cell state. Cell. 2012;148(5):1015–1028. doi: 10.1016/j.cell.2012.02.008 [PMC free article] [PubMed]
49. Chakravarty G, Moroz K, Makridakis NM, Lloyd SA, Galvez SE, Canavello PR, et al. Prognostic significance of cytoplasmic SOX9 in invasive ductal carcinoma and metastatic breast cancer. Exp Biol Med (Maywood). 2011;236(2):145–155. doi: 10.1258/ebm.2010.010086 [PubMed]
50. Sox9 upregulation in breast cancer is correlated with poor prognosis and the CD44+/CD24-/low phenotype; 2016.
51. Salavoura K, Kolialexi A, Tsangaris G, Mavrou A. Development of cancer in patients with primary immunodeficiencies. Anticancer Res. 2008;28(2B):1263–1269. [PubMed]
52. Wood LA, Venner PM, Pabst HF. Monozygotic twin brothers with primary immunodeficiency presenting with metastatic adenocarcinoma of unknown primary. Acta Oncol. 1998;37(7-8):771–772. doi: 10.1080/028418698430197 [PubMed]
53. Han X, Chen Y, Yao N, Liu H, Wang Z. MicroRNA let-7b suppresses human gastric cancer malignancy by targeting ING1. Cancer Gene Ther. 2015;22(3):122–129. doi: 10.1038/cgt.2014.75 [PubMed]
54. Zhou HC, Fang JH, Shang LR, Zhang ZJ, Sang Y, Xu L, et al. MicroRNAs miR-125b and miR-100 suppress metastasis of hepatocellular carcinoma by disrupting the formation of vessels that encapsulate tumour clusters. J Pathol. 2016;240(4):450–460. doi: 10.1002/path.4804 [PubMed]
55. Glud M, Rossing M, Hother C, Holst L, Hastrup N, Nielsen FC, et al. Downregulation of miR-125b in metastatic cutaneous malignant melanoma. Melanoma Res. 2010;20(6):479–484. doi: 10.1097/CMR.0b013e32833e32a1 [PubMed]
56. Tang F, Zhang R, He Y, Zou M, Guo L, Xi T. MicroRNA-125b induces metastasis by targeting STARD13 in MCF-7 and MDA-MB-231 breast cancer cells. PLoS ONE. 2012;7(5):e35435 doi: 10.1371/journal.pone.0035435 [PMC free article] [PubMed]
57. He Z, Yi J, Liu X, Chen J, Han S, Jin L, et al. MiR-143-3p functions as a tumor suppressor by regulating cell proliferation, invasion and epithelial-mesenchymal transition by targeting QKI-5 in esophageal squamous cell carcinoma. Mol Cancer. 2016;15(1):51 doi: 10.1186/s12943-016-0533-3 [PMC free article] [PubMed]
58. Hu Y, Ou Y, Wu K, Chen Y, Sun W. miR-143 inhibits the metastasis of pancreatic cancer and an associated signaling pathway. Tumour Biol. 2012;33(6):1863–1870. doi: 10.1007/s13277-012-0446-8 [PubMed]
59. Hirahata M, Osaki M, Kanda Y, Sugimoto Y, Yoshioka Y, Kosaka N, et al. PAI-1, a target gene of miR-143, regulates invasion and metastasis by upregulating MMP-13 expression of human osteosarcoma. Cancer Med. 2016;5(5):892–902. doi: 10.1002/cam4.651 [PMC free article] [PubMed]
60. Xia H, Sun S, Wang B, Wang T, Liang C, Li G, et al. miR-143 inhibits NSCLC cell growth and metastasis by targeting Limk1. Int J Mol Sci. 2014;15(7):11973–11983. doi: 10.3390/ijms150711973 [PMC free article] [PubMed]
61. Peng X, Guo W, Liu T, Wang X, Tu X, Xiong D, et al. Identification of miRs-143 and -145 that is associated with bone metastasis of prostate cancer and involved in the regulation of EMT. PLoS ONE. 2011;6(5):e20341 doi: 10.1371/journal.pone.0020341 [PMC free article] [PubMed]
62. Huang S, Guo W, Tang Y, Ren D, Zou X, Peng X. miR-143 and miR-145 inhibit stem cell characteristics of PC-3 prostate cancer cells. Oncol Rep. 2012;28(5):1831–1837. [PubMed]
63. Osaki M, Takeshita F, Sugimoto Y, Kosaka N, Yamamoto Y, Yoshioka Y, et al. MicroRNA-143 regulates human osteosarcoma metastasis by regulating matrix metalloprotease-13 expression. Mol Ther. 2011;19(6):1123–1130. doi: 10.1038/mt.2011.53 [PubMed]
64. Ma Q, Jiang Q, Pu Q, Zhang X, Yang W, Wang Y, et al. MicroRNA-143 inhibits migration and invasion of human non-small-cell lung cancer and its relative mechanism. Int J Biol Sci. 2013;9(7):680–692. doi: 10.7150/ijbs.6623 [PMC free article] [PubMed]
65. Wang M, Wang J, Deng J, Li X, Long W, Chang Y. MiR-145 acts as a metastasis suppressor by targeting metadherin in lung cancer. Med Oncol. 2015;32(1):344 doi: 10.1007/s12032-014-0344-6 [PubMed]
66. Donzelli S, Mori F, Bellissimo T, Sacconi A, Casini B, Frixa T, et al. Epigenetic silencing of miR-145-5p contributes to brain metastasis. Oncotarget. 2015;6(34):35183–35201. doi: 10.18632/oncotarget.5930 [PMC free article] [PubMed]
67. Li YQ, He QM, Ren XY, Tang XR, Xu YF, Wen X, et al. MiR-145 inhibits metastasis by targeting fascin actin-bundling protein 1 in nasopharyngeal carcinoma. PLoS ONE. 2015;10(3):e0122228 doi: 10.1371/journal.pone.0122228 [PMC free article] [PubMed]
68. Dong R, Liu X, Zhang Q, Jiang Z, Li Y, Wei Y, et al. miR-145 inhibits tumor growth and metastasis by targeting metadherin in high-grade serous ovarian carcinoma. Oncotarget. 2014;5(21):10816–10829. doi: 10.18632/oncotarget.2522 [PMC free article] [PubMed]
69. Liu ZL, Wang H, Liu J, Wang ZX. MicroRNA-21 (miR-21) expression promotes growth, metastasis, and chemo- or radioresistance in non-small cell lung cancer cells by targeting PTEN. Mol Cell Biochem. 2013;372(1-2):35–45. doi: 10.1007/s11010-012-1443-3 [PubMed]
70. Mudduluru G, George-William JN, Muppala S, Asangani IA, Kumarswamy R, Nelson LD, et al. Curcumin regulates miR-21 expression and inhibits invasion and metastasis in colorectal cancer. Biosci Rep. 2011;31(3):185–197. doi: 10.1042/BSR20100065 [PubMed]
71. Yan LX, Huang XF, Shao Q, Huang MY, Deng L, Wu QL, et al. MicroRNA miR-21 overexpression in human breast cancer is associated with advanced clinical stage, lymph node metastasis and patient poor prognosis. RNA. 2008;14(11):2348–2360. doi: 10.1261/rna.1034808 [PubMed]
72. Bornachea O, Santos M, Martinez-Cruz AB, Garcia-Escudero R, Duenas M, Costa C, et al. EMT and induction of miR-21 mediate metastasis development in Trp53-deficient tumours. Sci Rep. 2012;2:434 doi: 10.1038/srep00434 [PMC free article] [PubMed]
73. Yang CH, Yue J, Pfeffer SR, Handorf CR, Pfeffer LM. MicroRNA miR-21 regulates the metastatic behavior of B16 melanoma cells. J Biol Chem. 2011;286(45):39172–39178. doi: 10.1074/jbc.M111.285098 [PMC free article] [PubMed]
74. Xin M, Qiao Z, Li J, Liu J, Song S, Zhao X, et al. miR-22 inhibits tumor growth and metastasis by targeting ATP citrate lyase: evidence in osteosarcoma, prostate cancer, cervical cancer and lung cancer. Oncotarget. 2016;7(28):44252–44265. doi: 10.18632/oncotarget.10020 [PMC free article] [PubMed]
75. Tang Y, Liu X, Su B, Zhang Z, Zeng X, Lei Y, et al. microRNA-22 acts as a metastasis suppressor by targeting metadherin in gastric cancer. Mol Med Rep. 2015;11(1):454–460. [PubMed]
76. Chen M, Hu W, Xiong CL, Qu Z, Yin CQ, Wang YH, et al. miR-22 targets YWHAZ to inhibit metastasis of hepatocellular carcinoma and its down-regulation predicts a poor survival. Oncotarget. 2016;. [PMC free article] [PubMed]
77. Song SJ, Poliseno L, Song MS, Ala U, Webster K, Ng C, et al. MicroRNA-antagonism regulates breast cancer stemness and metastasis via TET-family-dependent chromatin remodeling. Cell. 2013;154(2):311–324. doi: 10.1016/j.cell.2013.06.026 [PMC free article] [PubMed]
78. Wan WN, Zhang YQ, Wang XM, Liu YJ, Zhang YX, Que YH, et al. Down-regulated miR-22 as predictive biomarkers for prognosis of epithelial ovarian cancer. Diagn Pathol. 2014;9:178 doi: 10.1186/s13000-014-0178-8 [PMC free article] [PubMed]
79. Kuo YZ, Tai YH, Lo HI, Chen YL, Cheng HC, Fang WY, et al. MiR-99a exerts anti-metastasis through inhibiting myotubularin-related protein 3 expression in oral cancer. Oral Dis. 2014;20(3):65–75. doi: 10.1111/odi.12133 [PubMed]

Articles from PLoS ONE are provided here courtesy of Public Library of Science