shows the workflow of the *MetSign* software that has two major components: project management and data analysis. The project management part takes care of samples and data meta-information. For data analysis, *MetSign* first reduces the raw instrument data into a peak list via spectrum deconvolution. It then performs initial metabolite assignment. The peak alignment recognizes peaks of the same type of metabolite in different samples. Four normalization algorithms were implemented for the user to choose from. *MetSign* also enables both presence/absence tests and abundance tests for statistical significance analysis. *K*-means, agglomerative hierarchical clustering, and fuzzy C-means clustering algorithms are available for clustering all sample features. Temporal analysis can automatically generate the metabolite time course plot and cluster the metabolites based on their time course trajectories.

Spectrum Deconvolution

*MetSign* supports importing the mzXML raw data format for spectrum deconvolution. Let {*pl*_{1} , *pl*_{2} ,..., *pl*_{n}} be the profile list of all scans in each mzXML raw data, where *pl*_{i} is the profile data of scan *i*, and *n* is the number of total scans. In case of direct infusion experiment, *MetSign* summarizes *pl*_{1} , *pl*_{1} ,..., *pl*_{n} to get a summed profile *P*. A wavelet transformation (WT) is then employed for noise removal. After removing the noise, many isolated peak profiles {*p*_{1} , *p*_{1} ,..., *p*_{n}} are left in *P*, where *p*_{i} ={(*x*_{1(}_{i}_{,1) ,} , *y*_{(}_{i}_{,1} ), (*x*_{(}_{i}_{,2 )} , *y*_{(}_{i}_{,2 )} ),..., (*x*_{(}_{i}_{,} _{p} _{)} , *y*_{(}_{i}_{,} _{p}_{)})}, *p*_{i} is the *i*^{th} peak profile, *x*_{(i,j)} and *y*_{(i,j)} represent the *m/z* value and the peak area of the *j*^{th} isotopic peak ion in the peak profile *p* , respectively. A second-order polynomial fitting (SPF) function is then introduced to centralize each peak profile {*p*_{1} , *p*_{2} ,..., *p*_{n}} in *P* to get the peak area and *m/z* value of each peak profile, respectively. In case of multiple peaks overlapping each other, a Gaussian mixture model (GMM) is employed to deconvolute the overlapping peaks. The overlapping peaks are detected if at least one side of a peak is higher than two times of the baseline, deduced from the removed noise signals detected by WT.

Metabolite Putative Assignment

*MetSign* performs metabolite putative assignment by matching the experimentally measured metabolite ion *m/z* value and the profile of the isotopic peaks with the theoretical data of metabolites recorded in the *MetSign* database, which is composed of all metabolites recorded in the Kyoto Encyclopedia of Genes and Genomes (KEGG),^{15} LIPID MAPS,^{16} and the Human Metabolome Database (HMDB).^{17} After incorporating all user-defined possible stable isotopes and adduct ions, *MetSign* first matches all the molecular ion *m/z* values of the *MetSign* database metabolites to each experimental *m/z* value of the centralized peaks. An experimental peak may be assigned to multiple isotopic peaks of different elemental compositions of database metabolites due to limited mass accuracy. In other words, the isotopic peaks of these putatively assigned metabolites are overlapped. Therefore, an iterative mean square error (MSE) algorithm is used to deconvolute the overlapped isotopic peaks.

Let *P*_{0} ={*P*_{1} , *P*_{2} ,..., *P*_{n}} be a group of experimental peak clusters that were assigned to the overlapped isotopic peaks of multiple metabolite elemental compositions, where *n* is the number of metabolite elemental compositions, *P*_{i} ={(*x* _{(}_{i} _{,1 )}, *y*_{(}_{i} _{,1 )} ), (*x* _{(}_{i} _{,2 )} , *y*_{(}_{i} _{,2 )}),..., (*x*_{(i, mi)} , *y*_{(i, mi)})} is a collection of isotopic peaks of the *i*^{th} overlapped metabolite elemental composition, and *m*_{i} is the index of the isotopic peaks of *P*_{i}. The intensity MSE aims to find *a*_{i}, that satisfies

where

*Y*_{(}_{i} _{,} _{j}_{)} is the theoretical abundance corresponding to

*y*_{(}_{i} _{,} _{j}_{)} . Through MSE deconvolution, the isotopic distribution of each overlapped metabolite elemental compositions is approximated to its theoretical distribution by minimizing the overall fitting error.

After initial isotopic peak deconvolution, Pearson s correlation coefficient is used to measure the similarity between the deconvoluted isotopic peaks and the theoretical isotopic peaks of each putatively assigned metabolite. A large similarity value, *i.e.*, close to 1, indicates a high probability that this metabolite is present in the experimental data. Recognizing a metabolite via its elemental composition is clearly not reliable and increases the chance of identifying false-positive metabolites and therefore, decreasing the peak intensity of true metabolites during MSE fitting. For this reason, *MetSign* employs an iterative MSE fitting procedure by setting an empirical threshold of the Pearson s correlation coefficient such as 0.7. Any metabolite elemental composition with a less than the user defined correlation coefficient threshold is discarded during the iterative MSE fitting. For example, assuming there are 10 metabolite elemental compositions and each of their mono-isotopic peak matches to one of the experimental peaks in a peak cluster. The metabolites with these 10 elemental compositions are considered as potential overlapping metabolites. *MetSign* first performs the MSE fitting using these 10 metabolite elemental compositions. After fitting, the Pearson s correlation coefficient between the fitted isotopic peak envelope and the corresponding theoretical isotopic peak envelope is calculated for each elemental composition. Any metabolites with a correlation coefficient less than the predefined threshold 0.7 are removed. *MetSign* repeats this process until all of the metabolite elemental compositions have a larger than 0.7 correlation coefficient. Finally, *MetSign* outputs all fitted metabolites ranking from high to low according to Pearson s correlation coefficients.

The iterative MSE optimization is for putative metabolite assignment only. The fitted isotopic peak envelope of each metabolite is only used for the calculation of Pearson s correlation coefficient to estimate the reliability of the metabolite putative assignment. For each peak that matched to a mono-isotopic peak of one or more metabolite elemental compositions, the original peak area of the matched peak is carried forward for quantification analysis. Therefore, the potential false-positive putative assignment of metabolites via the iterative MSE optimization will not affect the metabolite quantification in the downstream statistical analysis.

Peak List Alignment

The purpose of peak alignment is to recognize the metabolite peaks generated by the same type of metabolite in different samples. Alignment uses the results generated from the metabolite putative assignment as its input. *MetSign* performs peak alignment based on peak *m/z* values and the peak intensity profile of isotopic peaks from the direct infusion experiments. An additional feature, a user-defined retention time window, is further employed to restrict the alignment searching space for LC-MS data. In case multiple matches are detected in a target sample during the alignment of a peak in the reference sample, discrete convolution is used to find the peak in the target sample that correlates best with the peak in reference sample.^{18}

Normalization

Three literature-reported normalization algorithms were implemented into *MetSign* for the user to select from including quantile normalization, cyclic loss normalization, and contrast-based normalization.^{19, 20} The well-known quantile normalization is a technique for making two distributions identical in statistical properties, which may not be true for the comparison of samples acquired from different biological conditions. Therefore, *MetSign* also implemented a novel normalization method, entitled sample group-based quantile (SGQ) normalization. The hypothesis of SGQ is that the distributions of metabolites within the same sample group are identical. SGQ first performs quantile normalization for the samples that belong to the same sample group. After the quantile normalization, it then employs a trimmed constant mean method to normalize all samples across the sample groups.

Statistical Significance Tests

The purpose of statistical analysis is to find metabolites that have significantly different expression levels in different sample groups. Due to the limitation of the analytical platform, some low-level metabolites may not be detected and such metabolites are represented as missing values in the normalization table. *MetSign* first employs the Fisher s exact test to study the presence and absence of each metabolite between sample groups. It then employs the Grubbs test ^{21} for outlier detection to find the responses of a metabolite that are not consistent with the responses of the same metabolite in the remaining samples of the same sample group. After removing the outliers, an abundance test such as the pairwise two-tail *t*-test is performed to detect the abundance changes of each metabolite between two sample groups, and the false discovery rate (FDR) is used to correct for multiple comparisons.^{22}

Unsupervised Clustering

Pattern recognition aims to study the differences of the metabolite expression profiles acquired under different physiological conditions. The samples that have similar features are grouped into the same cluster. *MetSign* first filters data based on a user-defined frequency threshold *f*_{t}, defined as the number of samples in which a metabolite was detected divided by the number of all samples. The *k*-nearest neighbor (KNN) imputation algorithm is then used to estimate the missing data.^{23} Due to the nature of the data, *i.e.*, the large number of metabolites and the small number of samples, the data dimensionality reduction method can be used before clustering to eliminate redundancy information in the original data and to enhance the computing efficiency. *MetSign* provides two data dimensionality reduction methods, principal component analysis (PCA)^{24} and partial least squares (PLS)^{25} as options for the user to select from, if the user decided to employ data dimensionality reduction before clustering. Three clustering methods were implemented in *MetSign*, including *k*-means clustering, agglomerative hierarchical clustering, and fuzzy C means clustering.^{26} The clustering accuracy (CA) is further calculated as the number of correctly clustered samples divided by the number of all samples.

Temporal Analysis

Temporal analysis can generate the time course plots of all metabolites and cluster the metabolites based on their time course trajectories (response

*vs.* time).

*MetSign* automatically displays the time course trajectories of a metabolite generated from different sample groups in the same plot, and then employs the correlation and distance to characterize the relation between the time course trajectories. The correlation between the time course trajectories of the same metabolite is calculated using Spearman s rank-order correlation coefficient. For the calculation of the distance between the time course trajectories,

*MetSign* first calculates the difference of the metabolite response (

*i.e.,* peak area) between sample groups at each time point, which is represented as the probability of one-way analysis of variance (ANOVA).

^{27} The Fisher value

*p*_{F} is then computed to show the degree of difference (

*i.e.*, the distance) between the time course trajectories.

*p*_{F} is defined as follows:

where

*p*_{i} is the probability of the ANOVA results of a metabolite at the

*i*^{th} time point, and

*n* is the number of time points. A higher Fisher value

*p*_{F} indicates a large distance between the time course trajectories of the metabolite of interest in the comparing sample groups.