This in silico collection of human transcriptomes was constructed by collecting 9,783 publicly available Affymetrix microarray experiments in the form of CEL files as source material. The uniqueness of the collected files was tested with the cyclic redundancy check algorithm (cksum). For a complete listing of the original source data from 157 separate studies, please see Additional data file 3. We combined data from the following Affymetrix generations (HG-U95A, HG-U95Av2, HG-U133A, HG-U133B, HG-U133 Plus 2). Even though HG-U133A and HG-U133B are not different generations, they do have 2,074 common genes, and we considered them as such for the practical purposes of our normalization.
Data from all CEL files were pre-processed with the MAS5.0 algorithm [18
] with default parameters. Although different opinions exist about optimal preprocessing methods [52
], recent comparison studies indicate that MAS5.0 provides the most faithful cellular network construction [53
] and optimal identification of differentially expressed genes [54
]. In addition, other preprocessing methods may create false positive results [53
]. We used version 10 of the alternative CDF files [20
] summarizing the probe level intensities directly to the Ensemble [57
] gene IDs (Ensembl build 46). Probes mapping to multiple genes and other problems associated with old generations of Affymetrix probe designs were thereby excluded. Within our normalization process the term pre-processing refers only to steps performed by the MAS5.0 algorithm, and subsequent normalization steps are described below.
Sample-wise normalization with equalization transformation
We utilized equalization transformation (Q) [21
], a method similar to widely used quantile normalization [22
], to normalize the pre-processed data. After Q normalization, the dataset had the desired distribution that has been determined prior to transformation. The normal distribution with mean of 8 and standard deviation 2 (N
)) was selected as the desired distribution since the distribution of logarithmic, preprocessed values of all samples (N = 9,783) with median 7.92 and standard deviation 2.3 was near to this distribution (Additional data file 1). EQ values were brought to exponential scale to maintain the scale of the original values.
The quantile normalization [22
] would be another choice to perform normalization but has considerable drawbacks in this particular setting. First, it does not perform well when there is variation in the number of genes between samples. This problem is magnified when merging thousands of samples from different array generations. Also, the means of the quantiles may vary substantially when new samples are added to the dataset, whereas the change caused by the equalization transformation is smaller. Quantile normalization is also resource-intensive to compute for thousands of samples with different numbers of measured genes. Thus, equalization transformation (Q) [21
] was the method of choice in this study.
Array-generation-based gene centering (AGC)
To be able to compare the samples of in silico transcriptomics also between the array generations, we developed a novel method for gene-wise normalization of the data. In this AGC method we assume that the mean of the expression values for any particular gene in each array generation is the same. If the mean value of some of the array generations differs substantially from the others, the shift is assumed to be caused by the array generation based variation, and the AGC method aims to correct this variation. The AGC method requires that the collection of samples to be analyzed is large enough so that one can assume the distribution of values of each gene k to represent the total distribution of all potential expression values across all tissues for each array generation i. Therefore, the AGC method normalizes the data to have mean values μi, k= μall, k for all array generations i, where μall, k is the mean of all values of the gene k. Further, it is assumed that the minimum and the maximum estimates for the gene value are reached and the range of the gene k should approximately be [ak, bk], where ak is the lowest 2% value and bk is the largest 2% value of gene k. AGC values should not go over this range. However, if the new centered value exceeds the range, the difference is diminished towards the range limits with coefficient c, 0 ≤ c ≤ 1. Here, the coefficient is set to c = 1/5. Coefficient c is necessary to prevent some extremely tissue-specific genes from having arbitrarily large correction factors, which is possible if the specific tissue is absent from one or more array generation. The coefficient c affects 2.9% of all correction factors. Of those cases, the proportion of the correction factor modified by coefficient c was, on the average, 7.6%. Thus, the coefficient c affected an extreme minority of the corrections in a significant manner, but nevertheless, it was found to be crucial for the AGC method. The centered values can now be obtained with:
where xi, j, k is the value of gene k in sample j from array generation i, μi, k is the mean of the values of gene k across array generation I, and μall, k is the mean of the values of gene k across all array generations. Further, the adjusted values are computed based on the equation:
The resulting AGC values are now AGCvalue = 2y.
Some other methods [58
] are useful to combine different datasets. However, these are computationally very demanding and probably impractical for datasets comprising almost 10,000 samples. Additionally, the performance of these methods is not validated for integration of multiple datasets.
Sample annotation and manual curation
Annotation of the samples is important to make biological and medical sense of the data. Since not all sources of CEL files come with annotations following the MIAME standards [60
], we performed manual annotation of all the data in the database. Annotation terms linked to each sample were defined by a team of seven biologists and medical doctors. The content of the database in terms of healthy, malignant and other disease samples can be seen in Additional data file 4
Gene annotation is based on Ensembl. The database has data for each Ensembl gene, even those not featured on any arrays. Gene data include transcript and protein product information, chromosome name and position (band and nucleotide count), biotype (protein coding, miRNA, ribosomal, and so on), and Hugo and Entrez IDs for each gene. These data were downloaded from the Ensembl web site, using the same Ensembl genome build version (release 46) as that used for the construction of the used alternative CDF files [20
Multidimensional scaling and clustering accuracy
We utilized classic MDS in order to diminish the number of the dimensions within the data [24
]. With MDS, 1,137 samples with 7,390 dimensions (that is, genes) were brought to low-dimensional space so that the distance between each sample pair with these new dimensions is very close to the distance between the original values of the samples. As a distance metric, we used Manhattan distance.
K-means clustering and rand index analysis
K-means clustering was performed with default parameters in R. The initial centroids were given as the median value of each gene in array generations or tissues. The algorithm was allowed to run for a maximum of 100,000 iterations for each clustering. The corrected rand index [23
] was calculated in R with fpc library.
Replicate analysis was performed by comparing the correlation coefficients of the logarithmic values of two or three hybridizations from a single biological sample using standard methods of computing the Pearson correlation coefficient. This was done for all samples described in [9
Body-wide expression profiles of genes
We visualize the expression profile of a single gene across all human tissues with boxplots and with custom designed body-wide expression plots. In the boxplots, the expression profiles of a single gene are displayed and grouped into healthy samples (green boxes) and malignant samples (red boxes). Both types are in anatomically meaningful order, allowing easy comparison of related tissue types. Numbers of samples in each tissue type are in parentheses.
Custom designed body-wide expression profiles show the expression pattern of a single gene at the level of individual samples, while its layout allows easy analysis of the biological or medical significance of the profile. The y-axis provides the expression level of the gene and the x-axis contains all samples arranged into a fixed order by the type of the sample (healthy, malignant) and subsequently by the tissue type. Thus, each dot describes the expression level of a particular gene in one sample. The anatomical origin of each sample can be seen from the color bar at the bottom of the image. Tissues expressing the gene at a high level (more than one standard deviation higher than the baseline for that gene or having a group of outlier data points) are colored.
Body-wide gene expression heatmaps for human cancer genes
Bodywide expression maps of genes are done with hierarchical clustering (Euclidean distance with Ward linkage) of mean expression profile for 342 genes across 110 in vivo tissues. The number of samples per tissue type is given in parentheses. Values for each gene are mean-centered at 0 with a standard deviation of 1.
Availability of data
As the in silico
transcriptomics data of this project are composed of custom integration of already public microarray data we provide a table describing the origins of the data used to construct GeneSapiens (Additional data file 3). We have set up a website [16
] to allow browsing of expression profiles of these genes and associated information as well as generation of correlations/scatterplots between any pairs of genes across any tissues.