mRNA from human tissues was purchased from commercial vendors, including Clontech, Ambion, and Biochain. Most samples were pooled from multiple donors, typically twelve.42 normal tissues were tested, in cluding adipose, adrenal gland, bladder, activated CD4-positive T-lymphocyte, activated CD8-positive T-lymphocyte, bone marrow, brain, fetal brain, cerebellum, cerebral cortex, hippocampus, thalamus, pituitary gland, cervix uteri, colon, epididymis, heart, kidney, fetal kidney, liver, fetal liver, lung, fetal lung, trachea, lymph node, mammary gland, skeletal muscle, ovary, placenta, prostate, retina, salivary gland, skin, duodenum, ileum, jejunum, spinal cord, spleen, stomach, testis, thymus, and thyroid gland. These selected tissues cover most major organs and normal tissue types. Four fetal tissues of brain, kidney, liver and lung were included.
Microarray expression profiling
Human tissue microarray expression profiling was performed as described previously [53
]. In brief, purchased mRNA pooled from multiple normal individuals was amplified and labeled using a full-length amplification protocol and hybridized in duplicate against a common reference pool in a two-color dye swap experiment [54
]. Each gene is represented by 3 microarray probes placed at exon-exon junctions or in exons. Gene expression was calculated as the median probe intensity, after normalization by the pool of all data. The dataset is available at National Center for Biotechnology Information's Gene Expression Omnibus database [GEO accession: GSE16546].
Selection of HKGs and TEGs
We used fairly conservative criteria to identify HKGs: the intensity of the gene must be greater than the median intensity of all genes in the microarray in at least 41 out of 42 tissues and the coefficient of variance (CV, standard deviation/average) of the gene intensity across tissues must be less than 1. The intensity and CV of the 18,149 genes monitored in the microarray are distributed over a wide range, with average intensity of all genes 1.04 ± 1.94 (SD) and average CV of all genes 0.83 ± 0.77 (SD). A recent study shows that genes' breadth of expression in tissues is positively correlated with the expression level of the genes [24
]. Therefore it is reasonable to select HKGs from among those genes with higher intensity. While the CVs of most genes (76% of all genes) are below 1, some genes' expression is very volatile across tissues, with CV as high as 6. Our criteria guarantee the HKGs are highly expressed in vast majority of tissues with limited fluctuation in intensity level across tissues.
More stringent criteria were used to identify a reference HKG list for laboratory experimental controls. We required that the intensity of each HKG be greater than the median of all genes in each of the 42 tissues and CV of intensity less than 0.35. A total of 362 HKGs meet these criteria. The top 20 genes ranked by their average intensity across all 42 tissues were selected as the experimental housekeeping genes reference.
To identify TEGs, we selected 29 representative tissue types, removing fetal and redundant tissues from the set of 42 tissues described above. The resulting set was as follows: adipose, adrenal gland, bladder, bone marrow, brain, cervix uteri, colon, heart, kidney, liver, lung, trachea, mammary gland, ovary, skeletal muscle, lymph node, placenta, prostate, retina, salivary gland, skin, spinal cord, spleen, stomach, testis, thymus, thyroid gland, jejunum, and CD4-positive T-lymphocyte. To be identified as a TEG, the intensity of the gene in the relevant tissue was required to meet three criteria: 1) among the top 25% percentile of all genes in that particular tissue; 2) greater than 50% of the sum of intensities for that gene in all other tissues in the set of 29; and 3) greater than three times of intensity of the gene of interest in any other tissue.
Conservation of functions
We used the number of orthologs of human genes in other eukaryotic species as identified by NCBI HomoloGene [23
] as an indication of functional conservation across species. We mapped human HKGs, TEGs and all genes represented in the microarray to orthologs in mouse, rat, dog, fly (D. melanogaster
), worm (C. elegans
) and budding yeast (S. cerevisiae
). The numbers of human genes that map to genes of other species through HomoloGene are counted. Student's T-tests were applied between orthologs of HKG and all genes and between orthologs of TEG and all genes.
Distribution of SD and CNV in genes
We required at least a quarter of the total genomic length of a gene to overlap the SD or CNV region (Table ). The p-values, indicating the statistical significance of the overlap for HKGs and TEGs relative to all RefSeq genes, were calculated according to the hypergeometric distribution with a Bonferroni correction.
CpG islands coordinates were obtained from UCSC genome browser http://genome.ucsc.edu
human CpG island track. The number and length of CpG islands located within 500 bp upstream and downstream of transcription start sites and end sites are calculated for HKGs, TEGs and RefSeq genes. CpG density is indicated by the fraction of base pairs occupied by CpG islands. The hypergeometric distribution with Bonferroni correction is applied to determine the significance of the enrichment or depletion of CpG islands relative to the density seen for RefSeq genes.
Chromatin structure and epigenetics modifications
Data of DNase I hypersensitive (HS) sites, histone acetylation, methylation, transcription binding sites and DNA methylation were obtained from recent publications [35
]. The density of each feature is calculated in a 500 bp sliding window advancing 100 bp each time near transcription start sites for HKGs, RefSeq genes, TEGs. The average intensity of all genes in each group is plotted as a function of the distance to transcription start site.