We collected data from the NIH NCBI Gene Expression Omnibus for 16 different studies of immunodeficiency (). Importantly, these gene expression measurements are from many different forms of immune dysfunction and span multiple species and different tissue types. The gene expression samples in each study were hand annotated and further divided into 37 different experimental comparison subgroups (such as different mouse backgrounds used in the same study). Gene expression levels were compared between immune deficient samples and controls (normal immune function samples) in each subgroup using the modified t test proposed by Tusher et al37
and incorporated into the significance analysis of microarrays.38
The annotations of all samples are available upon request.
Publicly-available gene expression experimental data used in multiplex meta-analysis
We calculate the mean log fold change38
of each oligonucleotide in each array (m
) in each experimental class (k
, either i
for immunodeficiency or c
for control) and each subgroup (g
), with nk,g
indicating the total number of arrays of class k
in subgroup g
These mean log fold changes are then used in the modified t test (equation 2
) introduced by Tusher et al37
and explained in detail in Witten and Tibshirani,38
using the standard deviation (σg
) and a value so,g
(scaling factor) selected to minimize the coefficient of variation of Tg
across all oligonucleotides.
The modified t statistic Tg
is then bootstrapped to calculate a p value (pg
) for each gene. The oligonucleotides on the array are mapped to genes using AILUN39
and across species using HomoloGene groups.40
Using a method proposed by Fisher41
and based on the fact that p values selected at random should be uniformly distributed, we can look for deviations by the χ2
test (equation 3
), with pg
the p value for that gene in the subgroup g
the number of subgroups measuring that gene.
Importantly, this method will freely mix different directions of variation across studies, up or down, using only the significance of the test. The final meta p values may then be used to rank the significance of expression differences between immune deficient and normal controls across studies. We predict that genes that are significantly differentially expressed across studies identified through our method are more likely to be involved in immunodeficiency; the lower the p value, the greater the priority given.
To establish a baseline and identify genes involved with the immune system in general, we also took lists of genes annotated for involvement in acquired immunity, innate immunity, and inflammation taken from the Molecular Signatures Database.42
The genes with an annotation in each respective category were taken as predictions for having variants associated with disease risk, and those genes lacking annotations were taken to be predictions for those genes not being closely coupled to variants associated with increased disease risk. For example, one strategy to prioritize CVID related genes would be to first investigate all those genes annotated for being involved in the biological process of ‘acquired immunity’, using the presence of that annotation on a gene as the predictor for involvement with CVID.
To evaluate our ability to prioritize genes with variants associated with immunodeficiency, we compiled a manually expertly curated list of 131 genes known to have variations associated with immunodeficiency in general. However, as that list might be biased toward genes discovered through differential expression, we also used the results of a recent genome-wide association study of 363 patients with CVID and 3031 healthy controls. We took the results of this study and identified 31 genes linked to variations associated with CVID,43
prominent among them the genes of the major histocompatibility complex (MHC).
We evaluated the ability to prioritize the genes with variants implicated in immunodeficiency using the receiver operating characteristic (ROC) curve. The area under the ROC curve summarizes the power of a prediction method into a single numerical value, in this case predicting the association of genes likely to contain disease-associated variants. An area of 0.5, or a curve along the diagonal indicates no predictive power. A larger area under the curve (AUC) implies better discrimination ability. We obtain a p value by comparing the observed AUC against 1000 randomizations.