The concept conveyed by the proposed integrative analysis of SNP and GE markers also is applicable to predicting disease status in biomedical studies and drug response in pharmacogenomics studies. Genome-wide association studies that identify disease susceptibility genes using a large number of SNPs suffer from the problem of missing heritability and are limited in explaining the etiology of complex diseases [40
]. However, with the aid of GE, it is possible to increase the proportion of explained genetic variations which then elevates prediction accuracy. In view of the potential importance of integrative analysis of SNP and GE markers in the population genetics, forensic sciences, and medical genetics, we developed BIASLESS software. BIASLESS, which is useful for selecting important predictive marker sets from large numbers of biomarkers for inferences of ethnic groups, disease groups, and drug response groups, is a free, publicly available, and user-friendly analysis tool.
The method and software introduced in this paper can be used to construct high-accuracy and cost-beneficial AIM panels. Nevertheless, rather than the construction of AIM panels, the main focus of this paper is to introduce an integrative analysis of SNP and GE markers for the discrimination of samples from various populations, especially for closely related ancestral lineages. We don’t intend for the AIMs identified in this study to take the place of the AIMs found earlier for CEU, CHB, JPT, and YRI populations. Some of the AIMs identified in this study may be limited by the small to moderate number of samples in the HapMap II project; therefore, the generality of the identified AIMs should be further examined by using more independent samples and confirmed by biological verifications such as real-time reverse-transcription polymerase chain reaction before the AIMs applying to practical studies.
Although GE markers, which are more variable compared to SNPs, may change by population-specific food preferences or environmental exposures, previous studies did disclose the evidences of the genetic basis of global GE [28
]. Moreover, this study analyzed the GE data from the total RNA samples extracted from Epstein Barr virus (EBV)-transformed lymphoblastoid cell lines of study individuals [35
]. The GE variation of lymphoblastoid cell lines, which are important materials for dissecting genetic basis of GE variation of human populations [23
], reflects a substantially higher proportion of genetic effect compared to the effect of food preferences or environmental exposures [46
]. The finding of genetics of global GE can also be supported by previous studies. An important genomic study of global GE variation validated the genetic contribution of the discrepancy of GE between Asian and Caucasian samples, not an artifact due to life styles. This study showed that 24 Han Chinese residing in Los Angeles had much more similar GE profiles to the 82 HapMap CHB
JPT samples than to the 60 HapMap CEU samples [44
]. The other important genomic study of GE also uncovered the genetic contribution on global patterns of GE after adjusting potential confounding factors that may influence GE. This study analyzed GE data of 270 individuals from four HapMap II populations and found GE variation differentiated in population comparisons in agreement with earlier studies [36
The GE variation may also be influenced by the type of biological specimen, attributes related to the time and other circumstances of taking the biological samples, or GE microarray platform. This study provides a proof-of-concept method for construction of AIM panels by integrating SNP and GE markers but the current results are still limited by the use of single cell type (lymphoblastoid cell lines), fixed time/circumstances of taking the biological samples, and single microarray platform (Illumina’s Sentrix Human-6 Expression BeadChip). More investigations should be carried out to understand the proportions of the identified AIMs specific to the currently used conditions or transferable to more general conditions. For practical applications, we also plan to integrate SNP and GE variation from global genomic studies and construct larger reference database for normalizing GE data. SNP and GE markers will be integrated to identify AIMs and establish robust discriminant models using BIASLESS software. Biological specimen from a tested individual are collected and used to genotype/measure the identified and confirmed AIMs. Finally, SNP genotypes and GE levels of the tested individual are plugged into the discriminant models to determine the correct ethnic group.
Regarding the supervised classification method, two points are important to discuss. First, we modified the efficient and broadly used FDA algorithm and integrated forward variable selection and cross-validation procedures with FDA to select key predictive markers from enormous numbers of SNP and GE markers, and we then built accurate classification models for sample subdivision. Our supervised classification procedure provides multiple candidate models (e.g., 10 in a 10-fold cross-validation). Choosing a model with the highest testing accuracy is recommended but should not be the only criterion for model selection. Other optimal criteria and domain knowledge may need to be considered to determine the best model that satisfying both statistical properties and biological relevance. For example, the cross-validation consistency of a model among all candidate models may be used simultaneously, or genetic knowledge, biological relevance, and quality evaluation of genetic markers may also be integrated to assist in selection of the final classification model. Second, there is a very rich body of literature in the field of supervised classification, including support vector mechanisms [47
] and classification trees [48
]. Different algorithms have pros and cons in different study scenarios and data types. We are adding various classification algorithms to further enrich the BIASLESS software.
This study analyzed the data in the HapMap II Project, which contains only four populations, rather than the HapMap III Project, which contains 11 populations because GE data for the majority of samples in the HapMap III Project are not available. However, the proposed method and software can be applied in general to construct AIM panels for additional populations. The SNP data in this study came from two genotyping platforms: Affymetrix 500
K and Array6.0 SNP chips. The results of the sample classification were similar, although the number of SNPs interrogated on Affymetrix 500
K (~4 – 4.9 hundred thousand SNPs after quality control) was only about half the number in Array 6.0 (~7 – 8.7 hundred thousand SNPs after quality control), suggesting that the ancestral information in SNPs identified with Affymetrix Array6.0 is not more informative than that in SNPs identified with Affymetrix 500
K, with regard to the classification of samples in the HapMap II Project. Recently, whole-genome sequencing technology, in comparison with SNP microarrays, has become more common and has promoted the identification of new common SNPs and rare variants. Novel population-specific or ancestry-informative variants may be identified, and more eQTLs that contribute to genetic variation of ancestry informative GE may become available. It will be interesting to investigate if the bottleneck in a SNP-only analysis for discerning samples from closely related populations can be overcome using highly dense common SNPs and rare variants from massive parallel sequencing in the 1000 Genomes Project [49