Genome-wide SNP data were combined from 8 different datasets (). Patients with sickle cell disease were pooled from the Multicenter Study of Hydroxyurea (MSH) [
4], NIH Pulmonary Hypertension Study [
5] (NIH-PH), Boston University Pulmonary Hypertension Study (BU-PH) [
6], and the Cooperative Study of Sickle Cell Disease (CSSCD) [
7]. For comparison, we included African Americans without sickle cell disease by pooling subjects from the Multi-Institutional Research in Alzheimer's Genetic Epidemiology (MIRAGE) study [
8], randomly selected subjects from the Illumina repository and the HapMap [
9] panel of subjects with African ancestry in Southwest USA (ASW). As reference populations, we included 4 additional HapMap populations and 14 African and European populations from the Human Genome Diversity Project (HGDP) [
10] including the HapMap Yoruba in Ibadan, Nigeria (YRI), Luhya in Webuye, Kenya (LWK), Maasai in Kinyawa, Kenya (MKK), the CEPH Utah residents with Northwestern European Ancestry (CEU), Bantu, Biaka, Mandenka, Mbuti pygmy, San, HGDP Yoruba, Adygei, Basque, French, Italian, Orcadian, Russian, Sardinian and Tuscan.
MSH, NIH-PH, BU-PH and MIRAGE samples were genotyped using the Illumina 370K array; CSSCD samples were genotyped on the Illumina 610K array; African Americans from the Illumina repository were genotyped on the Illumina 550K array; European and African populations from the Illumina repository were genotyped on the Illumina 650K array; HapMap populations had genotypes for approximately 1.5 million SNPs available. Autosomal SNPs appearing in all arrays were used for the ancestry analysis yielding 253,880 SNPs with a call rate greater than 95% and minor allele frequency greater than 5%. All subjects had a call rate greater than 93% and all individuals with an identity by descent estimate greater than 0.30 indicating first degree relatives were excluded from the analysis.
To examine the ancestry of these subjects, we performed a principal component analysis [
11] (PCA) using the genome-wide set of SNPs, which passed quality controls for all subjects, and then applied a clustering algorithm to group individuals with similar ancestry [
12]. PCA is the standard approach to detecting underlying population substructure using genome-wide SNP data and reduces the dimension of the data by summarizing the largest amount of variability in the data to the top axes of variation called principal components (PCs). The top PCs often correspond to ancestry or geographic origin. To understand the joint pattern formed by the PCs, we applied our clustering algorithm to the top PCs to form clusters grouping individuals with similar ancestry together and separating individuals with differing ancestry. We also estimated the genetic distance between the clusters using the Fst statistic to further understand the relationship among the populations. Fst is defined as the proportion of genetic diversity due to allele frequency differences among populations and can be interpreted as the distance between populations [
2].
The clustering algorithm is described in detail in [
12]. Briefly, it first identifies the top ancestry informative PCs by inspecting a heatmap and a scree plot of the top 20 PCs for patterns (). Next, subjects are clustered with respect to the top PCs using k-means clustering for a range of cluster sizes (2–30). The algorithm provides a scoring index for each cluster size to automatically identify the optimal number of clusters. The scoring index averages measures of the accuracy of the subjects’ cluster assignments, the stability of k-means clustering, and the ability of k-means to maximize the distance between clusters. The PCA and Fst estimates were computed using the software EIGENSOFT [
11]. We used Fisher’s exact test to determine whether a higher percentage of African Americans without SCD compared to African Americans with SCD was more similar to Caucasians.