Genome sequencing and alignment to the human reference
We generated 113.16 Gb of sequence data (1,131.56 million paired-end reads of length 100

bp) that was aligned to the human reference sequence (GRCh37/hg19; 2,861,343,702 non-N bases) using BWA
[
23]. We aligned 96.27% of the reads (99.97 Gb) to the reference sequence resulting in an average coverage of 34.94 fold across the genome (Table

). The coverage depth distribution of the sequenced genome is shown in Figure

along with a Poisson distribution with the same mean value. Compared to the Poisson distribution, which has been used to model sequence data in several earlier studies, we observe that the coverage distribution has more weight on both tails. A decrease in the average coverage with increasing GC content in 50 Kb non-overlapping windows across the whole genome was also observed (Figure

). While 98.89% of the reference genome was covered by at least five reads (required for variant calls), 99.17% of it was covered by at least one read.
| Table 1Sequencing and analysis statistics |
We performed a
de novo assembly of reads that did not align to the chromosomes in GRCh37, using SOAPdenovo
[
24]. This generated 57,426 contigs comprising 23,683,357 bases with an average contig length of 412

bp. Of these, 42.69% sequences aligned to the unanchored contigs and chromosomes in GRCh37 and another 9.25% of the sequences aligned to the alternative human assemblies. About 33.05% of the assembled sequences aligned to other human sequences in the NT database, while another 3.64% of the sequences aligned to non-human primates with an E

<

10
-5.
SNPs and indels
We identified a total of 3,459,784 single nucleotide variants (2,087,876 heterozygous) in this genome, by comparing it against human reference genome assembly (GRCh37, also known as hg19), using methods previously described
[
14]. The SNP calls were further validated using Illumina HumanOmni1-Quad BeadchipSNP array data. We observed a 98.7% concordance between the SNP calls made using the sequencing data and the SNP array, confirming the validity of the sequencing derived SNP calls. The single nucleotide variants identified in the SAIF genome are referred to as SNPs (relative to the human reference genome) in the results below and this does not include single base insertions/deletions. Of the total SNPs identified, 1,679,111 (48.5%) mapped to gene (intragenic) coding regions of the genome. Given that exons form a small part of the gene coding region, of the total intragenic SNPs identified, only 5.6% (94,247; 29,724 in coding exons, 25,354 in non-coding exons, 8,651 in 5’UTR and 30,518 in 3’UTR) mapped to them (Additional file
1: Table S1A). This is consistent with the lower mutation rates typically observed with coding regions of the genomes
[
25]. Among the SNPs in coding exons, 11,155 are synonymous (syn) substitution that are distributed among 6,631 genes and 11,216 are non-synonymous (non-syn) changes that map to 6,279 genes. This is consistent with a non-syn:syn (dN/dS) ratio of ~1 expected of a normal genome
[
26]. Of the SNPs identified, 1,832,801 (53%) mapped to repeat containing regions of the genome
[
27,
28]. Further, we found that about two-thirds of the SNPs identified in the repeat regions were found in long interspersed elements (LINE; 41%; majority occurring in L1 elements) or short interspersed elements (SINE; 30%; majority occurring in Alu elements; Figure

, Additional file
1: Table S1B).
In addition to SNPs, insertions and deletions (indels) are a class of variations that shape evolution of genomes
[
29,
30]. In the SAIF genome, of the total 384,926 indels identified, 190,533 (49.5%) were found in gene coding regions. As observed with SNPs, only 7,871 (4.1%) of indels (1,591 in coding exons, 1,769 in non-coding exons, 620 in 5’UTR and 3,891 in 3’UTR) occurred within exons. Of the total indels, 248,309 (64.5%) were found in repetitive regions, proportionally higher than SNPs that occurred in this region. This very likely reflects the slippage that occurs during replication leading to increased occurrence of indels in repeat regions
[
31]. Further, it is interesting to note that while indels were predominant (34%; 85,193) in simple repeats (Figure

, Additional file
1: Table S1B), only 2% of the SNPs were found in the simple repeat regions.
The presence of SNPs and indels can affect the gene regulatory regions such as transcription factor binding region (TFBR)/promoters and enhancer sites. We first looked at the average enrichment of SNPs and indels around transcriptional start sites (TSS) of known genes. Interestingly, we observed an increased SNPs density around TSS, suggesting an important role for variations in modulating expression across individuals
[
32]. In contrast to SNPs, there were fewer indels downstream of TSS (around 50

bp downstream; Figure

). This likely suggests a need for preserving the promoter length/architecture around TSS. We further investigated all conserved TFBR and enhancer sites catalogued by UCSC genome browser
[
33] and VISTA enhancer browser
[
34], respectively in the SAIF genome to understand the extent of variations in these regions. For TFBR, we focused on SNPs and indels that are present within 5

kb upstream of the genes. We found 1,328 SNPs and 66 indels within the TFBR and 1,732 of SNPs and 203 indels in the enhancer sites. The top TFBR SNP containing sites included regions adjacent to Forkhead box J2 transcriptional activator protein, myocyte enhancer factor-2 involved in cellular differentiation, Brachyury protein involved in mesoderm formation and differentiation, CHX10 protein involved in progenitor cell proliferation and bipolar cell determination in developing retina, and the peroxisome proliferator-activated receptor-gamma (PPAR-gamma) protein that regulates adipocyte differentiation. In general, we found genes involved in cancer pathways to be enriched for SNPs in their promoter regions (FDR

<

= 0.05).
Coding SNPs are predominant in olfactory genes
Coding SNPs (cSNPs) can lead to amino-acid substitutions in proteins encoded by the genome. In the SAIF genome, of the 22,371 cSNPs that occur in the protein coding sequence 11,216 lead to non-synonymous (nsSNP) substitutions (Table

). While the synonymous SNPs (sSNPs) map to 6,631 genes, the nsSNPs are distributed among 6,279 protein-coding genes in the genome. To understand the significance of the cSNPs, we assessed their enrichment in KEGG pathways using DAVID
[
35]. While distribution of both sSNPs and nsSNPs was significantly enriched (FDR

<

= 0.05) in ECM-receptor interaction pathway genes, only the olfactory transduction pathway genes showed a statistically significant enrichment for nsSNPs (Figure

, Additional file
2: Figure S1). This is consistent with the higher levels of polymorphism observed in human olfactory gene family
[
36]. The nsSNPs included 11,107 missense substitutions and 109 non-sense mutations that lead to premature stop (Additional file
1: Table S2). Genes with premature stop included
CASP12, a cysteine protease involved in inflammation and innate immune response, and
OR1B1, an olfactory receptor interacting with odorant molecules in the nose. The CASP12 protein contains an N-terminal caspase activation and recruitment domain (CARD) and a C-terminal catalytic cysteine protease domain characteristic of caspase family of proteins (Figure

). The non-sense mutation observed in the SAIF genome codes for a truncated CASP12 protein that contains just the N-terminal CARD domain. Recent re-sequencing studies have shown that this truncated form of CASP12 confers resistance to sepsis and is predominant across many populations
[
37-
39]. Another non-sense mutation in the SAIF occurred in the
OR1B1, which encodes a G-protein coupled olfactory receptor. The OR1B1 SNP leads to truncation of the 7-transmembrane receptor domain present in OR1B1 (Figure

). This truncated variant of OR1B1 protein has been observed in a recent study
[
39] and is thought to affect metabolism of serum cholinesterase
[
40]. Overall, our assessment of the effect of the nsSNPs using SIFT
[
41] indicates that 1,460 are likely damaging (Additional file
1: Table S3).
| Table 2Variants with Gene coding regions |
Unlike SNPs, indels in coding regions, in addition to non-sense mutations, can lead to frame shift changes. Of the 372 coding region indels, 172 are in-frame and 200 lead to frame-shift change (Additional file
1: Table S4A, Additional file
1: Table S4B). Genes where the indel leads to a frame-shift includes HIF3A, hypoxia inducible factor 3 alpha subunit, thought to be a negative regulator of hypoxia-inducible gene expression; MMP28, a matrix metallopeptidase involved in the breakdown of extracellular matrix for both normal physiological and disease processes; and HNF1A, a transcription factor required for the expression of several liver-specific genes. The frame-shift at position 147 in MMP28 protein introduces a premature stop codon at 179. This results in loss of zinc-dependent metalloprotease and hemopexin-like repeat domain, leading to a truncated MMP28 protein that lacks a catalytic domain (Additional file
2: Figure S2). SIFT analysis of the indels indicated 126 indels to be deleterious (Additional file
1: Table S5).
Comparison and novel variants
We compared SAIF SNPs against those from other published personal genomes, the variations from the 1000 Genomes Project and dbSNP database (dbSNP132). The personal genomes used to perform the comparison had a sequencing coverage of at least 10X. Shared SNP sites, where both the SAIF genome and the genome it is compared to carry a SNP, provide a measure of the degree of similarity between the genomes. We also compared the indels found in the SAIF genome with those reported by the 1000 Genomes Project.
SNPs level comparison of the SAIF genome found that this individual shared 48.77% of the SNP sites with NA12891 (Caucasian) genome, 48.82% with the NA12892 (Caucasian) genome, 52.5% with the Venter (Caucasian) genome, 50.68% with the NA18507 (YRI) genome, 44.29% with the NA19238 (YRI) genome, 44.33% with NA19239 (YRI), 53.75% with YH (Han Chinese) genome, 59.24% with SJK (Korean) genome, 46.5% with ABT (South Africa) genome, 51.1% with Irish (Caucasian) genome, 49.86% with KB1 (Southern Kalahari, Africa), 59.41% with the recently published Indian male genome
[
20], 95.18% with dbSNP 132, and 92.44% with 1000 Genomes Project variation collection.
Overall, we found that 2.6% (89,873) of the SNPs and 83.83% (322,295) of the indels to be unique to SAIF genome (Figure

, Additional file
1: Table S6). The complete list of novel SNPs and indels is provided in Additional file
1: Table S7. Of the novel SNPs, 22,412 (24.94%) mapped uniquely to genes, 28,313 (31.5%) mapped specifically to repeat regions, 21,826 (24.29%) mapped to both gene coding and repetitive regions and 17,322 (19.27%) mapped to other regions of the genome. Further, of the total novel SNPs that mapped to coding regions, 543 led to protein level alterations (533 missense SNPs and 10 non-sense substitutions). Genes having novel non-sense substitutions include TSG101, a phosphoprotein implicated in tumorigenesis and the CD164 gene that plays a role in hematopoiesis. Further, assessment of the protein altering novel SNPs using SIFT predicted 154 of these to be likely damaging.
SNPs with health and medical relevance
We assessed cSNPs identified in the SAIF genome using annotations in SNPedia and OMIM for their health and disease relevance. This analysis identified 59 and 63 cSNPs with implications in health and disease from SNPedia and OMIM databases
[
45,
46], respectively (Additional file
1: Table S8 and Additional file
1: Table S9). Interestingly, this analysis revealed several SNPs with implications for susceptibility to cancer and cardiovascular diseases. The cancer susceptibility SNPs included the variation in SDHB gene (S163P, OMIM_ID #185470.0015), responsible for Cowden-like syndrome, resulting in enrichment of carcinomas of human breast due to downstream inactivation of PTEN
[
47]. We also found an exon 10 BRCA2 variant (N372H; OMIM_ID # 600185.0133),
[
48] and an EPCAM variant identified in Chinese population (M143T; rs1126497;
[
49]) that are associated with increased risk for breast cancer. Further, a SNP in CENPF gene (R2943G; rs438034) that occurs in the SAIF genome is associated with a poor breast cancer survival
[
50]. Other SNPs with increased cancer susceptibility include FCGR2A H166R (rs1801274) associated with increased risk for non-Hodgkin’s lymphoma
[
51], ANKK1 E713K (rs1800497;
[
52]) involved in advanced adenoma recurrence, HNF1A S487N (rs2464196;
[
53]), MMP9 Q166R (rs17576-rs2250889;
[
54]), and XPC Q939K (rs2228001;
[
55]) variants associated with lung cancer, ATG16L1 T137A (rs2241880;
[
56,
57]) with Crohn’s disease, and OGG1 P332A (rs1052133;
[
58-
60]) associated with bladder and gall-bladder cancer in Japanese, Chinese and Indian populations. An ATR (M211T; rs2227928) variant found in the genome has been associated with a poorer response to gemcitabine and radiation therapy in pancreatic cancer
[
61]. We also found a protective SNP that occurs in the PON1 gene (Q192R; rs662) that is known to lower (0.65x) risk for ovarian cancer
[
62]. Two common missense variations in ELAC2 gene (A541T; OMIM_ID # 605367.0002 and S217L; OMIM_ID # 605367.0001) implicated in genetic susceptibility to heredity prostate cancer were found in the SAIF genome. This while not of direct significance to SAIF individual, could be of relevance to the male children, if any
[
63-
65].
The cardiovascular disease associated SNPs found in this individual include those in LRP8 (R952Q; rs5174/OMIM_ID # 602600.0001;
[
66]) and MMP9 (Q166R; rs17576;
[
67]) both of which increase risk for myocardial infarction, ROS1 (S2229C; rs619203;
[
68]) variation associated with increased coronary heart disease, AKAP10 SNP (I646V; OMIM_ID # 604694.0001;
[
69]) associated with cardiac conductivity defect susceptibility and ADRB3 variant (W64R; rs4994;
[
70]) implicated in higher risk of cardiac events. Also, two SNPs in the PON1 (Q192R; rs662 and L55M; rs854560) show a high risk of cardiovascular disease
[
71] and a higher risk of coronary artery disease
[
72,
73]. A SNP in SNX19 (L878R; rs2298566) is linked to elevated risk of coronary heart disease but has also been shown to be associated with better response to statins and may be of clinical significance
[
74]. Other SNPs affecting cholesterol levels (EDN1 K198N; OMIM_ID # 131240), familial obesity (FAM71F1 E143K; rs6971091) and hypertension susceptibility (PPARFC1A, G482S; rs8192678 and CYP4A11, F434S; rs1126742) were also found in the genome.
In addition to this, several other SNPs associated with Alzheimer’s disease, diabetes, tuberculosis susceptibility and macular degeneration were also detected. A SNP in ICAM1 (K469E; rs5498), associated with increased resistance to malarial infection, originally identified in a study of over 552 Indian individuals
[
75], was also observed in the SAIF genome. It must be noted that a majority of the SNPs of health relevance used to annotate the coding SNPs were derived from studies involving western populations. Hence, validating the relevance of these in the context of Asian Indian population will require controlled studies in a cohort representative of the Indian subcontinent.
Besides assessing the cSNPs using SNPedia and OMIM, we performed a comprehensive assessment of predicted genetic risk of the SAIF genome for 49 diseases using Varimed
[
76]. As described recently
[
77], we first estimated the pre-test probability using the prevalence of each disease according to the age, gender, and ethnicity of SAIF. Using this analysis of the SAIF genome we found 17 diseases that had post-test probability >5% (Figure

).
We further assessed the relative genetic risk of SAIF against Gujarati Indians in Houston (GIH) population represented in HapMap III. We used the set of disease SNPs measured in both SAIF and GIH, and recalculated the likelihood ratio (LR) for SAIF and each of 101 GIH individuals. We found that the SAIF individual had a higher genetic risk than 80% of GIH for eight diseases (Additional file
2: Figure S5). Intersecting both results, we found that SAIF had a high genetic risk for four diseases, including multiple sclerosis (post-test probability

=

5%, relative risk

>

100% GIH), uterine leiomyoma (post-test probability

=

47%, relative risk

>

97% GIH), asthma (post-test probability

=

17%, relative risk

>

90% GIH), and obesity (post-test probability

=

34%, relative risk

>

82% GIH).
SAIF’s high genetic risk on multiple sclerosis is based on a rare heterozygous AG variant at rs3135388 in HLA-DRA (Figure

), which appears in 4% GIH individuals. Multiple studies have linked the presence of the “A” allele with an increased risk for multiple sclerosis in American, Australian, British, Canadian, Dutch, mixed European, and Serbian populations
[
78-
87]. International Multiple Sclerosis Genetics Consortium used this allele as a proxy for the DRB1*1501 allele, which had been demonstrated as a causal variant for multiple sclerosis and validated in animal models in OMIM (OMIM_ID #126200). This variant has also been validated as a functional regulatory variant, with evidence from transcription factor binding site, eQTL, and DNase peak from Regulome DB (
http://www.regulomedb.org/) with data from ENCODE.
In addition to multiple sclerosis, SAIF had a high genetic risk of uterine leiomyoma, driven by a rare heterozygous CT variant at rs7913069 (Additional file
2: Figure S6). The T allele had been validated to increase the risk of uterine leiomyoma with an odds ratio of 1.47 and p-value

=

8.65

×

10
-14 in Japanese women
[
88]. A high genetic risk for asthma and obesity were also identified in the SAIF individual
[
89,
90] (Figure

, Additional file
2: Figures S7 and Additional file
2: Figures S8).
SNPs of pharmacogenomic relevance
An individual’s SNPs can be used to predict adverse drug reactions and also manage the dose of drugs prescribed. In order to understand SNPs of pharmacogenomic relevance, we evaluated SAIF variants using annotations present in the PharmGKB database
[
91] and a set of published SNPs relevant for drug interactions. We identified 109 SNPs with pharmacogenomic relevance based on PharmGKB (Additional file
1: Table S10). We identified 10 actionable SNPs from literature as it pertains to treatment with various drugs, some of which are also represented in the PharmGKB (Table

). As India has a high burden of diabetes, we looked at the SAIF genome for a SNP in ATM previously associated with metformin response
[
92]. Metformin is a drug widely prescribed to manage blood sugar levels in diabetics. We found a GG (rs4585) variant in ATM in the SAIF genome and this is associated with positive response to metformin
[
92]. FDA in the US has recommended testing for SNPs that help in deciding the dosage of warfarin, used as a blood thinner. Based on the three SNPs (Table

) found in the SAIF genome, we conclude that a typical dose of warfarin is not likely to cause bleeding. Given that the SAIF genome SNPs predicted an increased risk for multiple sclerosis, we looked at genes with SNPs that predict response to therapy in the context of multiple sclerosis. We found a CT variant (rs7987675) in GPC5, which is associated with typical response to beta interferon therapy in relapsed multiple sclerosis patients. Besides these expected drug responses, we have identified additional ones based on the SAIF genome and they are detailed in Table

.