Overview of Personal Omics Profiling
Our overall iPOP strategy was to: 1) determine the genome sequence at high accuracy and evaluate disease risks, 2) monitor omics components over time and integrate the relevant omics information to assess the variation of physiological states, and 3) examine in detail the expression of personal variants at the level of RNA protein to study molecular complexity and dynamic changes in disease states.
We performed iPOP on blood components [Peripheral Blood Mononuclear Cells (PBMCs), plasma and sera which are highly accessible] from a 54 year-old male volunteer over the course of 14 months. Samples used for iPOP were taken over an interval of 401 days (Days 0–400). In addition, a complete medical exam plus laboratory and additional tests were performed before the study officially launched (Day -123) and blood glucose was sampled multiple times after the comprehensive omics profiling (Days 401–602) (). Extensive sampling was performed during two viral infections that occurred during this period: a human rhinovirus (HRV) infection beginning on Day 0 and a respiratory syncytial virus (RSV) infection starting on Day 289. A total of 20 time points were extensively analyzed and a summary of the time course is indicated in . The different types of analyses performed are summarized in . These analyses, performed on PBMCs and/or serum components, included WGS, complete transcriptome analysis (providing information about the abundance of alternative spliced isoforms, heteroalleleic expression and RNA edits, as well as expression of miRNAs at selected time points), proteomic and metabolomic analyses, and autoantibody profiles. An integrative analysis of these data highlights dynamic omics changes and provides rich information about healthy and diseased phenotypes.
Whole Genome Sequencing
We first generated a high quality genome sequence of this individual using a variety of different technologies. Genomic DNA was subjected to deep WGS using technologies from Complete Genomics (CG, 35nt paired end) and Illumina (100nt paired end) at 150 and 120 fold total coverage, respectively, exome sequencing using three different technologies to 80–100-fold average coverage (see Extended Experimental Procedures) and analysis using genotyping arrays.
The vast majority of human sequences (91%) mapped to the hg19 (GRCh37) genome. However, because of the depth of our sequencing, we were able to identify sequences not present in the reference sequence. Assembly of the unmapped Illumina sequencing reads (60,434,531, 9% of the total) resulted in 1,425 (of 29,751) contigs (spanning 26Mb) overlapping with RefSeq gene sequences that were not annotated in the hg19 reference genome. The remaining sequences appeared unique, including 2,919 exons expressed in the RNA-Seq data (eg. Figure S1A
). These results confirm that a large number of undocumented genetic regions exist in individual human genome sequences and can be identified by very deep sequencing and de novo
assembly (Li et al., 2010
Our analysis detected single nucleotide variants (SNVs), small insertions and deletions (indels) and structural variants (SVs; large insertions, deletions and inversions relative to hg19), (summarized in and Experimental Procedures). 134,341 (4.1%) high confidence SNVs, are not present in dbSNP or the 1000G, (1000 Genomes Project Consortium, 2010
) indicating that they are very rare or private to the subject. Only 302 high confidence indels reside within RefSeq protein coding exons and exhibit enrichments in multiples of 3 nucleotides (p<0.0001). In addition to indels, 2,566 high confidence SVs were identified (Experimental Procedures and Table S1
) and 8,646 mobile element insertions were detected (Stewart et al., 2011
Summary and breakdown of DNA variants.
Analysis of the subject's mother's genome and imputation allowed a maternal/paternal chromosomal phasing of 92.5% of the subject's SNVs and indels (see Figure S2 and Extended Experimental Procedures
). 139 phased genes contain predicted compound heterozygous deleterious and/or nonsense mutations. See Table S2
for details, Data S1. Phasing enabled the assembly of a personal genome sequence of very high confidence [cf. (Rozowsky et al., 2011
WGS-based Disease Risk Evaluation
We identified variants likely to be associated with increased susceptibility to disease (Dewey et al., 2011). The list of high confidence SNVs and indels was analyzed for rare alleles (<5% of the major allele frequency in Europeans) and for changes in genes with known Mendelian disease phenotypes (data summarized in ), revealing that 51 and 4 of the rare coding SNV and indels, respectively, in genes present in OMIM are predicted to lead to loss-of-function (Table S3A
). This list of genes were further examined for medical relevance (Table S3A
; example alleles are summarized in ), and 11 were validated by Sanger sequencing. High interest genes include a) a mutation (E366K) in the SERPINA1
gene previously known in the subject, b) a damaging mutation in TERT
, associated with acquired aplastic anemia (Yamaguchi et al., 2005
), c) variants associated with hypertriglyceridemia and diabetes, such as GCKR
[homozygous] (Vaxillaire et al., 2008
), and KCNJ11
[homozygous] (Hani et al., 1998
) and TCF7
[heterozygous] (Erlich et al., 2009
Summary of disease-related rare variants.
Genetic disease risks were also assessed by the RiskOGram algorithm, which integrates information from multiple alleles associated with disease risk (Ashley et al., 2010
) (). This analysis revealed a modest elevated risk for coronary artery disease and significantly elevated risk levels of basal cell carcinoma (), hypertriglyceridemia, and Type 2 Diabetes (T2D) ().
In addition to coding region variants we also analyzed genomic variants that may affect regulatory elements (Transcription Factors, TF), which had not been attempted previously (Data S2). 14,922 (of 234,980) SNVs lie in the motifs of 36 TFs known to be associated with the binding data (see Experimental Procedures), indicating that these are likely having a direct effect on TF binding. Comparison of SNPs that alter binding patterns of NFκB and PolII sites (Kasowski et al., 2010
), also revealed a number of other interesting regulatory variants, some of which are associated with human disease [eg. EDIL
, (Sun et al., 2010
), Figure S1B
Dynamic Omics Analysis: Integrative Omics Profiling of Molecular Responses
We profiled the levels of transcripts, protein, and metabolites across the HSV and RSV infections and healthy states using a variety of approaches. RNA-Seq of 20 time points generated over 2.67 billion uniquely mapped 101b paired-end reads (123 million reads average per time point) and allowed for an analysis of the molecular complexity of the transcriptome in normal cells (PBMCs) at an unprecedented level. The relative levels of 6,280 proteins were also measured at 14 time points through differentially labeling samples using isobaric tandem mass tags (TMT), followed by liquid chromatography and mass spectrometry (LC-MS/MS) (Cox and Mann, 2010
; Theodoridis et al., 2011
). 3,731 PBMC proteins could be consistently monitored across most of the 14 time points (see Figure S4A
, Data S4). In addition, 6,862 and 4,228 metabolite peaks were identified for the HRV and RSV infection and a total of 1,020 metabolites were tracked for both infections (see Figure S5
and Data S5(3)). Finally, as described below we also analyzed miRNAs during the HRV infection.
This wealth of omics information allowed us to examine detailed dynamic trends related directly to the physiological states of the individual and revealed enormous changes in biological processes that occurred during healthy and disease states. For each profile (transcriptome, proteome, metabolome), we systematically searched for two types of non-random patterns: 1) correlated patterns over time, and 2) single unusual events i.e. spikes that may occur at any given time-point defined as statistically significantly high or low signal instances compared to what would be expected by chance. To perform this analysis, we developed a novel general scheme for integrated analysis of data (see Figure S6 & Extended Experimental Procedures
for further details). We used a Fourier spectral analysis approach that both normalizes the various omics data on equal basis for identifying the common trends and features, and, also accounts for dataset variability, uneven sampling and data gaps, in order to detect real-time changes in any kind of omics activity at the differential time points (See Supplementary Information). Autocorrelations were calculated to assess non-randomness of the time-series (p<0.05 one-tailed based on simulated bootstrap non-parametric distribution by sampling with replacement of the original data, n>100,000), with significant signals classified as autocorrelated
(I). The remaining data was searched for spike events, which were classified as spike maxima
(II) or spike minima
(III) (p<0.05 one-tailed based on differences from simulated, n>100,000 random distribution of the time-series). After classification, the data were agglomerated into hierarchical clusters (using correlation distance and average linkage) of common patterns and biological relevance was assessed through GO (Ashburner et al., 2000
) Analysis [Cytoscape (Smoot et al., 2011
), BiNGO (Maere et al., 2005
) p<0.05, Benjamini-Hochberg (Benjamini and Hochberg, 1995
) adjusted p<0.05] and pathway analysis [Reactome (Croft et al., 2011
) Functional Interaction, FI, networks including KEGG (Kanehisa and Goto, 2000
; Smoot et al., 2011
), p<0.05, FDR < 0.05]. The unified framework approach was implemented on all the different datasets both individually and in combination, and our results revealed a number of differential changes that occurred both during infectious states and the varying glucose states.
We first analyzed the different individual transcriptome, proteome (serum and PBMC) and metabolome datasets; the proteome and metabolome results are presented in the supplement (Figures S4, S5, S7
and Data S4–S7). 19,714 distinct transcript isoforms (Wang et al., 2008
) corresponding to 12,659 genes (Figure S1C
) were tracked for the entire time course, and their dynamic expression response was classified into either autocorrelated (I) and spike sets, further subdivided as displaying maxima (II) or minima (III) (). The clustering and enrichment analysis displayed a number of interesting pathways in each class: In the autocorrelated group [ (I), see also Figure S7A
and Data S7 (1–2)], we found two main trends: an upward trend (2,023 genes), following the onset of the RSV infection, and a similar coincidental downward trend (2,207 genes). The upward autocorrelated trend revealed a number of pathways as enriched (p<0.002, FDR<0.05), including Protein Metabolism and Influenza Life Cycle. Additionally, the downward autocorrelation cluster showed a multitude of enriched pathways (p<0.008, FDR<0.05), such as TCR Signaling in Naïve CD4+ T cells, Lysosome, B Cell signaling, Androgen regulation and of particular interest, Insulin signaling/response pathways. These different pathways, which are activated as a response to an immune infection, often share common genes and additionally we observe many genes hitherto unknown to be involved in these pathways but displaying the same trend. Furthermore, we observed that the downward trend, that began with the onset of the RSV infection and appeared to accelerate after Day 307, coincided with the beginning of the observed elevated glucose levels in the subject.
Transcriptome time course analysis
In the dynamic spike class we again saw patterns that were concordant with phenotypes [, (II) and (III), see also Figure S7A
, Data S7(3–14)]. A set of expression spikes displaying maxima (547 genes), that are common to the onset of both the RSV and HRV infections are associated with Phagosome, Immune processes and Phagocytosis, (p<1×10−4
). Furthermore, a cluster that exhibits an elevated spike at the onset of the RSV infection involves the Major HistoCompatibility genes (p<7×10−4
, Benjamini-Hochberg adjusted p<0.03). A large number of genes with a coexpression pattern common to both infections in the time course have yet to be implicated in known pathways and provide possible novel connections related to immune response. Finally, our spike class displaying minima showed a distinct cluster (1,535 genes) singular to day 307 (Day 14 of the RSV infection), associated with TCR signaling again, TGF receptors and T Cell, and Insulin Signaling pathways (p<0.02, FDR<0.03). Overall, the transcriptome analysis captures the dynamic response of the body responding to infection as also evidenced by our cytokine measurements, and also can monitor health changes over long periods of time, with various trends.
To further leverage the transcriptome and genome data, we performed an integrated analysis of transcriptome, proteomic and metabolomics data for each time point, observing how this corresponded to the varying physiological states monitored as described in the above sections. Because of the availability of many time points through the course of infection, we examined in detail the onset of the RSV infection, as well as extended our complete dynamics omics profile during the times that our subject began exhibiting high glucose levels. shows an integrated interpretation of omics data (see also Figure S7B
, Data S8), where all trends are combined for each omics dataset and the common patterns emerge providing complementary information. In addition to the common patterns observed in our transcriptome analysis, new patterns emerged, some unique to protein data, some to metabolite, and some common to all. In particular we found the following interesting results: for autocorrelated clusters we found the same trends as observed in the transcriptome, additionally augmented with concordant protein expressions. Pathways such as the Phagosome, Lysosome, Protein Processing in Endoplasmic Reticulum and Insulin pathways emerged as significantly enriched (p<0.002, FDR<0.0075), and showed a downward trend post-infection, and further accelerated after about 3 weeks post the infection (this cluster comprised of 1,452 transcriptomic and 69 proteomic components, corresponding to 1,444 genes). The elevated spike class showed a maxima cluster on Day 18 post RSV infection (one time point after the cytokine maximum), with enrichment in pathways such as the Spliceosome, Glucose Regulation of Insulin Secretion and various pathways related to a stress response (p<1×10−4
, FDR <0.02) - this cluster included 1,956 transcriptomic, 571 proteomic and 23 metabolomic components, corresponding to 2,344 genes. Even though current proteomic information is more limited than the full transcriptome because it follows fewer components, as evidenced in (II) several pathways, including the Glucose Regulation of Insulin Secretion pathway, clearly emerge from the proteomic information and would not have been observed by monitoring the transcriptome only. Additionally, in this cluster we find significant GO enrichment in splicing and metabolic processes (p<6×10−47
, Benjamini-Hochberg adjusted p<10−45
). Furthermore, inspection of metabolites reveals 23 that show the same exact trend (i.e. spikes at Day 18 post RSV infection); at least one, lauric acid has been implicated in fatty acid metabolism and insulin regulatory pathways (Kusunoki et al., 2007
). Finally, we observe minima spikes as well, with yet another interesting group on Day 18, which showed down-regulation in several pathways (p<0.003, FDR<0.05), such as the Formation of Platelet Plug. This cluster displayed a high degree of synergy between the various omics data, comprised of 3,237 transcriptomic and 761 proteomic components corresponding 3,400 genes and 83 metabolomic components.
In summary our integrated approach revealed a clear systemic response to the RSV infection following its onset and post infection response, including a pronounced response evident at Day 18 post RSV infection. A variety of infection/stress response relative pathways were activated along with those related to the high glucose levels in the later time points. Insulin response pathways exhibit decreased signaling post infection.
Dynamic Omics Analysis: Extensive Heteroallelic Variation and RNA Editing
The considerable amount of transcriptome and proteome data allowed us to analyze and follow changes in allelic specific expression (ASE) splicing and editing at the RNA and protein levels, during healthy and disease states.
Of the 49,017 genomic variants associated with coding or UTR regions (), 12,785 (26%) were expressed in PBMCs (≥40 read coverage; Table S4
). 8,509 of the variants are heterozygous (1,113 missense) and the remainder (4,686; 684 missense) are homozygous. 8 of the 83 nonsense mutations were expressed indicating that not all nonsense mutations result in transcript loss.
The numerous heterozygous variants allowed an analysis of the dynamics of differential ASE, (Experimental Procedures; , Figure S2B
) in PBMCs during healthy and disease states. We found 497 and 1,047 genes that exhibited differential ASE during HRV and RSV infection, respectively (posterior probability ≥ 0.75, beta-binomial model; ≥40 reads, ≥7 time points); many of these are immune response genes, eg. PADI4
(). Amongst the differential ASE sites 100 and 218 were specific to HRV and RSV infected states, respectively (). Differential ASE genes in the HRV compared to healthy phase were enriched for those encoding SNARE vesicular transport proteins (DAVID analysis; Benjamini p<0.05). Summing over all computed ASE alternative to total ratios revealed that non-reference heteroallelic variants were expressed at 98% of reference variants. The expression of over 50 heterozygous variants, including rare/private SNVs (0.72% of the genomic total), and differentially expressed variants (SVIL
), was confirmed by Sanger cDNA sequencing and/or digital PCR (Hindson et al., 2011
) of cDNA (, S2
). Overall, these results demonstrate that differential ASE is pervasive in humans, particularly distinct during healthy and infected states, with many of these changes residing in immune response genes.
Heteroallelic expression study of PBMCs
The depth of our RNA-Seq data enabled us to re-evaluate the extent of RNA editing, typically an adenosine to inosine (A-to-I) conversion (Li et al., 2009b
) or cytidine to uridine (C-to-U), in normal human cells. We found 2,376 high confidence RNA edits, including 795 A-to-I (A-to-G) and 277 C-to-U deamination-like edits (, S8
). 587 edits in 175 genes were predicted to cause amino acid substitutions [Polyphen-2 (Adzhubei et al., 2010
)]; the remainder were nonsense (11), synonymous (435) or located in 5'/3' UTRs (103/1,240). 10 edited bases causing amino acid substitutions were validated by Sanger cDNA sequencing and/or digital droplet PCR, as well as by identification of their peptide counterparts by mass spectrometry (). Interestingly, we identified A-to-G edits (), eg. IGFBP7, BLCAP
in PBMCs that were known to occur in other tissues (Gommans et al., 2008; Levanon et al., 2005), indicating that the same RNA can be edited in other cell types. BLCAP
exhibited two edited changes, , with edited/total ratios of 0.12–0.2 and 0.18–0.31, respectively, comparable to the 0.21 ratio previously observed in the brain (Galeano et al., 2010).
RNA editing and miRNA expression of PBMCs
Furthermore, we found and validated two new missense causing edits, U-to-C in SCFD2
and G-to-A in FBXO25
(), indicating an amination-like RNA-editing mechanism, previously not observed in human cells. Our results reveal that a large number of edits occur and exhibit dynamic changes in populations of PBMCs. The total number of edited RNAs, while extensive, is significantly lower than that reported in human lymphoblastoid lines and very different in its distribution (Li et al., 2011
). We believe that in addition to tissue-specific variation, the observed differences are also likely due to overcalling of false-positive SNVs, a problem we corrected with deep exome sequencing, removal of repeat regions and strings of close-proximity variants ().
Finally, to determine whether the non-reference allele and edited RNAs serve as templates for protein synthesis, we generated proteome databases for 4,586 missense SNVs and all 30,385 edits and used them to search our mass spectra from the untargeted protein profiling experiments as well as in a targeted approach to directly search for 500 edited proteins (see Extended Experimental Procedures). Peptides for 48 SNVs and 51 edits were identified (FDR <0.01 and requiring one unique peptide per protein; Data S9). 17/17 selected SNVs (100%) validated by Sanger sequencing. 7 and 6 of peptides derived from the SNV and edited transcripts, respectively, were unique to a single protein in the IPI database (Kersey et al., 2004
) and classified as high confidence. These results indicate that a large fraction of personal variants are expressed as transcripts and a number of these are also translated as proteins.