|Home | About | Journals | Submit | Contact Us | Français|
Motivation: The sequencing of personal genomes enabled analysis of variation in transcription factor (TF) binding, chromatin structure and gene expression and indicated how they contribute to phenotypic variation. It is hypothesized that using the reference genome for mapping ChIP-seq or RNA-seq reads may introduce errors, especially at polymorphic genomic regions.
Results: We developed a Personal Genome Editor (perEditor) that changes the reference human genome (NCBI36/hg18) into an individual genome, taking into account single nucleotide polymorphisms (SNPs), insertions and deletions, copy number variation, and chromosomal rearrangements. perEditor outputs two alleles (maternal, paternal) of the individual genome that is ready for mapping ChIP-seq and RNA-seq reads, and enabling the analyses of allele specific binding, chromatin structure and gene expression.
Availability: perEditor is available at http://biocomp.bioen.uiuc.edu/perEditor.
Personal genomics does not stop at obtaining genetic variation. One of the next steps is to analyze the functional consequences of the genetic variation. To enable researchers at large to analyze the functions of individual genomes, large-scale personal genome projects including the 1000 Genomes Project (The 1000 Genomes Project Consortium, 2010), the Personal Genome Project (Lunshof et al., 2010) and the cancer genome projects (Pleasance et al., 2009, 2010) all provided cells lines from the sequenced individuals. Based on these personal cell lines and their genome sequences, people have started to analyze the variation in transcription factor (TF) binding (Kasowski et al., 2010), chromatin structure (McDaniell et al., 2010) and gene expression (Li et al., 2011) and started to study the association between these molecular-level functional variations and phenotypic variations. In addition, these cell lines and genomic sequences also made it possible to analyze allele-specific epigenetic modifications and gene expression (Turan et al., 2010). These resources and growing research areas require dedicated analysis tools.
One way to analyze individual differences is to map personal data, such as chromatin immunoprecipitation followed by sequencing (ChIP-seq) data, onto the human reference genome and then compare the TF binding intensities across individuals. It is hypothesized that using the reference genome for mapping ChIP-seq or RNA-seq reads may introduce errors (McDaniell et al., 2010), especially at polymorphic genomic regions. After all, the polymorphic regions are most likely to exhibit functional variation.
We developed a Personal Genome Editor (perEditor: http://biocomp.bioen.uiuc.edu/perEditor) that changes the reference human genome (autosomal, sex and mitochondrial chromosomes, build NCBI36/hg18) into an individual genome, taking into account single nucleotide polymorphisms (SNPs), insertions and deletions (indels), copy number variation and chromosomal rearrangement. perEditor takes the reference genome in Fasta format and the individual's differences from the reference genome in Variant Call Format (VCF) as inputs. For each difference described in the VCF file, perEditor makes a corresponding change to the reference genome. When the allele information is present in the VCF file, perEditor will keep two genome sequences, representing the two alleles, and make allele-specific changes. After all the data in the VCF file are processed, perEditor outputs the maternal and paternal alleles of the individual genome as Fasta files, ready for mapping ChIP-seq, RNA-seq and other sequence reads.
We quantified the difference in mapping ChIP-seq reads against the reference genome and the individual genome and compared this difference to reported inter-individual differences. Using perEditor and data from the 1000 Genomes Project (The 1000 Genomes Project Consortium, 2010; April 2009 data release, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/2009_04/), we constructed the two alleles of a European individual (GM10847, accession number NA10847) and an African individual (GM18505, accession number NA18505). We re-analyzed ChIP-seq reads of NFκB generated from these individuals (Kasowski et al., 2010) by mapping them to each allele of their individual genome (output of perEditor) as well as to the reference genome (hg18, including autosomal, sex and mitochondrial chromosomes). A total of 55 232 610 raw ChIP-seq reads for the European individual and 64 291 100 raw reads for the African individual were mapped, using the Bowtie program (Langmead et al., 2009) with default parameters and allowing for up to 1 mismatch. When the individual genome was used, more reads became alignable. Taking the maternal allele of GM10847 as an example, a total of 161 150 reads that could not be uniquely aligned to the reference genome could be uniquely aligned; 84.9% of these newly alignable reads overlap with maternal or homozygous SNPs of GM10847. The other 15.1% newly alignable reads were added because a SNP on GM10847 helped to resolve the uniqueness of a read alignment elsewhere. Importantly, 47 825 of the newly alignable reads are located in putative NFκB binding sites (defined as 200 bp windows with 10 or more alignable reads). In contrast, a much smaller number of reads aligned to the reference genome become not uniquely alignable to the individual genome (Lost alignments, Table 1). Compared with the new alignments, smaller fractions (40–72%) of the lost alignments overlapped with SNPs. These SNP-overlapping reads became not alignable primarily because they had 1 mismatch to the reference genome and had 2 or more mismatches (beyond the threshold) to the particular allele of the personal genome. The rest 28–60% of lost alignments were due to the result that they became not uniquely alignable to the particular allele of the individual's genome (a polymorphism elsewhere produced an identical sequence). These data indicate that mapping to the individual genome may increase the precision of quantifying the binding intensities using ChIP-seq reads, which is essential to explain individual variation (Figs 1 and and22).
Next, we asked how strongly the genomes used for mapping would affect our understanding of individual variation. The difference in reads alignable to an individual genome and to the reference genome was calculated for each 200 bp window covering the whole genome. We focused on the windows with 10 or more ChIP-seq reads for further analysis, because these regions are more likely to be binding sites (Table 2). Taking the maternal allele of GM10847 as an example, a total of 1356 windows (putative binding sites) showed a difference of 5 or more alignable reads. In terms of relative changes between individual and reference genomes (absolute difference of alignable reads divided by the maximum alienable reads), a total of 3794 windows showed 10% or larger changes, and 852 windows showed very strong (50% or larger) changes (Fig. 1). These data indicate the precision of inferred binding intensity can be increased by 10% or more on thousands of binding sites. This improvement is on the same scale as reported individual variation of NFκB binding (comparing ChIP-seq data of two individuals by using human reference genome hg18 for mapping) (Kasowski et al., 2010).
Funding: NIH DP2-OD007417; NSF DBI 08-45823; NSF DBI 09-60583.
Conflict of Interest: none declared.