We compared the DNA and RNA sequences from B cells of 27 unrelated CEPH individuals (table S1
). We chose these samples because much information is available on them, including dense DNA genotypes obtained using different technologies (20
). The genomes of B cells from the CEPH collection are stable as evidenced by Mendelian inheritance of genetic loci that allowed the construction of microsatellite- to single-nucleotide polymorphism (SNP)–based human genetic maps (20
). More recently, the International HapMap Consortium (17
) obtained millions of SNP genotypes, and the 1000 Genomes Project (19
) sequenced the DNA of these individuals. Comparison of sequence data from these two projects showed high concordance (~99%). Here, we used the DNA genotypes and sequences from the two projects for our analyses. First, we considered sites that are monomorphic in the human genome. A monomorphic site is one where there is no evidence for sequence variation at that locus in dbSNP, the HapMap, and the 1000 Genomes Project. Different studies have analyzed these 27 and hundreds of additional individuals for DNA variants; thus, if a site has not been identified as polymorphic, most likely all individuals have the same sequences at these sites. But to be certain, for these sites in the 27 individuals, we compared their DNA sequences from the 1000 Genomes Project with the sequences of the human reference genome and carried out traditional Sanger sequencing (22
). To be included in our analysis, we required that each site be covered by at least four reads in the 1000 Genomes Project and that the sequences from 1000 Genomes should be the same as those of the reference genome. To ensure the integrity of the aliquots of B cells that we used for analyses, we carried out Sanger sequencing of their DNA and found perfect concordance of sequences with data from the 1000 Genomes (thus also the reference genome sequences) (table S2
). Second, we considered SNPs. For each individual, a SNP locus was included only if it was homozygous and the HapMap, as well as the 1000 Genomes Project, reported the same sequence. We have high confidence in those sequences because despite using different technologies (microarray-based genotyping in HapMap and high-throughput sequencing in 1000 Genomes), we obtained identical sequences in the two projects.
We sequenced the RNA of B cells from the same 27 individuals using high-throughput sequencing technology from Illumina (23
). The resulting RNA sequence reads were mapped to the Gencode genes (24
) in the reference human genome. In total, we generated ~1.1 billion reads of 50 base pairs (bp) (~41 million reads and 2 Gb of sequence per individual), of which ~69% of the reads mapped uniquely to the transcriptome [see Methods in (25
)]. To be confident of the base calls, for each individual, we focused our analysis on high-quality reads (quality score ≥25) and sites that were covered by at least 10 uniquely mapped reads. Another study (26
) had carried out RNA sequencing of the same individuals but at a lower coverage; at these sites we compared our sequences with those from their study, and found that the concordance rate of the sequences is >99.5%. This is reassuring given that the samples were prepared and sequenced in different laboratories.