We have shown here an accurate personal CNV map of NA10851. This sample has been widely used as a reference sample for CGH array experiments (10
). The early phases of CNV discovery studies have focused on determining CN variable genomic regions among entire human populations, for which detection of absolute CNVs has not been critical. As high-resolution CGH array platforms have become available, more precise CNV maps of human populations have been generated (4–6
). To employ CNVs in personalized medicine, it is imperative to identify personal CNVs accurately. The accuracy of CGH arrays has been often compromised because the effects of CNVs of the reference sample were not removed, the final results, therefore, have been biased. Thus, it is critical to identify reference CNVs and also develop a streamlined approach to remove their influence from CGH arrays. We utilized a systematic approach to identify CNVs of the common reference sample by combining information from the personal genome sequence obtained from massively parallel sequencing data and ultra-high resolution CGH arrays obtained from 73 individuals using the sample as a reference. Compared with the personal genomes ascertained by a single technology, such as massively parallel sequencing or CGH array only, the NA10851 genome revealed by a combination of these methods allowed us to obtain the most accurate estimates of personal CNVs to date.
The predominance of numbers of CN losses over gains is in good agreement with previous reports (8
). Although we believe that this is the predominant characteristic feature of human CNVs, some technical issues are worth considering. Generally, CN losses are easier to identify than CN gains, using both the hybridization and the RD methods. Especially, high proportion of CN gains are placed on duplicated genomic regions, therefore it is difficult to design unique probes of good quality for CGH array or to align short-reads in resequencing methods. In addition, the numbers of copies of DNA segments in repetitive regions, such as microsatellites, vary almost continuously among human populations, making detection of integer CN difficult. Moreover, the insertion of DNA sequences, which are not found on the human reference genome, cannot be detected using general CGH array. Approaches that do not depend on the human reference genome, such as de novo
assembly are therefore needed to identify all the CN gains and their exact integer CNs in a personal genome.
NA10851 is the most widely employed individual genome in CGH arrays. Therefore, information on its genomic variants and CARA will contribute toward accurately estimating CNVs and their utility in personalized medicine. Most human genomic variations have been analyzed, cataloged and annotated in public databases based on the ‘Human Reference Genome’, which has been sequenced and assembled by Human Genome Project (24
). CARA enables the determination of personal CNVs based on the human reference genome rather than on an arbitrary sample NA10851. The high concordance rate after CARA between CGH array and RD shows the utility of CARA for accurately identifying personal CNVs. Using CARA, absolute CNVs from a variety of DNA samples, including cancer cells and mosaic samples, can be assessed only if NA10851 is used as a reference for CGH arrays. A more accurate determination of the genomic variants of NA10851 can increase the accuracy of adjustment from CARA. Therefore, it is critical to collect and release information on the genomic variants on NA10851, such as newly detected CNVs, or more precise CNV breakpoints. In addition, further deeper sequencing of NA10851, which will provide much higher RD and more accurate RD ratio, will be also valuable for finer adjustment. We have opened the database of NA10851 genomic variants on the website (http://cara.gmi.ac.kr
) and we have released all relevant information, such as CNVs, SNPs, short-indels and RD of NA10851. We hope to get new information from the community. As the information is updated, new versions of CARA will be released.
Along with CGH arrays, massively parallel sequencing is a powerful tool for identifying CNVs. However, systematic differences in CNVs calls due to the use of an arbitrary reference sample in CGH arrays have interfered with complete sample-matched CNV comparisons between the two technologies (12
). By correcting the demerits of CGH arrays using CARA, we were able to obtain the highest sample-matched concordance between the technologies.
To assess the impact of human CNVs, integer CNs (e.g. 0, 1, 2, 3) of each segment should be genotyped. Although CARA enables the detection of normal CN, as well as CN gains and losses, integer CN cannot be assessed, especially in CN gains. The methodology for accurately identifying personal structural variations will be improved continuously as new algorithms are developed by ultra-high-resolution CGH arrays and massively parallel sequencing. Ultimately, a rapid algorithm for detecting the integer CN of genes will be developed by combining all the CNV data available. These efforts will enable the identification of disease-related CNVs, as well as understanding their role in the pathophysiology of complex human diseases.