|Home | About | Journals | Submit | Contact Us | Français|
Few evidence-based best practice bioinformatics guidelines exist for genotyping using next-generation sequencing data, especially colorspace data produced by Life Technologies sequencers. Dozens of software packages can perform the various steps required, and genome features such as pseudogenes or large paralogous gene families are problematic. High false positive and negative rates can compound the difficulty of cohort analysis.
Using a Sanger-validated set of 32 BRCA gene regions from 16 patients, high-throughput colorspace (Life Technologies) sequencing performance was optimized by comparing various combinations of sequence aligners, re-aligners, de-duplicators, quality re-calibrators and genotype callers. Independently, six exomes were captured using the Agilent SureSelect v3 kit. The optimized pipeline was applied, and results were compared to microarray genotyping to characterize false positives and negatives. A further four exomes were pair-end sequenced on both the Life Technologies 5500x1 and Illumina HiSeq sequencers to check platform concordance. Variant metrics for each exome were compared to the literature.
In the clinic, individual exomes are manually triaged by a medical geneticist, and salient variants are confirmed by Sanger sequencing. For disease cohorts, software was developed to isolate variants possibly causing monogenic rare diseases, taking likely false positives into account.
Using results from Life Technologies' reference genome aligner, the intersection of single nucleotide polymorphism (SNP) calls from FreeBayes  (with SamTools  de-duplication) and Life Technologies' diBayes (with Picard de-duplication) was optimal. Using reads realigned by the Broad Institute Genome Analysis Toolkit (GATK) , the intersection of insertion and deletion calls from FreeBayes and Atlas2  was optimal. A threshold of 14% variant reads for true heterozygous calls was observed.
For bases with 10× coverage, variant calls are on average 98.9% concordant with SNP microarrays (versus 99.2% microarray technical reproducibility ). False positive and negative variant rates are each approximately 0.5%, with all false positives called heterozygous. Concordance with Illumina variant calls from a standard GATK pipeline was 95.2%. GATK produced more novel variants, especially in non-unique genomic regions: such variants are flagged with caveats in the colorspace pipeline. In a dominant heterozygous model analysis of five Nager syndrome patients, our cohort analysis software excluded 15 of 19 candidate genes, based mainly on a preponderance of genotype caveats.
Many published metrics for SNP quality control are based on a small number of genomes elucidated using other technologies, but Table Table11 shows overall agreement with the optimized colorspace pipeline results.
Low false positive and negative rates using colorspace data can be achieved by: first, reporting only concurrent variants from ultiple methods; and second, reporting caveats where the reference sequence is not unique. Accurate calls and caveats enable major cohort gene triage when modeling diseases caused by monogenic rare variants.
We thank Dr Richard Pon's laboratory for producing the high-quality colorspace data. We also thank the FORGE Consortium for the HiSeq-derived genotypes.