The recent successful applications of next-generation sequencing (NGS) technologies to identify disease-associated variants have revolutionized biomedical and biological research, especially in human disease studies 
. Rapid advances in NGS technologies, along with the dramatic decrease of cost, have propelled them to become a major approach in research. As of September 2011, more than 40 Mendelian diseases have been analyzed using whole exome sequencing (WES) 
, and more than 10 complex diseases have been studied using whole genome sequencing (WGS) and/or WES, including, but not limited to, renal cancer 
, melanoma 
, hepatocellular carcinoma 
, acute monocytic leukemia 
, and head and neck squamous cell carcinoma 
. While most of the early NGS studies were conducted by large sequencing centers or prominent research groups 
, the tremendous improvement in technologies during the past two to three years has dramatically reduced cost and hastened the speed of sequencing a genome within a short period of time. Subsequently, NGS technologies are now affordable and accessible to small or moderately sized laboratories and are expected to quickly evolve as a routine experimental technique in a similar fashion as the now common use of microarray.
NGS generates massive amounts of data for genetic variant detection. Thus, currently, a major bottleneck of NGS applications is downstream bioinformatics analysis. This problem is especially challenging for bench scientists. To meet this strong demand, many investigators have been redirected to this new field and are in the early stages of acquiring this technological knowledge, especially pertaining to data analysis. Meanwhile, a great number of computational tools dedicated to almost all aspects of NGS data analyses have been developed during the past few years. However, these tools are complicated and have project-specific features. Firsthand experience in using these tools is important, especially because investigators need to be able to readily identify true variants for validation from the typically millions of variants called by NGS computational tools. In this paper, we discuss four major parameters that affect variant calling and the validation process, aiming to provide some general guidelines to the NGS community, especially for those new to NGS applications.
Among the various types of mutations that cause diseases, single nucleotide variants (SNVs) and small insertions and deletions (indels) are the most abundant. The detection of these variants is critical in both WGS and WES studies. In the NGS data analysis pipeline, SNV/indel calling is performed after mapping reads to a reference genome, typically generating an initial set of SNVs/indels. Based on several recently sequenced individual genomes 
, a pattern has been recognized that, in general, approximately 3–4 million SNVs are expected to be found in a human genome by WGS when compared to the reference genome 
, and ~20,000 SNVs are to be found in a human exome by WES 
. Some of these SNVs might, however, be false positives. An open question is how to identify a set of SNVs with high enough quality for follow up validation or further analysis. In an early study by Ley et al. 
, the authors generated an initial calling list of SNVs (~3.8 million SNVs) using the software Maq 
and then selected a subset of mutations from the list for experimental validation. After they used their experimental data as a training dataset for the Decision Tree C4.5 algorithm, the authors successfully learned a set of critical rules (e.g., based on read counts, base quality and SNP quality scores), which were then used to predict a small yet well supported set of ~2.6 million SNVs with high accuracy 
. In another study, Wei et al. 
called SNVs using the software bam2mpg 
and developed a ratio score to evaluate the quality of the initially called genotypes. Based on their experimental data, the authors estimated a threshold of this ratio score and used it to filter their initial list of SNVs.
These two studies, as well as several large-scale sequencing projects 
, have shown that post-filtering of SNVs is essential to identify variants that are more likely to be true while effectively filtering false positives. Although not formulated, a few consistent filtering rules have been recognized by the community, including base quality, mapping quality, and coverage of supporting reads. Additionally, there are several other factors that may affect the accuracy of SNV/indel calling, such as sequence complexity and the fitness of the algorithms used in a specific case. In this study, we discuss four parameters that affect SNV and small indel calling, which are critical in NGS applications. These parameters are (1) variant quality and coverage, (2) refinement and improvement of initial mapping, (3) allele/strand balance, and (4) examination of spurious genes. Although some of them have been discussed in previous studies in various forms 
, here, we systematically examined and demonstrated these rules using our own experimental data, so that they may be generally applied to different NGS data analyses.