Massively-parallel DNA sequencing technologies are rapidly maturing, and are now being applied to a wide variety of projects of different scales and aims. A frequent goal among these projects is to examine inter-individual genomic variation, and to correlate this variation with phenotype. This approach has proven useful in medical sequencing by identifying causative variants in disease (Rios et al., 2010
; Sobreira et al., 2010
; Wei et al., 2011
) (for further examples, see (Mardis and Wilson, 2009
; Ng et al., 2010
; Teer and Mullikin, 2010
)). Although genome sequencing studies are becoming increasingly practical, the use of DNA-capture technologies (e.g., exome capture) extends the number of samples that can be investigated. The large number of sequence variants that are typically identified in such projects can be daunting. For example, a single exome sequencing experiment can lead to the detection of >200 000 variants. While these variant lists can be reduced using custom filtering scripts, or command-line filtering programs like GATK (McKenna et al., 2010
), this requires significant bioinformatics knowledge. These large datasets therefore present a challenge for those with limited bioinformatics expertise and resources.
Various programs have been written to visualize massively-parallel DNA sequence reads and their alignments (for review, see (Nielsen et al., 2010
)), but are not designed to view or analyze the identified variants. Some options for viewing annotated sequence variants are available (SVA (Ge et al., 2011
) or Galaxy (Goecks et al., 2010
)), but are designed more as analysis packages with limited viewing options and require dedicated high-performance workstations or servers. Recently, a probabilistic search tool designed to identify causative variants and genes, VAAST, was described (Yandell et al., 2011
). Although powerful, this analysis tool is designed to identify the genes most likely to be involved in disease, not to browse variant data.
There is a current need for analysis systems that allow investigators to view and manipulate variants identified using next-generation sequencing technologies, but that do not require high-end computational infrastructure or end-user bioinformatics expertise. Such systems would be designed with genetic variation data in mind, and would allow the user to apply expert knowledge in many different ways when deciding how to filter or prioritize the variants, allowing for further data mining possibilities beyond those offered by other tools with static filtering strategies. Towards that end, we have developed a program (VarSifter) that allows researchers to view exome-scale sequence variation and to perform the sorting, filtering, and searching required to analyze these data for biological relevance.