Inversions have long been known to play an important role in chromosomal evolution [5
]. Indeed, large inversions are thought to contribute to speciation through reproductive isolation caused by reduced recombination between normal and inverted chromosomes [6
]. Inversions of a wide range of sizes are abundant in mammalian lineages [7
]. Additionally, extensive study of inversions in Drosophila
revealed that inversions can leave genetic signatures, such as reduced nucleotide variation, within the inverted region [8
]. More recently, many polymorphic inversions have also been found in humans [3
]. A number of these have functional consequences; polymorphic inversions have been associated with genetic disorders [13
], complex disorders such as asthma [14
] and even positive selection [15
Recent resequencing efforts of many human genomes continue to reveal the prevalence of structural variation in humans [16
]. However, despite decreases in the cost of sequencing, genotyping microarrays remains the most cost-effective technology for analyzing entire genomes on thousands of individuals. Moreover, inversions have been traditionally difficult to study using experimental techniques. The typical presence of large inverted repeats at the breakpoints is a major challenge for their detection even with current next-generation sequencing techniques. While only few inversions (~ 15) have been experimentally validated [11
], a few studies have scanned the whole genome providing experimental evidence for a number of candidate regions. For example, Levy and colleagues determined inverted regions by the whole genome assembly of one subject [3
], and Kidd el al. used fosmid paired-end mapping in nine individuals [4
]. As such, the ability to accurately predict new inversions and infer their status on large number of subjects would provide a valuable tool for clinical and evolutionary studies of human populations.
Previous studies have used haplotype tagging to indirectly infer which chromosomes are most likely to have an inversion, assuming recombination suppression in inverted heterozygous. For instance, Steffanson et al. [15
] showed that an inversion within 17q21 in the European population can be tagged with two different haplotype groups (H1/H2), each related to a polymorphic variant of the MAPT
gene. Although haplotype tagging is performed on large groups, it is suitable for regions known to have inversions and known to exhibit divergence between the two arrangements. Three studies [13
], for instance, identified two haplotype groups within a region containing the known 17q21 and 8p23 inversions, and then experimentally validated the tagging on selected samples of the HapMap population. Taken together, this small group of subjects can be used to validate newly developed methods that determine the status of inversions in individuals.
Using a different approach, based on differences in linkage between groups of SNPs, two other methods have been developed to discover the presence of inversions across the genome [1
]. Bansal et al. [25
] used differences in linkage disequilibrium (LD) to determine regions likely to be inverted. However, their method requires the human reference to contain the minor allele, and does not predict which chromosomes in the population are most likely to have the inversion; a factor that is essential for association studies. More recently, Sindi and Raphael [1
] developed a probabilistic method that models the population as a mixture of normal and inverted haplotypes, and thus had increased power to detect inversions of lower frequency. Although their method accurately predicts inversion frequencies, it did not yield an accurate classification of individuals into normal and inverted subpopulations. The computationally intensive search of both methods have, in particular, failed to identify inversions like the one within 17q21 in the CEU population, for which a clear extended LD has been shown [13
]. In addition, both methods were developed for haplotype (phased) data only.
One way to analyze genotype data, using this method, is to phase the entire genome and then apply it to the resulting haplotypes. This procedure is computationally demanding. For instance, it has been reported that compiled software like fastPHASE [26
] can take up to 9 h to analyze 60 subjects in a 41,018 SNPs chromosome (3-GHz Xeon processor with 1 GB). Therefore, a method that directly analyzes genotypes, incorporating the limited phasing required by inversion detection, can substantially reduce this computational load and allow the complete implementation of the methodology in a single software tool to be used in standard up-to-date machines.
In this work, we propose a new methodology to (1) efficiently detect inversions across the genome by directly using genotype data and (2) accurately classify individuals in the population according to inversion status. In addition, our generalization of the inversion model for haplotypes [1
], as a computational technique, allows us to treat the different problem of phasing haplotype blocks separated at any distance. Our new application of the inversion model, within the analysis of polymorphic inversion from genotypes, increases the applicability of the method to current GWAS, and enhances its usability by a higher computational efficiency. We provide an efficient implementation of the novel analysis of genotypes, and the new classification and search methods in the R package inveRsion, freely available through Bioconductor [27
). We also include a computationally improved version of the previous haplotype model, and the use of the Bayesian Information Criterion (BIC) to gather statistical evidence from neighboring regions.
Both prior LD studies [1
] tested their methods by constructing "artificial" inversions by reversing the order of SNPs in phased haplotypes from HapMap. We provide a more rigorous test of our method by employing a recent software tool, invertFREGENE [2
], that utilizes coalescent theory and suppression of recombination between inverted and normal chromosomes to produce artificial haplotype and genotype data. Lastly, we apply our method to HapMap Phase III data, where we compare the analysis of genotypes with that for haplotypes, assess the our classification accuracy in two validated inversions and search the whole genome for inversion signals.