Raw signal intensity and genotype calling data were obtained from the WTCCC in an anonymized form, and the analysis of the data was approved by a Columbia Institutional Review Board. Each disease cohort contained approximately 2,000 individuals, while the two control cohorts each contained approximately 1,500 individuals. The Affymetrix platform supports 500,568 SNP loci, of which 459,653 passed the WTCCC quality control procedures [
11].
For a SNP locus with an A/B polymorphism, the microarray generates a pair of intensity values IA and IB. Each intensity value is the average intensity over a small number of oligonucleotide probes containing the allele together with some flanking sequence. The (IA, IB) point typically falls within one of three clusters corresponding to the three genotypes AA, AB, and BB.
Consider now an individual with an AA genotype. Suppose that 20% of the sampled cells of this individual have undergone gene conversion in which one of the A alleles has been converted into a B allele by a homologous sequence, while the flanking sequence has remained unchanged. The left example of Figure shows this kind of conversion. (Conversion of both A alleles would be rare, and is ignored.) This individual will display an overall (IA , IB) intensity pair that is 20% of the way from the AA cluster to the AB cluster. In another individual with a heterozygous AB genotype, a 20% conversion rate at the same locus would yield an overall (IA , IB) intensity pair that is 10% of the way from the AB cluster to the BB cluster, since only the conversion of the A allele will cause a change in probe intensities. In an individual with a BB genotype, no change would be observed.
Because there is experimental variation in intensity measurements, it may be difficult to determine whether a small perturbation in a single measurement represents gene conversion or merely noise. However, it is possible to study the distribution of perturbations for a population at a locus. If a population has a significant spread of intensities between clusters, when control populations do not, then one can hypothesize that gene conversion at that locus is happening in a population-specific manner. See the cluster plot for RA in Figure for an example. If the population is a disease cohort, then the locus may be associated with the disease phenotype.
Returning to the example above, consider the complementary situation in which the flanking sequence near an SNP probe has been converted. Whether or not the SNP locus is changed, the converted sequence will no longer match either probe sequence. The right example of Figure shows this kind of conversion. If many cells in an individual are converted in this fashion, a reduced signal from this sequence will be measured by both probes of the microarray. For a locus at which this effect is associated with the disease phenotype, all clusters will shift radially towards the origin in the cluster plots for the disease population.
Calling algorithms attempt to identify the boundaries of clusters corresponding to the AA, AB and BB genotypes. For example, the Chiamo algorithm [
11], considers all populations simultaneously, and estimates cluster boundaries in a way that allows for some population-dependent differences. The intensity distributions vary from SNP to SNP, and so clustering is performed separately for each SNP.
Based on the analysis above, gene conversion for a particular population should be accompanied by either (a) an increase in the spread of the two-dimensional intensity distribution relative to the control population, or (b) a translation of the clusters towards the origin, relative to the control population. In case (a), there should be an increase in the number of points that are either between clusters, or on the fringe of a cluster. In case (b), there should be a decrease in the distance between clusters, leading to an increase in the number of points whose cluster assignment is ambiguous. Either way, there will be an increase in the number of no-calls generated by the calling algorithm, relative to the control populations. This is one 'signature' of gene conversion that I will try to identify.
The Chiamo calling algorithm has been applied to the WTCCC data, and it is possible to use those calls to help recognize the signature of gene conversion. Chiamo generates a confidence score for a call; the authors of the Chiamo algorithm recommend that when this score is below 0.9, the genotype should be considered a 'no-call.' When clusters are more dispersed, their peripheries can begin to overlap with each other. In such a situation, the Chiamo algorithm will have less certainty about points falling in the intermediate regions. Chiamo will define cluster boundaries more tightly, resulting in an increase in the no-call rate for intermediate points [
11]. An example of this phenomenon is given in Figure , where the orange points (that are particularly frequent in RA at this locus) are no-calls.
An increase in no-calls between two clusters can lead to a biased allele distribution in the called genotypes. For example, if there are many no-calls between the AA and AB clusters, then the A allele will be underrepresented among the subpopulation whose genotypes are called with high confidence. This bias is another possible signature for gene conversion. (See Additional file
1 for an extended discussion of no-calls.) Note that there may be cases of gene conversion that do not show this signature because the non-called points do not change the observed allele frequencies.
To identify gene conversion events, I take three complementary approaches. The first approach that I call the 'stringent' filter is designed to optimize precision, that is, to minimize the number of false positives while possibly missing some true positives. The second approach is designed to provide better recall, that is, to include more true positives at the risk of also including false positives. This second approach is called the 'relaxed' filter. The third approach, termed the 'no-call-only' filter, looks only for extreme no-call rates, since some gene conversion loci may not exhibit changes in called allele frequencies.
For the stringent filter, called SNPs with high no-call rates in a population relative to the union of the two control populations are initially selected. A chi-squared statistic is calculated for each SNP based on a 2 × 2 chi-squared test comparing calls/no-calls for both the disease population and the control population. Only SNPs with an increase in the no-call rate in the disease population and a chi-squared statistic corresponding to P < 5 × 10-5 in a one-sided test are retained by this initial selection.
A further selection is applied to test for a bias in the genotype distribution in the disease population relative to controls. Bias is assessed in one of two ways; an SNP that displays bias according to either of these tests is retained. Only SNPs in which the control population has at least ten individuals for each of the AA, AB and BB genotypes are considered. First, the three genotype frequencies in the disease population are compared with the corresponding frequencies in the control population using a 3 × 2 chi-squared test to determine the likelihood that they have a common distribution. Only SNPs with a chi-squared statistic corresponding to P < 5 × 10-4 in a two-sided test are retained. Second, the three genotype frequencies in the disease population and control population are separately assessed for departure from Hardy-Weinberg Equilibrium using a conventional 3 × 2 chi-squared test. Only SNPs with a chi-squared statistic corresponding to P < 5 × 10-4 in a two-sided test in the disease population and a chi-squared statistic corresponding to P > 0.01 in the control population are retained.
Gene conversion appears to require at least 300 base pairs of homology in humans [
1]. Among known gene conversion loci, the smallest degree of identity between the homologous regions is 88% [
1]. One should therefore not expect newly discovered loci to have identity much below 88%. I will thus use 85% identity as a lower bound for the stringent filter.
The candidate SNPs were evaluated for homologous flanking sequence elsewhere in the genome. The UCSC database of segmental duplications [
25] was used to identify genomewide duplications with at least 1,000 base pairs of homology (after elimination of low-complexity repeats) and at least 90% identity. Additionally, each SNP that met the other stringent filter conditions was subjected to manual analysis using the BLAST network service at NCBI to identify duplications that may not meet the thresholds of the segmental duplication database, but that may still be relevant for gene conversion. (I used the Megablast algorithm with default parameters. When a duplicon contains several almost-contiguous segments, the identity of the duplicon is the identity reported by BLAST for the segment containing the region that maps to the SNP under consideration.) The three filters are summarized in Table . The relaxed and no-call filters use different homology criteria from the stringent test so that the segmental duplication database can be used to automate the analysis. Because the segmental duplication database excludes regions with low complexity repeats, some SNPs in regions with more than 90% homology (for example, rs9378249) are not in the segmental duplication database.
| Table 1Summary of the three data filters. |
The analysis does not consider SNPs on the Y chromosome. For the X chromosome, the analysis is limited to the female subpopulation within each cohort. As a result, some statistical power is lost, particularly for cohorts such as CAD that have a relatively small number of female members.
Cluster plots for all SNPs mentioned in the text can be found in Additional file
2.
Sources of variation
Copy number variations at an SNP locus mean that in addition to the conventional AA, AB, and BB genotypes, there may be additional genotypes such as AAB and B. Each of these alternative genotypes would have its own cluster in the cluster plot, which can be examined for signs of more than three clusters. Each SNP was also assessed for known copy-number variation using the Database of Genomic Variants [
26], since copy-number variants could also cause changes in no-call frequencies and genotype distributions that may be related to disease. (See Additional file
1 for further discussion of copy number variation.) Note that somatic deletion would generate genotypes like B in some cells, but since most cells retain the normal copy number, the effect will be a small perturbation in the cluster plot rather than a separate cluster. Germ-line mutations would not give the same perturbation patterns as somatic conversion. For a germ-line mutation that changed one allele to another, the individual would appear as part of another cluster in the corresponding cluster plot. If a germ-line mutation deleted or duplicated an allele, then the individual would appear as part of a cluster with a nonstandard copy number. If this deletion/duplication was common, then the cluster plot would show features typical of CNV loci, such as the presence of more than three clusters.
A paralogous sequence variant occurs when the homologous sequence to the mapped SNP sequence possesses a polymorphism. Suppose an SNP has probes for alleles A and B. If the paralogous sequence also has an A/B polymorphism, then the cluster plot will have five clusters, corresponding to AAAA, AAAB, AABB, ABBB, and BBBB. If the paralogous sequence has an A/C polymorphism, then the probes will not detect the signal from the C allele, and there will be clusters for AA, AB, BB, AAA, AAB, ABB, AAAA, AAAB, AABB. In either case, the cluster plot will differ significantly from what is expected under a gene conversion hypothesis.
Some polymorphisms on the microarray platform may have been misidentified, with the true polymorphism being in paralogous sequence with no polymorphism at the mapped SNP locus. As long as the paralogous sequence is part of a larger region of homology with the mapped SNP locus, the outcome of the gene conversion analysis will be unchanged by such phenomena because both duplicons are examined.
A foundational somatic mutation could occur during early development, leading to a lineage of cells within the individual carrying the mutation. This kind of mutation will not be identified by the present analysis unless the blood cells being genotyped come from more than one such lineage. Even then, the relevance of a foundational mutation to disease would be unclear because the mutation would also have to have been in a lineage ancestral to the diseased tissue.