PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
 
J Comput Biol. Apr 2009; 16(4): 565–577.
PMCID: PMC3148127
Statistical Comparison Framework and Visualization Scheme for Ranking-Based Algorithms in High-Throughput Genome-Wide Studies
Waibhav D. Tembe,corresponding author John V. Pearson, Nils Homer, James Lowey, Edward Suh, and David W. Craig
Translational Genomics Research Institute (TGen), Phoenix, Arizona.
corresponding authorCorresponding author.
Address reprint requests to: Dr. Waibhav D. Tembe, Translational Genomics Research Institute (TGen), 445 North 5th Street, Suite 600, Phoenix, AZ 85004. E-mail:wtembe/at/tgen.org
As a first step in analyzing high-throughput data in genome-wide studies, several algorithms are available to identify and prioritize candidates lists for downstream fine-mapping. The prioritized candidates could be differentially expressed genes, aberrations in comparative genomics hybridization studies, or single nucleotide polymorphisms (SNPs) in association studies. Different analysis algorithms are subject to various experimental artifacts and analytical features that lead to different candidate lists. However, little research has been carried out to theoretically quantify the consensus between different candidate lists and to compare the study specific accuracy of the analytical methods based on a known reference candidate list. Within the context of genome-wide studies, we propose a generic mathematical framework to statistically compare ranked lists of candidates from different algorithms with each other or, if available, with a reference candidate list. To cope with the growing need for intuitive visualization of high-throughput data in genome-wide studies, we describe a complementary customizable visualization tool. As a case study, we demonstrate application of our framework to the comparison and visualization of candidate lists generated in a DNA-pooling based genome-wide association study of CEPH data in the HapMap project, where prior knowledge from individual genotyping can be used to generate a true reference candidate list. The results provide a theoretical basis to compare the accuracy of various methods and to identify redundant methods, thus providing guidance for selecting the most suitable analysis method in genome-wide studies.
Key words: genome-wide association studies, candidate lists
The principal goal of analysis of differential expression of genes on expression microarrays, measurement of allelic frequency differences using genotyping arrays, or quantification of copy number changes using comparative genomic hybridization (CGH) arrays is to generate a prioritized or ranked list of genes, single nucleotide polymorphisms (SNPs), or aberrations. The priority score assigned by the analysis method reflects the likelihood of a candidate being associated with the disease/phenotype being investigated. To this end, there is a plethora of methods to analyze high-throughput data generated in such studies, and newer methods based on mathematical, image processing, and statistical techniques continue to be an active area of research in computational biology and statistical genetics. However, experimental data frequently has some biases, such as unknown biases in data acquisition process, and different analysis methods have varying sensitivity to these artifacts.
The identification of ranked list of candidates is often the first step in generating hypotheses for future investigation, typically obtained by testing thousands to millions of hypotheses in order to identify the most significant ones. For example, differential gene expression analyses output a list of genes based on the statistical significance of the observed difference in their expression at thousands of genes (>10,000 genes are typical) between sample groups, such as cases and controls. In genome-wide association (GWA) studies, genotyping data is analyzed at hundreds of thousands SNPs to generate a list of candidate SNPs that are most likely to have significant association with the disease. Indeed, one is often only validating the top few hundred SNPs from the large number of SNPs tested. Many variations and types of analyses are continually being developed that leverage specific study-design frameworks, and each framework may come with its own experimental framework. For example, DNA pooling-based GWA approaches (Bansal et al., 2002; Craig et al., 2005; Sham et al., 2002), a low-cost alternative to individual genotyping, use statistical analysis to select a few hundred SNPs for fine mapping from hundreds of thousands of SNPs. Similarly, CGH and differential gene expression studies identify ranked list of candidate genetic loci, and a combination of these two independently generated lists could be used to select consensus genetic loci for fine mapping. The question, then, is which analysis method, filter, or specified constants within a method should be used for a particular study. Often these are arbitrary choices based on anecdotal previous experience.
1.1. Motivation for comparison and visualization of ranked lists
As an example, we consider a case-control GWA study carried out on a high-density microarray genotyping >500,000 SNPs per microarray. The specific analysis methods chosen at various stages, such as processing of raw images from scanners, genotype calling to generate allelic frequency data, and statistical analysis to identify the most significantly associated genetic loci, will generate different lists of prioritized candidate SNPs. The candidates are further screened by experts, possibly with inputs from study specific literature survey, for selecting fine mapping regions with the ultimate goal of identifying the causal genetic loci. With appropriate wet-lab and technical details, an analogous description would apply to a differential gene expression or CGH study. Typically, a GWA study may utilize a few hundred to a few thousand arrays to be appropriately powered, resulting in a cost often exceeding a million dollars. Such a cost has led to the development of alternative designs using pooling-based approaches whereby only a few arrays are needed. This design, termed pooling-based GWA study, represents an alternative approach that requires development of novel analysis algorithms. One relevant question is “what are the best approaches for analyzing data whereby samples have been physically pooled and assayed on just a few arrays?”
Little help is available to theoretically select a method from several available analysis methods. To our knowledge, there are no direct approaches available for comparing candidate lists generated from different algorithms. Given the high-throughput nature of genotyping arrays that examine hundreds of thousands of SNPs (or, more broadly, hypotheses), or expression arrays simultaneously studying tens of thousands of genes, a manual inspection of such results is clearly impractical. In addition, our experience in genome-wide association studies indicates that customizable visualization capabilities for genome-wide data are lagging behind their analysis counterparts. Advances in miniaturization and automation coupled with cost reductions have made high-density microarrays a preferred choice for many genome-wide studies, and it is important to give some consideration to development of intuitive visualization tools that highlight the information in high-throughput data.
1.2. Goals of this study
In this paper we present:
  • I. A theoretical framework that could be used in the following scenarios:
    •  If the true ranked candidate list is known from prior knowledge, how to compare results from different methods with the given true list and possibly assign a theoretical p-value? This will help in picking the best method from several methods, or to validate a new method, although study-specific factors may limit the general applicability of the conclusion.
    •  How to quantify the consensus between ranked lists from independent algorithms? Is it possible to assign a p-value to a set of common candidates?
    •  The reproducibility of GWA studies is adversely affected by factors such as the study design variability and instrumentation errors. If two identical experimental designs generate overlapping but non-identical candidate sets, to what extent did the replicated experiments generate the same results?
    •  In many applications, once prioritized lists are obtained, only the “top k” candidates are used further for downstream analysis. Instead of comparing the entire lists, how can one compare two or more methods just based on the “top k” candidates? A key feature of our analysis framework is that it not only takes into account the number of common candidates in the “top k,” but also scores based on the consistency in the order in which candidates are placed.
  • II. A novel customizable visualization method for genome-wide data: We propose a new graphical visualization scheme that represents genomic location as a circular rather than linear map, can display scores or ranks, and has a number of display modes, such as dot, lines, and filled polygon. The implementation is in object-oriented R programming language.
Although the formal mathematical framework proposed in this paper is very general, we present our results in the context of an example case-control association study using pooled DNA of HapMap CEPH population samples (HapMap, 2005) genotyped on high-density (>500,000 SNPs per array) Affymetrix microarrays. In contrast to individual genotyping, pooling physically mixes DNA from case samples into one cohort and controls samples in another cohort. Under the assumption that the abundance of alleles, generally referred to as A and B, is directly proportional to the allele-specific image intensity values obtained from genotyping microarrays, the proportion of alleles could be estimated. Several different analysis methods are available to prioritize SNPs based on the likelihood of discriminating between cases and controls computed from their allelic frequency differences. These lists are the input prioritized candidate lists for our analysis. For the CEPH samples used in our study, the true list was computed using the test of two proportions on known genotyping data. The z-score for the test of proportions is computed using the following formula:
equation M1
The subscripts 1 and 2, respectively, correspond to the two pools, n denotes the number of chromosome pairs, and A, B generically refer to the frequency of the respective alleles. SNPs with minor allele frequency (MAF) of <0.05 were omitted. Since z-scores are used for ranking, the actual significance values were not computed as it will not affect the order.
For clarity, extensive mathematical description of various data transformation techniques and analysis methods has been described in Appendix A. Briefly, cases and controls genotyping data is considered as a two-class data and the separation between the two classes is measured using various statistical and cluster analysis methods (Brun et al., 2007) operating on mathematically transformed microarray image data measuring the abundance of alleles for each SNP. Results were generated using GenePool software (GenePool, 2006; Pearson et al., 2007; Tembe et al., 2007), and we note here that similar approach could be used on candidate lists generated by other pooling-based analysis tools (Yang et al., 2008). The methods in GenePool include both well-known and application specific approaches such as the Silhouette (Lovmar et al., 2005), t-test, centroid distance, and Dunn index.
This specific example study highlights several issues in comparison of ranked lists: High-throughput genotyping data from hundreds of thousands of SNPs on a single array generates very large ranked lists that truly represent the full scale of the problem. It also demonstrates challenges in effective genome-wide visualization. In addition, the availability of known genotypes from the HapMap allows a non-disease and non-platform specific case-controls association study wherein the true list of candidate SNPs can be easily computed. Thus, comparison of different methods against each other and comparison with true results is possible.
Mathematically, a ranked list of N candidates is a permutation of numbers {1, 2, 3,…, N}, and comparing rankings from different algorithms corresponds to measuring the degree of agreement between a pair of permutations. We adopt Kendall's tau and Spearman's Footrule coefficient (Kendall, 1938, 1948), which are two of the well-known approaches to compare two ranked lists. For clarity, the formal definitions are described in Appendix B which also describes how these two metrics have the following desirable features:
  •  When the lists are identical, coefficients are close to zero.
  •  When the lists have few common elements, or they are disjoint, the coefficients take higher values.
  •  When ordering of elements is different, the coefficients take higher values.
Intuitively, this is similar to measuring the distance between two lists based on the number of common members and their relative order in the respective lists.2 Several variations of these two formulae have been proposed in the literature, and we follow the general approach detailed in Fagin et al. (2003). These methods have been used in information retrieval applications, such as the comparison of results generated by web-search engines, comparison of data lists that get modified frequently, etc., and details could be found in Dwork et al. (2001) and Fagin et al. (2003). The intent here is to adapt Kendall's and Spearman's coefficients to genome-wide studies.
The classical Kendall's Tau and Spearman's Footrule coefficients are defined over two permutations of the same list. In the context GWA studies on high-density genotyping microarrays, this amounts to processing lists containing >500,000 SNPs. However, GWA studies are often designed for identification of the “top k” best SNPs. Literature (Pearson et al., 2007) and our prior experience in this field indicate that the typical measurement errors in the estimation of allelic frequencies are around 1–2% and that the allelic frequency differences are negligible for SNPs beyond the top few thousand SNPs. Thus, the reliability of rankings and ordering diminishes as we move down the list. Since only “top k” candidates are retained, we employ a more practical approach that compares two ranked lists based on “top k” elements. In other words, instead of comparing complete lists, we compare partial lists which is a more appropriate and general approach.3 To that end, we follow the modified Kendall's Tau and Spearman's Footrule proposed in Fagin et al. (2003). Given two partial ranked lists, we take into account the number of elements common to both the lists, those in only one of the lists, and the order in which they appear. For the sake of completeness, we also include in our analysis the intersection coefficient between the lists, which is just the number of SNPs common in two lists divided by the size of the list (see Appendix B). Thus, we have three different methods for comparing lists and, in general, we recommend using Kendall's and Spearman's than the intersection method because the intersection method does not take into account the ordering. We leave the comparison between the Kendall's and Spearman's method as an open question that is outside the scope of this study.
3.1. Comparing a method with the true results
For six representative algorithms (described in Appendix A) available for prioritizing SNPs using GenePool software, Figure 1 shows the Kendall's, Spearman's, and Intersection coefficient compared with the true rankings for “top k” SNPs. The true ranks were computed using the test of two proportions on the known genotyping data from HapMap. k is varied from 1000 to 25,000, in steps of 1000. More precisely, for each plot, the Y coordinate shows the value of the coefficient for the corresponding “top k” SNPs on the X axis. We restrict the analysis to the top 25,000 SNPs because it represents the top 5%, a typical number selected in GWA studies. (In fact, due to practical limitations and cost issues, many researchers can examine only top few hundred SNPs.) As explained in Appendix B, values near one for Kendall's and Spearman's coefficient indicate that the lists are highly different while those near zero indicate that the lists are similar. In Figure 1, it is observed that the Kendall's and Spearman's coefficients decrease monotonically with increasing k. Increased similarity with higher values of k implies that there are more common SNPs and the relative ordering of the SNPs between the two lists has a higher degree of agreement.
Fig. 1.
Fig. 1.
Kendall, Spearman, and intersection coefficients between GenePool-generated top candidate SNPs lists and the true list for CEPH population in HapMap (a–f ). The x-axis is the number “top k” SNPs considered. For Kendall (more ...)
As an example, consider comparing the methods in Figure 1d,e based on Kendall's coefficient, with the constraint of selecting only the top 5,000 candidate SNPs. Under this scenario, the method in Figure 1e should be preferred because it has higher similarity with the true results (lower Kendall's coefficient) than does the method in Figure 1d. For k > 20,000, all six plots show similar Kendall's coefficients indicating that all methods are equally accurate as compared to the case k = 5,000 where significant differences exist.
Figure 1 itself does not indicate how statistically significant these results are. To assign a statistical significance to these coefficients, Monte Carlo simulations were carried out because generating a true distribution from every possible permutation of 500,000 numbers is not practicable. We generated a very large number (>107) of pairs of random permutations of the set {1, 2,… , N}, N = number of SNPs ≈500,000. For each pair of permutation, the Kendall, Spearman, and intersection coefficient were computed for various top k SNPs, with k from 1000 to 25,000, in steps of 1000, to be consistent with the top k SNPs shown on the X axis. The p-values were extremely small (<10−5), practically equal to zero, for k values greater than 5,000, which implies that there is a statistically significant degree of agreement between the true ranks and those from methods used in GenePool. It should be noted that no individual genotyping information was available to GenePool; it only analyzes SNPs using pooled DNA. This leads us to the conclusion that the rankings provided by GenePool closely reflect our prior knowledge based on allelic frequency differences of known genotypes, and the methods could be used generally on any pooled DNA with no individual genotyping information available whatsoever.
3.2. Comparing two different methods
Generalizing the comparison between a true list and a candidate list described in the previous section, pair-wise comparison of different candidates could be useful in the following ways:
  •  When true results are unknown, independent methods could be used to analyze the same data in different ways. If a majority of the methods point to the same best-performing candidates, we have high confidence in the results.
  •  Even if the same exact same set of SNPs does not come out on top in different lists, the general agreement/disagreement between the methods can be used to identify high confidence SNPs.
  •  If the comparison of results from two methods shows that they are effectively generating the same set of results, we can discontinue the use of the more computationally expensive method.
The procedure of comparing candidate ranked lists from two different methods is similar to the comparison between a true list and a candidate list generated by a single method. Kendall, Spearman, and Intersection coefficients are computed analytically, and Monte Carlo simulations are carried out. Figure 2 shows pair-wise comparisons of five methods (hence five choose 2 = 10 plots in each sub-figure) for each of the coefficients with increasing values of k. For smaller values of k (<1000), the intersection coefficient is very small and the other two are very high. This indicates that the methods did not have a huge degree of agreement for extremely smaller values of k. However, with increasing k, the degree of agreement increased significantly. From the Monte Carlo simulations, the results were found to be statistically significant with p-values almost zero for k  20,000.
Fig. 2.
Fig. 2.
Pair-wise exhaustive comparison of five methods. The x-axis shows selected number of “top k” SNPs. The y-axis shows the respective coefficient values. Legend indicates the clustering algorithm and data transformation. For example, Silh-AkB (more ...)
Figure 2 could be used to draw various inferences. For example, pair-wise comparison of three methods (Ttest-Silho, Silho-Dunn, Dunn-Ttest) shows that the intersection coefficient is quite high and Kendall/Spearman coefficients are quite low. Very low p-value indicates that this is statistically significant and not just a random event. Thus, these three methods generate a similar result, which is also clear from the high similarity between the plots in Figure 1c, e, f. To avoid redundancy in the analysis, just one of these three methods could be employed. Similarly, the methods CoDir and MdfdT generate significantly different lists. Thus, if the same data had to be analyzed using independent methods, CoDir and MdfdT should be included. If the same set of SNPs is observed in the “top k” of these independent methods, we have much more confidence in the results than the case when the SNPs are selected by similar methods, such as the Silhouette, Dunn, and Ttest.
We have developed a novel visualization method for genome-wide association data that we are calling an Inferno plot. Figure 3 shows Inferno plots for two methods, centroid distance and T-test, used in the CEPH population study. Inferno plots are circular rather than linear. This has the effect of increasing detail—since the genome or genomic region is represented by a circle, it is π (pi) times longer because the circle's diameter is effectively the X-axis of an equivalent linear plot. The default display mode is to plot each score as a line from the origin of the plot circle towards the outer edge of the plot circle. The length of the line is taken from the score itself. Thus, larger scores have longer lines and the location of the line is determined by the score locations. Line lengths (scores) are scaled based on the largest score in the dataset or a supplied maximum score. Locations are scaled by considering that the largest location will be shown on the plot circle at 360 degrees and that all other locations will be converted into degrees accordingly.
Fig. 3.
Fig. 3.
Four inferno plots are shown to visualize the results of the genome-wide study carried out on Affymetrix 500 K genotyping microarrays. (Top left) Plot shows the locations of the top 5,000 SNPs using the centroid method. The ring of alternating (more ...)
An Inferno plot is agnostic as to what is displayed, and it can just as easily display a whole genome, a chromosome, or a genomic region. Multiple sets of scores can easily be shown on the same plot by using different colors or by orienting some scores outwards from the plot circle origin and others inwards from the plot periphery towards the plot origin. The scaling of each set of score and the overall plot scale can be manipulated so that scores do not obscure each other. In addition to plotting alternative dark and light regions to represent chromosomes on a whole genome plot, there are many other potential uses, such as plotting regions implicated in a disease by linkage or association, the location of known causative disease genes, and regions of repetitive sequence.
With the advances in high-throughput data generation capabilities in the field of genomics, an exponentially growing number of analysis algorithms are being proposed to extract biologically significant information using analytical techniques. Therefore, computational biology is presented with new challenges of comparing different methods with each other and theoretically selecting the most appropriate study specific method. To that end, this study proposes within the context of genome-wide studies a formal mathematical framework for comparing and quantifying the consensus between the candidate solutions obtained from different algorithms operating on the same data. The mathematical principles used in the analysis of DNA-pooling-based genome-wide study presented as an example could be extended to a wide variety of problems, such as the candidate genes from expression studies. Moreover, the framework readily lends statistical significance calculations for each method and also allows analysis under the practical constraints of selecting only a fraction of the high-scoring candidates. The proposed framework validates the cluster analysis approaches employed in DNA-pooling based association studies, and it also provides guidance about choosing a specific method or a group of methods to extract maximum information from different analyses of the same data. The novel visualization scheme provides for a convenient graphical rendering of different candidate lists to visually identify significant candidate genetic loci obtained from competing analysis approaches. The statistical and visualization schemed proposed in this paper will be valuable to life scientists interested in theoretically selecting the most appropriate analysis approach for specific studies.
 
6. Appendix A
Description of cluster analysis algorithms for pooling-based association studies
From the standpoint of data analysis, DNA pooling-based GWA studies consider cases and control cohorts as two distinct groups. SNPs that have statistically significant allelic frequency differences between cases and controls are marked as candidate SNPs to select genomic loci for fine-mapping. GenePool (GenePool, 2006), a software tool specifically developed to facilitate the downstream analysis of pooling-based studies, takes the cluster analysis approach in that cases and controls are considered two separate clusters and SNPs whose allelic frequency differences maximize the separation between the clusters are selected. Figure 4 shows this concept for Affymetrix genotyping arrays. We have omitted here a discussion of platform-specific details, such as the concept of quartets in Affymetrix genotyping arrays and Beads in the Illumina arrays. An in-depth discussion can be found in Tembe et al. (2007).
Fig. 4.
Fig. 4.
Relative allele strength (RAS) values for a case-control association study on Affymetrix genotyping microarray using three arrays each for cases and controls. Squares correspond to case cohort and diamonds correspond to control cohort. The x-axis corresponds (more ...)
In the case of individual genotyping, calling algorithms are used to ascertain homozygous (AA or BB) or heterozygous (AB or BA) SNPs. In case of pooling, it is also possible to approximate the relative allele frequency of alleles for a SNP using the proportional abundance of A and B alleles. This is the key concept in pooling; the relative ratio of the A and B alleles correlates to the percent distribution of an allele within a pool. The question, then, is how best to transform the data in a manner that has an intuitive meaning. Various formulae are employed for this data transformation, but the concept of relative allele signal (RAS) is perhaps the most intuitive one that still retains correlation to the allele frequency within the pool. Here, RAS = A/(A + B), i.e., the ratio of signal arising from A allele to the total signal. RAS values close to 1 indicate A allele homozygosity (AA case), values close to 0 indicate B allele homozygosity (BB case), and intermediate values between 0 and 1 are obtained depending on the relative abundance of A and B alleles for the SNP. Alternate approaches to compute RAS values include using a k-correction factor to account for SNP-specific uneven amplification and/or hybridization. In that case, the modified formula will be kRASi = A/(A + kiB), where RASi is the predicted allelic frequency and ki is a SNP-dependent correction factor for the ith SNP. Similarly, arctan(B/A) could also be used. In the results presented in the study, we have used these three RAS value transformations along with several different algorithms for cluster analysis as described next.
Mathematically, given a data set consisting of N features (SNPs) and their values (quantified allelic frequency differences) for two classes C0 (controls) and C1 (cases), the goal is to analytically identify features that provide the best discrimination between cases and controls. Let X(i,j), i = 1,… n0,…,(n0 + n1 = N);j = 1, …,P be a two-dimensional data matrix representing the allelic frequencies of a single SNP genotyped on n0 controls and n1 cases replicates. For Affymetrix, the dimensions 1,…, P correspond to quartets. For Illumina, P = 1, i.e., each data point represents the allelic frequency for an individual bead. Let Ci, i[set membership] {0, 1}, denote, respectively, control and case classes. Let d(i, j) denote the pair-wise difference between two data points. Let D(C0, C1) denote the separation between the two classes computed from d(i, j) using some analysis method. Then, D(C0, C1) effectively provides a basis to rank SNPs based on their allelic frequency differences between cases and controls.
The separation between two classes can be quantified using several methods, such as the Silhouette distance (Lovmar et al., 2005), and SNPs are ranked in descending order of the separation. The higher the separation, the more likely a SNPs is associated with the phenotype. In addition, several different mathematical formulae, such as the Euclidean Distance and the Manhattan Distance, could be used to compute pair-wise difference between data points. Within the GenePool framework, we have implemented several cluster analysis methods and mathematical details can be found in the appendices. Thus, a combination of various approaches to compute RAS values and cluster analysis methods provides several possibilities to analyze the data. Each combination will examine the data in a different way and would potentially generate different lists for prioritized SNPs.
Pair-wise distance between two vectors
Let X and Y be two vectors each with n features. Let Xi and Yi denote the value of the ith feature. Then, the Euclidean distance (Euc), the Manhattan Distance (Man), and the Modified Manhattan Distance (Mod) are defined, respectively, as follows:
equation M2
In general, we denote the pair-wise distance by d(X,Y), where d could be substituted by the appropriate method above.
Relative allele strength
Let A and B denote the image intensity values obtained by interrogating for two possible alleles for a bi-allelic SNP. The relative allele strength calculated using different formulae indicates the homozygous or heterozygous nature of the SNP as shown in Table 1.
Table 1.
Various Formulae for Computing the Relative Allele Strength (RAS) Values
FormulaDenoted byValue for AA homozygousValue for AB heterozygousValue for BB homozygous
equation M3AAB10.50
equation M4AkB, where k = correction factor (Craig et al., 2005; Hoogendoorn et al., 2000)1Depends on kDepends on k
equation M5ATn0equation M6equation M7
The letters A and B generically refer to the two alleles, while A and B in mathematical formulae refer to the microarray image intensity values corresponding to the alleles.
Cluster analysis methods
Let Ci, i[set membership]{0, 1}, denote, respectively, control (C0) and case classes (C1) containing n0 and n1 data points. Let d(i, j) denote the pair-wise difference between two n-dimensional data points Xi and Xj. Let D(C0, C1) denote the separation between the two classes. D(C0, C1) can be computed from d(i, j) using various analysis methods described next.
Silhouette statistic (Rousseeuw, 1987)
The Silhouette Index (Sil) for a point is a measure of how similar that point is to points in its own class compared to points in other classes, and ranges from −1 to +1. It is defined as
equation M8
where a(i) is the average distance from the ith point to the other points in its cluster, and b(i,k) is the average distance from the ith point to points in another cluster k. In the case-control study, there are only two clusters, and therefore the formula reduces to
equation M9
where a(i) is the average distance from the ith point to the other points in its cluster, and b(i) is the average distance from the ith point to points in the other cluster.
For a given SNP, the silhouette value is the average of all silhouette values in cases and controls value. Values closer to +1 indicate higher separation between cases and controls and the SNPs is more likely to be associated with the phenotype being studied.
Centroid distance
Centroid (Centr) distance is pair-wise distance between the centroids of two classes.
equation M10
Dunn index (Azuaje, 2002)
The Dunn Index (DunIn) is defined as the ratio between the minimum distance between two clusters and the size of the largest cluster.
equation M11
Consistency score
Consistency score measures the extent to which the difference in the magnitude between corresponding dimensions of two vectors is in the same direction (positive or negative).
equation M12
If direction of the consistency score is to be maintained, the following formula is used.
equation M13
 
7. Appendix B
Kendall, Spearman, and intersection coefficients
Let τi denote a permutation of k > 0 distinct elements from the set R = {1, 2,…, N}, such that τi(j) denotes the rank of an element j[set membership]R. When k < N, τi is called a partial list. For simplicity, we assume that each j[set membership]R has a unique rank, i.e., τi(x)≠τi(y)∀ x,y[set membership] R. Let i}[subset, dbl equals]R denote the set containing all elements in τi, such that |i }| = k. For two such lists τ1 and τ2 defined over R, let Z = 1}∩ {τ2}, S = 1}\{τ2}, and T = 2}\{τ1}, where \ is the set difference operator. Let |Z| = z, |S| = s, and |T| = t.
Let σm,n1, τ2) = 0 if the elements m, n[set membership]R have the same order in both τ1 and τ2, i.e., if the differences τ1(m)−τ1(n) and τ2(m)−τ2(n) have the same sign. Otherwise, σm,n1, τ2) = 1. If σm,n1, τ2) = 1, the pair (m, n) denotes an inversion with respect to the lists τ1 and τ2. The maximum number of inversion between τ1 and τ2 will be z(z − 1)/2. Adopting the approach proposed in Fagin et al. (2003), we define the following modified coefficients to compare two ranked lists containing top k candidates.
Kendall's modified rank coefficient
Kendall equation M14
  •  When the lists are identical, k = z, S = T = {ϕ}, and there are no inversions.
    Thus, equation M15 and Kendall(τ1, τ2) = 0.
  •  When the lists are exact reverses of each other. k = z, S = T = {ϕ}, and
    equation M16. Therefore, Kendall equation M17
  •  When the lists are disjoint, z = 0,
equation M18
Therefore, Kendall equation M19
Spearman's modified footrule coefficient
Spearman equation M20.
  •  When the lists are identical, k = z, S = T = {ϕ}, and equation M21.
    Therefore, Spearman (τ1, τ2) = 0.
  •  When the lists are exact reverses of each other, k = z, S = T = {ϕ}, and
    equation M22
    (Proof omitted.)
    Therefore, equation M23
  •  When the lists are disjoint, z = 0,
    equation M24
    Therefore, Spearman equation M25.
For convenience, we normalize the Kendall's and Spearman's modified coefficients by factors k2 and k(k + 1) respectively to map then into [0, 1]. Thus, values close to zero indicate a high degree of agreement and values close to one indicate that the top-k lists are significantly different.
Intersection coefficient
The intersection coefficient for two equal sized lists τ1 and τ2 is simply the number of common elements normalized by their size.
equation M26
The intersection coefficient ignores the pair-wise ordering. So it will be 1, 1, and 0 for lists that are respectively identical, reversed, and disjoint.
Footnotes
1A natural extension of comparing more than two lists to generate a consensus candidates list is known as Rank Aggregation Problem (Fagin et al., 2003), and it is beyond the scope of this paper.
2Kendall's and Spearman's coefficients in this paper are different from correlation coefficient. We are comparing ranked lists and not measuring correlation between two variables.
3Comparison of two complete lists is a special case with k = N.
Acknowledgments
We thank the High-Performance Bio-Computing Center of TGen for providing the high-performance cluster computing resources for this study. Funding for DWC and JVP was provided by the National Institutes of Health (ENDGAME grant U01-HL086528-01). WDT and DWC were also funded by the Stardust Foundation.
Disclosure Statement
No competing financial interests exist.
  • Azuaje F. A cluster validity framework for genome expression data. Bioinformatics. 2002;18:319–320. [PubMed]
  • Bansal A. van den Boom D. Kammerer S., et al. Association testing by DNA pooling: an effective initial screen. Proc. Natl. Acad. Sci. USA. 2002;99:16871–16874. [PubMed]
  • Brun M. Sima C. Hua J., et al. Model-based evaluation of clustering validation measures. Pattern Recogn. 2007;40:807–824.
  • Craig D.W. Huentelman M.J. Hu-Lince D., et al. Identification of disease causing loci using an array-based genotyping approach on pooled DNA. BMC Genom. 2005;6:138. [PMC free article] [PubMed]
  • Dwork C. Kumar R. Naor M., et al. Rank aggregation methods for the web. WWW. 2001:613–622.
  • Fagin R. Kumar R. Sivakumar D. Comparing top k lists. SIAM J. Discrete Math. 2003. pp. 134–160.
  • GenePool. GenePool software. 2006. http://genepool.tgen.org. [Feb 1;2009 ]. http://genepool.tgen.org
  • HapMap. A haplotype map of the human genome. Nature. 2005;437:1299–1320. [PMC free article] [PubMed]
  • Hoogendoorn B. Norton N. Kirov G., et al. Cheap, accurate and rapid allele frequency estimation of single nucleotide polymorphisms by primer extension and DHPLC in DNA pools. Hum. Genet. 2000;107:488–493. [PubMed]
  • Kendall M. A new measure of rank correlation. Biometrica. 1938;30:81–89.
  • Kendall M. Rank Correlation Methods. Charles Griffin & Co. Ltd.; London: 1948.
  • Lovmar L. Ahlford A. Jonsson M., et al. Silhouette scores for assessment of SNP genotype clusters. BMC Genom. 2005;6:35. [PMC free article] [PubMed]
  • Pearson J.V. Huentelman M.J. Halperin R.F., et al. Identification of the genetic basis for complex disorders by use of pooling-based genomewide single-nucleotide-polymorphism association studies. Am. J. Hum. Genet. 2007;80:126–139. [PubMed]
  • R Programming Language. 2008. www.r-project.org. [Feb 1;2009 ]. www.r-project.org
  • Rousseeuw P. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987;20:53–65.
  • Sham P. Bader J.S. Craig I., et al. DNA pooling: a tool for large-scale association studies. Nat. Rev. Genet. 2002;3:862–871. [PubMed]
  • Tembe W.D. Homer N. Mitropanopoulos S., et al. Analysis software for high-density pooled genotyping data. Proc. Int. Conf. Bioinform. Comput. Biol. 2007:188–194.
  • Yang H.C. Huang M.C. Li L.H., et al. MPDA: microarray pooled DNA analyzer. BMC Bioinform. 2008;9:196. [PMC free article] [PubMed]
Articles from Journal of Computational Biology are provided here courtesy of
Mary Ann Liebert, Inc.