|Home | About | Journals | Submit | Contact Us | Français|
Many statistical analyses of genetic data rely on the assumption of independence among samples. Consequently, relatedness is either modeled in the analysis or samples are removed to “clean” the data of any pairwise relatedness above a tolerated threshold. Current methods do not maximize the number of unrelated individuals retained for further analysis, and this is a needless loss of resources. We report a novel application of graph theory that identifies the maximum set of unrelated samples in any dataset given a user-defined threshold of relatedness as well as all networks of related samples. We have implemented this method into an open source program called Pedigree Reconstruction and Identification of a Maximum Unrelated Set, PRIMUS. We show that PRIMUS outperforms the three existing methods, allowing researchers to retain up to 50% more unrelated samples. A unique strength of PRIMUS is its ability to weight the maximum clique selection using additional criteria (e.g. affected status and data missingness). PRIMUS is a permanent solution to identifying the maximum number of unrelated samples for a genetic analysis.
Interrelatedness can be a confounding factor in many statistical analyses, including burden tests in sequence data, association studies [Devlin and Roeder, 1999; Voight and Pritchard, 2005], genome-wide estimates of identity by descent (IBD) [Sun and Dimitromanolakis, 2012], and principle component analyses [Patterson et al., 2006]. Unless modeled into the statistical analysis [Kang et al., 2010; Thornton and McPeek, 2010], interrelatedness must be removed from the data before proceeding with genetic analyses. Given the expense of DNA ascertainment, clinical phenotyping, sequencing and/or genotyping, and data analysis, maximizing the number of unrelated samples utilized in such analyses should be a priority.
Estimates of pairwise IBD, a quantitative measure of relatedness, can reliably detect relatives as distant as first cousins [Huff et al., 2011]. Over the years, multiple strategies to detect IBD have been developed [Browning and Browning, 2011; Browning and Browning, 2010; Han and Abney, 2011; Huff et al., 2011; Kong et al., 2008; Manichaikul et al., 2010; Purcell et al., 2007], and new methods are emerging that use IBD estimates to confidently detect more distant relatives (up to third cousins) [Browning and Browning, 2010; Huff et al., 2011]. With good IBD estimates, relatedness structures that violate the assumption of sample independence can be identified and removed from the dataset through sample pruning.
We have identified three publicly available methods to produce a set of unrelated individuals given a threshold of tolerated pairwise IBD. The documentation for PLINK [Purcell et al., 2007] (see Web Resources) suggests a method to remove pairwise relatedness by iteratively removing one member of each pair until no pairs remain (Figure 1A). Pemberton et al.  suggest generating networks of relatedness in which samples are nodes and pairwise relationships are edges. Relatedness networks are then broken by iteratively removing the most highly connected node, until no edges remain in the dataset (Figure 1B). Finally, the authors of KING [Manichaikul, et al., 2010] describe how they generate a set of unrelated individuals in a recent paper [Manichaikul et al., 2012]. They first add the person who is related to the fewest other people in the dataset and then proceed to add the individual who is related to the next fewest people in the dataset, as long as the individual to be added is not related to anyone already in the set of unrelated individuals (Figure 1C). However, none of these approaches maximize the number of retained unrelated samples or selectively retain the most informative samples.
In order to test software package called Pedigree Reconstruction and Identification of a Maximum Unrelated Set (PRIMUS) and compare it to other methods, we programed each of the three methods as described (Figure 1). This was required because neither method described in PLINK Pemberton is available in a software package, and the KING program does not allow for the input of user-defined IBD estimates. Rather, KING calculates its own IBD estimates from input genotype data.
We present a method adapted from graph theory that always identifies the maximum set of unrelated individuals in any dataset, and allows weighting parameters to be utilized in unrelated sample selection (Figure 1D). We implemented this method in a new software package PRIMUS, and it is available online (see Web Resources).
PRIMUS reads in user-generated IBD estimates and outputs the maximum possible set of unrelated individuals, given a user-defined threshold of relatedness. PRIMUS converts the IBD relationship file to an undirected graph in which nodes represent individuals and edges represent pairwise relationships; each connected component represents a “family network” or pedigree. PRIMUS writes out each family network to a .dot file to be viewed in graph visualization software such as GraphViz (see Web Resources) to generate images of the family networks (Figure 2).
All individuals within each family network of the data are unrelated to any individual in a different family network (at the user-specified threshold). Thus, the problem of identifying the maximally sized unrelated set is reduced to finding the maximum unrelated set within each family network and then combining the unrelated sets of each family network to get the maximum unrelated set of the entire graph/dataset.
In graph theory, the maximum unrelated set is referred to as the maximum independent set; the maximum independent set of a graph is the same as the maximum clique of the complement graph. In the complement of a graph, all missing edges are added and all existing edges of the graph are removed. Here, this is equivalent to forming edges when relationships fall below the user-defined relatedness threshold rather than above it. We then search this complement graph for a maximum clique. A clique is defined as a portion of the graph (subgraph) where each node is connected to every other node in the subgraph. A maximal clique is a clique that is not a subgraph of a larger clique. Finally, a maximum clique is the largest maximal clique.
PRIMUS uses the Bron–Kerbosch algorithm [Bron and Kerbosch, 1973] with improved pivot selection [Cazals and Karande, 2008] to enumerate all maximal cliques of each complement family network. For each family network, PRIMUS picks the maximum clique or the weighted maximum clique to add to the maximum unrelated set of individuals. Finally, it generates a file containing the maximum set of unrelated individuals.
A unique strength of our program is its ability to weight the maximum clique selection using additional criteria. The maximum clique is the clique containing the most samples; however, there are often two or more maximum cliques. Any one of these will produce a maximum unrelated set, and PRIMUS allows for preferential selection of the maximum clique based on additional weighting criteria. In case/control studies this function is particularly useful, because it allows for the retention of the maximum clique with the most affected individuals. Alternatively, the user may wish to select the maximum clique with the lowest missingness rate within the data, or perhaps to first select for affected status and then for lowest missingness. PRIMUS allows specification of as many of these weighting criteria as desired as well as ordering how they are applied in the selection. No other available method for selecting unrelated samples offers weighting functionality.
PRIMUS can also retain the maximum number of unrelated individuals with a desired binary characteristic (e.g. affected status), even if this unrelated set is smaller than the maximum set of unrelated individuals. For example, a study may contain a trio with an affected child and two unaffected parents. The maximum unrelated set would require removing the child and retaining both parents, because the parents are unrelated to each other. It is likely that one would wish to retain the single affected child for further analysis instead of both unaffected parents. As a result, the overall unrelated set size will be smaller, but the set will contain more of the affected samples.
Because none of the PLINK, Pemberton, and KING methods has a weighting algorithm, we implemented one for each. These implementations are available upon request. For PLINK method, we implemented weighting by selecting the individual with the desired trait. For example, to preferentially select affected individuals, the algorithm will keep the affected individual in a case/control related pair. For Pemberton method, we implemented a weighting scheme by choosing to remove the node with the less optimal criteria whenever two nodes are equally connected. For the KING method, we implemented weighting by retaining the more desirable individual whenever two individuals are related to the same number of other individuals.
The Bron–Kerbosch algorithm is impractical to run on large, sparse family networks due to the algorithm’s exponentially increasing computational cost (Figure 3). To remedy this, we implemented an approximation function for networks above a set cut-off size (Table I). PRIMUS’ approximation function takes a similar approach to the Pemberton method [Pemberton et al., 2010] by repeatedly removing the highest degree node from the family network until the network is smaller than the approximation function size cutoff or until it breaks into subnetworks smaller than the cutoff. Once the size of the network or subnetworks is below the approximation function size cutoff, PRIMUS uses the Bron–Kerbosch algorithm to obtain an independent set that is approximately the largest.
To compare the performance of PRIMUS to these methods on all types of family networks, we randomly generated 7,500 simulated family networks of varying sizes and network connectivity, which is a measure of how interconnected the network is. Connectivity can vary widely in family data (Figure 4); some family networks are highly connected (e.g. a father, mother, and 10 offspring), whereas other family networks may be sparsely connected (e.g. a “string” of cousins in which each is related through a unique parent). For each network size (five to 130 by increments of five), we randomly generated 30 simulated networks with the network connectivity proportion ranging from 0.1 (10% of all possible pairwise relationships exist in the network) to 1 (every individual is related to every other individual), and our simulation data are available upon request. For each simulation, we obtained an unrelated set from PRIMUS and the three other methods.
In all 6,540 simulations that did not require the use of the approximation function, PRIMUS produced an unrelated set of size equal to or greater than all other approaches. In our simulations, PRIMUS increased the unrelated set size by more than 50% relative to the PLINK method (Figure 5) and by similar amounts relative to the other selection methods (Supporting Information Figures S1A–C). Although PRIMUS provides the greatest improvement as the network size and connectivity increase, even for sparse; small networks PRIMUS typically provides 5–20% improvement compared to the other method (Supporting Information Figures S1A–C).
Only when PRIMUS’ approximation function was used (960 simulations), the other methods have the potential to outperform PRIMUS. The Pemberton method never outperforms PRIMUS because PRIMUS’ approximation function is very similar to the Pemberton method; the size of the network is above the approximation size thresholds shown in Table I. Table II shows that PRIMUS’ approximation function outperforms the other three methods in more than 98.75% of the simulations. To address the 1.25% of cases, we have incorporated each of the other methods into PRIMUS, such that when it recognizes the need to run the approximation function, it will also run each of the other methods and return the largest unrelated set derived from any of the four methods.
We also compared the performance of each method on weighting for a binary and a quantitative trait. Similar to the maximum unrelated set identification, PRIMUS always identifies the largest set of unrelated affected individuals when the approximation function is not needed. PRIMUS retained up to 75% more affected individuals in the weighted comparisons between PRIMUS and each of the other methods (Supporting Information Figures S2A–C).
Finally, we compared the performance of PRIMUS and the other three methods on data from phase 3 of the Haplotype Mapping Project [Altshuler et al., 2010] and the 1,000 Genomes Project [Consortium, 2010]. For each dataset, PRIMUS obtained the largest set of unrelated individuals (see Supporting Information Table S1). Given our IBD estimates for these reference datasets, the maximum sample set in which no pair of individuals have a coefficient of relatedness () > 0.1 are listed in Supporting Information Table S1 and a link to a list of the sample identification numbers can be found at the PRIMUS website (see Web Resources).
Although PRIMUS will identify the largest unrelated set of samples, as shown in Figure 5, the performance advantage of PRIMUS depends strongly on the amount of connectivity within the families, the size of the families, and clearly, the presence of family data among the samples (all methods do equally well when the samples are unrelated). PRIMUS provides the greatest benefit on large family networks with moderate to high interrelatedness; however, PRIMUS is useful on all varieties of genetic datasets.
We recommend using PRIMUS to obtain unrelated reference sample sets. For example, many researchers use HapMap3 and 1,000 Genomes datasets to impute genotypes, estimate population allele/haplotype frequencies, and run principle component analyses. However, both datasets contain related samples [Pemberton et al., 2010], and if the interrelatedness is not removed, then these imputations and estimates will be inaccurate.
Methods exist to account for pedigree structure or interrelatedness when doing association studies [Kang et al., 2010; Thornton and McPeek, 2010]. However, in the context of large cohort studies, the power gain would be modest because most samples are not related within the last few generations, and the minor power gain may not be worth the computational burden of accounting for the interrelatedness. In such a case, PRIMUS is the best option for removing relatedness.
We also recommend using PRIMUS’ weighted maximum set selection to optimize your unrelated set for a desired characteristic. Specific scenarios include selecting for affected status in a case/control study, selecting for the lowest missingness, or selecting samples in the tails of a distribution in a quantitative trait study.
PRIMUS allows users to specify the level of relatedness in their dataset. Because PRIMUS can take any quantitative measure of relatedness, the selected cutoff should be based on the sensitivity of the tool used to estimate the pairwise relatedness. For example, PLINK is relatively accurate at estimating relationships up to first cousins but less accurate for more distant relationships [Huff et al., 2011]. Therefore, a coefficient of relatedness () cutoff of 0.1 is appropriate. We have found that KING has similar sensitivity as PLINK when estimating pairwise relationships; however, KING uses the kinship coefficient, /2, therefore recommended cutoff for KING IBD estimates is 0.05. Other programs [Browning and Browning, 2010; Huff et al., 2011] are more powerful at accurately detecting more distant relationships, and the user specified cutoff should be adjusted accordingly.
When statistics assumes independence among samples, stripping datasets of relatedness observed in the genetic data are a necessary step in quality control and data cleaning. We have developed an efficient and optimal approach that uses user-generated IBD estimates to quickly provide a maximum set of unrelated samples to retain in further analyses. Despite the importance of retaining the largest sample size possible in genetic analyses, we have only found a single published analysis that utilizes this concept [Heath et al., 2008] to obtain a maximum unrelated set of samples. In addition, our approach provides the option to retain the most informative samples in the resulting dataset (i.e. based on phenotype or data missingness). Furthermore, as a byproduct, PRIMUS reports all connected family networks in the data; knowledge of these networks can then be leveraged to improve the power in some analyses by utilizing this familial information [Ott et ., 2011], or to select the most distantly related affected individuals within each family for exome or whole genome sequencing. Finally, PRIMUS is fast; capable of processing thousands of individuals distributed across hundreds of family networks in minutes or less, making it a practical tool for even the largest and most complex datasets.
We thank Nancy Cox, Josh Smith, Bruce Weir, Timothy Thorton, Qian Yi, Brian Browning, Walter L. Ruzzo, and Peggy D. Robertson for their helpful discussions during the development of this method. We also thank Janet Stanford and Elaine Ostrander for use of the Prostate Cancer Pedigree dataset. J. S. is supported by NHGRI (T32 HG00035), and support for D. N. and J. E. B. through the NHLBI (HL102926) and NHGRI (HG006493).
The authors have no conflict of interest.
PLINK 1.07, http://pngu.mgh.harvard.edu/purcell/plink/
Maximum unrelated set of individuals for the publically available datasets, http://sourceforge.net/projects/primus-beta/datasets/
Supporting Information is available in the online issue at wileyonlinelibrary.com.