When faced with the challenge of studying 496 hereditary prostate cancer families and a total of 5,247 individuals, we sought a publicly available database management system capable of handling the unique challenges that accompany a large-scale, multi-center genetic linkage study of a complex trait. Although data management systems have been developed [8
], none could securely and efficiently handle a very large amount of data, as well as provide additional features to facilitate quality control and analysis of data generated. Therefore, we developed GeneLink, a database with unique features, to address these needs.
We designed GeneLink to use a Sybase database backend to take advantage of Sybase's ability to process large amounts of data. Currently, GeneLink is the only publicly available freeware database capable of efficiently storing millions of genotypes. The need for efficient data management will grow in importance as researchers explore genome-wide SNP association studies that may generate close to one billion genotypes (500 cases, 500 controls and 500,000 to 1,000,000 SNPs) [13
]. We are currently updating GeneLink so it can run using either Sybase or Oracle. Furthermore, GeneLink was designed to avoid database-specific code and therefore should be portable to other open access DB engines, such as PostgreSQL, without too much difficulty.
To collect the necessary number of DNA samples needed to provide sufficient power to detect linkage or association, collaborative efforts are almost always required. The Web-based interface of GeneLink facilitates multi-center collaborations, as data can easily be accessed via the Internet. GeneLink's Web-based interface also makes it platform-independent, a feature that was essential given the number of researchers who would be accessing it using various hardware-browser combinations. Other publicly-available databases described in the literature do not have this advantage. In this paper, we have presented GeneLink in the context of a collaborative effort in which multiple sites will need access to data generated in a single laboratory. However, GeneLink would also be valuable in the context of a meta-analysis of data generated in more than one laboratory. Making data access easier for our collaborators translated into the need for a sophisticated security system. Specifically, in our study of hereditary prostate cancer, researchers are permitted access to only their own set of data. This is important because, in some cases, a site's internal review board (IRB) protocol may not allow for raw data to be shared with other analysts.
GeneLink provides several other advantages for investigators performing linkage or linkage disequilibrium studies of complex traits. For example, the process by which genotypes can be imported into GeneLink was designed to be flexible enough to handle data from laboratories like our own which employ duplicated samples and double-scoring methods for quality control purposes. Using duplicated samples and double scoring aids us in keeping our genotyping error rate low (< 1%). In our HPC study, we included 92 duplicated samples (~ 4% of total samples) in order to evaluate our genotyping error rate. The entire import process is outlined in Figures and . After each of the initial steps (the import (step 1), duplicates check (step 2), and check for differences (step 3)), GeneLink produces a summary report (Figures , and ). The "Import Report" summarizes the details of the import, including the date, user ID, number of records imported, and file name of the uploaded flat file containing the genotype data (Figure ). Examples of the duplicates and differences reports are shown in Figures and .
Figure 7 A, and B. Import process Outline of Import process illustrates GeneLink's ability to be used in laboratories that include duplicated samples and double scoring for quality control purposes. Import process includes within table duplicate check and across (more ...)
Figure 8 A, B, and C. Example of import (A), duplicates (B), and differences (C) reports Import report stores all information regarding genotypes imported into GeneLink. This information includes number of records imported (how many unique individuals, how many (more ...)
Another challenge of complex trait linkage or association studies is formatting data appropriately for analysis by existing software packages. Chromosome-specific LINKAGE, GAS, and RelCheck format files can easily be exported by GeneLink. By design GeneLink's exporting capabilities also provide several additional advantages. First, GeneLink is capable of exporting multiple traits at the same time, thus facilitating analyses in which covariate information will be included. Second, by taking advantage of GeneLink's ability to generate liability classes defined by age, sex, and affection status, researchers can maximize power in the investigation of complex traits, which often exhibit reduced penetrance and phenocopies. Third, GeneLink's Allele Translation
table allows comparison of alleles across families or across analyses, as each allele for each polymorphic marker will only be recoded once. This is particularly important as linkage disequilibrium or association studies become more common. Fourth, GeneLink's ability to export only a subset of families is critical, as genetic heterogeneity is a significant factor contributing to the difficulty of mapping genes involved in many complex traits. Multiple genes (RNASEL, ELAC2, and MSR1, among others; [14
]) have been implicated in hereditary prostate cancer susceptibility, suggesting that genetic heterogeneity is likely to be a complicating factor in the gene mapping of HPC risk alleles regardless of the analysis method. Finally, GeneLink maintains a list of previously exported files, which eliminates redundant generation of data files by collaborators and functions as an archive of data files used for analyses.
Additional quality control measures were included in GeneLink's design. First, all changes to the database are recorded. As genetic studies of complex traits can be spread over many years, it is important to keep a detailed log of any changes made to the data. For example, an individual's affection status may change during the course of a study; therefore it is critical to track when this information was updated in the database. Second, in order to monitor data quality, GeneLink was also designed to perform several built-in checks, as described above.
Given that genetic studies of complex traits will generate millions (or even billions) of genotypes, it is essential to have appropriate mechanisms in place to ensure data integrity. In our study of hereditary prostate cancer, these checks immediately discovered a typographical error, which, if left undiscovered, would have resulted in data from an affected individual never being exported or analyzed. Finally, GeneLink generates detailed reports storing pertinent information regarding all imports and exports (Figures and ), the status of projects (Figure ), statistical information about markers (success rates and heterozygosity; Figure ), and DNA samples tested (Figure ). These reports are helpful in maintaining data quality. For example, in our HPC study with over 2,500 DNA samples, it would have been very easy to miss that any single individual was failing for greater than 95% of markers if we were not using GeneLink. We were able to request new DNA samples for such individuals, as well as flag the stored data as potentially problematic.
Figure 9 Status report GeneLink's status reports allow collaborators to easily tract the project's progress by site. Reports show markers by chromosome (in map order) and the status of each marker for each site. By site, markers can be Not started, In Lab, Genotypes (more ...)
Figure 10 A, and B. Marker (A) and individual (B) summaries GeneLink's Marker Summary provides success rates and heterozygosity for individual markers typed in the study. The Marker Summary also provides information regarding when the genotype records for this (more ...)
GeneLink was designed primarily in the context of family-based studies of complex traits. It is capable of handling both linkage and association data, and can be used for both whole genome scans and/or candidate-gene studies. Further development of GeneLink will focus on extending its capabilities in regard to the case-based design. We recognize that both the family-based and case-based study designs have unique advantages, so we see it as critical to make GeneLink flexible enough to accommodate a case-based design. Currently, there is no limitation in storing case-based data however changes to GeneLink's exporting mechanisms should be made. Finally, in the same way that GeneLink is capable of storing "exported" data input files, future work will center on the storage of analysis results. Again, this would be helpful for multi-center collaborative studies, which will continue to be critical to successful efforts to identify genes important in complex trait etiology.