A number of groups have introduced stand-alone or web-based software tools that are either designed, or can be adapted for, eQTL analysis. Below we detail several software tools that have been used for eQTL analysis and provide a short description of the advantages and limitations of each package, with a focus on the ease of use for the wider scientific community (). Several of the packages were not specifically developed for eQTL analysis, and thus our speed comparisons should be viewed in that context. However, such packages may already be used in individual laboratories, and therefore we include those that, in our opinion, may be scaled for eQTL analysis on data from genome-wide array platforms.
] is a comprehensive R package for QTL mapping, initially configured for individual traits (in eQTL studies, one trait = one transcript) [33
]. It can perform single- and two-QTL genome scans, interval mapping (typically most useful in a linkage context) and multiple QTL mapping. It can also perform imputation and genotype error-correction for experimental crosses using a hidden Markov model. Although R/qtl was originally designed for standard QTL analysis, it scales well for the massive task of eQTL analysis. In our tests, R/qtl showed generally fast performance. The R package ‘eqtl’ is a ‘wrapper’ for R/qtl, simplifying the process of analyzing eQTL data. R/qtl can estimate false-discovery rates to correct for multiple comparisons. However, this option is available for multiple QTL mapping only.
] is an R package used for eQTL mapping that uses linear models. Its main features are model selection using backward selection, visualization and detection of eQTL hotspots. eMap was designed for eQTL analysis and shows average performance on large datasets. The package is partly written in C and requires GNU Scientific Library to compile, and does not readily support installation on Windows.
] is an online tool for eQTL analysis in mouse populations, and is designed for datasets with only relatively sparse genotyping density [34
]. It tests for association using an analysis of variance model over k-SNP genotype windows. The significance of the findings can be estimated using parametric models or permutations. SNPster can estimate the generalized family-wise error rate to correct for multiple comparisons. Although SNPster is no longer in development and cannot handle modern large datasets, it remains potentially useful for smaller datasets.
] is a part of an R/Bioconductor toolset designed for genotype management and association analysis [35
]. It supports several common input file formats, including Plink, PED and HapMap files. In our tests, snpMatrix showed generally fast performance.
] is a very popular command-line tool for human genome-wide association studies (GWAS), designed to perform a range of basic, large-scale analyses [36
]. Many researchers find Plink extremely useful for data manipulation tasks, including data filtering, preprocessing and merging. Plink was designed for association analysis, including for quantitative traits, and it is not optimized for analysis of multiple traits. For our eQTL experiments with no covariates (--assoc), Plink demonstrated good performance; however, inclusion of even a single covariate (--linear) made it approximately ten-times slower. Plink supports a variety of adjustments for multiple testing, including the false-discovery rate, Bonferroni correction and Sidak’s adjusted p-values.
] is a command-line tool for pedigree analysis [37
]. Merlin can perform a range of tasks, including QTL analysis, linkage, error detection and haplotyping. QTL analysis in datasets with unrelated individuals can be performed by identifying each sample as belonging to a separate family. Merlin employs likelihood ratio and score-test statistics. Although Merlin was designed for pedigree analysis, it is reasonably well suited for eQTL analysis. When run in its ‘fast’-association testing mode (--FastAssoc), Merlin had slower performance than other packages. In addition, Merlin can simulate datasets with characteristics of the original data, but with no associations, to estimate the number of false findings.
] is a command-line tool for various statistical genetic analyses using mixed models [38
]. Its main features are linkage and association testing with fixed and mixed effects models using maximum likelihood and QTL analysis with single or multiple QTLs. Qxpak.5 has specific input-file format specifications not shared by other packages. In addition, Qxpak.5 does not scale well for the task of eQTL analysis. When multiple phenotypes are specified, Qxpak.5 creates a temporary file for each SNP, a total of 80 GB for just 57,000 SNPs of our test dataset. This tool only has 64-bit Linux and 32-bit Windows versions.
] is a Java-based eQTL tool that has a graphical user interface [22
]. FastMap visualizes findings for any given transcript using Manhattan plots with a zoom function, and provides one-click connection to the University of California Santa Cruz (CA, USA) Genome Browser. Due to employed optimizations for discrete genotypes, FastMap cannot fully handle covariates in the analysis, except via preadjustment of the transcripts. FastMap is designed for fast eQTL analysis and was noticeably faster than all other tools in our tests, except Matrix eQTL. FastMap uses a two-step procedure to correct for multiple comparisons. For each gene it estimates the significance of the most strongly-associated SNP using permutations. Next, false-discovery rates are estimated for the obtained gene-level p-values.
Matrix eQTL [109
] is a new R and Matlab package for fast eQTL analysis. Matrix eQTL can account for correlated and/or heteroskedastic errors if the error-covariance matrix is provided. Matrix eQTL accounts for multiple comparisons by estimation of the false-discovery rate, with the option of separate false-discovery rate calculations for cis
- and trans
-eQTLs. In our experiments, Matrix eQTL was two to three orders of magnitude faster than all other tools.
Scalability of the computational tools
The availability of faster and cheaper genotyping tools, coupled with considerable accuracy of imputation routines, has led to denser genotyping on more individuals. The dramatic increase in the number of gene–SNP pairs in eQTL analysis is thus creating serious impediments in computational and memory requirements, and it is infeasible to store the test statistics for all gene–SNP pairs. The challenge of potential scalability of the eQTL tools to bigger datasets thus becomes an important consideration. The R/qtl and snpMatrix packages store all test statistics in memory. Therefore, to analyze a large dataset, the user must apply the packages to one portion of the data at a time, manually filtering and aggregating the results. As we detailed above, Qxpak.5 and SNPster are not suitable for datasets of a large size. By contrast, we judge that Plink, Merlin, eMap, FastMap and Matrix eQTL can be more readily applied to large datasets, as they can filter the associations based on a user-defined significance threshold.
Input data formats
Although there are some commonalities in the data formats, for the most part, each tool has its own specific file format requirements. Plink takes three files as input: .MAP, with the SNP location locations; .PED, with pedigree information and genotype; and a phenotype file with expression data. Merlin requires three different input files: .MAP, with SNP locations; .PED, with pedigree information, genotype and phenotype; and a .DAT file with annotation for the .PED file. Matrix eQTL, FastMap, R/qtl and eMap can accept data in simple-matrix format with expression and genotype data in separate files. The files must have one SNP/gene per line. FastMap can also accept HapMap and transposed Plink format. R/qtl also supports transposed input files with one sample per line. Qxpak.5 input files include: .PAR, with model parameters; .MKR, with genotypes; .DAT, with expression; and .PED, with pedigree information. While most other tools accept tab-delimited files, Qxpak.5 requires the values to be space-delimited, and also treats zeros in input files as missing values, so that the user must replace zero values with a specific small value (0.000001).