In this section, we broadly brush the features offered by EggLib and offer a comparison with widely-used software packages available to the scientific community and offering population genetics utilities. We also compare their performances for file importing and parsing, polymorphism analysis, coalescent simulations and estimation of demographic parameters through ABC. Finally we provide two short examples of code: i) showing a very simple example of polymorphism analysis on a number of loci and ii) explaining how to customize a model for ABC inference using available functions in EggLib.
An overview of the different type of services provided by Egglib is given in Table and shows whether those features are implemented in other frequently used software. Whereas no general class of features is exclusive to EggLib, the point of EggLib is to bring together most tasks routinely performed in population genetics analyses within a single framework, whenever possible as built-in features (which are efficient and convenient to use). EggLib also brings specific features, such as missing data management and several coalescent simulation options (mutation bias, explicit position of markers, diploid model with selfing). Missing data (and alignment gaps) are a recurrent concern of empirical studies. EggLib can perform nucleotide diversity analyses allowing a given proportion of missing data (the statistics are computed on the remaining data). The power of this approach to detect polymorphic sites that would be otherwise ignored is depicted in Figure .
Features available in EggLib and alternative population genetics software packages
Figure 2 Effect of missing data and quality threshold on the detection of polymorphic sites. Estimates of the number of polymorphic sites as a function of the proportion of missing data for different quality thresholds (red = 100%, magenta = 90%, green = 50%, (more ...)
Usage of egglib-py
The programming interface of the Python package EggLib was designed to be intuitive, simple to use, and to allow fast development of scripts automating population genetics analyses. This was done by providing high-level interface layers above components implemented in C++ and internalizing much of the complexity. We present a simple example to demonstrate how a data set comprising an arbitrary number of loci can be analyzed in a compact and readable fashion by combining Python and EggLib simple syntax (Figure ). The example's comments describe what each block achieves (a full documentation of EggLib's class Align and simul module is available in the online reference manual). Here, we will point out the parts of the code exploiting EggLib's potentialities. Line 16 (align = egglib.Align(locus)) creates an alignment instance. The user is only required to specify the name of the FASTA file containing the alignment (locus). Line 19 (pol = align.polymorphism()) performs a polymorphism analysis with default settings, that correspond to the standard approach. One of the options (not shown here, see the reference manual) allows to support missing data (see above and Figure ). The returned value, pol, is a dictionary (associative array), that allows straightforward access to computed statistics. Whenever several populations and/or outgroup sequences are present in the alignment, between-population and outgroup-based statistics will be automatically computed. Finally, lines 44-46 demonstrate the usage of the coalescent simulator. In this example, the simplest possible model is used: a single constant-sized population with an infinite-site model of mutation. Three steps are performed: creation of a CoalesceParamSet instance (specifying the number of samples; line 44), creation of a FiniteAlleleMutator instance (specifying the mutator type and the rate of mutation; line 45), and, finally, call to the coalesce function that returns a list of Align instances (line 46). The advantage of this three-step syntax for configuring coalescent simulation is that it can accommodate both simple models (as the one used here) and more complex scenarios exploiting all potentialities of the coalescent simulator.
Figure 3 Example of diversity analysis implemented in Python using egglib-py. This script imports 100 FASTA-formatted alignments, performs a basic diversity analysis and finally compares the average Tajima's D statistic to a number of neutral coalescent simulations (more ...)
User-defined ABC model
Several commands accessible from the command line utilities of EggLib allow one to perform ABC analysis using command-line tools. However, the set of pre-defined models cannot be exhaustive and one of our aims is to allow using all EggLib functionalities to design any possible demographic model.
The model presented in Figure is an arbitrary example of model that is not available in the fitmodel module. This model is depicted at the top of the figure. It has five different parameters: THETA (θ in the picture), DATE, SIZE, MIGR1 and MIGR2. This model can be viewed as a double, simultaneous domestication from two partially isolated stocks (time runs from top to bottom). DATE is the age of the domestication event, the migration parameters specify exchange rates MIGR1 and MIGR2 between pairs of populations and SIZE gives the relative size of cultivated populations.
Figure 4 Code example: User-defined ABC model. Example of user-defined demographic model extending EggLib's pre-implemented ABC models. A graphical representation of the model is showed at the top of the picture, and the code to implement it is showed at the bottom. (more ...)
The code at the bottom of Figure shows the implementation of this model within the EggLib framework. Note that ABC models should conform to a few requirements: they must be formalized as a class; they must define their name and the names of all parameters; their constructor must accept at least one argument specifying whether recombination must be implemented (and deal with it appropriately), but it can accept more arguments; and they must contain a generate method which specifies the body of the model implementation. This very piece of code can be used in conjunction to fitmodel (within a Python script), but it can also be used to add this model to the list of models available through the interactive command abc_sample. Custom priors and sets of summary statistics can be incorporated using a similar system, although currently (as of version 2.1.2) abc_sample does not currently support run-time addition of priors and summary statistics (such support is planned).
The generate method is the hook that connects the model to the rest of the ABC framework. It must take two arguments: a sample configuration and a set of parameter values drawn from the prior (the fitmodel documentation provides details of the exact format of these data). The generate method must return simulated data sets (using a type defined in fitmodel). Apart from these constraints, the user has full freedom with regard to what is actually done for generating the data set. Obviously, all potentialities of the coalescent simulator incorporated within EggLib are allowed. Furthermore, all forms of post-processing operations are not only possible, but easy to implement using egglib-py. For example, one can readily include error rates or sampling or ascertainment biases and set them as model parameters.
The running time and maximum memory usage of programs performing common population genetics operations using EggLib compared with alternatives (whenever available) is shown on Tables , , and . All tests were run on a laptop computer and (except for coalescent simulations) were repeated 10 times. Tests were performed using EggLib 2.1.2, Biopython 1.58, libsequence 1.7.4 [26
], analysis 0.8.1 (containing compute, polydNdS
], the version of ms updated December 11, 2009 [20
], coasim-python 1.3 [21
], msABC 20111219 [24
] and ABCreg 2009-07-30 [25
] (which were all the latest available versions at time of testing). For coalescent simulations (including in ABC), EggLib was set to use at most 4 processor cores.
Running time and memory use while importing FASTA files
Running time and memory use while performing diversity analyses
Running time and memory use while performing coalescent simulations
Running time and memory use while performing ABC
EggLib is comparatively more efficient than Biopython for importing large FASTA files (Table ). The Align class of EggLib is slightly more efficient than AlignIO of BioPython for importing a large alignment. For importing data files representing the whole Oryza sativa genome, the Container class of EggLib is much more efficient than SeqIO of Biopython (EggLib is able to import these two files fully in memory in a few seconds and with a limited memory overhead: the memory use is hardly larger than the file size). However, the difference between EggLib and Biopython reflects a difference in paradigm (import all file at once for EggLib, and read sequence one at a time for Biopython).
The comparison of an EggLib script for analyzing polymorphism with programs developed using the C++ libsequence library (compute for standard statistics, polydNdS for coding sequence statistics and rsq for linkage disequilibrium) shows that skipping unneeded statistics can significantly fasten the analysis. For analyzing a single alignment, libsequence programs are better, but for processing many alignments in a row a single loop using EggLib is more efficient. In EggLib, the linkage disequilibrium analysis is comparatively more efficient, and the coding sequence analysis (based on the wrapping of Bio++) is comparatively less efficient.
The comparison of the eggcoal, ms and CoaSim simulators shows that ms is consistently and significantly the fastest and the least memory-demanding (Table ). eggcoal lies between ms and CoaSim. EggLib has a generic design that makes it difficult to maximize performance, explaining part of the discrepancy. However, we believe that future versions will improve performance, especially thanks to improved implementation of recombination and multithreading scheme that are currently planned.
We compared the performances of EggLib commands for ABC to the very efficient programs msABC (for the simulation phase) and ABCreg (for the analysis phase). We used two different summary statistics sets: SDZ (number of polymorphic sites, Tajima's D and Fay and Wu's H) and SFS (site frequency spectrum with 8 categories). The SFS was available only with EggLib. EggLib was used through the interactive commands abc_sample and abc_fit. We found that EggLib abc_sample was slower than msABC and used more memory (chiefly because of Python-level multithreading). This is explained by the performance of the original ms program (see above), that was efficiently leveraged in msABC. However, the overhead tends to be decreased compared with the eggcoal/ms comparison presented before, showing that the EggLib integration does not worsen performance. We therefore expect that future improvements of the coalescent simulator will bring EggLib closer to the level of msABC. For the analysis step, a large data file of 5,000,000 samples was imported and analyzed. We observed that this step of the ABC procedure was not limiting in running time (compared to the simulation step) but could be limiting in memory use. Therefore we followed a strategy favoring data access from file, which is relatively slower but more memory efficient. EggLib and ABCreg are therefore complementary regarding the speed/memory balance.
EggLib is under active development and we expect new features to be added in the future. Our current routes for improving the package include: improving the performance of the coalescent simulator thanks to a new design of the recombination process and an improved parallelization scheme; easing the definition by users of custom ABC models, sets of summary statistics and priors using automated helpers; improving the performance of the ABC framework by internalizing replications within the C++ layer and removing unnecessary steps (such as Align conversion) without interfering with the general flexibility of the framework; putting a special effort in documentation, especially by providing tutorials besides the complete reference manual.