|Home | About | Journals | Submit | Contact Us | Français|
In this paper, we introduce pebl, a Python library and application for learning Bayesian network structure from data and prior knowledge that provides features unmatched by alternative software packages: the ability to use interventional data, flexible specification of structural priors, modeling with hidden variables and exploitation of parallel processing.
Bayesian networks (BN) have become a popular methodology in many fields because they can model nonlinear, multimodal relationships using noisy, inconsistent data. Although learning the structure of BNs from data is now common, there is still a great need for high-quality open-source software that can meet the needs of various users. End users require software that is easy to use; supports learning with different data types; can accommodate missing values and hidden variables; and can take advantage of various computational clusters and grids. Researchers require a framework for developing and testing new algorithms and translating them into usable software. We have developed the Python Environment for Bayesian Learning (pebl) to meet these needs.
pebl provides many features for working with data and BNs; some of the more notable ones are listed below.
pebl can load data from tab-delimited text files with continuous, discrete and class variables and can perform maximum entropy discretization. Data collected following an intervention is important for determining causality but requires an altered scoring procedure (Pe’er et al., 2001 and Sachs et al., 2002). pebl uses the BDe metric for scoring networks and handles interventional data using the method described by Cooper and Yoo (2002).
pebl can handle missing values and hidden variables using exact marginalization and Gibbs sampling (Heckerman, 1998). The Gibbs sampler can be resumed from a previously suspended state, allowing for interactive inspection of preliminary results or a manual strategy for determining satisfactory convergence.
A key strength of Bayesian analysis is the ability to use prior knowledge. pebl supports structural priors over edges specified as ’hard’ constraints or ’soft’ energy matrices (Imoto et al., 2003) and arbitrary constraints specified as Python functions or lambda expressions.
pebl includes greedy hill-climbing and simulated annealing learners and makes writing custom learners easy. Efficient implementaion of learners requires careful programming to eliminate redundant computation. pebl provides components to alter, score and rollback changes to BNs in a simple, transactional manner and with these, efficient learners look remarkably similar to pseudocode.
pebl includes both a library and a command line application. It aims for a balance between ease of use, extensibility and performance. The majority of pebl is written in Python, a dynamically-typed programming language that runs on all major operating systems. Critical sections use the numpy library (Ascher et al., 2001) for high-performance matrix operations and custom extensions written in ANSI C for portability and speed.
pebl’s use of Python makes it suitable for both programmers and domain experts. Python provides interactive shells and notebook interfaces and includes an extensive standard library and many third-party packages. It has a strong presence in the scientific computing community (Oliphant, 2007). Figure 1 shows a script and configuration file example that showcase the ease of using pebl.
While many tasks related to Bayesian learning are embarrassingly parallel in theory, few software packages take advantage of it. pebl can execute learning tasks in parallel over multiple processors or CPU cores, an Apple Xgrid1, an IPython cluster2 or the Amazon EC2 platform3. The EC2 platform is especially attractive for scientists because it allows one to rent processing power on an on-demand basis and execute pebl tasks on them.
With appropriate configuration settings and the use of parallel execution, pebl can be used for large learning tasks. Although pebl has been tested successfully with datasets with 10000 variables and samples, BN structure learning is a known NP-Hard problem (Chickering et al., 1994) and analysis using datasets with more than a few hundred variables is likely to result in poor results due to poor coverage of the search space.
The benefits of open source software derive not just from the freedoms afforded by the software license but also from the open and collaborative development model. pebl’s source code repository and issue tracker are hosted at Google Code and freely available to all. Additionally, pebl includes over 200 automated unit tests and mandates that every source code submission and resolved error be accompanied with tests.
While there are many software tools for working with BNs, most focus on parameter learning and inference rather than structure learning. Of the few tools for structure learning, few are open-source and none provide the set of features included in pebl. As shown in Table 1, the ability to handle interventional data, model with missing values and hidden variables, use soft and arbitrary priors and exploit parallel platforms are unique to pebl. pebl, however, does not currently provide any features for inference or learning Dynamic Bayesian Networks (DBN). Despite its use of optimized matrix libraries and custom C extension modules, pebl can be an order of magnitude or more slower than software written in Java or C/C++; the ability to use a wider range of data and priors, the parallel processing features and the ease-of-use, however, should make it an attractive option for many users.
We have developed a library and application for learning BNs from data and prior knowledge. The set of features found in pebl is unmatched by alternative packages and we hope that our open development model will convince others to use pebl as a platform for BN algorithms research.
We would like to acknowledge support for this project from the NIH grant #U54 DA021519.
Availability pebl is released under the MIT open-source license, can be be installed from the Python Package Index and is available at http://pebl-project.googlecode.com.
1Grid computing solution by Apple, Inc. http://www.apple.com/server/macosx/technology/xgrid.html
2Cluster of Python interpreters. http://ipython.scipy.org
3An pay-per-use, on-demand computing platform by Amazon, Inc. http://aws.amazon.com