|Home | About | Journals | Submit | Contact Us | Français|
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Summary: The Biopython project is a mature open source international collaboration of volunteer developers, providing Python libraries for a wide range of bioinformatics problems. Biopython includes modules for reading and writing different sequence file formats and multiple sequence alignments, dealing with 3D macro molecular structures, interacting with common tools such as BLAST, ClustalW and EMBOSS, accessing key online databases, as well as providing numerical methods for statistical learning.
Availability: Biopython is freely available, with documentation and source code at www.biopython.org under the Biopython license.
Python (www.python.org) and Biopython are freely available open source tools, available for all the major operating systems. Python is a very high-level programming language, in widespread commercial and academic use. It features an easy to learn syntax, object-oriented programming capabilities and a wide array of libraries. Python can interface to optimized code written in C, C++or even FORTRAN, and together with the Numerical Python project numpy (Oliphant, 2006), makes a good choice for scientific programming (Oliphant, 2007). Python has even been used in the numerically demanding field of molecular dynamics (Hinsen, 2000). There are also high-quality plotting libraries such as matplotlib (matplotlib.sourceforge.net) available.
Since its founding in 1999 (Chapman and Chang, 2000), Biopython has grown into a large collection of modules, described briefly below, intended for computational biology or bioinformatics programmers to use in scripts or incorporate into their own software. Our web site lists over 100 publications using or citing Biopython.
The Open Bioinformatics Foundation (OBF, www.open-bio.org) hosts our web site, source code repository, bug tracking database and email mailing lists, and also supports the related BioPerl (Stajich et al., 2002), BioJava (Holland et al., 2008), BioRuby (www.bioruby.org) and BioSQL (www.biosql.org) projects.
The Seq object is Biopython's core sequence representation. It behaves very much like a Python string but with the addition of an alphabet (allowing explicit declaration of a protein sequence for example) and some key biologically relevant methods. For example,
Sequence annotation is represented using SeqRecord objects which augment a Seq object with properties such as the record name, identifier and description and space for additional key/value terms. The SeqRecord can also hold a list of SeqFeature objects which describe sub-features of the sequence with their location and their own annotation.
The Bio.SeqIO module provides a simple interface for reading and writing biological sequence files in various formats (Table 1), where regardless of the file format, the information is held as SeqRecord objects. Bio.SeqIO interprets multiple sequence alignment file formats as collections of equal length (gapped) sequences. Alternatively, Bio.AlignIO works directly with alignments, including files holding more than one alignment (e.g. re-sampled alignments for bootstrapping, or multiple pairwise alignments). Related module Bio.Nexus, developed for Kauff et al. (2007), supports phylogenetic tools using the NEXUS interface (Maddison et al., 1997) or the Newick standard tree format.
Modules for a number of online databases are included, such as the NCBI Entrez Utilities, ExPASy, InterPro, KEGG and SCOP. Bio.Blast can call the NCBI's online Blast server or a local standalone installation, and includes a parser for their XML output. Biopython has wrapper code for other command line tools too, such as ClustalW and EMBOSS. Bio.PDB module provides a PDB file parser, and functionality related to macromolecular structure (Hamelryck and Manderick, 2003). Module Bio.Motif provides support for sequence motif analysis (searching, comparing and de novo learning). Biopython's graphical output capabilities were recently significantly extended by the inclusion of GenomeDiagram (Pritchard et al., 2006).
Biopython contains modules for supervised statistical learning, such as Bayesian methods and Markov models, as well as unsu pervised learning, such as clustering (De Hoon et al., 2004).
The population genetics module provides wrappers for GENEPOP (Rousset, 2007), coalescent simulation via SIMCOAL2 (Laval and Excoffier, 2004) and selection detection based on a well-evaluated Fst-outlier detection method (Beaumont and Nichols, 1996).
BioSQL (www.biosql.org) is another OBF supported initiative, a joint collaboration between BioPerl, Biopython, BioJava and BioRuby to support loading and retrieving annotated sequences to and from an SQL database using a standard schema. Each project provides an object-relational mapping (ORM) between the shared schema and its own object model (a SeqRecord in Biopython). As an example, xBASE (Chaudhuri and Pallen, 2006) uses BioSQL with both BioPerl and Biopython.
Biopython is a large open-source application programming interface (API) used in both bioinformatics software development and in everyday scripts for common bioinformatics tasks. The homepage www.biopython.org provides access to the source code, documentation and mailing lists. The features described herein are only a subset; potential users should refer to the tutorial and API documentation for further information.
The OBF hosts and supports the project. The many Biopython contributors over the years are warmly thanked, a list too long to be reproduced here.
Funding: Fundacao para a Ciencia e Tecnologia (Portugal) (grant SFRH/BD/30834/2006 to T.A.).
Conflict of Interest: none declared.