PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of bioinfoLink to Publisher's site
 
Bioinformatics. Jun 1, 2009; 25(11): 1422–1423.
Published online Mar 20, 2009. doi:  10.1093/bioinformatics/btp163
PMCID: PMC2682512
Biopython: freely available Python tools for computational molecular biology and bioinformatics
Peter J. A. Cock,1* Tiago Antao,2 Jeffrey T. Chang,3 Brad A. Chapman,4 Cymon J. Cox,5 Andrew Dalke,6 Iddo Friedberg,7 Thomas Hamelryck,8 Frank Kauff,9 Bartek Wilczynski,10,11 and Michiel J. L. de Hoon12
1Plant Pathology, SCRI, Invergowrie, Dundee, DD2 5DA, 2Liverpool School of Tropical Medicine, Liverpool, L3 5QA, UK, 3Institute for Genome Sciences and Policy, Duke University Medical Center, Durham, NC, 4Department of Molecular Biology, Simches Research Center, Massachusetts General Hospital, Boston, MA 02114, USA, 5Centro de Ciências do Mar, Universidade do Algarve, Faro, Portugal, 6Andrew Dalke Scientific, AB, Gothenburg, Sweden, 7California Institute for Telecommunications and Information Technology, University of California, San Diego, 9500 Gilman Dr., La Jolla, CA 92093-0446, USA, 8Bioinformatics Center, Department of Biology, University of Copenhagen, Ole Maaloes Vej 5, 2200 Copenhagen N, Denmark, 9Molecular Phylogenetics, Department of Biology, TU Kaiserslautern, 67653 Kaiserslautern, UK, 10EMBL Heidelberg, Meyerhofstraβe 1, 69117 Heidelberg, Germany, 11Institute of Informatics, University of Warsaw, Poland and 12RIKEN Omics Science Center, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama-shi, Kanagawa-ken, 230-0045, Japan
*To whom correspondence should be addressed.
Associate Editor: Dmitrij Frishman
Received March 11, 2009; Accepted March 16, 2009.
Summary: The Biopython project is a mature open source international collaboration of volunteer developers, providing Python libraries for a wide range of bioinformatics problems. Biopython includes modules for reading and writing different sequence file formats and multiple sequence alignments, dealing with 3D macro molecular structures, interacting with common tools such as BLAST, ClustalW and EMBOSS, accessing key online databases, as well as providing numerical methods for statistical learning.
Availability: Biopython is freely available, with documentation and source code at www.biopython.org under the Biopython license.
Contact: All queries should be directed to the Biopython mailing lists, see www.biopython.org/wiki/_Mailing_listspeter.cock/at/scri.ac.uk.
Python (www.python.org) and Biopython are freely available open source tools, available for all the major operating systems. Python is a very high-level programming language, in widespread commercial and academic use. It features an easy to learn syntax, object-oriented programming capabilities and a wide array of libraries. Python can interface to optimized code written in C, C++or even FORTRAN, and together with the Numerical Python project numpy (Oliphant, 2006), makes a good choice for scientific programming (Oliphant, 2007). Python has even been used in the numerically demanding field of molecular dynamics (Hinsen, 2000). There are also high-quality plotting libraries such as matplotlib (matplotlib.sourceforge.net) available.
Since its founding in 1999 (Chapman and Chang, 2000), Biopython has grown into a large collection of modules, described briefly below, intended for computational biology or bioinformatics programmers to use in scripts or incorporate into their own software. Our web site lists over 100 publications using or citing Biopython.
The Open Bioinformatics Foundation (OBF, www.open-bio.org) hosts our web site, source code repository, bug tracking database and email mailing lists, and also supports the related BioPerl (Stajich et al., 2002), BioJava (Holland et al., 2008), BioRuby (www.bioruby.org) and BioSQL (www.biosql.org) projects.
The Seq object is Biopython's core sequence representation. It behaves very much like a Python string but with the addition of an alphabet (allowing explicit declaration of a protein sequence for example) and some key biologically relevant methods. For example,
An external file that holds a picture, illustration, etc.
Object name is btp163i1.jpg
Sequence annotation is represented using SeqRecord objects which augment a Seq object with properties such as the record name, identifier and description and space for additional key/value terms. The SeqRecord can also hold a list of SeqFeature objects which describe sub-features of the sequence with their location and their own annotation.
The Bio.SeqIO module provides a simple interface for reading and writing biological sequence files in various formats (Table 1), where regardless of the file format, the information is held as SeqRecord objects. Bio.SeqIO interprets multiple sequence alignment file formats as collections of equal length (gapped) sequences. Alternatively, Bio.AlignIO works directly with alignments, including files holding more than one alignment (e.g. re-sampled alignments for bootstrapping, or multiple pairwise alignments). Related module Bio.Nexus, developed for Kauff et al. (2007), supports phylogenetic tools using the NEXUS interface (Maddison et al., 1997) or the Newick standard tree format.
Table 1.
Table 1.
Selected Bio.SeqIO or Bio.AlignIO file formats
Modules for a number of online databases are included, such as the NCBI Entrez Utilities, ExPASy, InterPro, KEGG and SCOP. Bio.Blast can call the NCBI's online Blast server or a local standalone installation, and includes a parser for their XML output. Biopython has wrapper code for other command line tools too, such as ClustalW and EMBOSS. Bio.PDB module provides a PDB file parser, and functionality related to macromolecular structure (Hamelryck and Manderick, 2003). Module Bio.Motif provides support for sequence motif analysis (searching, comparing and de novo learning). Biopython's graphical output capabilities were recently significantly extended by the inclusion of GenomeDiagram (Pritchard et al., 2006).
Biopython contains modules for supervised statistical learning, such as Bayesian methods and Markov models, as well as unsu pervised learning, such as clustering (De Hoon et al., 2004).
The population genetics module provides wrappers for GENEPOP (Rousset, 2007), coalescent simulation via SIMCOAL2 (Laval and Excoffier, 2004) and selection detection based on a well-evaluated Fst-outlier detection method (Beaumont and Nichols, 1996).
BioSQL (www.biosql.org) is another OBF supported initiative, a joint collaboration between BioPerl, Biopython, BioJava and BioRuby to support loading and retrieving annotated sequences to and from an SQL database using a standard schema. Each project provides an object-relational mapping (ORM) between the shared schema and its own object model (a SeqRecord in Biopython). As an example, xBASE (Chaudhuri and Pallen, 2006) uses BioSQL with both BioPerl and Biopython.
3 CONCLUSIONS
Biopython is a large open-source application programming interface (API) used in both bioinformatics software development and in everyday scripts for common bioinformatics tasks. The homepage www.biopython.org provides access to the source code, documentation and mailing lists. The features described herein are only a subset; potential users should refer to the tutorial and API documentation for further information.
ACKNOWLEDGEMENTS
The OBF hosts and supports the project. The many Biopython contributors over the years are warmly thanked, a list too long to be reproduced here.
Funding: Fundacao para a Ciencia e Tecnologia (Portugal) (grant SFRH/BD/30834/2006 to T.A.).
Conflict of Interest: none declared.
  • Chapman B, Chang J. Biopython: Python tools for computational biology. ACM SIGBIO Newslett. 2000;20:15–19.
  • Chaudhuri RR, Pallen MJ. xBASE, a collection of online databases for bacterial comparative genomics. Nucleic Acids Res. 2006;34:D335–D337. [PMC free article] [PubMed]
  • Bateman A, et al. The Pfam protein families database. Nucleic Acids Res. 2004;32:D138–D141. [PMC free article] [PubMed]
  • Beaumont MA, Nichols RA. Evaluating loci for use in the genetic analysis of population structure. Proc. R. Soc. Lond. B. 1996;263:1619–1626.
  • Benson DA, et al. GenBank. Nucleic Acids Res. 2007;35:D21–D25. [PMC free article] [PubMed]
  • Felsenstein J. PHYLIP -phylogeny inference package (Version 3.2) Cladistics. 1989;5:164–166.
  • Hamelryck T, Manderick B. PDB file parser and structure class implemented in Python. Bioinformatics. 2003;19:2308–2310. [PubMed]
  • Hinsen K. The molecular modeling toolkit: a new approach to molecular simulations. J. Comp. Chem. 2000;21:79–85.
  • Holland RCG, et al. BioJava: an open-source framework for bioinformatics. Bioinformatics. 2008;24:2096–2097. [PubMed]
  • De Hoon MJL, et al. Open source clustering software. Bioinformatics. 2004;20:1453–1454. [PubMed]
  • Kauff F, et al. WASABI: an automated sequence processing system for multi-gene phylogenies. Syst. Biol. 2007;56:523–531. [PubMed]
  • Kulikova T, et al. EMBL nucleotide sequence database in 2006. Nucleic Acids Res. 2006;35:D16–D20. [PMC free article] [PubMed]
  • Lavel G, Excoffier L. SIMCOAL 2.0: a program to simulate genomic diversity over large recombining regions in a subdivided population with a complex history. Bioinformatics. 2004;20:2485–2487. [PubMed]
  • Maddison DR, et al. NEXUS: an extensible file format for systematic information. Syst. Biol. 1997;46:590–621. [PubMed]
  • Oliphant TE. Guide to NumPy. USA: Trelgol Publishing; 2006.
  • Oliphant TE. Python for Scientific Computing. Comput. Sci. Eng. 2007;9:10–20.
  • Pearson WR, Lipman DJ. Improved tools for biological sequence analysis. PNAS. 1988;85:2444–2448. [PubMed]
  • Pritchard L, et al. GenomeDiagram: a Python package for the visualisation of large-scale genomic data. Bioinformatics. 2006;22:616–617. [PubMed]
  • Rice P, et al. EMBOSS: the European molecular biology open software suite. Trends Genet. 2000;16:276–277. [PubMed]
  • Rousset F. GENEPOP '007: a complete re-implementation of the GENEPOP software for Windows and Linux. Mol. Ecol. Res. 2007;8:103–106. [PubMed]
  • Stajich JE, et al. The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002;12:1611–1618. [PubMed]
  • The UniProt Consortium. 2007 The universal protein resource (UniProt) Nucleic Acids Res. 35 D193-D197. [PMC free article] [PubMed]
  • Thompson JD, et al. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22:4673–4680. [PMC free article] [PubMed]
Articles from Bioinformatics are provided here courtesy of
Oxford University Press