As complete genome sequences have become available for large numbers of closely related organisms, interest has steadily grown in improved computational methods for comparative genomics. Of particular interest are statistical, phylogenetic methods for characterizing rates and patterns of molecular evolution and for identifying sequences under natural selection against a background of neutral evolution. Methods of this kind have been used to identify evolutionary conserved elements [1
], novel protein-coding genes [4
], fast-evolving noncoding sequences [7
], transcription factor binding sites [10
], noncoding RNAs [11
] and other types of functional elements.
Since 2002, we have been developing a software package, called PHylogenetic Analysis with Space/Time models (PHAST), that consists of a collection of programs and supporting libraries for statistical phylogenetic modeling and functional element identification. (The phrase ‘space/time models’, borrowed from Yang [12
], reflects the prominent role of phylogenetic hidden Markov models in the package.) Our initial goal in developing PHAST was to support our own research in comparative genomics. Over time, however, as the package has expanded in functionality, it has gradually been adopted by a fairly large group of researchers from the broader comparative genomics community. PHAST is best known as the engine behind several popular tracks in the UCSC Genome Browser [13
] (including, most notably, the Conservation track), but it can also be downloaded and installed for use in custom analyses not available through the browser. As of September 2010, the package has been downloaded more than 1000 times (counting unique IP addresses). More than two-thirds of those downloads have occurred since November 2008, when the PHAST web site became available (http://compgen.bscb.cornell.edu/phast
PHAST has some overlap with the popular phylogenetic modeling package PAML [14
] as well as with other packages for phylogenetics such as HYPHY [15
] and MEGA [16
], tools for conservation analysis such as GERP [2
] and SCONE [17
], and comparative gene finders such as N-SCAN [18
]. However, PHAST is unique in that it combines phylogenetic modeling and functional element identification. In addition, it supports some phylogenetic modeling features not available in other packages, such as context-dependent subsitution models and model fitting by expectation–maximization. PHAST also has a particularly rich collection of methods for detecting departures from neutrality in rates and patterns of molecular evolution, with the ability to detect both consevation and acceleration, either across the branches of a phylogeny or on individual branches or clades. Finally, PHAST is well-suited for large-scale phylogenomics, with the ability to process entire mammalian genomes efficiently and native support for a variety of file formats used by the UCSC Genome Browser.
Here we describe the first ‘official’ release of PHAST, denoted v1.0. While most of the key algorithmic and modeling ideas behind PHAST have been published, this is the first article summarizing all components of the package and showing how they fit together and complement one another. We provide an overview of the programs and libraries in PHAST, and describe several recent improvements. In addition, we introduce a new interface to the PHAST libraries from the R statistical computing environment [19
], called RPHAST. The combination of PHAST and R is particularly powerful, especially in applications requiring a mixture of comparative genomic and downstream statistical analyses, and for rapid prototyping of new phylogenomic methods. We expect the improved usability of PHAST v1.0 (with RPHAST) to increase interest in the package among comparative genomics researchers.