Pfam website development
Although Pfam data have always been centrally maintained and curated, historically each member of the Pfam consortium has run a separate website to serve the same data. The three primary mirror sites are based in the UK, Sweden and the USA, with a further two recognized mirror sites in France and South Korea. Each of the primary consortium sites has tended to adopt a different look and feel and, although all sites have provided the same set of core services, each has also provided some additional tools and services that are unique to that particular site. This has lead to an entirely different user experience at each Pfam site, and has led to users’ confusion as to which site provides which services. The development of three main websites also caused a significant duplication of effort for the Pfam consortium.
A new Pfam website has been developed, with the goal of providing a single, unified website for Pfam data and services, that combines the best features of the separate sites in a single, common interface. In re-designing the website, we have been able not only to improve the navigation and architecture of the website itself, but also to design a more easily extensible and maintainable code-base for the future. This new code-base will be common to and developed by all members of the Pfam consortium. Furthermore, the new website code has been written with portability in mind and has been made publicly available, so that users may install and run the website locally if desired.
We have improved the organization and presentation of Pfam data. Everything related to, for example, a Pfam-A family, is collected into a single page, which is sub-divided into tab-panes that the user can easily switch between. shows a typical page for a Pfam-A family. We have similar tab-layout pages for data related to protein sequences, Pfam-B families, Pfam clans, proteome data from completed genomes and 3D protein structures. Each type of page represents a different route into the Pfam data, and each tabbed page provides links that allow the user to navigate easily between these different sections of Pfam. Additionally, users can browse lists of Pfam families or clans and can jump quickly between any type of entry in the site via a ‘jump to’ box found on most pages.
A common feature of every type of page is a summary box, providing the salient details of every entry in a single glance. The five summarized features of the entry are: the number of architectures associated with the entry; the number of protein sequences; the number of interactions [as determined by
iPfam (
9)]; the number of species; and the number of 3D structures. The exact meaning of each value is context-dependent, so that in the Pfam family page, for example, the structure icon shows the number of structures associated with that family, whilst in a protein sequence page the structure icon shows the number of structures, which map to that sequence. The link for each icon is also context-dependent, taking the user to the most appropriate section of the page for the icon clicked.
Previously, it has been difficult to search Pfam by species or taxonomic division. In addition to the species tree found on each family page, which provides a breakdown of the species found in that family, we have implemented a new taxonomy search tool. As with the taxonomy search tool in the old Pfam website, the new tool returns a list of Pfam domains that match a Boolean query expression. For example, the query ‘Caenorhabditis elegans AND NOT Homo sapiens’ will return all Pfam domains found in C. elegans, but are not found in H. sapiens. As well as being less error prone and significantly quicker than the version in the old Pfam website, the new taxonomy search tool also provides a feedback mechanism that suggests organism names as the user enters them. This reduces the likelihood of typographical or spelling errors in queries, since incorrectly entered species terms are immediately highlighted in the interface, as well as providing an insight into the organisms that are found in the database.
A commonly requested capability for the Pfam site is the ability to find Pfam domains that are unique to a given taxonomic division or species. This feature is now available. For example, searching for unique ‘Metazoa’ families returns a list of domains that are found only in Metazoans will be returned.
In addition to the standard features of the old Pfam websites, such as search tools for quickly finding Pfam domains on a protein sequence or for locating sequences with a specified domain architecture, we have also introduced several new features in the new site, many of which use the Distributed Annotation System (DAS) (
10) to aggregate multiple data sources in a single display.
The Distributed Annotation System
We have improved access to Pfam by providing data through the DAS. DAS is a system for disseminating annotations and alignments of DNA or protein sequences through a simple, web-based protocol. Three types of Pfam data are now available via DAS (
11): domain annotations for both Pfam-A and Pfam-B families; sequence features such as active sites (
12) and transmembrane region predictions; and seed and full alignments for Pfam-A families. The availability of Pfam data via DAS enables users to access specific parts of the database as a web service, without the need to download and install it in its entirety.
We have also been able to incorporate other data sources that are accessible through DAS, in order to enrich our own display of Pfam data. One feature of the new website is a DAS-based viewer for sequence annotations (). This allows the user to view annotations of protein sequences from a wide range of third-party databases alongside information from Pfam itself. The viewer presents the standard Pfam domain structure image, showing the arrangement of Pfam domains on the sequence in question, and allows users to add or hide annotations from any of the available DAS sources. As the user moves their mouse over each feature, a tool-tip gives detailed information about it. If provided by the external DAS source, a link to further information is also given.
Another use of DAS within the new website is in the Pfam sequence alignment viewer. Pfam provides two alignments for every family: the seed alignment is a manually curated alignment of related sequences and generally contains a relatively small number of sequences; the full alignment is generated by searching the sequence database using the HMM for the family and may contain a very large number of sequences (the largest alignment, that of GP120, currently contains over 68 000 sequences). Historically, it has been difficult, if not impossible, to view the largest sequence alignments in a web browser, due simply to the size of the resulting web page. We have implemented a DAS-based sequence alignment viewer (shown in ) that is able to present even the largest alignments in manageable portions, by retrieving only the required section of the alignment and rendering it as HTML. This allows the user to scroll through wide alignments (those with long sequences) or to page through long alignments (those with a large number of sequences), without having to load the entire alignment into their browser. Alignments are coloured according to a pre-calculated consensus sequence, which is also retrieved via DAS, and in this way even alignment fragments can be marked-up using the properties of the whole alignment.