|Home | About | Journals | Submit | Contact Us | Français|
We describe in this communication a set of functional perl script utilities for use in peptide mass spectral database searching and proteomics experiments, known as the Wildcat Toolbox. These are all freely available for download from our laboratory Web site (http://proteomics.arizona.edu/toolbox.html) as a combined zip file, and can also be accessed via the Proteome Commons Web site (www.proteomecommons.org) in the tools section. We make them available to other potential users in the spirit of open source software development; we do not have the resources to provide any significant technical support for them, but we hope users will share both bugs and improvements with the community at large.
Protein mass spectrometry is one of the cornerstone technologies of proteomics. One of the most important steps in such experiments is the use of computer software programs to match tandem mass spectral data to peptide sequences present in protein databases. There are a number of such database search engines available to perform these calculations, including Sequest,1,2 Mascot,3 Spectrum Mill,4 ProteinLynx,5 XTandem,6,7 Omssa,8 and numerous others. The proliferation of products from different vendors has led to a high degree of fragmentation among users in the field, depending on which products they mainly use. In recent years, there has been a noticeable and welcome movement towards open-source software in this area,9–11 but commercial products still appear to dominate most user forums.
In our laboratory, we routinely use three search engines: Sequest, XTandem, and Mascot. All of our Sequest output data are further filtered and organized using DTASelect and Contrast,12 and results are usually stored in Microsoft Excel format. For certain experiments, we validate protein identification results by repeating a search with a different search engine,4,13 and for many experiments, especially those involving multidimensional nanoflow LC-MS/MS (high-pressure liquid chromatography–tandem mass spectrometry—MudPIT), we establish confidence levels for protein identification using reversed database searching.14,15
We use a very wide range of fasta-format protein-sequence databases to search data against, especially for our core facility work, where we have hundreds of users all working on their own unique biological problems. These databases are available via download from a wide variety of sources, such as the nonredundant protein database of the National Center for Biotechnology Information (NCBI) [www.ncbi.nlm.nih.gov] or the PlantProtein database from The Arabidopsis Information Resource (TAIR) [www.arabidopsis.org], and can also be custom assembled by, for example, cutting and pasting from different sources, or translating EST sequence data. The level of annotation supplied in protein headers, which is necessary for interpretation of database search results, thus varies greatly.
In the context of developing some of the workflow outlined in the preceding paragraphs, we have developed a set of ten perl-script programs to aid in various steps of the process. Known as the Wildcat Toolbox (http://proteomics.arizona.edu/toolbox.html), these are explained in detail below, and can be divided into three main categories: fasta database-manipulation utilities (count. pl, reverse.pl, comment.pl, extract.pl, fasta_labeler. pl); searching and results utilities (organizer.pl, run_tandem.pl); and spectral sorting utilities (append.pl, sub_ append.pl, DTA_sorter.pl).
All of these programs require perl build 5.8.1 or higher, available for free download from www.activestate.com. The perl scripts should be installed in the /perl/bin directory, and run directly from a command prompt window. We have tested these extensively on machines running Windows 2000 Service Pack 2 or higher, Windows 2000 Server, Windows XP Pro, and Windows XP Home Edition, and found no major errors or conflicts.
Several tools, such as seqIO.pm and WriteExcel.pm, also utilize other perl modules. The complete bioperl 1.5 can be downloaded from http://bioperl.org/Core/Latest/index.shtml and should be unarchived into the c:\perl\ site\lib folder. The WriteExcel.pm module can be downloaded from the comprehensive perl action network Web site at http://www.cpan.org/modules/01modules.index.html and should also be unarchived into the c:\perl\site\ lib folder.
A detailed description of each of the current set of utilities in the toolbox is provided in the following sections.
This is a utility for counting the number of entries in a fasta format protein database that has been compiled or downloaded to use in MS/MS database searching. Including the number of protein entries in a database used to search against is a requirement of the recently published guidelines for protein identification data.9
Usage: In the directory where the fasta protein file is located, type: Count.pl [filename.fasta]
The program displays on screen the number of protein headers in the file, which can then be recorded elsewhere.
This is a utility for reversing the protein sequences in a fasta format protein database, allowing the user to research the same dataset against a reversed database and use that information to assess rate of false positive assignments.14,15
Usage: In the directory where the fasta protein file is located, type: reverse.pl [filename.fasta] [newfilename. fasta]
Example: reverse.pl [yeast.fasta] [yeastREV.fasta]
The original database file is untouched, and a new file is created with the protein headers untouched but all of the sequence information reassembled in reverse order.
This is a utility program for adding the same text string (i.e. a comment) to the descriptive header of all of the entries in a protein sequence database file. This is used in cases where a database has been assembled from various sources and later use requires knowing the original source. A user can add a comment to the headers in one fasta file before appending it to another one. One good example is to add a keyword such as artifact to the headers in a database of common laboratory contaminant proteins that is appended to databases assembled from other sources prior to use. Since all of the protein headers from the contaminant database contain the word artifact, a user can then employ a filtering program such as DTASelect to display the results with and without the contaminants by filtering out results containing that keyword. For correct use, it is essential to use a keyword that is not present in any other entries in the database.
Usage: In the directory where the fasta protein file is located, type: Comment.pl –i [filename.fasta] –o [new filename.fasta] –c Comment
Example: Comment.pl –i contaminants.fasta –o –i contaminants_artifacts.fasta –c ARTIFACT
The program will output a new fasta file called contaminants_artifacts that has the word ARTIFACT inserted at the start of each descriptive header.
This is a utility for retrieving a specified set of proteins from a fasta protein database file based on locus names. This is employed when a user needs to take proteins identified in MS/MS experiments and run them in a batch-wise mode through BLAST or another protein analysis programs. The program requires an input file containing fasta format sequences and a text file with sequence names, one per line.
Usage: extract.pl –i fasta_file –o output_file –l list
Example: extract.pl –i nr071405.fasta –o phresults.fasta –l mylist.txt
where mylist.txt would contain entries such as:
The program would then output a new file called phresults.fasta, which contains the protein header and complete sequence for each of the entries included in the mylist.txt file. The number of entries present in the newly created fasta database file can then be determined using the count.pl utility (see above) and cross-checked against the expected number of entries from the mylist. txt file.
It is important to recognize that the program defines a locus name as only those characters from after the initial >, up to the first space in the header. The program also displays on screen how many sequences were added to the specified output file, providing an additional means of checking that all of the desired entries were successfully extracted for further processing.
This is a very useful utility when working in an organism for which there is relatively poor gene function annotation, or little sequence information readily available in NCBI or elsewhere, but the user has access to a custom-assembled fasta database file containing protein sequence information from gene predictions and translations of raw DNA sequencing results. Examples of projects in which we have used this tool include analysis of species such as Aspergillis flavis, Coccidioides immitis, Drosophila mojavensis, and Bemisia tabaci.
After the user has performed MS/MS and database searching, typical results are multiple peptides assigned to a sequence that has a locus name, e.g., AN 6772.1, and a descriptive header saying, for example, “protein translation,” or simply nothing at all. The next step is to do a BLAST search of the protein sequence from the database against the NCBI nonredundant database; in the case of AN6772.1, for example, the user would get a result indicating high homology to the sequence of a quercetinase enzyme from Aspergillis fumigatus. If this work is part of an ongoing project, the same entries may be identified repeatedly. The aim of this program is to avoid unnecessarily repeating BLAST searches.
The fasta_labeler program takes as input the protein sequence database search file that has been searched against initially, and a second text file that is manually created by the user, which contains two tab-separated columns: locus names and annotated descriptions that are to be added to the protein headers. It then creates a new version of the protein sequence database file with the annotated descriptions or other comments added to, or replacing, the descriptive headers, along with an optional date-modified stamp.
The usage syntax varies according to the functions that are to be performed. There is one mandatory option (–oldfasta) that specifies which fasta database file is to be operated on as a starting point. In addition there is a nested hierarchy of options, starting with a second-level selection of either merge, strip or extract (–m, –s or –e), and additional options specified under each of those.
There are three distinct functions of this script: merge, strip, and extract. Merge will start with a fasta file and a tab-delimited file of descriptions, and either append the descriptions at the end of the existing header line, or remove any previous descriptions for that sequence and replace them with those found in the descriptions file. The user can specify a text tag for the newly added comments by using the –tag option, which is useful for keeping track of annotation changes. The tag value defaults to “UA.” Strip will copy the old fasta file to the new fasta file without the descriptive headers. Extract will read the locus names and descriptions from a fasta file and output them into a tab-delimited text file. The –do option tells the script which file to put the extracted descriptions into. If no output file is specified, the program prints the descriptions to the screen. A file that is created in this way can then be used as a template for manual annotation to create a descriptions file as input in the merge option.
Example: Once the user has performed BLAST searches of the extracted proteins in the example above (see extract.pl), the fasta_labeler program can be used to annotate those results back onto the database file so that the annotations will be displayed in future search results.
The user would create a simple text file containing tab-delimited columns such as:
and then enter: Fasta_labeler.pl –oldfasta NCBIricesep05. fasta –m –a –newfasta NCBIriceoct05.fasta –d blastresults.txt –tag PHoct05
The program will create a new database file called NCBIriceoct05.fasta, which contains everything in the original file, NCBIricesep05.fasta, plus the annotations shown above appended to the existing descriptive protein headers, plus the tag PHoct05 added to each of the entries that have been changed. Note that this is a more complicated program than most described in this report, which means that it is both applicable to more data manipulation situations, and prone to more usage errors.
This is a very useful utility for users running DTASelect on a large number of related samples. The program is designed to assemble individual DTASelect result pages for a set of experiments into a single Excel file, separated into individual worksheets. This greatly facilitates the dissemination of results. For example, when a user has run nanoLC-MS/MS on a large set of gel bands or spots, and searched all of them using Sequest, the current implementation of DTASelect requires the user to run DTASelect in each individual directory, save 32 individual result files, and then cut and paste them together. This is a tedious, repetitive process, and the same result can be achieved using a single line of input into the organizer.pl program.
There are three DTASelect criteria sets stored in the program:
Low: −1 1.5 −2 2.0 −3 3.0 −d .05 −y 1 −p 1
High: −1 1.8 −2 2.5 −3 3.5 −d .08 −y 1 −p 1
Vhigh: −1 1.8 −2 2.5 −3 3.5 −d .10 −y 2 −p 2
Usage: Starting in the parent directory, which contains a Sequest.params file and a number of subdirectories, each of which contains multiple .dta and .out files, type:
Organizer.pl –type [low|high|vhigh|none] –loc [full path to parent directory]
Other optional arguments are:
--[other DTASelect arguments to apply]
Example: For an experiment in which the user has performed nanoLC-MS/MS analysis on a set of 32 gel bands, and then run Sequest on each set and stored the results under the parent directory ph092505, enter:
Organizer.pl –type high –loc c:\xcalibur\Sequest\ph092505 –excel ph092505hiresults.xls -- –l artifact
The program will run DTASelect in each of the subdirectories using the “high” cutoffs, exclude any proteins with “artifact” in the descriptive header, and output an Excel results file called ph032505hiresults.xls, which contains 32 individual worksheets with a DTASelect results page for each of the 32 subdirectories. We have stored in the program the three DTASelect criteria sets we routinely use in our laboratory, but a user has the option to set the type parameter to “none” and enter any set of criteria they wish to use after the “--” option switch.
This utility is designed to simplify the process of setting up and running multiple searches with the XTandem program running on a local processor at the command line. This requires the user to have already installed XTandem locally, and be familiar with the various input parameters that are necessary. The basic idea of this program is that it runs an XTandem search on all .dta format files found in a specified directory, when all of them are to be searched against the same fasta format database.
There are two modes in which the program operates. The user can either run searches by specifying a database file and the directory containing the .dta files, or the user can specify an input.xml for XTandem to use as a template. If no input file is specified, the program will run XTandem using an xml input file generated from a default file, called run_tandem_file.xml. If the user desires parameters in this file to be altered, a copy of run_tandem_file.xml can be created and modified accordingly, and then used as an input template file.
Requirements: (1) run_tandem.pl and run_tandem_file. xml must be in the \tandem\bin directory; (2) run_ tandem_file.xml must be present to use the default options, or a different input.xml file must be created using this as a template, without altering the text tags that the program accesses, which are labeled tag1, tag2, and tag3.
Usage: In \tandem\bin, type: run_tandem.pl [database name] [directory path] –i [name of your xml input file]
Example: To run an XTandem search against a fasta format database file called “yeast,” on all of the .dta files in C:\data\PH030705, using input parameters specified in PH_input_template.xml located in C:\PH090705, the user would go to C:\program files\tandem\bin following at the command line, and enter: run_tandem.pl yeast C:\ PH090705 C:\ PH090705\PH_input_template.xml
The program will run an XTandem search for each dta file against the yeast database and create a corresponding output xml file for each one.
This is a simple but essential utility for preparing concatenated dta files as input for database searching using XTandem or Mascot. The program starts with a directory containing a set of dta files, produced from an nanoLC-MS/MS run. This requires first inputting an Xcalibur raw file into either the Sequest bioworks browser and making a set of dta files, usually several thousand individual files.
The append program joins all these spectra together in a single file, with each spectrum separated by an empty line. The first line of each spectrum in the file lists the parent ion mass, charge state, and spectrum filename, and the rest of the lines are m/z vs. intensity pairs. The program needs to be run from the actual directory where all the dta files are located.
Usage: Append.pl –i [files to be concatenated] –o [outputfilename]
Example: In the directory where all the dta files are located, enter: append.pl –i *.dta –o combine.dta
The program will create a single file called combine. dta that contains all of the spectra concatenated together. This file is then compatible for searching using either XTandem or Mascot.
This is an enhancement of the append.pl program described above, that runs in a parent directory containing a number of subdirectories, each of which contains a directory containing a set of dta files, such as those produced from an nanoLC-MS/MS run. The program is run from a parent directory containing subdirectories which contain multiple dta files.
No arguments or options are required. The output is a set of concatenated .dta files of all the spectra for each subdirectory, which are stored in the parent directory. This program also works on concatenated dta files as input, so it is possible to concatenate, for example, all of the spectra from all of the fractionation steps in a MudPIT run. This creates a single, very large, .dta file that can be searched using Xtandem or Mascot to create a single unified results output xml file for the entire MudPIT experiment.
This is a utility for parsing a set of dta files in a directory into three subdirectories based on Sequest results contained in a dtaselect-filter.txt file. The idea of this program is that the user has already done a Sequest search and has some DTASelect-filtered results, but then wants to do additional searching for various reasons. The spectral .dta files are parsed into three categories: (1) those found in the DTASelect-filtered results file with identification by multiple peptides; (2) those found in the DTASelect-filtered results file with identification based on only a single peptide; and (3) those not found in the DTASelect-filtered results file. This is useful, for example, in doing iterative database searching against larger databases with more allowable protein modifications. The spectra that have been successfully matched in the first round of searching can be removed, thus simplifying and speeding up the next stage of the process.
The program creates three new subdirectories: inexcel, notinexcel, and singlexcel. The program starts with a directory full of .dta and .out files that have already been sorted by DTASelect according to specified criteria, and then interrogates the dtaselect-filter.txt file and determines for each .dta file whether the corresponding .out file is found in the DTASelect results, either as part of a multiple-peptide protein identification hit or as a single peptide hit, or not at all. It then moves the spectra into the appropriate subdirectory. When this process is complete, the program creates a concatenated .dta file for each created subdirectory, using the sub_append program described above.
The program requires the user to start in a directory containing multiple .dta and .out files, and a dtaselect-filter.txt file.
The program output involves moving files into three newly created subdirectories depending on the status of the corresponding .out files in the specified dtaselect-filter.txt file or equivalent, and creating concatenated .dta files for each of the newly created subdirectories. It is important to note that files are moved rather than copied from their original location, so the user should create a backup copy of the entire dataset before running this program if any further processing of the original data set is planned.
We wish to emphasize that this is very much an ongoing project. More tools are under development, as are refinements to existing ones. As our workflow changes and develops over time, we expect that more tools will be needed that we have not yet even considered. We plan to continue making this available as a resource to others, and we hope to get valuable feedback and assistance in programming development in return.
The authors would like to thank Martin Van Winkle and Patrick Degnan for coding contributions, Gavin Nelson for technical support, and the Bio5 Institute and National Science Foundation for funding. PH would like to thank Vicki Chandler, Vicki Wysocki, and Peter Lehmann for continued support and encouragement. The views expressed in this paper are those of the author and do not reflect the official policy or position of the Air Force, Department of Defense, or the United States Government.