Post genomic research and systems biology have greatly expanded our knowledge and understanding of biological processes, fuelled by the growth in sequenced genomes and accompanying technological developments. These techniques, such as microarray-based transcriptomics and proteomics, are reliant on the high quality annotation of newly sequenced genomes. Indeed, this heavy dependency on a sequenced genome or cDNA library can often be limiting in the scope of studies, particularly for non model organisms [
1]. However, functional genomics experiments on sequenced organisms can also play an important role in defining or re-evaluating the genome sequenced on which they are based. Experimental data can be fed back into the genome to help demonstrate the validity or otherwise of the original gene structure predictions or to assist the annotation of new genomes.
Many genome sequencing projects use a range of
in silico prediction methods to generate a large, and sometimes highly redundant, set of possible open reading frames (ORFs) and gene structure models. A good example is the pipeline employed by the widely-used Ensembl genome browser [
2]. Here, a combination of EST, cDNA, orthology and statistical data are used to derive gene sets which are reconciled to produce a final set of high quality predicted genes. A further example is provided by recent fungal genomes sequenced at the US DOE Joint Genome Institute (JGI) whereby a large set of gene models are produced, typically with several candidates for each locus. Further analyses reduce this to a smaller filtered set of "best" gene predictions
via a second layer of bioinformatic methods, manual annotation and the use of experimental data. It is one such example, that of
Aspergillus niger, which forms the basis for this study.
A. niger is a common ascomycete fungus that acts as an opportunistic human pathogen, however, it is generally more commonly known for its use in industrial biotechnological applications such as the production of citric acid [
3]. We wished to apply mass spectrometry-based proteomics on
A. niger as an exemplar system with which to test the utility of proteomics to refine and process a recently sequenced and annotated genome and produce an even higher quality gene set. There have already been several studies of the proteomics of filamentous fungi, now that there are several complete genome sequences, and this technique is being widely applied to understand fungal biology [
4].
Although cDNA and oligonucleotides arrays can demonstrate that a predicted gene is expressed [
5,
6] and tiling arrays can define exon-intron structure with exquisite accuracy [
7], they still focus on the un-translated mRNA. Proteomics provides a higher level confirmation of gene expression and is beginning to be used in genome annotation [
8-
10]. Mass spectrometry (MS) is an effective and fast method for identifying proteins from their constituent peptides and recent developments support much higher coverage of the commonly expressed proteome [
11-
13]. For example, Aerbersold and colleagues demonstrated how the PeptideAtlas database could be exploited to map many thousands of peptides back on to the human proteome [
14,
15]. Similarly, cDNA/EST data and mass spectrometry experiments have been used to identify novel ORFs and splice variants. Peptide identifications in expressed sequence tags (ESTs) [
16] or expressed peptide tags (ePSTs) [
17,
18] were matched back to the genomic scaffolds, thereby identifying or validating real ORFs. Experimental proteomic data can therefore help with the prediction and validation of predicted gene structure and there are a growing number of examples which have helped annotate translational start sites, exons and SNPs [
19,
20,
8]. In parallel, informatic proteome pipelines are also becoming more "genome-centric". Examples include the genome annotating pipeline (GAPP) [
21] and PeptideAtlas resources [
14,
15] which both support the mapping of identified peptides back onto genome viewers [
22]. Some experiments have even found these published genomes are annotated incorrectly [
23] fully demonstrating the utility of proteome data. These conclusions have sparked interest throughout both proteomics and genomics as to the best ways in which to use this new source of experimental validation of genome annotations [
9].
In this project we collected tandem MS data from
A. niger samples and searched this against predicted protein sequences derived from two independent genome sequences: ATCC1015
http://www.jgi.doe.gov/aspergillus by JGI and CBS 513.88 by DSM [
24]. The JGI sequence in particular had 87,287 predicted gene models, containing 11,200 "best" models, which we clustered to 8709 genomic loci (Table ). To generate peptide identifications, tandem MS data was searched against forwards and reversed protein sequence databases derived from the JGI and DSM model sets using Mascot [
25]. As well as using standard Mascot scoring, we used a modified version of the Average Peptide Scoring (APS) technique which iteratively calculates peptide filters and reverse database thresholds [
26] at various false discovery rates (FDR). By filtering out low scoring peptides using a threshold score this method claims to find more confident protein identifications than the standard Mascot (v2.1) protocol. The APS-identified peptides were then mapped back to the genome via the gene models clustered at each locus. This data offers direct support for predicted open reading frames for two independent
A. niger genome annotations. For the JGI annotation, the MS data provides support for conflicting gene model predictions and can potentially eliminate inconsistent ones from further consideration. For a significant number of clusters, gene models not hitherto considered the "best" were seen to be more consistent with the experimental data, suggesting they are more likely to be correct or be the principally expressed isoform. This pilot project further demonstrates the utility of proteome data for genome annotation, since it can be used to experimentally validate predicted gene model sets and offer an additional source of evidence that a gene is not only transcribed, but also translated.
| Table 1Overview of JGI and DSM A. niger genome data |