High throughput (HTP) molecular technologies are at the core of new capabilities to derive genomic-level profiles oforganisms [1
]. One challenge often not addressed in the context of HTP technologies is the relationship of the analyses to the defined structural annotation of the genome. For example, the accuracy of global bottom-up proteomics is directly dependent upon accurately defined open reading frames (ORFs), because spectra are matched directly to an in silico enzymatic digest of the predicted proteins. Although a well-annotated genome is typically needed to analyze HTP data, it is also true that HTP data can contribute to genome annotation. Specifically, both next-generation sequencing transcriptomic data (RNA-Seq) and tandem mass spectrometry (MS/MS)-based proteomics have demonstrated immense value to genome curators [3
] to locate features such as missed genes and intron/exon borders. While the procedural aspects of genome sequencing and assembly have become relatively inexpensive, the full and accurate annotation of these genomes, and integration of HTP data types to improve structural genome annotation is not straightforward, still very labor-intensive, and few computational tools have been developed to address this issue.
The development of RNA-Seq has been a major leap forward for transcriptomics, providing data to identify differentially expressed genes, as well as improve structural gene annotation. Common tools to process RNA-Seq data, such as IGV [10
], SAMtools [11
], Tablet [12
], and Bambino [13
], focus on aligning individual reads with the genome, because the number of reads aligned with particular genes can be used as a metric to quantify differential gene expression within the context of an experiment. Although most RNA-Seq experiments are focused on differential expression, the expression pattern in the context of the genome can yield information about transcriptional units, such as operons, and annotation errors, such as missed genes. Similar observations can also be made from other transcriptomics platforms, such as tiled arrays. However, visualization and analysis of these data in genomic context, in order to enhance the annotations or make inferences about mis-annotations, remains a challenge.
While transcriptomics data can give valuable insight into genome annotation, transcription does not necessarily mean translation into protein. Mass spectrometry-based proteomics can fill this gap through global identification of proteins expressed in a sample. However, similar to transcriptomics, proteomics usually focuses on comparative studies to identify differentially expressed proteins. In particular, in tandem mass spectrometry (MS/MS), spectra from proteolytic peptides are matched to theoretical spectra derived from candidate peptides from a defined genome annotation. In this traditional manner, only peptides from an annotated gene will be identified. However, in theory, proteomics data includes spectra from any gene translated into protein. Thus, an alternate strategy is to match spectra against peptide candidates from any potential open reading frame between two stop codons in any of the six frames of the DNA - proteogenomics
. Proteogenomics experiments have successfully corrected gene locations (start sites), located novel genes, and identified additional various mis-annotations, such as frameshifts [7
]; however, because mass spectrometry-based proteogenomics analyses require investigation of large numbers of potential peptides relative to the standard analysis, parsing and visualizing this data is challenging. Current software tools for proteomics data primarily focus on the processes of peptide identification, quantification and statistical comparison [15
], whereas for proteogenomics, prokaryotic genome browser tools such as ARTEMIS [19
] or Gbrowse [21
] have been used due to their ability to compare different gene annotation models. To use these genome browsing tools for proteomics requires significant data formatting on the side of the user, because peptide identifications must be put into a standard format, such as a general feature format (GFF). Furthermore, there is no simple way to search for locations of interest in the genome, such as peptides located outside the defined gene annotations.
We present a novel software platform for Visual Exploration and Statistics to Promote Annotation (VESPA). VESPA was developed as a specialized tool within an overarching tool suite focused on the visualization and statistical integration of multiple data sources in a genomic context. VESPA 1.1.1 is a client-side Java application focused on assisting scientists with the annotation of prokaryotic genomes through the integration of proteomics (peptide-centric) and transcriptomics (probe or RNA-Seq) data with current genome location coordinates. VESPA visualizes all potential reading frames in a genome and has the capability to browse and query the data to quickly identify regions of interest with respect to structural annotation (e.g., novel genes, frameshifts). A basic proteotypic peptide statistic called SVM Technique to Evaluate Proteotypic Peptides (STEPP) [23
] can be computed within VESPA, and used to filter peptides displayed in the visualization and queries. In addition, sequences of interest can be sent directly to BLAST [24
] to assess the homology of genes identified within VESPA to known genes in the public databases. Alternatively, information extracted from the data, based on user queries to locate regions of interest, can be exported in easy-to-use formats for continued exploration outside of VESPA. VESPA is freely available at https://www.biopilot.org/docs/Software/Vespa.php
. Here, we demonstrate the capabilities of VESPA with several use-case scenarios.