Setup of the Workflow
As illustrated in , to the species whose peptides database do not contain sufficient information, our workflow for identifying peptides based on shotgun proteomics data includes three steps: transcriptome identification, database creation and peptide identification.
The protein sequence database was created based on the transcriptome data and the homologous species database (). To obtain a global view of the orange transcriptome, we performed high-throughput RNA-seq, using Illumina sequencing technology, on poly(A)-enriched RNAs from orange leaves. To minimize the likelihood of systematic biases in transcriptome sampling, multiple cDNA libraries were prepared and data were generated from three paired-end libraries with insert sizes ranging from 100 to 500 base pairs (bp). We conducted in-depth sequencing by paired-end RNAseq on the three samples.
The reads were then realigned to the contig sequence, and the paired-end relationship between the reads was transferred to linkage between contigs. We constructed scaffolds starting with short paired-ends and then iterated the scaffolding process, step by step, using longer insert size paired-ends. To fill the intra-scaffold gaps, we used the paired-end information to retrieve read pairs that had one read well-aligned on the contigs and another read located in the gap region, then did a local assembly for the collected reads.
Based on scaffold data from transcriptome, reference database was processed using getorf of EMBOSS (version 6.3.1). Minimum nucleotide size of ORF to report is 500. The created database contains 70,134 entries. Transcriptome-based database were integrated to homologous species database, a downloaded clementine database (
http://phytozome.net/clementine; 32,473 entries), and the proteome reference database for proteins identification was completed. The analysis between two databases would be discussed below.
After integration of the database, shotgun proteomics data can be searched against the database using a database search engine (). The next important step is the confidence evaluation of the peptide identifications, i.e. FDR estimation. A FDR of 0.01 for proteins and peptides and a minimum peptide length of 6 amino acids were required.
In the last step of the workflow, the peptides were identified based on the refined separate FDR estimation and an easily interpretable report was generated. ().
Application on Orange Leaves Data Sets
With the procedure described above, we performed database search and peptide identification for data sets from orange leaves. An orange homologous database (clementine database;
http://phytozome.net/clementine; 32,473 entries) integrated with transcriptome-based database (70,134 entries). The integrated database was used for peptides identification.
Here we noticed that there were twice as many entries from orange leaf transcriptome-based database as from clementine database. To gain better understanding of the similarity of the sequences from the two databases, we aligned clementine database against the orange transcriptome-based database, utilizing the NCBI blastp algorithm
[17] with e-value threshold set to 1e-5. Blastp output was subjected to filtering by requiring that two sequences had alignment >20 amino acids with >90% identity.
The result was that 19, 177 out of 32, 473 (59.06%) clementine sequence and 57, 268 out of 70, 134 (81.66%) orange transcriptome-based sequences can be considered sufficiently similar. The ratio of the two numbers, approximately 0.33
![[ratio]](/corehtml/pmc/pmcents/x2236.gif)
1, implicated that three orange sequences corresponded to one clementine sequence roughly.
By increasing the alignment length threshold from 20 amino acids upwards to 300 in steps of 10, we had generally decreasing number of sequences involved in alignment (). The different decreasing rates of the aligned sequences number reflected the corresponding distribution of alignment length.
The results showed that high throughput sequencing transcriptome data were more comprehensive, the integrated database could increase the numbers of identified peptides.
MaxQuant was used as the search engine, and the FDR threshold was set to 0.01. Thus, 2951 unique peptides were identified, which were mapped to 955 indiscernible protein groups. The number of protein groups was 778 and 806 separately, based on different reference database (), corresponding to 81.47%, and 84.40% of all protein groups identified.
The results showed that the integrated database had great advantage on orange shotgun proteomics data analysis compared to the Homologous species database, 18.5% increase in number of proteins identification ().
In order to know whether identified protein groups differ in terms of GO categorization, we compared these two using WEGO
[19] algorithm (). All of the identified proteins were classified into 38 different functional categories and subcategories. The results showed that no significant difference between them, which illustrated that the increased indentified proteins were similar to the original in functional categorization.
In summary, we have presented a workflow with integrated database for the peptides identification, which will be useful to the proteome research of species whose protein sequence database is defective. Recently, more and more big next-generation sequencing projects were launched, such as 1,000 Plant and Animal Genome Project, 1,000 Plant Transcriptome Project. The workflow will help the scientists who are working on any species even without original protein database. The number of proteins identified could be 2 times of the past studies for some species. We believe that more proteome studies will be performed well by using our strategy.