Because a SAGE tag is located at the 3' part of a transcript [8
], we used 3' ESTs for comparison. We collected 3' ESTs representing low-abundance transcripts by searching UniGene clusters which contained only a single 3' EST (ftp://ftp.ncbi.nih.gov/repository/UniGene/ Hs.seq.all.gz, UniGene Build #161). We identified 42,500 such UniGene clusters and obtained the same number of 3' ESTs. For comparison with SAGE tags, we extracted virtual tags from these ESTs. We identified 32,587 from the 42,500 3' ESTs that have CATG site(s), a pre-condition for release of a SAGE tag from a transcript, and we extracted 32,587 virtual SAGE tags (10 bases downstream of the last CATG) from the 32,587 sequences. We removed virtual tags that were shared by more than one 3' EST. This resulted in a final set of 22,243 virtual tags from 22,243 3' ESTs representing low-abundance transcripts.
To obtain the experimental SAGE tags for the comparison, we downloaded 477,261 SAGE tags containing 6,847,555 copies collected from 154 SAGE libraries http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL4
. Comparison of the 22,243 virtual SAGE tags with the experimental SAGE tag set identified 20,575 tags that were present in both sets. By matching the 20,575 tags in the SAGEmap database (http://www.ncbi.nlm.nih.gov/SAGE/), we identified 2,278 tags that represented the same 3' ESTs detected by both the EST method and the SAGE method. We used the 2,278 tags as the final set for quantitative comparison. Whereas each of the 2,278 virtual tags represents a transcript detected only once by the EST method, the copy number in each of the 2,278 experimental SAGE tags represents the frequency of a transcript detected by SAGE. We observed that the total copy number for the 2,278 experimental SAGE tags appeared 59,754 times; 1,424 (63%) of these SAGE tags appeared between two and more than 100 times. On average, SAGE was 26 times more sensitive than the EST method in detecting these transcripts (Table ). The data clearly show that the SAGE method is much more sensitive than the EST method for the detection of low-abundance transcripts.
Comparison between EST and SAGE methods for the detection of low-abundance transcripts
What could be the explanations for the difference between the EST and SAGE methods for detecting the low abundant transcripts?
It is unlikely that the difference is due to the depth of sequence collection. The current number of human ESTs reaches to 4.5 millions including 131,229 mRNAs and 1,470,982 3' ESTs, whereas the total human SAGE tags has about 8 millions. Considering that over 20 tags can be detected by a single SAGE sequence, the number of sequences collected from SAGE is far less than that from ESTs. In our previous studies [2
], we observed the "loss" effect on EST collection due to the non-specific polydA/dT hybridization during subtraction / normalization widely used in EST library construction [6
], as evidenced by the quantitative loss of a group of targeted transcripts, although it will be difficult to give an absolute rate of loss at the whole genome level due to the complexity of the transcriptome. Such a phenomenon can explain in part but other possibilities may also exist for the loss, such as the limitation of cloning efficiency when ligating cDNAs into vector during cDNA library construction, and clonal loss during library transformation etc. In the SAGE process, there is no subtraction / normalization step, and all the cDNA fragments at each step during SAGE library construction have nearly the same length with the same ends till being cloned into vector. Therefore, the repertoire of the total transcripts is well preserved in SAGE libraries for the detection.
It is true that SAGE method has many limitations for transcript detection. For example, a 14-base SAGE tag contains less sequence information for the detected transcript comparing with an EST that has hundred bases; the specificity of a SAGE tag representing a unique transcript is also lower than that of EST, particularly for SAGE tags at higher copies [14
]; and SAGE can't detect CATG-negative transcripts, although this number is low as shown that only 151 (7.8%) among the 19,399 full-length human cDNAs in the Refseq (NM) database are CATG-negative. Another issue is related with the error SAGE tags. A SAGE tag has 10 bases. In theory, any base within a single tag could be sequencing error leading to the generation of 4 × 4 × 4 × 4 × 4 × 4 × 4 × 4 × 4 × 4 = 410
mutated tags. However, such event doesn't happen in the real world [7
]. We have converted thousand SAGE tags into their 3' cDNA experimentally using the GLGI method. From these studies, we clearly see that over 70% of the low-copy SAGE tags represent the real transcripts expressed at low level (these are experimentally confirmed. The real rate may be higher considering the limitation of the experimental sensitivity). Although there are certainly error SAGE tags, these error SAGE tags cannot be a significant portion in the total SAGE tag collection, particularly for the SAGE tags with low copies. Regardless these limitations, SAGE does have unique features for transcriptome study. Among these is that the presence of a SAGE tag implies in large the presence of a transcript.
It is worth to indicate that we only focused on the known low-abundance transcripts for the analysis. For the unknown low-abundance transcripts, many of them may not be present in EST libraries therefore not detectable as novel ESTs. However, these unknown low-abundance transcripts may be well preserved in SAGE libraries therefore readily detectable as novel SAGE tags.