Table summarizes the basic tag information for each SAGE library. More than 70,000 tags were extracted from both LongSAGE and ShortSAGE libraries. The number of tag counts per tag ranges from one to 2,202 for long SAGE tags, and one to 1,098 for short SAGE tags. Interestingly, the total tag counts and the numbers of distinct tags (unique tags) were higher in AD than control samples in both LongSAGE and ShortSAGE libraries. For instance, there are 34,475 unique tags in L_AD and 30,581 in L_Ctrl, indicating more tags expressed in the AD than control tissues. Since not all tags are expressed in both libraries of AD and control samples, the number of tags that are expressed in at least one of libraries increases to 55,093 for LongSAGE, 43,937 for tSAGE, and 37,900 for ShortSAGE compared datasets. Furthermore, the overall frequency of SAGE tags mapped to UniGene build 182 for each library is not very high. For instance, we found 14,643 tags (42.5%) in L_AD and 11,646 tags (38.1%) in L_Ctrl that map to the UniGene database, which lead to a large number of orphan tags (no UniGene IDs) in each library (Table ).
Summary of SAGE tags for four SAGE libraries
Applying the same strategy described in Lu et al. [5
], we evaluated the tag-to-gene relationship using confident LongSAGE tags, which are defined for the tags with counts > 1. Under this constraint, we still observed more LongSAGE tags in L_AD than L_Ctrl. Interestingly, we observed similar frequencies of redundant short tags. We found that only about 4.9 – 5.7% of tSAGE tags mapped to multiple LongSAGE tags (Table ). Further, more than 70% of confident tags can be mapped to UniGene Cluster(s), indicating that the overall low tag-to-gene mapping for each library is mainly coming from those tags with tag counts < 2 (non-confident tags).
Redundancy and tag-to-gene mapping for unique tags with tag counts > 1 (confident tags).
As expected, the tag-gene relationship is more specific for the LongSAGE tags than the short SAGE tags. Figure depicts the distribution of tags based on the number of their corresponding UniGene clusters for each compared dataset. The LongSAGE library shows a large percentage of orphan tags (65%) in comparison to tSAGE and ShortSAGE that have about 18% of orphan tags. This is expected, as the probability of mapping to a UniGene Cluster is much smaller for a long SAGE tag due to the extra seven bps. Three compared libraries show a similar percentage of tags mapping to a single UniGene cluster, that is, 32.3% for the LongSAGE, 32.7% for the tSAGE, and 33.1% for the ShortSAGE libraries. However, 97.3% of LongSAGE tags are either orphan tags or map to a single UniGene cluster, while both tSAGE and ShortSAGE libraries still have about 50% of tags mapping to more than one UniGene clusters. The maximum number of UniGene clusters that correspond to a single tag was 15 for the LongSAGE tags, and 279 for both tSAGE and ShortSAGE tags. This may imply that there is a higher chance of obtaining false matches for a ShortSAGE tag than a LongSAGE tag. For instance, of the 17,793 LongSAGE tags that map to a single UniGene cluster, only 5,749 tags map to a single UniGene cluster after converting to the tSAGE tags, and the rest contribute to the pool of tags that map to more than one cluster which may represent false matches. As theorized, the increased specificity in gene mapping offered by the LongSAGE tags is substantial, compared to ShortSAGE tags.
Distribution of SAGE tags. The distribution of SAGE tags depicted by the number of corresponding clusters in the LongSAGE, truncated LongSAGE, and short SAGE datasets.
When we compared the expression pattern between AD and control for three types of libraries: LongSAGE, tSAGE, and ShortSAGE, both LongSAGE and tSAGE libraries share strong similarity (Figure ). This is reasonable as they were based on the same samples. Unexpectedly, S_AD and S_Ctrl show very similar expression levels for the majority of genes, which is different from the case and control samples used for LongSAGE and tSAGE libraries. Our testing results reflected the expression patterns in Figure . We detected 380 LongSAGE tags, 400 tSAGE tags, and 156 ShortSAGE tags with significant differential expression between AD and control (P < 0.05). Clearly, we detected fewer tags in the ShortSAGE dataset than the other two. Although significant, this difference could be due to gene expression variation between samples with the same disease status.
Tag frequency comparison. Comparisons of tag frequencies between AD and controls of LongSAGE, ShortSAGE, and tSAGE libraries.
Since both LongSAGE and tSAGE libraries were derived from the same samples, we used these two datasets to measure the relative ability of long and short SAGE libraries to detect altered gene expression. We found that the 400 significant differentially expressed tSAGE tags were derived from 336 significant and 1,425 non-significant LongSAGE tags. We assigned each tSAGE tag to one of three categories that are defined based on the testing results of its corresponding long tags: (1) Positive group, if all corresponding LongSAGE tags for the tSAGE tag are significant; (2) Negative group, if all corresponding LongSAGE tags for the tSAGE tag are not significant; or (3) Either group, if the corresponding LongSAGE tags for the tSAGE tag are a combination of significant and non-significant. Figure depicts the relationship between the 400 significant tSAGE tags and their corresponding LongSAGE tags in these three groups. The 400 tSAGE tags distributed as 156 tSAGE tags in the Positive group, 79 in Negative group, and 165 in the Either group. Interestingly, each tSAGE tag in the Positive group was derived from a single LongSAGE tag, but the tag in both Negative and Either groups was derived from at least two LongSAGE tags. The maximum number of corresponding LongSAGE tags for a tSAGE tag was 114 for the Negative group and 68 for the Either group. We also examined the number of UniGene clusters that mapped to each of the 400 significant tSAGE tags. The tSAGE tags in the Positive group mapped up to seven UniGene clusters, while the tags in the Negative group and Either group mapped up to 108 and 66 clusters, respectively. Overall, the significant tSAGE tags in both Negative and Either groups tend to map to more LongSAGE tags and known genes.
Figure 3 The property of significantly differentially expressed tSAGE tags. A diagram to relate the LongSAGE tags to 400 tSAGE tags that are significantly differentially expressed between AD and control. The distribution of the tSAGE tags is summarized based on (more ...)
One of the most interesting findings is the analysis of orphan tags. The BLAST http://www.ncbi.nlm.nih.gov/BLAST/
analysis for the 100 randomly selected orphan tags revealed 17 orphan tags with at least 17 bps in the tag completely matched to a gene sequence in human [6
]. This frequency (17%) is close to the probability of obtaining one gene sequence perfectly matched to 17 bps of a given tag under an assumed human genome size of 2.864 × 109
bps (14%) and equal frequency of each nucleotide occurred at a base. The number of matched gene sequences for an orphan tag increases as the number of matched bps decreases (Table ). A total of 39 gene sequences were identified through this approach. Since the tag sequence used in the BLAST analysis consists of four bps (nucleotide position one to four) from the restriction site and 17 bps (nucleotide position five to 21) from the SAGE tag, we also restricted our selection to tags that have at least all 17 bps in the tag region which match to a gene sequence. The reason for this is that sequencing errors are more likely in the restriction sites rather than in the tag region. Under these criteria, the ending position of the matched segment in the tag sequence is always 21 and the starting position needs to be less than or equal to five. We found nine orphan tags that met these criteria (Table ). Four of nine orphan tags matched to a single human gene sequence – with 21, 20, and 18 matched bps, which are more likely to be the real transcripts for these four orphan tags.
Results of BLAST analysis for 100 orphan tags.
A list of genes mapping to nine orphan tags.