Serial Analysis of Gene Expression (SAGE) is a new technique that allows a detailed and profound quantitative and qualitative knowledge of gene expression profile, without previous knowledge of sequence of analyzed genes. We carried out a modification of SAGE methodology (microSAGE), useful for the analysis of limited quantities of tissue samples, on normal human cervical tissue obtained from a donor without histopathological lesions. Cervical epithelium is constituted mainly by cervical keratinocytes which are the targets of human papilloma virus (HPV), where persistent HPV infection of cervical epithelium is associated with an increase risk for developing cervical carcinomas (CC).
We report here a transcriptome analysis of cervical tissue by SAGE, derived from 30,418 sequenced tags that provide a wealth of information about the gene products involved in normal cervical epithelium physiology, as well as genes not previously found in uterine cervix tissue involved in the process of epidermal differentiation.
This first comprehensive and profound analysis of uterine cervix transcriptome, should be useful for the identification of genes involved in normal cervix uterine function, and candidate genes associated with cervical carcinoma.
To identify the genes expressed in normal human trabecular meshwork tissue, a tissue critical to the pathogenesis of glaucoma.
Total RNA was extracted from human trabecular meshwork (HTM) harvested from 3 different donors. Extracted RNA was used to synthesize individual SAGE (serial analysis of gene expression) libraries using the I-SAGE Long kit from Invitrogen. Libraries were analyzed using SAGE 2000 software to extract the 17 base pair sequence tags. The extracted sequence tags were mapped to the genome using SAGE Genie map.
A total of 298,834 SAGE tags were identified from all HTM libraries (96,842, 88,126, and 113,866 tags, respectively). Collectively, there were 107,325 unique tags. There were 10,329 unique tags with a minimum of 2 counts from a single library. These tags were mapped to known unique Unigene clusters. Approximately 29% of the tags (orphan tags) did not map to a known Unigene cluster. Thirteen percent of the tags mapped to at least 2 Unigene clusters. Sequence tags from many glaucoma-related genes, including myocilin, optineurin, and WD repeat domain 36, were identified.
This is the first time SAGE analysis has been used to characterize the gene expression profile in normal HTM. SAGE analysis provides an unbiased sampling of gene expression of the target tissue. These data will provide new and valuable information to improve understanding of the biology of human aqueous outflow.
SAGE (serial analysis of gene expression) is a powerful method of analyzing gene expression for the entire transcriptome. There are currently many well-developed SAGE tools. However, the cross-comparison of different tissues is seldom addressed, thus limiting the identification of common- and tissue-specific tumor markers.
To improve the SAGE mining methods, we propose a novel function for cross-tissue comparison of SAGE data by combining the mathematical set theory and logic with a unique “multi-pool method” that analyzes multiple pools of pair-wise case controls individually. When all the settings are in “inclusion”, the common SAGE tag sequences are mined. When one tissue type is in “inclusion” and the other types of tissues are not in “inclusion”, the selected tissue-specific SAGE tag sequences are generated. They are displayed in tags-per-million (TPM) and fold values, as well as visually displayed in four kinds of scales in a color gradient pattern. In the fold visualization display, the top scores of the SAGE tag sequences are provided, along with cluster plots. A user-defined matrix file is designed for cross-tissue comparison by selecting libraries from publically available databases or user-defined libraries.
The hSAGEing tool provides a combination of friendly cross-tissue analysis and an interface for comparing SAGE libraries for the first time. Some up- or down-regulated genes with tissue-specific or common tumor markers and suppressors are identified computationally. The tool is useful and convenient for in silico cancer transcriptomic studies and is freely available at http://bio.kuas.edu.tw/hSAGEing
The highest rates of cervical cancer are found in developing countries. Frontline monitoring has reduced these rates in developed countries and present day screening programs primarily identify precancerous lesions termed cervical intraepithelial neoplasias (CIN). CIN lesions described as mild dysplasia (CIN I) are likely to spontaneously regress while CIN III lesions (severe dysplasia) are likely to progress if untreated. Thoughtful consideration of gene expression changes paralleling the progressive pre invasive neoplastic development will yield insight into the key casual events involved in cervical cancer development.
In this study, we have identified gene expression changes across 16 cervical cases (CIN I, CIN II, CIN III and normal cervical epithelium) using the unbiased long serial analysis of gene expression (L-SAGE) method. The 16 L-SAGE libraries were sequenced to the level of 2,481,387 tags, creating the largest SAGE data collection for cervical tissue worldwide. We have identified 222 genes differentially expressed between normal cervical tissue and CIN III. Many of these genes influence biological functions characteristic of cancer, such as cell death, cell growth/proliferation and cellular movement. Evaluation of these genes through network interactions identified multiple candidates that influence regulation of cellular transcription through chromatin remodelling (SMARCC1, NCOR1, MRFAP1 and MORF4L2). Further, these expression events are focused at the critical junction in disease development of moderate dysplasia (CIN II) indicating a role for chromatin remodelling as part of cervical cancer development.
We have created a valuable publically available resource for the study of gene expression in precancerous cervical lesions. Our results indicate deregulation of the chromatin remodelling complex components and its influencing factors occur in the development of CIN lesions. The increase in SWI/SNF stabilizing molecule SMARCC1 and other novel genes has not been previously illustrated as events in the early stages of dysplasia development and thus not only provides novel candidate markers for screening but a biological function for targeting treatment.
To develop large-scale, high-throughput annotation of the human macula transcriptome and to identify and prioritize candidate genes for inherited retinal dystrophies, based on ocular-expression profiles using serial analysis of gene expression (SAGE).
Two human retina and two retinal pigment epithelium (RPE)/choroid SAGE libraries made from matched macula or midperipheral retina and adjacent RPE/choroid of morphologically normal 28- to 66-year-old donors and a human central retina longSAGE library made from 41- to 66-year-old donors were generated. Their transcription profiles were entered into a relational database, EyeSAGE, including microarray expression profiles of retina and publicly available normal human tissue SAGE libraries. EyeSAGE was used to identify retina- and RPE-specific and -associated genes, and candidate genes for retina and RPE disease loci. Differential and/or cell-type specific expression was validated by quantitative and single-cell RT-PCR.
Cone photoreceptor-associated gene expression was elevated in the macula transcription profiles. Analysis of the longSAGE retina tags enhanced tag-to-gene mapping and revealed alternatively spliced genes. Analysis of candidate gene expression tables for the identified Bardet-Biedl syndrome disease gene (BBS5) in the BBS5 disease region table yielded BBS5 as the top candidate. Compelling candidates for inherited retina diseases were identified.
The EyeSAGE database, combining three different gene-profiling platforms including the authors’ multidonor-derived retina/RPE SAGE libraries and existing single-donor retina/RPE libraries, is a powerful resource for definition of the retina and RPE transcriptomes. It can be used to identify retina-specific genes, including alternatively spliced transcripts and to prioritize candidate genes within mapped retinal disease regions.
Serial Analysis of Gene Expression (SAGE) is becoming a widely
used gene expression profiling method for the study of development,
cancer and other human diseases. Investigators using SAGE rely heavily
on the quantitative aspect of this method for cataloging gene expression
and comparing multiple SAGE libraries. We have developed additional
computational and statistical tools to assess the quality and reproducibility
of a SAGE library. Using these methods, a critical variable in the
SAGE protocol was identified that has the potential to bias the
Tag distribution relative to the GC content of the 10 bp SAGE Tag
DNA sequence. We also detected this bias in a number of publicly
available SAGE libraries. It is important to note that the GC content bias
went undetected by quality control procedures in the current SAGE
protocol and was only identified with the use of these statistical
analyses on as few as 750 SAGE Tags. In addition to keeping any
solution of free DiTags on ice, an analysis of the GC content should
be performed before sequencing large numbers of SAGE Tags to be
confident that SAGE libraries are free from experimental bias.
Leishmaniasis are widespread parasitic-diseases with an urgent need for more active and less toxic drugs and for effective vaccines. Understanding the biology of the parasite especially in the context of host parasite interaction is a crucial step towards such improvements in therapy and control. Several experimental approaches including SAGE (Serial analysis of gene expression) have been developed in order to investigate the parasite transcriptome organisation and plasticity. Usual SAGE tag-to-gene mapping techniques are inadequate because almost all tags are normally located in the 3'-UTR outside the CDS, whereas most information available for Leishmania transcripts is restricted to the CDS predictions. The aim of this work is to optimize a SAGE libraries tag-to-gene mapping technique and to show how this development improves the understanding of Leishmania transcriptome.
The in silico method implemented herein was based on mapping the tags to Leishmania genome using BLAST then mapping the tags to their gene using a data-driven probability distribution. This optimized tag-to-gene mappings improved the knowledge of Leishmania genome structure and transcription. It allowed analyzing the expression of a maximal number of Leishmania genes, the delimitation of the 3' UTR of 478 genes and the identification of biological processes that are differentially modulated during the promastigote to amastigote differentiation.
The developed method optimizes the assignment of SAGE tags in trypanosomatidae genomes as well as in any genome having polycistronic transcription and small intergenic regions.
The serial analysis of gene expression (SAGE) method is based on
the isolation of unique sequence tags from individual transcripts
and concatenation of tags serially into long DNA molecules. SAGE
is an innovative technique that offers the potential of
cataloging both the identity and relative frequencies of mRNA
transcripts in a given RNA preparation. It can quantify
low-abundance transcripts and reliably detect relatively small
differences in transcript abundance between cell populations.
SAGE data can be used to complement studies in cases where other
gene expression methods may be more convenient or
efficient. SAGE can be used in a wide variety of applications to
identify disease-related genes, to analyze the effect of drugs on
tissues, and to provide insights into the disease pathways. The
most important application of SAGE is the identification of
differentially expressed genes. In this review, we describe
various applications of this powerful technology in malarial
parasite, yeast, plant, and animal systems.
To facilitate in the identification of gene products important in regulating renal glomerular structure and function, we have produced an annotated transcriptome database for normal human glomeruli using the SAGE approach.
The database contains 22,907 unique SAGE tag sequences, with a total tag count of 48,905. For each SAGE tag, the ratio of its frequency in glomeruli relative to that in 115 non-glomerular tissues or cells, a measure of transcript enrichment in glomeruli, was calculated. A total of 133 SAGE tags representing well-characterized transcripts were enriched 10-fold or more in glomeruli compared to other tissues. Comparison of data from this study with a previous human glomerular Sau3A-anchored SAGE library reveals that 47 of the highly enriched transcripts are common to both libraries. Among these are the SAGE tags representing many podocyte-predominant transcripts like WT-1, podocin and synaptopodin. Enrichment of podocyte transcript tags SAGE library indicates that other SAGE tags observed at much higher frequencies in this glomerular compared to non-glomerular SAGE libraries are likely to be glomerulus-predominant. A higher level of mRNA expression for 19 transcripts represented by glomerulus-enriched SAGE tags was verified by RT-PCR comparing glomeruli to lung, liver and spleen.
The database can be retrieved from, or interrogated online at http://cgap.nci.nih.gov/SAGE. The annotated database is also provided as an additional file with gene identification for 9,022, and matches to the human genome or transcript homologs in other species for 1,433 tags. It should be a useful tool for in silico mining of glomerular gene expression.
Lung cancer is the most common cause of cancer-related deaths. Tobacco smoke exposure is the strongest aetiological factor associated with lung cancer. In this study, using serial analysis of gene expression (SAGE), we comprehensively examined the effect of active smoking by comparing the transcriptomes of clinical specimens obtained from current, former and never smokers, and identified genes showing both reversible and irreversible expression changes upon smoking cessation.
Twenty-four SAGE profiles of the bronchial epithelium of eight current, twelve former and four never smokers were generated and analyzed. In total, 3,111,471 SAGE tags representing over 110 thousand potentially unique transcripts were generated, comprising the largest human SAGE study to date. We identified 1,733 constitutively expressed genes in current, former and never smoker transcriptomes. We have also identified both reversible and irreversible gene expression changes upon cessation of smoking; reversible changes were frequently associated with either xenobiotic metabolism, nucleotide metabolism or mucus secretion. Increased expression of TFF3, CABYR, and ENTPD8 were found to be reversible upon smoking cessation. Expression of GSK3B, which regulates COX2 expression, was irreversibly decreased. MUC5AC expression was only partially reversed. Validation of select genes was performed using quantitative RT-PCR on a secondary cohort of nine current smokers, seven former smokers and six never smokers.
Expression levels of some of the genes related to tobacco smoking return to levels similar to never smokers upon cessation of smoking, while expression of others appears to be permanently altered despite prolonged smoking cessation. These irreversible changes may account for the persistent lung cancer risk despite smoking cessation.
Serial Analysis of Gene Expression (SAGE) is a powerful expression profiling method, allowing the analysis of the expression of thousands of transcripts simultaneously. A disadvantage of the method, however, is the relatively high amount of input RNA required. Consequently, SAGE cannot be used for the generation of expression profiles when RNA is limited, i.e. in small biological samples such as tissue biopsies or microdissected material. Here we describe a modification of SAGE, named microSAGE, which requires 500- to 5000-fold less starting material. Compared with SAGE, microSAGE is simplified due to incorporation of a 'single-tube' procedure for all steps from RNA isolation to tag release. Furthermore, a limited number of additional PCR cycles are performed. Using microSAGE gene expression profiles can be obtained from minute quantities of tissue such as a single hippocampal punch from a rat brain slice of 325 micrometers thickness, estimated to contain, at most, 10(5) cells. This method opens up a multitude of new possibilities for the application of SAGE, for example the characterization of expression profiles in tissue biopsies, tumor metastases or in other cases where tissue is scarce and the generation of region-specific expression profiles of complex heterogeneous tissues.
Neural tube defects (NTDs) are common human birth defects with a complex etiology. To develop a comprehensive knowledge of the genes expressed during normal neurulation, we established transcriptomes from human neural tube fragments during and after neurulation using long Serial Analysis of Gene Expression (long-SAGE).
Rostral and caudal neural tubes were dissected from normal human embryos aged between 26 and 32 days of gestation. Tissues from the same region and Carnegie stage were pooled (n>=4) and total RNA extracted to construct four long-SAGE libraries. Tags were mapped using the UniGene Homo sapiens 17 bp tag-to-gene best mapping set. Differentially expressed genes were identified by chi-square or Fisher’s exact test and validation was performed for a subset of those transcripts using in situ hybridization. In silico analyses were performed with BinGO and EXPANDER.
We observed most genes to be similarly regulated in rostral and caudal regions, but expression profiles differed during and after closure. In silico analysis found similar enrichments in both regions for biological process terms, transcription factor binding and miRNA target motifs. Twelve genes potentially expressing alternate isoforms by region or developmental stage, and the miRNAs miR-339-5p, miR-141/200a, miR-23ab, and miR-129/129-5p, are among several potential candidates identified here for future research.
Time appears to influence gene expression in the developing central nervous system more than location. These data provide a novel complement to traditional strategies of identifying genes associated with human NTDs, and offer unique insight into the genes associated with normal human neurulation.
gene expression; Homo sapiens; long-SAGE; neurulation; neural tube defects
We generated the gene expression profile of the total testis from the adult C57BL/6J male mice using serial analysis of gene expression (SAGE). Two high-quality SAGE libraries containing a total of 76 854 tags were constructed. An extensive bioinformatic analysis and comparison of SAGE transcriptomes of the total testis, testicular somatic cells and other mouse tissues was performed and the theory of male-biased gene accumulation on the X chromosome was tested.
We sorted out 829 genes predominantly expressed from the germinal part and 944 genes from the somatic part of the testis. The genes preferentially and specifically expressed in total testis and testicular somatic cells were identified by comparing the testis SAGE transcriptomes to the available transcriptomes of seven non-testis tissues. We uncovered chromosomal clusters of adjacent genes with preferential expression in total testis and testicular somatic cells by a genome-wide search and found that the clusters encompassed a significantly higher number of genes than expected by chance. We observed a significant 3.2-fold enrichment of the proportion of X-linked genes specific for testicular somatic cells, while the proportions of X-linked genes specific for total testis and for other tissues were comparable. In contrast to the tissue-specific genes, an under-representation of X-linked genes in the total testis transcriptome but not in the transcriptomes of testicular somatic cells and other tissues was detected.
Our results provide new evidence in favor of the theory of male-biased genes accumulation on the X chromosome in testicular somatic cells and indicate the opposite action of the meiotic X-inactivation in testicular germ cells.
Sixteen longSAGE libraries from four different clinical stages of cervical intraepithelial neoplasia have enabled us to identify novel cell-surface biomarkers indicative of CIN stage. By comparing gene expression profiles of cervical tissue at early and advanced stages of CIN, several genes are identified to be novel genetic markers. We present fifty-six cell-surface gene products differentially expressed during progression of CIN. These cell surface proteins are being examined to establish their capacity for optical contrast agent binding. Contrast agent visualization will allow real-time assessment of the physiological state of the disease process bringing vast benefit to cancer care. The data discussed in this publication have been submitted to NCBIs Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) and are accessible through GEO Series accession number GSE6252.
longSAGE; cervical cancer; biomarker; optical imaging
Once thought to be a part of the ‘dark matter’ of the genome, long non-coding RNAs (lncRNAs) are emerging as an integral functional component of the mammalian transcriptome. LncRNAs are a novel class of mRNA-like transcripts which, despite no known protein-coding potential, demonstrate a wide range of structural and functional roles in cellular biology. However, the magnitude of the contribution of lncRNA expression to normal human tissues and cancers has not been investigated in a comprehensive manner. In this study, we compiled 272 human serial analysis of gene expression (SAGE) libraries to delineate lncRNA transcription patterns across a broad spectrum of normal human tissues and cancers. Using a novel lncRNA discovery pipeline we parsed over 24 million SAGE tags and report lncRNA expression profiles across a panel of 26 different normal human tissues and 19 human cancers. Our findings show extensive, tissue-specific lncRNA expression in normal tissues and highly aberrant lncRNA expression in human cancers. Here, we present a first generation atlas for lncRNA profiling in cancer.
Recent observations indicate potential role of transcription factor STAT3 in cervical cancer development but its role specifically with respect to HPV infection is not known. Present study has been designed to investigate expression and activation of STAT3 in cervical precancer and cancer in relation to HPV infection during cervical carcinogenesis. Established cervical cancer cell lines and prospectively-collected cervical precancer and cancer tissues were analyzed for the HPV positivity and evaluated for STAT3 expression and its phosphorylation by immunoblotting and immunohistochemistry whereas STAT3-specific DNA binding activity was examined by gel-shift assays.
Analysis of 120 tissues from cervical precancer and cancer lesions or from normal cervix revealed differentially high levels of constitutively active STAT3 in cervical precancer and cancer lesions, whereas it was absent in normal controls. Similarly, a high level of constitutively active STAT3 expression was observed in HPV-positive cervical cancer cell lines when compared to that of HPV-negative cells. Expression and activity of STAT3 were found to change as a function of severity of cervical lesions from precancer to cancer. Expression of active pSTAT3 was specifically high in cervical precancer and cancer lesions found positive for HPV16. Interestingly, site-specific accumulation of STAT3 was observed in basal and suprabasal layers of HPV16-positive early precancer lesions which is indicative of possible involvement of STAT3 in establishment of HPV infection. In HPV16-positive cases, STAT3 expression and activity were distinctively higher in poorly-differentiated lesions with advanced histopathological grades.
We demonstrate that in the presence of HPV16, STAT3 is aberrantly-expressed and constitutively-activated in cervical cancer which increases as the lesion progresses thus indicating its potential role in progression of HPV16-mediated cervical carcinogenesis.
Serial analysis of gene expression (SAGE) was applied to the malarial parasite Plasmodium falciparum to characterize the comprehensive transcriptional profile of erythrocytic stages. A SAGE library of ∼8335 tags representing 4866 different genes was generated from 3D7 strain parasites. Basic local alignment search tool analysis of high abundance SAGE tags revealed that a majority (88%) corresponded to 3D7 sequence, and despite the low complexity of the genome, 70% of these highly abundant tags matched unique loci. Characterization of these suggested the major metabolic pathways that are used by the organism under normal culture conditions. Furthermore several tags expressed at high abundance (30% of tags matching to unique loci of the 3D7 genome) were derived from previously uncharacterized open reading frames, demonstrating the use of SAGE in genome annotation. The open platform “profiling” nature of SAGE also lead to the important discovery of a novel transcriptional phenomenon in the malarial pathogen: a significant number of highly abundant tags that were derived from annotated genes (17%) corresponded to antisense transcripts. These SAGE data were validated by two independent means, strand specific reverse transcription-polymerase chain reaction and Northern analysis, where antisense messages were detected in both asexual and sexual stages. This finding has implications for transcriptional regulation of Plasmodium gene expression.
Serial Analysis of Gene Expression (SAGE) is a powerful tool to determine gene expression profiles. Two types of SAGE libraries, ShortSAGE and LongSAGE, are classified based on the length of the SAGE tag (10 vs. 17 basepairs). LongSAGE libraries are thought to be more useful than ShortSAGE libraries, but their information content has not been widely compared. To dissect the differences between these two types of libraries, we utilized four libraries (two LongSAGE and two ShortSAGE libraries) generated from the hippocampus of Alzheimer and control samples. In addition, we generated two additional short SAGE libraries, the truncated long SAGE libraries (tSAGE), from LongSAGE libraries by deleting seven 5' basepairs from each LongSAGE tag.
One problem that occurred in the SAGE study is that individual tags may have matched to multiple different genes – due to the short length of a tag. We found that the LongSAGE tag maps up to 15 UniGene clusters, while the ShortSAGE and tSAGE tags map up to 279 UniGene clusters. Both long and short SAGE libraries exhibit a large number of orphan tags (no gene information in UniGene), implying the limitation of the UniGene database. Among 100 orphan LongSAGE tags, the complete sequences (17 basepairs) of nine orphan tags match to 17 genomic sequences; four of the orphan tags match to a single genomic sequence. Our data show the potential to resolve 4–9% of orphan LongSAGE tags. Finally, among 400 tSAGE tags showing significant differential expression between AD and control, 79 tags (19.8%) were derived from multiple non-significant LongSAGE tags, implying the false positive results.
Our data show that LongSAGE tags have high specificity in gene mapping compared to ShortSAGE tags. LongSAGE tags show an advantage over ShortSAGE in identifying novel genes by BLAST analysis. Most importantly, the chances of obtaining false positive results are higher for ShortSAGE than LongSAGE libraries due to their specificity in gene mapping. Therefore, it is recommended that the number of corresponding UniGene clusters (gene or ESTs) of a tag for prioritizing the significant results be considered.
"Open" transcriptome analysis methods allow to study gene expression without a priori knowledge of the transcript sequences. As of now, SAGE (Serial Analysis of Gene Expression), LongSAGE and MPSS (Massively Parallel Signature Sequencing) are the mostly used methods for "open" transcriptome analysis. Both LongSAGE and MPSS rely on the isolation of 21 pb tag sequences from each transcript. In contrast to LongSAGE, the high throughput sequencing method used in MPSS enables the rapid sequencing of very large libraries containing several millions of tags, allowing deep transcriptome analysis. However, a bias in the complexity of the transcriptome representation obtained by MPSS was recently uncovered.
In order to make a deep analysis of mouse hypothalamus transcriptome avoiding the limitation introduced by MPSS, we combined LongSAGE with the Solexa sequencing technology and obtained a library of more than 11 millions of tags. We then compared it to a LongSAGE library of mouse hypothalamus sequenced with the Sanger method.
We found that Solexa sequencing technology combined with LongSAGE is perfectly suited for deep transcriptome analysis. In contrast to MPSS, it gives a complex representation of transcriptome as reliable as a LongSAGE library sequenced by the Sanger method.
Serial analysis of gene expression (SAGE) is a widely used and powerful technique to characterize and compare transcriptomes. Although several modifications have been proposed to the initial protocol with the aim of reducing the amount of starting material, unless additional PCR steps are added, the technique is still limited by the need for at least 1 µg of total RNA. As extra PCR amplification might introduce representation biases, current SAGE protocols are not fully suitable for the study of small, microdissected tissue samples. We propose here an alternative method involving the linear amplification of small mRNA fragments containing the SAGE tags. The procedure allows preparation of libraries of over 100 000 tags from as few as 2500 cells. A satisfactory correlation was observed between a microSAGE library made from 5 µg of total thyroid RNA, and a library prepared from 50 ng of the same RNA preparation according to the present protocol.
Serial analysis of gene expression (SAGE) is a powerful quantification technique for gene expression data. The huge
amount of tag data in SAGE libraries of samples is difficult to analyze with current SAGE analysis tools. Data is often not
provided in a biologically significant way for cross‐analysis and ‐comparison, thus limiting its application.
Hence, an integrated software platform that can perform such a complex task is required. Here, we implement set theory for
cross‐analyzing gene expression data among different SAGE libraries of tissue sources; up‐ or down‐regulated
tissue‐specific tags can be identified computationally. Extract‐SAGE employs a genetic algorithm (GA) to reduce the
number of genes among the SAGE libraries. Its representative tag mining will facilitate the discovery of the candidate genes with
discriminating gene expression.
This software and user manual are freely available at
SAGE; genetic algorithm; set theory; software
Serial analysis of gene expression (SAGE) is a powerful tool,
which provides quantitative and comprehensive expression profile
of genes in a given cell population. It works by isolating short
fragments of genetic information from the expressed genes that are
present in the cell being studied. These short sequences, called
SAGE tags, are linked together for efficient sequencing. The
frequency of each SAGE tag in the cloned multimers directly
reflects the transcript abundance. Therefore, SAGE results in an
accurate picture of gene expression at both the qualitative and
the quantitative levels. It does not require a hybridization
probe for each transcript and allows new genes to be discovered.
This technique has been applied widely in human studies and
various SAGE tags/SAGE libraries have been generated from
different cells/tissues such as dendritic cells, lung fibroblast
cells, oocytes, thyroid tissue, B-cell lymphoma, cultured
keratinocytes, muscles, brain tissues, sciatic nerve, cultured
Schwann cells, cord blood-derived mast cells, retina, macula,
retinal pigment epithelial cells, skin cells, and so forth. In
this review we present the updated information on the
applications of SAGE technology mainly to human studies.
The SAGE (serial analysis of gene expression) method is sensitive at detecting the lower abundance transcripts. More than a third of human SAGE tags identified are novel representing the low abundance unknown transcripts. Using the GLGI method (generation of longer 3′ EST from SAGE tag for gene identification), we converted 1,009 low-copy, human X chromosome-specific SAGE tags into the 3′ ESTs. We identified 3,418 unique 3′ ESTs, 46% of which are novel and originated from the lower abundance transcripts. However, nearly all 3′ ESTs were mapped to various regions across the genome but not X Chromosome. Detailed analysis indicates that those 3′ ESTs were isolated by SAGE tag mis-priming to the non-parent transcripts. Replacing SAGE tags with non-transcribed genomic DNA tags resulted in poor amplification, indicating that the sequence similarity between different transcripts contributed to the amplification. Our study shows the prevalence of novel low abundance transcripts that can be isolated efficiently through SAGE tags mis-priming.
transcript; low abundance; SAGE tag; 3′ EST
Serial Analysis of Gene Expression (SAGE) is a high-throughput method for inferring mRNA expression levels from the experimentally generated sequence based tags. Standard analyses of SAGE data, however, ignore the fact that the probability of generating an observable tag varies across genes and between experiments. As a consequence, these analyses result in biased estimators and posterior probability intervals for gene expression levels in the transcriptome.
Using the yeast Saccharomyces cerevisiae as an example, we introduce a new Bayesian method of data analysis which is based on a model of SAGE tag formation. Our approach incorporates the variation in the probability of tag formation into the interpretation of SAGE data and allows us to derive exact joint and approximate marginal posterior distributions for the mRNA frequency of genes detectable using SAGE. Our analysis of these distributions indicates that the frequency of a gene in the tag pool is influenced by its mRNA frequency, the cleavage efficiency of the anchoring enzyme (AE), and the number of informative and uninformative AE cleavage sites within its mRNA.
With a mechanistic, model based approach for SAGE data analysis, we find that inter-genic variation in SAGE tag formation is large. However, this variation can be estimated and, importantly, accounted for using the methods we develop here. As a result, SAGE based estimates of mRNA frequencies can be adjusted to remove the bias introduced by the SAGE tag formation process.
Establishing more effective treatment of pancreatic cancer requires an understanding of the molecular events leading to the onset and progression of this disease. The biology of tumorigenesis may be better understood if cell type–specific genes in the pancreas are more recognized. This recognition may be as important as discovering a disease-responsible gene. Identification of a ductal epithelium–specific gene can contribute not only to our knowledge of pancreatic tumorigenesis, tumor marker discovery, and effective drug targeting but also is crucial for making a reliable animal model.
We used the x-Profiler engine online to compare the SAGE (Serial Analysis of Gene Expression) libraries derived from 2 short-term cultures of normal human ductal epithelial cells from the pancreas against 34 other SAGE libraries generated from other normal human tissues to identify the best candidate gene specific for the ductal epithelium of the pancreas.
We identified 3 genes, ribosomal protein L38 (RPL38), uridine phosphorylase (UPP1), and FOS-like antigen-1 (FOSL1), predominantly expressed in the pancreatic ductal epithelium. The expression patterns of these 3 genes were confirmed by virtual Northern analysis, semi-quantitative RT-PCR, and in situ hybridization.
Although the expressions of these 3 genes are not completely restricted to the ductal epithelium of the pancreas, we showed that they have more specific expression patterns than CK19 and MUC1. We also demonstrated that all 3 genes are highly expressed in a panel of pancreatic cancer cell lines and can potentially be useful in tumor targeting or as tumor markers.
SAGE; pancreas-specific; RPL38; FOSL1; UPP1