This study was conducted according to the principles expressed in the Declaration of Helsinki. The study was approved by the Institutional Review Board of the Mayo Clinic. All patients provided written informed consent for the collection of samples and subsequent analysis.
Collection and Processing of Tissues
All tissues used in this study were collected from patients undergoing treatment at Mayo Clinic, Rochester, MN. Tumor samples were obtained from patients undergoing surgical treatment for oral squamous cell carcinoma (OSSC). Normal samples were collected from the negative surgical margins. Following surgical excision, a portion of the tissue was immediately processed and snap-frozen in liquid nitrogen for storage and future use. The remainder was processed for clinical examination and long-term storage in the tissue archives of Mayo Clinic, Rochester MN, according to standard clinical protocols. All patients consented to the use of their tissue for research and this study was approved by the Mayo Clinic Institutional Review Board. Each collected sample was frozen sectioned, changing the blade between samples, and mounted on positively charged glass slides that were stained with haematoxylin and eosin (H&E) by the Mayo Tissue and Cell Molecular Analysis Core facility. These slides were then evaluated by a qualified Mayo Pathologist (J. Lewis) to confirm the presence or absence of tumor in each sample. Appropriate tumor was circled and extreme care was taken to obtain only tumor-containing sections for subsequent isolation of DNA or RNA.
DNA and RNA Extraction
Frozen tissues were compared to corresponding H&E slides following evaluation by pathology to verify classification of tumor or normal tissue status. Portions of tissue were removed for nucleic acid extraction, using disposable scalpels, in quantities <30mg. DNA was extracted from frozen tissue using the Invitrogen PureLink Genomic DNA Mini kit (Carlsbad, CA) according to the manufacturer's protocol. Total RNA was extracted from portions of the frozen tissue samples using the Qiagen RNAeasy Plus Kit (Valencia, CA) according to the manufacturer's protocol. Isolated DNA and RNA were quantified by NanoDrop ND1000 (ThermoFisher Scientific, Waltham, MA). RNA samples were further assessed for quality using the Agilent 2100 Bioanalyzer (Santa Clara, CA) prior to library construction.
RNA Library Preparation
To construct libraries suitable for SOLiD™ System sequencing, 5 ug of total RNA was depleted of 18S and 28S rRNA using GLOBINclear™ (Ambion) buffers and reagents supplemented with biotinylated capture probes designed against these rRNAs and following the given protocol. The rRNA depleted samples (~1 ug) were then fragmented by incubation with 1 unit of RNase III (Ambion) for 10 minutes in a 10 ul reaction volume containing 1X RNase III buffer supplied with the enzyme. The samples were then mixed with formamide gel loading dye and denatured for 10 min at 95°C and then separated on a flashPAGE™ gel apparatus using a modified procedure. The flashPAGE™ gel was first run for 15 minutes as per the given procedure and conditions. The lower running buffer was then removed and the lower chamber rinsed with nuclease-free water 2 times. The lower chamber was then replenished with fresh buffer and the gel was run for an additional 45 minutes. The lower running buffer was then removed and the RNA was purified using the flashPAGE™ clean up kit, producing RNA fragments ranging from ~50–150 nt in size. This RNA was then used with the SOLiD Small RNA Expression Kit (Ambion) as per the given protocol, except the size range of products purified from the 6% native PAGE step was ~140–200 bp in size. The final purified products were quantitated using a nanodrop and the size range of the products was confirmed by bioanalyzer analysis. The samples were then diluted and used for emulsion PCR.
DNA Library Preparation
10 ug of each DNA sample was used to generate mate-paired libraries with a 2.5 kb insert size using standard manufacturer protocols. Briefly, DNA was sheared to a target size of 2.5 kb using a HydroShear® (Genomic Solutions). The resulting fragments were end repaired using the End-It™ (Epicentre) kit, methylated to protect EcoP151 sites, and ligated to CAP adapters. The DNA was then size selected by electrophoresis on a 1% agarose gel, and a band 2 kb to 3 kb was excised from the gel using a scalpel blade. DNA was recovered from the gel using QIAquick Gel Extraction Kit (Qiagen). The resulting DNA fragments were circularized by ligation to an internal adapter, and 25–27 bp ‘mates’ created by digestion with the type III restriction enzyme EcoP151. Double stranded P1 and P2 sequencing adapters were then ligated to the library and nick translated before final amplification using 14–15 cycles of PCR. Libraries were again purified by electrophoresis on a 3% agarose gel, excising the appropriate library band and recovering the DNA using the QIAquick Gel Extraction Kit (Qiagen). The size, quantity and quality of the resulting libraries were confirmed by analysis on a 2100 Bioanalyzer (Agilent) using a DNA 1000 chip before the library was diluted and used for emulsion PCR.
Templated beads were generated for sequencing using standard manufacturers' protocols. Briefly, an aqueous phase was prepared from the SOLiD ePCR kit containing AmpliTaq Gold DNA Polymerase UP, buffer, MgCl2, dNTP's, amplification primers and library template. The aqueous phase was then introduced to a whirling oil phase in an ULTRA-TURRAX® Turbo Drive (IKA) to create a water-in-oil emulsion. The emulsion was then transferred to a 96 well plate and thermocycled using the recommended PCR conditions. After PCR amplification, emulsions were broken using butanol, and the beads were washed, enriched, and terminal transferased before quantification and deposition onto a slide for sequencing.
Whole Transcriptome Sequencing
Templated beads were deposited onto one full slide per sample. Massively parallel ligation sequencing was carried out to 50 bases using Applied Biosystems SOLiD System (V3 chemistry) and following the manufacturer's instructions.
Mate-Pair Genome Sequencing
Templated beads for the normal and tumor samples were deposited across two sequencing slides, three quadrants per sample. Both forward and reverse tags from the mate-paired libraries were sequenced to 25 bases, using Applied Biosystems SOLiD System (V3 chemistry) and following the manufacturer's instructions.
Alignment of Transcriptome Fragment Reads
Whole transcriptome reads were aligned using AB's SOLiD Whole Transcriptome Pipeline 
. This software is open-source and freely available (http://solidsoftwaretools.com/gf/project/transcriptome/
). An overview of the alignment strategy is presented in Figure S2
. In all the analyses of gene expression presented here, only uniquely aligned reads were considered. A “uniquely aligned” read is defined as a read with a max scoring alignment to the genome scoring (1) at least 24 and (2) at least four higher than any of the other alignments of that read to the genome. Exon locations described in the text were taken from the alignments of RefSeq transcripts to version 18 of the human genome sequence (hg18), and are available at the UCSC Genome Browser website (http://genome.ucsc.edu/
). All sequence data has been deposited at the MIAME compliant Gene Expression Omnibus (GEO) database at National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/geo
) and is accessible through accession number GSE20116.
Alignment of Genomic Mate-Pair Reads
Mapping and pairing were performed with Applied Biosystems' alignment and pairing package (corona_lite, http://solidsoftwaretools.com/gf/project/corona/
). The AB SOLiD alignment tool, mapreads, translates the reference sequence to dibase encoding (color space) and aligns reads in color space. The program guarantees finding all alignments between a read and the reference sequence with up to M mismatches (a user specified parameter, which was set to two here). Mapreads uses multiple spaced seeds (discontinuous word patterns) to achieve a rapid running time (Zhang et al., in preparation). Reads that align to the color space reference in only one location with up to two mismatches are referred to as “uniquely aligned”. Mate-pair reads were aligned individually and subsequently a pairing process was conducted to pair the individual reads. Pairing rescue was also performed, which uses one aligned tag as an anchor and searches for the other tag in a nearby window (of 1.4–3.7 kb, either upstream or downstream, depending on orientation of the anchored tag) with more relaxed criteria.
Quantification of RefSeq Transcripts from RNA-Seq Data
For each of the 18,095 RefSeq transcripts, reads uniquely aligned within its genome-mapped exons were summed. One pseudo-count was added to this sum and the resulting modified raw transcript count was divided by the total number of uniquely aligned reads for the sample, yielding normalized transcript counts for each RefSeq transcript in each sample. Normalized transcript counts and TvN fold-changes can be found in Table S1
. Because it is not yet clear whether genes expressed at the very lowest levels in a tissue are accurately measured by these methods, we applied a conservative raw count filter throughout our analysis of the RNA-Seq data, requiring that a transcript have at least fifty uniquely aligned reads in at least one tissue (tumor or normal) in all three patients. We did not normalize for transcript length because all further analyses focused on fold-changes between two conditions.
Defining the Sets of Genes Mis-Regulated Across the Patients
To isolate the set of genes commonly mis-regulated in the development of OSCC, we rank ordered the TvN fold-changes for each patient. We then ranked transcripts by their median TvN rank across patients, considering further only the three hundred highest and three hundred lowest ranking genes as the sets of genes commonly up-regulated and down-regulated in OSCC development, respectively. Gene sets can be found in Table S3
. The three hundred most up-regulated and down-regulated genes have median fold-change cutoffs of at least 3-fold, though typically the median fold changes are much higher, with a mean of 6-fold for the up-regulated genes and 33-fold for down-regulated set. A simulation, in which the pairing of transcripts and expression levels is randomly shuffled, indicates that less than 10% of our down-regulated and 20% of our up-regulated genes would have as extreme a median rank if there was no similarity in TvN expression profiles between patients (not shown). Thus, we estimate that these sets contain relatively few false positive genes, which are not truly differentially expressed across these three patients.
Although rank ordering was our original approach, we have also employed a recently proposed likelihood ratio test combined with a fold-change cutoff to define sets of mis-regulated genes 
). Using this statistical test (specifically, an FDR cutoff of 10−4
and 4-fold change requirement) results in very similar lists of up- and down-regulated genes. We performed GO analysis on the sets of up- and down-regulated genes defined in this manner and found that the resulting enriched biological functions are also very similar to before, changing none of our conclusions.
Gene Expression Validation by Microarray
mRNA samples were reverse transcribed by priming off the polyA tail, producing cDNA pools. Each cDNA pool was amplified by in vitro
transcription, generating biotin-labeled cRNA that was then hybridized to an Illumina HumanHT-12 v3 BeadChip array. Arrays were subsequently stained with Streptavidin-Cy3 and scanned with a high resolution Illumina scanner to determine fluorescence intensities. The raw intensity files from the arrays were pre-processed using Illumina's BeadStudio software for background subtraction and quantile normalization, producing normalized intensity values for each gene. Differential gene expression values, between the tumor and normal tissues of a particular patient, were compared with the corresponding values from RNA-Seq data (Table S2
). The differential expression of a gene was calculated by applying a log2
transform to the ratio of modified, normalized intensities in the tumor and normal samples of a patient. The modified, normalized intensity for a gene was derived from its normalized intensity by subtracting the minimum normalized signal intensity across all genes in the sample and adding a value of one (to avoid dividing by zero). All microarray data is MIAME compliant and the raw data has been deposited at the MIAME compliant Gene Expression Omnibus (GEO) database at National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/geo
) and are accessible through accession number GSE19089.
Gene Expression Validation by Reverse Transcription Quantitative PCR (RT-qPCR)
cDNA was produced using 2 ug of isolated RNA and the High Capacity cDNA Reverse Transcription Kit with RNase Inhibitor (P/N 4374966, Applied Biosystems, Foster City, CA). A total of 250 ng of the cDNA product was preamplified using the Applied Biosystems TaqMan® PreAmp Master Mix Kit according to standard protocol (P/N 4391128). Briefly, 250 ng of cDNA was amplified for 14 cycles with a pool of 20X TaqMan Gene Expression Assays specific to the target genes (P/Ns 4331182 and 4351372, Applied Biosystems). Following preamplification, the product was diluted 1:20 in 1 X TE buffer. The diluted preamplified cDNA was used in individual quantitative PCR reactions including TaqMan gene expression assays to measure the expression levels of target genes. To normalize the expression levels of each target gene, the cycle threshold (CT) of an abundantly-expressed control gene (GADPH, GUSB, PGK1, TBP; P/Ns 4333764F, 4333767F, 4333765F and 4333769F) was subtracted from the CT for each target gene of interest. This value was then used to calculate log gene expression changes across samples/conditions (ddCT).
Allelic Imbalance Analysis and Validation
Allelic imbalance (AI) is a difference in the nucleotide frequencies at a given genomic position between two RNA samples. To quantify AI at each position in the genome, our method first tallies di-colors from reads aligned to that position that represent either a reference nucleotide or a single nucleotide substitution, filtering from further analysis invalid di-colors, which likely represent sequencing errors or more complex mutations. The nucleotide frequencies are then compared between samples and the significance of the AI is determined by applying a χ2 test of independence (on a 2×4 contingency table).
In our first attempt to investigate AI we only required that each genomic position have: (1) at least 15x coverage in the tumor and normal samples, (2) significant AI with χ2 P
-value less than 10−3
, and (3) more than a 10% change in absolute frequency of at least one allele. Using these criteria, our first round validation of 27 genomic positions by allele-specific PCR (as-PCR) yielded a 67% false positive rate (Figure S6
), 72% of which could be attributed to false heterozygous genotypes (i.e., as-PCR of genomic DNA showed only one allele was present, whereas RNA-Seq found two). Upon careful examination we realized that two characteristics were prevalent amongst the false positives and not amongst the true positives. The first characteristic of false positives is that most have more than 90% of the reads supporting at least one nucleotide aligned to identical positions in the genome. In many of these cases a non-reference nucleotide was observed in only one of the two samples (normal or tumor) and that nucleotide was not present in dbSNP. We think these cases are likely attributable to errors introduced during reverse transcription that are subsequently amplified and over-sampled in later steps of the experimental protocol. Thus, in our second attempt to investigate allelic imbalance we added another criterion, requiring that at each genomic position, (4) less than 90% of the reads supporting each nucleotide can be aligned to identical positions in the genome. The second characteristic of false positives is that some tend to occur towards the end of alignments where there is some uncertainty about the pairing of read and genomic sequence. This mis-pairing can arise at a splice junction boundary when there is sequence similarity between the 5′ end of the intron and the 5′ end of the downstream exon. Thus in our second attempt to investigate allelic imbalance we conservatively discarded the last five colors of each read alignment. Employing this added criterion and filter reduced the false positive rate to 7%, while generating only one false negative (false negative rate is 4%) on the first round data. A second round of validation was also undertaken on a new set of 34 genomic positions, chosen prospectively with these revised criteria. For this second validation round we saw the false positive rate drop to 43%, only 27% of which could be attributed to false heterozygous genotypes from RNA-Seq. Thus these revised criteria have considerably improved our specificity and reduced artifacts likely arising from errors in reverse transcription that are subsequently over-sampled during sequencing.
Allelic Imbalance Validation by Allele-Specific qPCR (as-qPCR) and Allele-Specific Reverse Transcription qPCR (as-RT-qPCR)
cDNA was produced by reverse transcription using the High Capacity Reverse Transcription Kit (P/N 4322171, Applied Biosystems) and manufacturer's instructions. as-qPCR and as-RT-qPCR were performed on an Applied Biosystems 7900HT Sequence Detection System. The 10 µl PCR mixture contains diluted RT products (for as-RT-qPCR) or 3 ng genomic DNA (for as-qPCR), 1x TaqMan® Genotyping Master Mixture (P/N 4371357, Applied Biosystems), 0.3 µM allele-specific forward primer, 0.2 µm TaqMan® probe, and 0.9 µm reverse primer. The reactions were incubated in 384-well plate at 95°C for 10 minutes, followed by 2 cycles of 95°C for 15 seconds and 58°C for 1 minute, and 48 cycles of 95°C for 15 seconds and 60°C for 1 minute. All allele-specific forward primers, TanMan® probes and reverse primers were manufactured at Applied Biosystems. For each genomic coordinate of interest, each allele was quantified (in each gDNA and cDNA sample) by obtaining CT values from at least two (and typically three) replicate reactions. The mean across reactions was computed and a dCT value calculated by subtracting the mean CT value for the second allele from the mean CT value for the first allele. Genomic coordinates with gDNA assays yielding absolute dCT values greater than 4.0 were deemed to be homozygous for the allele with lower CT value, while all others were assigned a heterozygous genotype (Figure S6a
). Allelic imbalance of RNA between tumor and normal samples was estimated by calculating a ddCT value for each genomic coordinate assayed, by simply subtracting the dCT value of the normal cDNA sample from the corresponding dCT of the tumor cDNA sample. ddCT values were then compared to allelic imbalance measurements made by RNA-Seq (Figure S6
Analysis of Copy Number Variation (CNV)
The ratio of the number of uniquely aligned reads in paired tumor and normal samples for any given genomic sequence window is an estimate of the copy number change in the window. We modified Segseq 
, a recently developed CNV calling and segmentation algorithm that is based on this principle, to handle SOLiD system sequencing reads. The SegSeq implementation can not effectively handle the large volumes of data produced here due to memory limitations, so we divided our original datasets (~50M reads per sample, ~0.8x sequence coverage) into random subsets (~12M reads each, ~0.2x sequence coverage). We then applied the modified algorithm to each subset of the mate-pair reads from the normal and tumor samples of Patient 8. Reassuringly, the algorithm produced very similar results on the random subsets of data (not shown). Default parameters for Segseq were used: the size of local windows (-W
; the size of alignable windows (-d, -e
100 kb. The P-
value cutoffs, pinit
, that control genome wide false positive CNV segments, were set such that we generated 1 false positive segment from about 20–30 false positive initial break points (see Chiang et al. 
for details). Segments are listed in Table S6
Calculation of Differential Gene Expression Across Copy-Number Segments
To compare the CN changes observed between normal and tumor tissue to changes in gene expression, we calculated the differential expression of each genomic segment simply by summing the number of reads uniquely aligned to this region in the tumor sample and dividing it by the number of uniquely aligned reads to this region in the normal tissue sample. The resulting ratio was normalized by the ratio of the total reads uniquely aligned for each sample.
Copy Number Validation by qPCR
TaqMan Copy Number (CN) Assays were performed according to the manufacturer's instructions (P/Ns 4400293 and 4400296, Applied Biosystems, Foster City, CA). In all, the copy numbers of 23 genes of interest were measured across 23 samples, derived from six normal tissues, 14 tumors, and two Coriell Institute for Medical Research (Camden, NJ) gDNA controls (see Table S7
for assay details). For each gene of interest, four replicate CT values were obtained per sample with FAM™-labeled and VIC®-labeled probes, assaying the gene of interest and a control gene, respectively. Both RNaseP and TERT were used as controls in separate assays (P/Ns 4403328 and 4403315, Applied Biosystems). Four dCT values were calculated by subtracting each VIC CT from the corresponding FAM CT. A single ddCT value, estimating the CN fold-change in a particular tumor sample, was calculated by subtracting the median of median dCTs of the six normal samples from the median dCT value for that particular tumor sample. The null hypothesis that the CN fold-change is zero was tested by a t-test in which variance was estimated by pooling the 24 dCT values from normal samples. This hypothesis was rejected and the CN fold-change deemed significant for p<0.05. Genes with significant CN fold-changes in two separate assays (one using the RNaseP locus as a control and the other using TERT) are highlighted in Figure S7
Copy Number Validation by Microarray (aCGH)
aCGH was performed using the Agilent Human Genome Microarray Kit 244K (Agilent Technologies, Santa Clara, CA) which contains ~244,000 60-mer oligonucleotide probes spanning coding and non-coding genomic sequences with median spacing of 7.4 and 16.5 kb respectively. Arrays were analyzed using the Genepix 4200A scanner (Axon Instruments, Union City, CA) and the Agilent Feature Extraction software (v9.1). Copy number segments were obtained with the Agilent CGH Analytics software (v.3.4), using the ADM-1 algorithm and default settings