The present report is the result of our development of RNA-Seq methods suitable for biomarker discovery in fixed clinical tissue. The library chemistry described accommodates low amounts of archival FFPE tissue RNA, preserves strand-of-origin information, is compatible with sample indexing, and quantifies transcripts with sufficient sensitivity and precision for biomarker discovery in valuable clinical tissue specimens. Data analysis methods were selected from a number of tested options. Results generated using 5–12 year old FFPE tumor tissue from 136 breast cancer patients, are concordant with RT-PCR data. Study results also indicate that effective biomarker identification is possible with multiplexed samples.
RNAs that are new putative markers of breast cancer recurrence risk are identified here, which, singly and in sets, frame hypotheses to test in later studies of other breast cancer patient cohorts. While we associated ~1300 RefSeq RNAs (which are mostly mRNAs) with breast cancer prognosis (at FDR<10%), more than half of the total RNAs identified as prognostic lie in the ~98% of the genome that does not code for proteins. It is noteworthy that, for most of the intronic RNAs identified as prognostic, their cognate assembled exons were not also identified as prognostic, consistent with the possibility that these intronic sequences carry biomarker information not captured in gene coding sequence. Most of the identified non-coding RNAs are very long sequences that have low counts per kilobase, and the power for identifying longer sequences is expected to be higher because of the increased counts. However, each evaluated RNA effect size (the hazard ratio for its association with recurrence) is effectively estimated by comparisons of sequence expression among patients. To the extent that shorter RNAs are handicapped in signal strength, they are handicapped equally within an RNA species, so they do not bias the analysis of each individual sequence. Future studies of other breast cancer cohorts will reveal whether rare transcripts identified here prove to be robust biomarkers.
To analyze intergenic transcripts we evaluated a set of lincRNAs described in the recent literature 
and also transcripts identified by a new algorithm that interrogates an entire population of transcriptomes to identify intergenic transcripts based on transcript abundance and density. Development of biostatistical and bioinformatic programs and databases for NGS data analysis is a very active area 
. Subsequent analysis by new biostatistical and bioinformatics methods will further test and validate study conclusions.
The 1307 RefSeq RNAs associated with prognosis were also examined for prognostic significance in tumors from an independent cohort of patients in which DNA microarray technology was used to profile gene expression (public NKI data set) 
. About half of these 1307 transcripts could be found as features on the microarrays. Of these shared RefSeq transcripts, about 40% were found to be prognostic in the NKI data set (P<10−16
). There is no significant inter-study concordance for the lower abundance quartile of Ref-Seq transcripts, plausibly attributable to the fact that signal-to- noise ratios in both technologies decrease as transcript numbers decrease.
A number of DNA microarray and RT-PCR studies of early breast cancer have identified as markers of poor prognosis a network of co-expressed mRNAs that regulate the cell cycle 
. The original RT-PCR interrogation of this 136 patient cohort identified a number of these transcripts and this co-expressed network strongly emerges in the RNA-Seq analysis of this cohort. A network of genes that co-express with ESR1, the estrogen receptor gene, has also been linked to decreased breast cancer recurrence risk in several published studies 
and also emerges in our RNA-Seq results. Several novel networks of RefSeq transcripts also track with increased breast cancer recurrence risk. The largest of these (containing 134 RNAs) is heavily populated with low abundance RNAs, and includes a number of pre-microRNAs and olfactory receptor mRNAs. It may mark decreased stringency of certain transcriptional controls.
We have not analyzed these RNA-Seq data for either mutations or alternatively spliced isoforms. While we plan to do these evaluations, the laboratory protocols used here are not optimal for these assessments. The chemistry of our libraries is compatible with paired-end sequencing (not performed in the present study), which is highly desirable for analysis of splice variants and gene fusions 
. Given its fragmented condition, FFPE tissue RNA could present a formidable challenge to assembly of differentially spliced isoforms. We do anticipate that the library preparation methods described here will yield high quality RNA-Seq data from non-fixed fresh or frozen tissue, based on unpublished preliminary data.
In conclusion, this work describes the application of RNA-Seq methods with sufficient sensitivity and reproducibility to enable biomarker discovery in archival FFPE tissue. Whole transcriptome RNA-Seq reveals hundreds of new coding and non-coding transcripts, as well as heretofore unappreciated gene networks that strongly associate with breast cancer recurrence in this study cohort. Recognizing the challenges for development of robust gene signatures associated with clinical variables 
, the transcripts identified here should be explored in future screens for biomarkers of breast cancer recurrence risk.