The development of methods for directly measuring thousands of transcripts simultaneously, transcriptomics, has been a major factor in the advancement of biological studies and the creation of new fields like genomics and systems biology. The use of transcriptomics has spread to nearly every field of biological study for example, genetics, biochemistry, ecology, and evolution. This has allowed for better understanding of how an organism’s transcriptome is structured by regulatory and evolutionary pressures and more fundamentally allowed the identification of the function for innumerable new genes. Recent technological advancements have led to the rapid conversion from microarray based transcriptomics to RNAseq based transcriptomics largely because of increased breadth of organisms for which RNAseq is possible. RNAseq methodology also provides a new ability to study other aspects of transcriptomics such as splicing and processing.
An emerging area of systems biology that requires full utilization of transcriptomics is the area of factorial biology, i.e., the biological response of multiple treatments or conditions. Modern systems biology and genomics have done a great job of studying individual genetic variants or regulatory networks in isolation but it is rapidly becoming obvious that this provides a limited view of an organism. Instead of using networks in isolation, organisms must integrate the signal inputs from all of these networks to measure and properly orchestrate a phenotype. Unfortunately measuring this integration requires factorial experiments where the organism is manipulated according to at least two separate treatments. For instance, there have been systems genetic studies of how knockouts in all pairwise combinations of S. cerevisiae
genes combine to affect growth (Segre et al., 2005
; Roguev et al., 2008
). However, these studies are limited to the ability to robotically control the organism and measure a single phenotype within a 5000
5000 gene matrix of pairwise epistatic combinations, in this case a single replicate of the entire genetic matrix would require 25,000,000 genotypes. This factorial nature generates an experiment of the size where complete transcriptomics upon 25,000,000 lines is not considered technically or financially feasible.
Another approach to the same goal of systems genetics has been to utilize crosses between natural genotypes to allow segregation to shuffle 100 to 1000
s of polymorphisms and then measure the transcriptome in the resulting progeny (Brem et al., 2002
; Brem and Kruglyak, 2005
; Kliebenstein et al., 2006a
; West et al., 2007
). However, the population sizes barely scratch the possible combinations of alleles because they typically have less than 500 individuals for a population that may have at least 1000 different causal polymorphisms with the ability to affect the transcriptome (Chan et al., 2011
). Thus, this example would require a population of 1,000,000 individuals to sample the 1000
1000 matrix of all possible pairwise combinations between the causal polymorphisms to fully interrogate the factorial nature of the natural variation network. In this case the 500 original individuals would only sample 0.05% of the potential genetic matrix. Thus, there is a need to develop approaches to allow genomics of much larger genotype collections to fully understand how networks may vary in nature.
Systems regulation is another area with an emerging need for factorial experiments with transcriptomics. A prime example of this is research into the transcriptional circadian clock, which is showing how regulatory networks are central to the function of an organism by integrating numerous inputs (light, heat, metabolism, etc) to properly control the output of the clock (Harmer et al., 2000
; Covington and Harmer, 2007
; Covington et al., 2008
; Harmer, 2009
). Thus, a full transcriptomic understanding of the clock in an organism would require a multi-factorial survey of the environment and how variation in all of the external cues combines to shape the organism’s phenotype. Similar observations of massive integration are present in the literature on interactions between biotic and abiotic stimuli and development and the environment suggesting that there may be no isolated regulatory networks further emphasizing the need for massively factorial manipulations involving transcriptomics.
The daunting nature facing the above factorial studies is the vast number of samples that need to be analyzed for transcriptomics. This large sample requirement forces a need (desire?) to develop methods and approaches to quick and cheaply conduct these factorial analyses. One solution may be the use of next generation sequencing technologies that have been shown to have the capacity for high-throughput parallel sequencing of DNA for rapid large scale mapping studies (Tarazona et al., 2011
; Monson-Miller et al., 2012
). However, the use of next generation sequencing for RNAseq has largely focused on the identification and measuring of more transcripts, i.e., deep sequencing, to capture the expression of all/most genes present in the transcriptome. The results from these experiments have largely already been previously investigated using microarrays with the two approaches leading to the same general observations in Arabidopsis thaliana
(Kliebenstein et al., 2006b
; Van Leeuwen et al., 2007
; Zhang et al., 2008
; Gan et al., 2011
One difficulty with transcriptomics optimization is that transcriptomes have significant co-expression that is largely driven by the shape of the underlying regulatory network (Velculescu et al., 1997
; Ge et al., 2001
; Hirai et al., 2005
; Obayashi et al., 2007
; Chan et al., 2011
). This co-expression structure of the transcriptome has often led to the goal of finding a specific subset of transcripts that measure key nodes of this network and the entire state of the transcriptome could theoretically be described by monitoring the expression of a small set of select genes. However, finding this set has been elusive since the key nodes often change depending upon the biological question. An alternative would be to take a randomized set of genes. In addition, even if a specific subset could be identified this still requires specialized technology to measure these specific genes that typically do not allow for the enhanced throughput required to conduct massive factorial or quantitative genomics experiments (Heinrich et al., 2012
). Given the similarity in transcriptomics results between the platforms, I theorized that it may be possible to utilize shallow RNAseq analysis for factorial transcriptomic studies by measuring where the information lies in microarray transcriptome studies. This should help to optimize the approach to factorial analysis with transcriptomics.
One potential solution to this conundrum would be to utilize the parallel sequencing capacity of next generation sequencing technologies to sequence transcriptomes at a shallow depth for the factorial experiments (Kumar et al., 2012
; Monson-Miller et al., 2012
). This could then be input into a network architecture to analyze the transcriptomic data as if it were physiological measurements (Kliebenstein et al., 2006a
; Kliebenstein, 2009b
; Kerwin et al., 2011
). A frequent retort to this idea is that this approach would be a biased sample towards the most expressed genes which could not possibly provide the information that would be desired about how the transcriptome behaves in factorial experiments. It is true that this would be a biased sample but it is currently not known how much of the total possible information present in a transcriptomic study this sample would actually contain.
In this study, I conduct a quantitative analysis of the information contained in several transcriptomic experiments to test how much transcriptome information can be obtained using shallow sequencing of factorial experiments. To accomplish this, I apply a Shannon Entropy analysis to existing transcriptomic datasets to measure the information content of expression biased subsets in comparison to the total dataset (Shannon, 1948
). This is applied to three different potential uses of transcriptomics in factorial studies, a general analysis of the co-expression network in the Arabidopsis
transcriptome, an expression QTL (eQTL) analysis and finally a temporal analysis of the circadian clock output network. In all instances, the data suggests that it should be possible to obtain at least 80% of the information present in a transcriptomic study by only measuring the top 10% of the transcripts within a sample.