|Home | About | Journals | Submit | Contact Us | Français|
The efficacy of microarrays in examining gene expression, gene and genome structure, protein-DNA interactions, whole-genome similarities and differences, microRNA expression, methylation, (and more), is no longer in question. It is a fast-developing, cutting edge technology that has grown up along with massive sequence databases and is likely to become part of everyday patient care. Many advances have recently expanded the power and utility of microarrays; among them is our development of a new array tiling technique that dramatically increases the scope of coverage of an oligonucleotide tiling array without substantially increasing its cost.
DNA microarrays have become so popular that they are nearly ubiquitous; this is not surprising, as scientists can use them to examine large segments of a genome in a single experiment, a feat that has transformed genetics. Microarrays can be used to address basic scientific questions as well as in therapeutic and biomedical applications. They are used to investigate the molecular basis of genetic disease, to define the diagnosis or a group of marker genes responsible for a particular condition, to ascertain specific drug response mechanisms, or to establish therapeutic strategies for a medical disorder.
Attaching known nucleic acid probes to a solid substrate is a simple, yet brilliant concept. When a labeled extract binds the attached probes, the resulting signal, while not trivial to interpret, reflects the quantity of complementary molecules present in the extract, and the scope of the results is limited mainly by the number of features on the array. In recent years, great effort has been made to create microarrays with more features, (both by reducing the feature size and increasing the scanners’ resolution). With higher density arrays, and thereby wider scope, more comprehensive results can be achieved. Indeed, microarray development parallels current manufacturing trends of miniaturization and increased throughput, and it has been suggested that the number of features on an oligonucleotide tiling array follows Moore's Law1. As in the world of computers and MP3 players, the price of the average product stays the same regardless of the number of bytes (or features), so that microarray technology remains a very expensive research tool, prohibitively so for many labs, especially when a typical design involves multiple experimental replicates (a single microarray is suggestive but never conclusive, so even at a few hundred dollars apiece, microarray costs add up quickly as replicates are performed). Multiple technical replicates can be performed simultaneously on newer multiple-array formats, though these arrays have even fewer features and are identical to each other, and technical replicates are of arguable value. If current applications do not scale up much further, though, the tremendous increases in microarray density will eventually make this an affordable technique for most labs; however, it would be surprising if biologists stopped inventing more creative ways to take advantage of every advance in this technology.
Array technology is an obvious way to access the recent explosion of large-scale genomic information, though the above cost considerations are limiting, particularly when a single experiment could require dozens of slides for each replicate. Using a simple strategy, we can effectively double or better the scope of coverage of a microarray without changing its cost. This increase in density should minimize the number of required arrays for a particular experiment, maximizing the contents to include additional controls or experimental probes covering more genomic sequence. In turn, this will reduce the overall cost per experiment, making the technology more accessible to the scientific community and, especially, smaller laboratories.
In this review, we adhere to the MGED/MIAME approved terminology2-4 to describe the various parts of a microarray. An array consists of individual molecules of known sequence, the “probes,” immobilized in clumps of many identical probes in areas known as “features.” The labeled molecules whose identity/abundance is being detected on the array are the “targets;” they bind to the features and are part of the starting biological sample, the “labeled extract.”
The idea of performing chemical or biological reactions with one reagent spatially immobilized is not new. Southern's 1975 paper5 described a technique that needs no introduction, and forever altered molecular biology. A decade later, researchers began work on a reciprocal technique, in which the known molecules were immobilized on a membrane or slide, and the solution to be queried was labeled and hybridized to the surface. Several publications from the 1980s6-8 describe the use of such arrays in DNA mapping and sequencing. This work culminated in the 1991 publication by Fodor et al. of Affymax9 (later to become Affymetrix) describing protein and nucleotide arrays, their construction (on microscope slides using photolithography) and their uses. At first, the technique was somewhat limited, but as sequencing data began to accumulate, so grew the power of microarrays; the scope of questions to be answered was limited only by the amount of known genome sequence that could be placed on an array.
By 1999, one could find several publications, including the landmark papers from Brown's group and collaborators 10-14, that introduced the scientific community to the staggering potential of microarrays. In fact, Schena et al 10 first introduced the term, “microarray,” in 1995. These papers described the use of microrarrays as methods for sequencing and gene expression monitoring in the burgeoning field of high-throughput biology, and set off a trend that is still gaining momentum. In fact, the 21st century has seen an explosion in microarray-based publications as researchers find more and more uses for the technology, and one would be challenged indeed to find an issue of a modern biology or genetics journal that does not mention microarrays (Figure 1).
A typical microarray consists of known biological molecules, probes, affixed to a solid support, which can be a glass slide, a custom surface, or a membrane. These probes bind their labeled targets, and the resulting signal is analyzed computationally. The key to microarray technology is that a probe is detected at a level that is proportional in a predictable way to the amount of its target present in the labeled extract.
At first, microarrays were custom made by individual researchers by hand or using simple spotting robots; the usual form of DNA deposited on these slides was cDNA clone inserts hundreds of base pairs long. Now, relatively few people make their own arrays, as commercial oligonucleotide-based microarrays are less prone to variation, have become more affordable, and offer far greater feature densities15. We primarily discuss commercial oligonucleotide arrays, which are fabricated on glass slides or other solid support. Illumina arrays are slightly different in that the oligos are bound to very densely spaced beads randomly positioned on the surface of a chip. 16.
Oligonucleotide fabrication techniques vary from manufacturer to manufacturer, and are described in a recent review15. Table 1 provides a brief synopsis of the physical characteristics of microarrays. The manufacturing process used for each array is what limits the number and quality of features achieved, determines the flexibility of the platform, and influences the cost. More techniques are being developed 17, 18, and the landscape of microarray manufacturing is still evolving rapidly.
To many biologists and clinicians, asking why one would use microarrays is akin to asking a sommelier why one would drink wine rather than water. Compared to other molecular biology techniques, using an array can conserve precious samples, can decrease the amount of human-mediated processing, increase reproducibility, and decrease the amount of time needed to complete an experiment20. Microarrays, when well designed, can answer a multitude of questions in a single experiment, or can expand the reach of an experiment to previously unobtainable scales. There are problems, of course: microarrays are still expensive and labs cannot afford to waste them on poorly contrived experiments, and, more seriously, microarray experiments generate enormous amounts of labyrinthine data that do not easily yield to analysis.
Microarrays have many uses, and researchers are continually inventing ever more fascinating ways in which to employ them. The first microarray experiments were largely concerned with DNA mapping and sequencing. By the mid-90s, scientists had extended the realm of microarray capabilities by using the technique to examine gene expression levels of dozens to hundreds of genes simultaneously12, 21, thereby establishing microarrays as a viable and flexible molecular biology tool. A promising early use of microarrays was in drug discovery (pharmacogenomics) — developing new drugs to correct or at least offset detected gene expression changes mediated by a specific gene defect22, 23. Another early application was in toxicology (toxicogenomics), finding correlations between changes in genetic profiles and effects caused by exposure to toxic agents24. Microarrays were soon utilized for genome-wide genotyping, SNP detection studies, and resequencing applications. They were later used to characterize the expression signatures of different types of cancer and to classify tumor types. Microarrays are now already being used for diagnostics (AmpliChip CYP450), and, cost permitting, could be quick and reliable assays to screen for disease mutations, or for identification of genes with a key role in disease pathways.
Boguski's group saw arrays as potential aids in genome annotation25. Shoemaker and colleagues26 developed the first version of TAG arrays, designed to query the entire yeast genome in a single hybridization. Their work has been continued and expanded by Yuan et al.27 as well as Pierce et al.28 Lenoir and Giannela29 have written an excellent review of the early days of microarray technology.
Tiling arrays, in which the probe sequences are simply derived sequentially, moving along the genome, without regard to sequence features (except that repeats are usually skipped), were soon popularized. These arrays are particularly powerful, as they allow a researcher to discover the truly unexpected, which is exactly what happened when scientists began to realize that an unexpectedly large portion of the human genome is transcribed30, 31. In addition to elucidating gene regulatory pathways32, tiling arrays found another creative use in splice variant analysis and the in-depth examination of gene structure; this research so far has challenged our notion of what a gene is. Another application of tiling arrays is in array CGH, in which two genomes are compared in their entirety in a single experiment to see whether pieces of one genome or the other are amplified or have been lost. This type of analysis has proven especially helpful to human disease research, particularly to cancer research.
SNP arrays, designed to detect single nucleotide polymorphisms, have found many uses, including analysis of cancerous cells and tissues in search of copy number variants33.
Other applications of microarrays are appearing rapidly. Innovative sample preparation techniques soon allowed examination of DNA-protein interactions (for example, ChIP-on-chip1, 34 and transcription factor binding site definition35, 36). DNA methylation studies using tiling arrays are just beginning to appear37, and show great promise as a method for examining epigenetic regulation on a genome scale. Microarrays have also been a critical addition to the fast-paced field of microRNA research38. A recent publication describes a method for using microarrays to estimate exposure to ionizing radiation through expression profiling39, and the authors note that their method could be extended to detect exposure to other toxins as well.
Recently, we have used microarrays to uncover transposon insertions in the yeast genome40. Our Transposon Insertion site Profiling chips (TIP-chips) rely on vectorette PCR to amplify transposon flanks, which are hybridized to a tiling array, allowing one to locate transposons, both previously identified and new polymorphisms, in any given sample.
Currently, microarray derived data must adhere to MIAME standards41 and data used in manuscripts should be deposited in a public repository such as GEO42, 43 or ArrayExpress44 as a condition of acceptance for publication, allowing the entire scientific community to benefit from these experiments. As a result, meta-analyses of such data can be done in silico to arrive at biologically meaningful conclusions; for example a non-small-cell lung cancer survival study was performed by examining several published datasets 45. The authors described a genetic signature of the condition that may be useful for patient management.
Microarrays are expensive. Depending on the platform, a single experiment can cost hundreds if not thousands of dollars, and there is currently no consensus about how many replicates need to be done, although the number, fortunately, appears to be low46, 47. The cost of the arrays is independent of the cost of the reagents (the Cy3 and Cy5 fluorescent dyes coupled to triphosphates are expensive, as are the kits for biotin, single dye labeling, and the biotin labeling protocols are tedious and time consuming). The array scanner and workstation (though a one-time purchase and often shared via a core facility), the effort of a technician and an analyst, and of course the intangible cost associated with the time spent doing the experiment in the first place, must all be added to the cost of microarray experimentation.
Table 1 does not take into consideration an extremely important point, often overlooked, in microarray experiments: that is, how many times can the array be stripped and reused. Combimatrix and Nimblegen offer protocols and advice to those wanting to reuse their arrays, but as far as most companies and investigators are concerned, arrays are single-use reagents. Many home-brewed protocols describe stripping procedures for other arrays, but not everyone agrees that these techniques yield an array that is as unbiased as one that has never been used.
Probe length is still a controversial subject, with some claiming that shorter probes are more specific48 and others preferring the longer probes because of their sensitivity. It is clear, at least, that the probe length should be tailored to the application of the array49, 50. Depending on the manufacturer and the manufacturing protocol, longer probes may or may not cost more than shorter probes, and the manufacturing technique ultimately determines the maximum probe length possible, as most processes suffer from decreasing accuracy and low yield at the limits of probe length.
Good sequence information is critical to the success of a microarray; each probe needs to be as complementary as possible to its target sequence — variations from perfect complementarity can be tolerated, and the tolerance will depend on hybridization and washing conditions. In fact, some groups have tried to quantify acceptable deviation from perfect complementarity 51, 52. It is certainly possible to detect single base changes; SNP chips, built just for this purpose, have been quite successful. Recently, a method for whole yeast genome sequence mapping of sequence polymorphisms using arrays has been outlined53. While it is clear that perfect complementarity is important, nobody has established guidelines for planning and analyzing real-life experiments that will necessarily include imperfect hybridization (due to both polymorphism and error).
Still debated in the world of microarrays is what to do with intra- and inter-platform variability, and whether experiments can be compared across platforms. At the very least, the amount of data generated creates a formidable informatics challenge54. While spotting or in situ synthesis techniques and probe design continue to improve, there seem to be many significant sources of variability within the technology, ranging from the specific batch of manufactured arrays and reagents 47, to feature shapes and binding affinities55, to the particular settings of the scanner, or, perhaps most importantly, the operator or technician processing the samples even within the same laboratory (with the same protocols and reagents). Other sources of variability are the laser detection systems themselves.
There have been a few encouraging studies examining variability among microarrays done on different platforms and in different labs, but questions still remain, as the two largest such investigations 56, 57 apparently used the same technicians in the multiple labs to perform the same experiments on multiple arrays. This overly stringent design leaves a major question open, as in reality, arrays are performed by many different people at many levels of training, and, like any other experimental technique, are subject to the expected variations in protocol that occur when weather, social or family schedules, and lack of sleep interfere.
Therefore, each array must be normalized and arrays in a set must be analyzed together; there are a number of normalization strategies and numerous analytical techniques in use today; an excellent synopsis of statistical techniques needed for microarray analysis can be found58-61.
Finally, as more microarray data is published, more data and experimental design integrity studies are performed, some of which are challenging the validity and reproducibility of microarray-based clinical research. Recent articles describe common mistakes and basic flaws in the experimental design of studies published in peer-reviewed journals, that question the reliability of the outcome62.
Rising to the challenge, biostatisticians are developing sophisticated analysis techniques to control artifacts and bias while extracting as much information as possible from array data. Making sense of the results from a biological or medical perspective is equally as challenging, as the quantity of data necessitates automated analyses, often introducing yet another layer of communication (though rewarded by a new perspective) when a programmer is added to the team.
An important characteristic of an array is the number of features that it contains. For gene expression arrays, this means more genes covered and/or more features per gene (increasing redundancy and reliability); for tiling arrays, this means denser tiling or tiling across an even more expansive genomic region than before. As array densities climb skyward while prices remain the same, experimentalists uncover ever more creative ways to take advantage of the new technology63, 64. One critical question is how much sample is required per experiment. This number ranges from 5 micrograms to a few nanograms (aiming to reach the single cell picogram level), depending on the platform and the type of experiment. Whole genome amplification techniques help increase sensitivity dramatically, although not without worries that amplification is not perfectly consistent genome wide. For precious clinical samples a denser array that conserves samples and provides a more comprehensive result could enable a scientist to perform a previously unreachable experiment.
Feature topology on an array has also evolved, enabling more features (even of the same size) to be packed into a smaller space. For example, Agilent's chips have gone through an evolution of at least three distinct phases: matrix, orange crate packing, and honeycomb, allowing progressively higher feature density even without reducing feature size (though feature size also shrank over time, as scanner resolution increased).
We have developed a simple method for effectively multiplying the number of features on an oligonucleotide tiling array without changing the cost of the array, the array handling, or the amount of sample needed.
Briefly, the technique involves creating probes that are not simply contiguous pieces of sequence from a given genome but rather consist of two or more disjoint genomic fragments (Figure 2).
Our first double-tiled array, a 44K 60mer Agilent array, spans the entire yeast genome at approximately 120 bp resolution. Following Lashkari et al.'s13 important expression array experiment, we hybridized RNA from galactose-grown yeast labeled with Cy5 versus RNA from glucose-grown yeast labeled with Cy3. We had surprisingly little trouble with the unusual array configuration, and our results unambiguously confirmed the yeasts’ dependence on different carbon sources, demonstrating that this technique may be a useful stopgap measure as we await the next generations of high-density arrays.
Computational analysis of these data proved challenging but not overwhelming; in fact, our solution was quite simple. If the signal intensity of each feature is a linear combination of the intensities of its subprobes (where we use “subprobes” here to refer to the separate genome-coordinate contiguous pieces of each probe), then we can deconvolute the signal using a matrix of genes × intensities (Wheelan et al 65). Briefly, we calculated a vector of true relative gene expression levels (6606 yeast genes × 1 value per gene) as the solution to the matrix form of the multiple regression equations where each probe appears zero, one, or two times in the matrix of 44K probes × 6606 genes (each probes appears once for each gene that it represents) and the last factor is simply the 44K × 1 matrix of log intensities versus probe IDs. The equations are detailed in Wheelan et al65, but will vary depending on the application.
Our technique can intuitively be extended further into triple- or even quadruple-tiling, as long as the probes are sufficiently long so that the subprobes will hybridize with appropriate specificity. As the push for greater and greater density continues, this technique is a simple way to multiply the density of any oligonucleotide tiling array, bringing us ever closer to the goal of the entire human genome tiled to 1bp resolution desired on a single chip.
Microarray technology has the potential to vault molecular biologists forward in our exploration, as we can now form queries on much larger scales; our ability is limited now only by our imagination and by feature density on an array. Hoheisel66 describes many past, current, and future uses of microarrays to tickle the imagination. There are also many competing techniques67, and labs should keep these in mind, since they are often cheaper and can perform the same task as a microarray, albeit usually on a smaller scale.
One technique that promises to seriously challenge microarrays is high-throughput sequencing. While not quite mainstream at this point, large-scale sequencing methods can provide similar information to a microarray and may have different biases.
The top four sequencing platforms currently in serious competition are Solexa (Illumina), Helicos, 454 (Roche), and SOLiD (Applied Biosystems). These instruments run in the $500,000 to $1,000,000 range but this estimate may not include very significant costs of providing the required computational infrastructure, which may or may not be present at a site contemplating such a purchase. Importantly, the latter costs consist of both a capital outlay and ongoing informatics support, to an even greater extent than microarrays. Each comes with strengths and weaknesses but all except 454 can provide gigabases of sequence in days for several thousand dollars. The 454 instrument provides significantly less base pairs per dollar but this disadvantage is offset by the distinct advantage of significantly longer read lengths (200−300 bp vs. 10's of base pairs for the other systems) which is extremely helpful for certain applications. As these data have not been extensively analyzed and the full cost per experiment is not clearly defined largely because of the uncertainties associated with the cost of the associated informatics, it is difficult to say which platform actually fits best for any given application, but so far, the field is full of promise.
Sequencing is fundamentally different from microarray-based methods. The main difference is that nearly any piece of sequence isolated from any cell can be detected by sequencing, while this is not at all the case for microarray analysis. With full coverage tiling arrays, this is less of a problem, as such arrays can capture all known sequences. However, any unexpected sequences (SNPs, copy number variants, virus, bacteria) will be missed, and the sequencing will give better information with regard to repeated DNAs unless targeted microarray protocols are developed to deconvolute them. Sequencing will pick up small variations in repeat sequence that can in principle be traced back to individual repeat copies68; these are very difficult to detect on an array.
The statistical challenges inherent in any massive-scale experiment are very important when dealing with microarrays. A microarray experiment is not hundreds of thousands of independent experiments; it is hundreds of thousands of very, very dependent observations. Many statistical techniques are calibrated to assume independence and will give flawed results otherwise. A p-value of 0.05 is technically acceptable but in reality will only lead to overenthusiastic interpretations of random noise, as this allows 1/20 of hundreds of thousands of observations to be false positives; correction for multiple hypothesis testing is essential.
We are still in the “better, bigger, stronger, faster”69 stage of microarray development. Microarrays are unarguably powerful and fascinating, although our enthusiasm must be tempered by the considerable cost involved in using them as part of our molecular biology toolbox. Finally, microarray experiments require a statistical rigor that is unusual in biology, and by involving statisticians at the design stage as well as throughout the study, biologists can optimize their resources and perform fewer and more informative experiments.