Investigation design graphs and their representations
A key recommendation of the MIAME standard is the description of how biomaterials and data objects relate to each other within an experiment. Such relationships are most easily represented in graph form. A DAG in which nodes represent biomaterials (e.g., samples, RNA extracts, arrays) or data objects, and in which edges represent the relationships between these objects, can be represented as an IDG. For instance, an IDG can show which samples are hybridized on which array, producing which data files, as shown in Figure .
An example investigation design graph. This graph depicts two samples hybridized on an array (design name SMD-10K) labeled by Cy3 and Cy5, generating the data file Data.txt.
Nodes and edges in this graph can be annotated with information about the respective objects, such as sample characteristics. Edges (the relationships between nodes) can be annotated by pointers to the respective experimental or data processing protocols, or by protocol parameters (e.g., dyes Cy3 or Cy5 for labeling protocols). More complex investigation design graphs are shown in Figures and . The IDG is a general concept applicable to any investigation description, and not restricted to microarray investigations. Effectively, the IDG represents the workflow of the investigation. The level of detail in this workflow description can vary; here we aim at the level of detail corresponding to the MIAME requirements. Two basic notions we use in defining the IDG are biomaterial
and data object
. The first intuitively represents a physical material such as a sample, RNA extract, array, or hybridized array. A protocol, when applied to a biomaterial, can generate a new biomaterial as its result. Biomaterials can also be split or pooled. For instance, one can take two samples, apply an RNA extraction/labeling protocol to each of them, labeling with Cy3 in the first case and with Cy5 in the second case, mix them and hybridize them on the array (as shown in Figure ). Data objects can be created from biomaterials by applying a 'measurement' protocol, for example, by scanning a hybridized array to obtain feature intensities. Data objects can be transformed into new data objects by applying a data transformation protocol; for precise definitions of these objects MAGE-TAB will refer to the Functional Genomics Experiment (FuGE, [17
]) object model that provides a higher-level class model for extension by technology-specific models such as MAGEv2 [16
Figure 3 An investigation design graph representing a two-channel experiment with extract pooling and reference RNA. This investigation is similar to the example in the Introduction (Figure 1), except that it uses a two-channel array and an RNA reference. The (more ...)
Each node in an IDG has an identifier and a list of labels. For instance, a node corresponding to a sample has the sample identifier and the sample properties, e.g., 'Organism' (genus and species) and 'OrganismPart' (organ). A label can be either a simple character string or a reference to an external object such as an ontology entry. For instance, 'Organism' will normally be described by an external ontology (e.g., NCBI taxonomy), 'OrganismPart' can be either a character string or an ontology entry obtained from an anatomy source of controlled terms. Edges in this graph can be labeled by protocols (or more usually by references to protocols) that have been used to derive one biomaterial from another. If protocols have parameters, these parameter values can be shown as labels on the respective edges (e.g., labeling protocols may have 'label' parameters, which can take values such as Cy3 or Cy5). Finally, each node in the graph has a type, e.g., 'sample', 'extract', 'hybridization', 'data'.
A question arises: How granular should the graph be? For instance, should one represent samples, extracts and labeled extracts within the same node, or using three different nodes? Note the differences between IDGs in Figures and – Figure illustrates a two-channel experiment comparing a series of RNA extracts with a common reference extract, while Figure represents a much simpler single-channel experiment. An additional layer of 'Extract' nodes have been used in Figure to better indicate the point at which pooling occurred. In practice, the degree of granularity used in the IDG largely does not matter, unless one of the 'intermediate' objects is being split or pooled. Nodes in the graph that have only one incoming and one outgoing edge can be contracted into their predecessor nodes, by adding extra labels. Thus, unless extracts are pooled or split, it is sufficient to show which sample is hybridized to which array. Viewing a complex investigation design as a graph may be helpful, even if the graph is not drawn at the most granular scale possible. The graph representation makes the replicate structure in the investigation clear, and is even more valuable for developing software for data export/import from a database or tool. The possibility to represent an investigation design graph at different levels of granularity may seem to introduce ambiguity. However, the investigation design graph is an informal concept, and it is neither possible nor desirable to prescribe exactly how a particular investigation should be represented. For our purposes, the general guideline is that the graph should reflect the level of granularity defined by MIAME. We will show in the next section that this flexibility in the representation of an IDG does not substantially affect the investigation design representations in the resulting spreadsheets, as all these different graphs will result in essentially the same spreadsheet and encode the same semantic information. A labeled graph can be encoded in various ways; in MAGE-TAB, we use a tabular format for the following four reasons:
1. The observation that large investigation designs typically have a regular structure, i.e., the same sub-graph is repeated many times (possibly with well defined modifications); moreover, the replicated structure is simple. This observation was supported by analysis of the structure of over 1,000 different investigations in the ArrayExpress database.
2. The degree of nodes in these graphs (i.e., the number of incoming and outgoing edges for a node), is small (most often 1 to 3), except for a few specific nodes which are related 'reference' samples or extracts (e.g., 'Extract reference' in Figure ).
3. The observation that DAGs which correspond to commonly used investigation designs have a property that their nodes can be grouped in consecutive layers, i.e., the source nodes (the nodes in the DAG which do not have entering edges) are in layer 1, the nodes that are connected to source nodes by an edge are in layer 2, etc. Furthermore, the grouping can be done so that each layer only contains objects of the same type, e.g., for the graph in Figure , we have sample layer 1, extract layer 2, hybridization layer 3, raw data file layer 4, and processed data layer 5.
4. Similar tabular formats have been used successfully in the biosciences and are familiar to many practitioners. For examples, see [19
] for a spreadsheet approach to microarray data management, or [20
], which describes the application of spreadsheets to the problem of data acquisition in the field of biochemical network modeling. In addition, the PRIDE database [21
] is also developing a spreadsheet-based system for the submission of mass spectrometry data.
Once a DAG of a regular structure has been represented in such a layered fashion, it is natural to encode it as a tab-delimited file or 'spreadsheet'. Each node in a DAG is represented by entries in a contiguous set of columns within the spreadsheet. The first column within each set contains the ID of the node, with subsequent columns containing the labels attached to that node, followed by the labels of the edges leading from the node. Note that the labels in each list have a particular order. Objects of the same type (e.g., Sample, Hybridization, ArrayData) are all contained within the same column set, thereby capturing the layered DAG structure within the spreadsheet. Each row in the spreadsheet corresponds to a path in the graph from one of the source nodes to one of the 'sink' nodes. Thus if there are two or more edges leaving or entering a node, this node will appear in the spreadsheet once for each path passing through it. For instance, the DAG given in Figure is represented as a spreadsheet in Table .
SDRF representation of the investigation design graph in Figure 3.
Note that use of IDGs provides a powerful mechanism to describe the pooling or replicate structure of the investigation precisely and unambiguously. One can easily distinguish between biological replicates (different source nodes, but all having the same experimental factor values; see below for experimental factor definition) and technical replicates on various levels, such as several samples from the same source, or dye swaps (Figure ).
An example of a more complex experimental design (data objects not shown). This is a real-world example, corresponding to the experiment with accession number E-MIMR-12 in Array Express.
MAGE-TAB definition and examples
As described in the Introduction, a MAGE-TAB document includes four different types of files: (1) Investigation Description Format (IDF); (2) Array Design Format (ADF); (3) Sample and Data Relationship Format (SDRF); and (4) raw and processed data files. In this section we will describe each of these in more detail. Here we describe the main concepts and ideas upon which the format is based; the full MAGE-TAB specification is available online [22
Investigation Design Format
An overall description of an investigation, including protocols and contact details, consists of a relatively small amount of information with few or no repetitious elements, and fits naturally into a single top-level document. Table shows an example of an IDF document. Values for certain fields such as "Replicate Types" and "Protocol Type" may be drawn from the MGED Ontology [23
] providing for a shared vocabulary of terms across files. Where fields may contain more than one term, these terms are separated using a semicolon delimiter.
Array Design Format
The aim of the ADF component is to describe an array design in a spreadsheet or a set of spreadsheets. Conceptually, microarray designs are devised to measure presence and/or abundance of molecular (biosequence) entities in biological samples. Each sequence of interest is represented by one or more reporter sequences on the array, each of which in turn is present in one or more physical locations on the two-dimensional microarray surface. Thus three levels of hierarchy are required to describe the array design:
1. A feature on the array – a location (spot) on the array where nucleic acids are spotted or synthesized.
2. A reporter sequence – the sequence of the molecules present at a particular feature on the array. Note that the same reporter sequence can be present at different features, i.e., there is one-to-many relationship between reporter sequences and features.
3. A composite element – a set of reporter sequences designed to measure the same biological entity, such as a gene or an exon.
In the simplest case there may be a one-to-one relationship between reporter sequences and the biological objects they are measuring. However, in a more general case, there may be a set of reporters measuring the biological entity. For instance, on short oligonucleotide arrays (such as those produced by Affymetrix), many reporters are used to measure the expression of the same gene. In the most general case there may be a many-to-many relationship between the reporters and the biological entities they are measuring (for instance, the same short oligonucleotide may be present in several different splice variants of a gene). These concepts are derived from the MAGE object model. To describe a microarray layout fully, information about composite elements, reporter sequences, and features on the array, and the relationships (mappings) between them, must be provided. The ADF has been designed to provide the means to do this. An example of an ADF document is shown in Table .
An example of an ADF document.
Sample and Data Relationship Format
The least trivial part of an investigation description is in the relationship between sample and data objects, as represented in the SDRF file. As already mentioned, an investigation design can be described as a DAG, and the SDRF is a spreadsheet-based representation of such graphs. Tables and show SDRF examples representing the investigation design graphs shown in Figures and , respectively. Similarly, Figure shows a simplified experimental design graph of replicated design, dual channel with dye swap (the protocols and data files are omitted for simplicity), and its spreadsheet representation is shown in Table . In the next example (Figure ), Sources are split into Samples, which are then pooled into Extracts as shown. The IDG in Figure can be represented by the SDRF in Table .
Replicated design, dual channel with dye swap. Data objects are not shown as there is a simple one-to-one mapping between hybridizations and raw data files.
Replicated design, dual channel with dye swap. Data objects have been omitted for brevity.
Representation of the investigation design in Figure 5 as an SDRF.
There are several conventions that can be used to make the encoding of DAGs into spreadsheets more concise. First, not every path in a DAG has to be represented on the spreadsheet to encode the DAG unambiguously; it is sufficient to represent every edge only once. For instance, in the graph shown in Figure , there are four possible paths (a → c → d), (a → c → e), (b → c → d), and (b → c → e). However, it is enough to present only two full paths, e.g., (a → c → d) and (b → c → e), to represent all the relationships between the nodes in the graph, as shown in the spreadsheet in Table . The second 'compaction' rule allows an SDRF spreadsheet to be split vertically on any ID column. More precisely, it is permitted to end an SDRF table at any ID column, and then start a subsequent table with the same column. It is not necessary to duplicate lines for any ID in the second part. For instance, the SDRF in Table can be represented by the two spreadsheets in Tables and .
Figure 6 Graph with four possible paths between nodes. While four paths are possible between the nodes in this graph [(a → c → d), (a → c → e), (b → c → d), and (b → c → e)], only two full paths, (more ...)
SDRF representation of the DAG in Figure 6.
Representing SDRF from Table 2 by a set of two SDRF files: first spreadsheet.
Representing SDRF from Table 2 by a set of two SDRF files: second spreadsheet.
For a detailed description of rules for encoding an arbitrary investigation design graph as an SDRF file, see the MAGE-TAB online documentation [22
The MAGE-TAB specification requires that raw data files are provided as binary or ASCII files in their native formats, such as Affymetrix CEL files, Agilent TXT files, or GenePix GPR files, whereas processed data files may be communicated in tab-delimited text format as data matrix files. Normally, a MAGE-TAB document will have one data matrix where rows typically represent genes (though they may also represent other biological entities, such as exons or genomic locations), and columns typically represent samples or experimental conditions. One can think of such a matrix as containing the data that are typically published as supplementary information for a given paper and on which the author would perform analyses such as clustering.
The main feature of data matrices, that distinguishes them from arbitrary data files, is that columns in such matrices have references to ID objects in SDRF files, for instance to particular raw data files or particular samples. This enables mapping from biomaterials and their characteristics (especially experimental factor values) to individual processed data columns by following the edges in the investigation design graph. Syntactically, each data matrix file has two header rows, as shown in Table . The first header row contains references to ID objects in an SDRF file. All the IDs should come from one particular column in the SDRF. That is, each column in the data matrix is marked by unique IDs from a particular column in the SDRF. The second row contains the names of the quantitation types, such as 'signal', 'p-value', or 'log_ratio(Cy3/Cy5)' (from the MAGE-TAB perspective, these are simply labels that do not have to have a particular meaning, but normally should be defined in the data processing protocol). An example is shown in Table .
Using this mapping each column in the summary data matrix can be automatically and concisely annotated by the most important characteristics, such as experimental factor values (see next section).