Overview of Data Flow
Data from the tab-delimited output file are filtered to remove unnecessary information output by the SoftWorx Tracker software before converting replicate spot data into single data records of standard deviations and averaged spot ratios. These filtered records and experiment identifiers are then filed in the database. One of two routes are utilized for displaying information from the database, direct for further annotation and via a data converter as positional and ratio data. In addition, chromosome specific information such as base pair position of each chromosome band is routed through the data converter for presentation (Fig. )
Figure 1 Overall View of SeeGH Data Flow: The user inputs data formatted as a tab delimited text file. The relevant data is then extracted from the text file via a filtering algorithm and replicate ratios and features are averaged before being stored in an SQL (more ...)
To accommodate output from various scanner/analyzer software packages, the only input requirement of SeeGH is a tab delimited text file with the following six fields for each array spot: a unique identifier, the base pair starting position of the clone on the chromosome, chromosome number, channel 1 signal to noise ratio (Ch1 SNR), channel 2 signal to noise ratio (Ch2 SNR), and log2 spot ratio (Fig. : buttons 1–6). Two additional fields, clone name and accession number may contain further text information (Fig. : buttons 7–8). Additional fields of miscellaneous data may be included in the tab delimited text file as the user is required to enter the total number of columns and the specific column number for each of the required data fields (Fig. : buttons 1–9). For example, the text file exported from SoftWorx Tracker contains a total of 72 fields for each spot imaged from the array.
SeeGH "New Data" Window. Buttons correspond to descriptions in text.
Input files can be located and opened by using the Browse button or by manually entering their file path (Fig. : button 12). Because array CGH experiments contain replicate spots to ensure high confidence in spot ratios SeeGH was designed with the capability of accepting up to five replicate spots (Fig. : button 10). Replicate spot ratio records are identified by their use of a common unique identifier and these spots are averaged and their standard deviations calculated. In a mantle cell lymphoma versus normal male hybridization, our example clone RP11-6J2 demonstrated triplicate spot ratios of -0.02690442, 0.009741764, and 0.04698608 respectively. Averaging these spots resulted in an average spot ratio of 0.0099414 and a standard deviation of 0.0369457. If replicate spots have been previously averaged then SeeGH requires that the 'Number of Replicates' field should be set to one and the spot standard deviations must be included in the records of the input file (Fig. : buttons 10,11).
SeeGH also requires the user to enter a basic description for each data file. The required fields are bar code/unique identifier, disease type, experimenter, and date (Fig. : buttons 13–16). Additional information may be entered into the "Comments" field but is not required (Fig. : button 17).
Data Filtering and Storage
Once all the required information has been entered, pressing the 'Load File' button will create a record in the 'Existing Data' table containing the five file description fields (BarCode, Disease_Type, Date, Experimenter, and Comments). The BarCode field is used as a key to generate 25 new tables which consist of a filtered input data table and one table per uniquely identified chromosome (for human material 1–22, X and Y). For our example experiment BarCode 10300047 points to these 25 new tables and the information for all three replicates of RP11-6J2 are located in the filtered input data table. The calculated average ratio and standard deviation as well as the lowest signal to noise ratio (SNR) for the three spots for each channel are placed into the appropriate chromosome table along with the required annotation information reducing the three replicate records to a single chromosome record. For example, the data for RP11-6J2 from our experiment, which is a clone derived from chromosome 6, would be stored in chromosome table 10300047_chr6.
The Genomic View window appears automatically after new data has been loaded into the database (Fig. ). The Genomic View consists of 24 tiles (one for each unique chromosome) each measuring 100 by 150 pixels with the origin pixel position (0, 0) at the bottom left corner for each tile. In order to graphically plot chromosomes and spot ratios, SeeGH takes the base pair information for each chromosome and spot ratio, converts them to pixel position coordinates, and draws the image of each chromosome and spot ratio into a tile using the pixel position coordinates.
Figure 3 SeeGH "Genomic View" Window. Reconstructed whole genomic array CGH profile from 97,299 array elements. Mantle cell lymphoma DNA (labeled with Cye5) was competitively hybridized with normal male (labeled with Cye3) to an array of 32,433 DNA segments spotted (more ...)
The chromosomal information used to draw the chromosomes is contained in 49 text files. For each chromosome arm there is a corresponding file that contains band names and base pair positions. The p and q arms of the 22 autosomes and 2 sex chromosomes are represented in a total of 48 files. The 49th file contains information about total chromosome lengths and individual arm lengths for each chromosome. In the example presented in this paper we used information from the UCSC April 2003 assembly to create these files. These files are included with the software and can be updated with new chromosomal mapping information as it becomes available. Using this information, the total base pair length of each chromosome arm is converted into pixel position y-coordinates using a base pair to pixel conversion formula (pixel position y-coordinate = base pair position / 1,700,000). This same formula is used to calculate each chromosome band's start and end pixel position y-coordinate from the 48 band information files. Chromosomes are drawn in the Genomic View with the x-coordinate starting at pixel 10 and having a width of 20 pixels.
The base pair start information for spot ratios is retrieved from the 24 chromosome tables created in the database for each experiment and converted into pixel position y-coordinates using the same formula. The x-coordinate for each spot ratio is calculated using a similar pixel conversion formula (pixel position of x-coordinate = X_Axis + spot ratio * One_Ratio). One_Ratio is given a default value of 10 pixels and X_Axis is set to a constant of 50. Therefore the y and x co-ordinates of our example clone (RP11-6J2) are 68, 60 (y-coordinate = 115712602 / 1700000, x-coordinate = 60 + 0.00994114 * 10).
Chromosomes and corresponding spot ratios are plotted on each tile using the calculated x and y coordinates. The 24 resulting tiles are displayed in the Genomic View as an 8 by 3 grid (Fig. : button 1). The Genomic View allows manipulation of several display parameters: ratio lines, ratio width, standard deviation filters, and signal to noise filters.
Ratio lines can be displayed at +/- 0.5, 1.0, 1.5 and 2.0, with a default display of +/- 1.0 (Fig. : buttons 2–5). Ratio width can be increased or decreased by inputting a numerical modifier that expands or contracts the x-coordinates of the spot ratios relative to the X_Axis (pixel position of x-coordinate = X_Axis + spot ratio * (One_Ratio + modifier)) (Fig. : button 6). Another feature available in SeeGH is the ability to display only those spots that meet user defined criteria. These criteria include a standard deviation cutoff and/or a minimum signal to noise ratio for either Ch1 SNR or Ch2 SNR (Fig. : buttons 7–9). The 8 by 3 tiled image can be saved as a bitmap which can be viewed or printed using any image viewing software (Fig. : button 10).
While in the Genomic View, the user can also search for a specific spot based on unique identifier, clone name, or accession number. An example search is shown in Figure : button 11 and Figure : buttons 1–2. Once located, the appropriate Chromosome View is automatically opened with a line through the chromosome image at the appropriate spot loci and the spot is highlighted. A Chromosome View can also be opened without the need for inputting a search term by selecting a chromosome with the left mouse button and choosing a magnification from the pop-up menu (Fig. : button 12).
SeeGH "Search" Window. Buttons correspond to descriptions in the text.
The Chromosome View displays the selected chromosome tile as a 649 by 673 pixel image with a zoom factor incorporated into the base pair to pixel conversion formula (pixel position y-coordinate = base pair position * zoom factor / 1,700,000) which increases or decreases the total pixel length for the chromosome image. The x-coordinates for displaying the chromosome now start at pixel 100 and have a width of 40 pixels. The x-coordinates for spot ratios are calculated using the same formula (X_Axis + spot ratio * Ratio_One) with Ratio_One equal to 50 pixels and X_Axis set to a constant of 375. For our demonstration clone the coordinates become 272,375 in the tile.
In the Chromosome View, the user is given many of the same features available in the Genomic View: hiding spots based on standard deviation criteria or signal to noise ratios, changing ratio widths of the spot image, adding or deleting ratio lines of 0.5, 1.0, 1.5 and 2.0, and saving the image as a bitmap (Fig. : buttons 1–5). However, the Chromosome View provides many additional features that are unavailable in the Genomic View: the display of standard deviations for replicate spots, flagging of high standard deviations, mouse-over activated spot information, continuous zoom, the ability to scroll along the chromosome, display UCSC regional information, and clear search results (Fig. : buttons 6–12).
Figure 5 SeeGH "Chromosome View" Window. 1,972 DNA segments are displayed for chromosome 6. The red line through the chromosome denotes the location of the search DNA segment which is highlighted. Horizontal lines through each data point represent standard deviations (more ...)
Spot standard deviations, are displayed as a line through each spot and can be turned on or off simply by checking or unchecking a box in the Chromosome View (Fig. :). In addition, standard deviation lines which exceed a user defined value (Fig. : button 7) can be flagged in red. One key feature added in the Chromosomal View is the 'mouse-over' functionality which displays specific spot information when the mouse cursor is positioned over a spot. The spot information displayed consists of the clone name, accession number, unique id, base pair starting position, ratio, standard deviation, and signal to noise ratio for both channel 1, and channel 2 (Fig. : button 8). The zoom feature in Chromosome View functions the same as in the Genomic View, and can be accessed multiple times for limitless magnification (Fig. : button 9). The Chromosome View can be scrolled up or down at a rate set by the user (Fig. : button 10). UCSC base pair positions are given for the displayed image (Fig. : button 11). The final feature clears the highlighted results of the Search function (Fig. : button 12).
SeeGH "Existing Data" Window. Buttons correspond to descriptions in the text.
Accessing Previously Entered Data
The Existing Data window contains a list of all the files that have been loaded into the program (Fig. : buttons 1–3). The displayed list can be limited by searching for data sets with specific search criteria (Fig. : buttons 1–2). Alternately, the list can be ordered by selecting a field from the drop down menu and performing a search function without entering any search criteria. A data set can be selected by highlighting a row in the list of existing data (Fig. : button 3). Once selected, the data set can either be viewed or deleted (Fig. : buttons 4–5). Deleting a data set removes all tables from the database, whereas, viewing opens a Genomic View for that data.