In addressing R1 it is important to understand the nature of the experiments undertaken, the data generated and the requirements for access. Figure is a workflow diagram describes a generic microscopy experiment and data that is output from or required to describe each stage.
Workflow diagram illustrating the process of conducting a microscopy experiment. White boxes indicate the outputs from each stage that are required to describe the experiment.
The requirement for the model is that it must represent the output from experiments and results along with descriptive metadata. The metadata should, in keeping with guidelines devised for other experimental metadata [25
], provide enough detail to validate the experimental methods used, minimise unnecessary repetition of experiments, and provide enough detail to repeat the experiment. There is no explicit requirement for the images to be archived along with the metadata, and as such it is reasonable to maintain a reference to the location of the image data, rather than storing the raw data in the database.
The microscopy experiments conform to the set-up illustrated in Figure . Cells are observed in multiple locations on a dish. Cells may be transfected with one or more plasmids and treated with one or more compounds. The plasmids and compounds may be identical at every location across the dish or, in the case of high throughput screenings, may vary by location. Each experiment performed may have more than one dish. Associated with the dish, or locations on the dish, are additional treatments and potential environmental perturbations. These in turn have protocols associated with them. This arrangement of entities forms the core of a model that can be used to describe an experiment and its associated results. Beyond that, on a more abstract level, is a description of the context within which it was created.
Elements that form an experimental dish unit.
Figure shows a UML diagram describing the relationships between the main objects that are associated with the Experiment class. Experiments are performed by people who belong to research groups. Experiments have a hypothesis, may be of various types (e.g., FISH (Fluorescence In Situ Hybridisation), FRET (Förster Resonance Energy Transfer) or Fluorescence), and may be performed using various techniques (e.g., confocal or wide field microscopy). Experiments can have many Dishes that were observed with a Microscope, as described above. The Experiment may be a spotted experiment, in which case cells treated in different ways may be applied to particular locations on the dish. Images are produced of Locations on the Dish, which once analysed, yield results that are directly related to that Location.
UML Diagram illustrating relationships between metadata elements and experimentally generated elements.
The model represents the physical and conceptual relationships between the elements that go to make up the experiment, but it must also capture the additional information provided by the microscope data files. The microscope records a large amount of information about its settings alongside the images it captures, such as laser intensities, filters, objectives, time points and tracks. This information is useful for validating procedures and replicating experimental configurations. Figure shows the elements of the model that directly relate to the information provided in the Microscope data file and their relationship to the Dish, LocationReading and Results objects.
Classes derived from Microscope data files.
The remainder of the model is populated from CellTracker output and subsequent analysis performed on this. As CellTracker outputs XML (see additional files 1
), the model represents the CellTracker output format where a ResultTimeSeries
has a sequence of ResultState
s. CellTracker records the fluorescence in the nucleus and cytoplasm (CellularCompartment
s) of each Cell
. Both Cell
s and CellularCompartment
s have fluorescence data captured about them (CellProperty
) on one or more Channel
s. These Channel
s have a name
that can be related back to the wavelength of light (and hence tagged protein of interest) which was being recorded on that channel, and intensity
of the fluorescence recorded. Additionally, we use this results section of the model to hold information derived from an automated analysis of the CellTracker output (described in "Summarisation of Results" below). For each recorded Result
for a location on a dish, we generate AnalysisResult
s. The AnalysisResults
consist of one or more AnalysedCell
s for that dish location, and for each of these we generate AnalysedChannel
s corresponding to the Channel
s that were recorded for that particular cell. The AnalysedChannel
s contain the details (Time
, ratio value
) of any Peak
s relating to the movement of fluorescence between cytoplasm and nucleus. The content of this section of the model is shown in Figure .
How the Result data structure relates to Location Reading.
Implementation of the Model
The data model is implemented as an XML Schema Definition (XSD), and thus the associated data are captured as XML. This is advantageous for two reasons. Firstly, XML is a de-facto standard for the transfer of biological data. Secondly, we are able to make use of existing software infrastructure for capturing, managing and accessing XML data. The full XSD, along with example data files, are available in the additional files.
The data capture workflow is illustrated in Figure . The data are captured using Pedro [27
]. Pedro is a flexible model-driven data capture tool that is used to populate XML documents that adhere to a predefined schema. The use of Pedro as a data capture tool for cell imaging data is discussed elsewhere [10
]. To fulfil R2, data capture for the repository must capture the information specified by the model but minimise the amount of form-filling which must be performed by the experimentalist. This is achieved in two ways:
Workflow diagram illustrating the process of annotating and uploading an experiment to the database using Pedro.
1. Making use of smaller repositories that store model fragments relating to commonly used items, such as Researchers, Protocols, Plasmids and Compounds. These can be selectively added to the main document being edited in Pedro.
2. Extracting metadata and experimental structure from the microscope generated data files and the CellTracker output files.
The extraction of metadata from the microscope data initially populates a Dish document element with the correct number of LocationReading elements. These in turn are automatically annotated with the correct image data file names and microscope settings.
For each LocationReading an analysis file is produced by CellTracker, which populates the result elements with the relevant time series data. After capturing the data, it may be saved as an XML document, or directly submitted to the database.
Once stored documents may be imported directly from the database back into Pedro for further editing and updating – results may be added or removed,
The Tamino [28
] or eXist [29
] native XML DBMS can be used to implement the repository, which allows us to directly store the XML documents generated by Pedro during data capture. Additionally, as discussed below, the use of native XML storage provides for convenient generation of web pages using XSLT (XSL Transformation).
Requirement 3 is for efficient searching over the archived metadata. Data needs to be accessed for two reasons, either updating, or searching and viewing. For updating, the data may be directly loaded back into Pedro from the database. For searching and browsing we have produced a web-accessible front end.
The front end is implemented as a series of Java Server Pages that send XQuery queries to the database and then transform the returned XML into HTML using XSLT documents.
After discussions with the experimentalists and modellers, initially ascertaining how data were currently consumed and subsequently examining what other questions may be asked of the data once stored in a database, the following requirements for querying the data were identified:
• List experiments by specific Experimentalist
• List experiments using a specific cell line
• List experiments performed on a specific date
• List experiments performed using a specific cell line and a specific plasmid
• List experiments performed using a specific compound treatment
• List experiments performed using a specific compound and specific plasmid
• List experiments performed using two specific plasmids
As such, the emphasis of the search interface is on finding specific experiments rather than on more complex tasks such as comparison of experiments. As a result, the database is performing the role of an experimental catalogue.
Additionally, requirements were identified for retrieval of experiments, the results of which have certain characteristics:
• Show results where an oscillation of a specific period (give or take a certain amount) has taken place.
• Show results where a change in whole cell luminescence has taken place at a specific rate (give or take a certain amount).
All of these queries have been implemented, but as new requirements are identified, it is generally straightforward to add new queries.
The results of queries that return experimental lists are represented as a table of experiment titles. Clicking through provides a summary of the experiment and protocols used (Figure ). From there it is possible to click through to the results of that experiment.
Experimental details from the metadata stored in the database. Canned search queries are in the list on the left hand side of the screen.
Queries interrogating the results yield a table of experiment titles, and selecting one takes the user directly to the results, highlighting any locations that fulfilled the query request. In order to allow querying over the experimental results we have implemented a summarisation algorithm that is run when CellTracker analyses are imported into Pedro.
Summarisation of Results
The facility to search the database for experiments whose results fit certain parameters, specified as R4, is important for modellers and experimentalists alike. We identified the following questions as being relevant for searching over experimental results:
Q1. Is there a movement of a measured fluorescent protein between the cytoplasm and nucleus, and if so when does this occur?
Q2. Are there subsequent movements resulting in an oscillation, and if so what is its period?
Q3. Is there a general trend in the overall level of measured fluorescence in a cell over time?
To meet these requirements, summaries are generated from the results of CellTracker analyses, which are stored in the database.
CellTracker generates a series of nuclear and cytoplasmic fluorescence intensities over time. By calculating and plotting the ratio between these intensities the translocation of a labelled protein from cytoplasm to nucleus may be observed as a peak (Figure ). An algorithm has been developed to automate the detection of these peaks in the CellTracker output. This enables us to annotate the data as it is imported into the database with the times of Nuclear-Cytoplasmic translocations and the period of any oscillation, which in turn allows us to answer questions Q1 and Q2. Q3 is addressed by calculating a regression line through the whole cell fluorescence over time.
A peak in Nuclear:Cytoplasmic ratio of labelled protein, denoting a translocation between cellular compartments. Also indicated are calculated detection thresholds and data points extrapolated by the peak detection algorithm.
The peak detection algorithm takes as its input nuclear and cytoplasmic fluorescence intensity values over time. It returns peaks identified by time and nuclear:cytoplasmic (N:C) ratio and the period of any identifiable oscillation. The algorithm is implemented in Java and based upon the Tom O'Haver's PeakFinder function for MatLab [30
]. This was chosen as it had been specifically designed to identify positive peaks in noisy time-series data, and provided several parameters that could be adjusted to fit the data gathered from cell imaging.
The function accepts the data to be analysed along with parameters specifying the width of peaks to spot, a height threshold they must pass beyond (in our case this is a nuclear:cytoplasmic ratio), the width of the window to be used in the sliding average smoothing applied to the data, and a threshold gradient for the slope of the peak. The pseudocode for the algorithm is shown in Figure .
Figure 10 Pseudocode for the peak detection algorithm. fitpoly is a Java implementation of the MatLab Polyfit function which finds the coefficients of a polynomial which fits the specified data – fitpoly(x,y,n) where x and y are vectors of the x and y values (more ...)
The algorithm has the following stages:
1. Detect prestimulation. Some experiments start when a chemical stimulus is added to the cells, others are run with a pre-stimulation period providing a base level for the N:C ratio. Adding the stimulus takes at least 30 seconds and hence an increase in the spacing of timepoints by at least this amount is indicative of a prestimulation period having been undertaken.
2. Calculate detection threshold. If a prestimulation has been undertaken, the detection threshold should be twice the standard deviation of the N:C ratios in the prestimulation period (the criteria used by the experimentalists). Otherwise set the threshold to be 1.75 times the minimum recorded N:C ratio – there is some tolerance with this value, however setting it much lower (1.5) or higher (2) increases the false positive and negative rates respectively.
3. Extrapolate values. If no prestimulation has occurred, the peak detection function is unlikely to identify the first peak, so a run-in of 10 timepoints with the N:C ratio at 0.1 below the starting value is prepended to the data. The peak detection function is also unlikely to a final peak that does not finish below the calculated detection threshold, hence if the final ratio values in the time series are a downward slope, the slope is extrapolated until it falls below the detection threshold. This is illustrated in Figure .
4. Optimise detection width. The detection width (that is the width of peak, in numbers of time points, to be detected) to be used is determined by repeatedly calling the peak detection function with increasing detection widths. When the optimal detection width is encountered, there is a jump in the detected location of the first peak (Figure ).
Detected position of first peak with increasing peak detection width.
5. Detection of peaks
. Derivatives of the data are smoothed using an average sliding window of 10 time points in width (Lower values introduced extra false positives within available data, higher values increased false negatives). If a maxima in the data is encountered and the ratio value is above the detection threshold then a peak is recorded. If the detection width is less than 7 time points (as implemented in the original algorithm [30
]), the location and height of that local maximum is recorded. For larger detection widths, a second order polynomial is fitted to the data and the location and height of its maximum is recorded.