SAIL has been developed as a tool that can be installed and used independently by any biobank or research group in need for a system for keeping track of their samples. However, we believe that the main use of SAIL is in a centralized instance holding data from many providers and used to identify samples across multiple data sources.
For SAIL to function effectively across multiple data sources, it is important that their vocabularies are compatible and that the correct relations are made between data elements in different vocabularies. There are two possible strategies to achieve this: (i) harmonizing the vocabularies in advance of loading them into the system or (ii) by using SAIL to create mappings between terms of vocabularies already loaded into the system. In each case this is not a trivial task and, although it is central to the system, describing this is outside the scope of this application note.
A particular use case is the design of meta-analysis studies based on genome-wide genotype data availability across many collections. We give two examples on how SAIL can be used in this context.
Meta-analysis of genome-wide association studies for glucose levels in plasma.
A consortium wants to conduct a meta-analysis of genome-wide association studies for fasting plasma glucose levels. It is of importance to know diabetes status for the study, and age, gender and BMI are to be used as study covariates. For the study design, an estimate is sought for how many samples can be included in the study, and from which cohorts.
In the SAIL report constructor a query is constructed by selecting the parameters of interest, and adding them one by one to the query: glucose concentration (GLU), diabetes status (DB), age (AGE), gender (SEX) and BMI (BMI). We add the requirement for genome-wide genotyping data (GW_GT) and retrieve a report with detailed information about the availability in each collection for samples supporting the query (Supplementary Figure S1). The report tells us that 12 487 samples in three different cohorts may be eligible for the study. A further query restricted to these cohorts show the exact genome-wide genotyping arrays these samples have been measured on, and whether or not genome-wide imputed genotypes are available (Supplementary Figure S2). Based on the results from SAIL, it is now easy to contact the administrators of the different cohorts to ask for specific information and coordinate the sharing of data for the meta-analysis.
Metabolic Syndrome is a term for a combination of phenotypes that affect the risk for diseases involving the metabolic system and diseases that may follow as the conditions progress. The definition of Metabolic Syndrome is complex and can be done in different ways. Three commonly used definitions are those of the International Diabetes Federation (IDF), the US National Cholesterol Education Program (NCEP) and the World Health Organization (WHO). For example, WHO defines Metabolic Syndrome as a combination of impaired glucose regulation and two out of four additional risk factors. Impaired glucose regulation in itself is determined by the presence of either type 2 diabetes, impaired fasting glucose, impaired glucose tolerance or insulin resistance. The four additional risk factors are: (i) central obesity (threshold is gender dependent and determined by waist-to-hip ratio or BMI); (ii) raised plasma triglyceride levels and raised HDL cholesterol level (where the threshold depends on gender); (iii) raised blood pressure; and (iv) raised urinary albumin secretion ratio or raised albumin : creatinine ratio in serum. SAIL supports queries for such a complex combination of phenotypes, with operations such as ‘two out of four’, but encodes them as pre-defined queries.
To query SAIL for Metabolic Syndrome by the WHO definition, we select the eligible collections and simply add the predefined MetS_WHO query and submit the request. In the current installation of SAIL at the EBI, 16 903 out of 85 979 samples across 13 collections have sufficient measurements available to determine Metabolic Syndrome status according to WHO (Supplementary Figure S3).
Although SAIL has been developed with a focus on biobanks and biological sample collections, its design allows for the integration of data from other sources where information can be arranged into annotated records. The largest of the four SAIL instances which are currently run for various projects is accessible at http://www.ebi.ac.uk/Tools/sail/
with data from approximately 189 000 samples from 14 collections. Technically SAIL software can scale to any number of cohorts though the system may slow down as more cohorts are added. According to our current assessment, on the existing hardware (Intel Xeon 2.66 GHz, 4 GB RAM) the system can scale up to millions of samples.