The ADNI project has brought together geographically distributed investigators with diverse scientific capabilities to join forces in studying biomarkers that signify the progression of Alzheimer's disease (AD). The quantity of imaging, clinical, cognitive, biochemical, and genetic data acquired and generated throughout the study has required powerful mechanisms for processing, integrating and disseminating these data not only to support the research needs of the investigators who make up the ADNI cores, but also to provide widespread data access to the greater scientific community who benefit from having access to these valuable data. At the junction of this collaborative endeavor, the Laboratory of Neuro Imaging has provided an infrastructure to facilitate data integration, access, and sharing across a diverse and growing community.
ADNI is made up of the eight cores responsible for conducting the study as well as the extended ADNI family of external investigators who have requested and been authorized to use ADNI data. The various information systems used by the cores result in an intricate flow of data into, out of, and between information systems, institutions, and among individuals. Ultimately the data flow into the ADNI data repository at LONI where they are made available to the community. Well-curated scientific data repositories allow data to be accessed by researchers across the globe and to be preserved over time [9
]. To date, more than 1,300 investigators have been granted access to ADNI data resulting in extensive download activity that exceeds 800,000 downloads of imaging, clinical, biomarker, and genetic data. ADNI investigators come from 35 countries across various sectors, as shown in .
Global distribution of ADNI investigators by sector.
The ADNI Informatics Core has made significant progress in meeting our aim to provide a user-friendly, web-based environment for storing, searching, and sharing data acquired and generated by the ADNI community. In the process, our LONI Data Archive (LDA) has grown to meet the evolving needs of the ADNI community and we continue making strides toward a more interactive environment for data discovery and visualization. The automated systems we have developed include components for de-identification and secure archiving of imaging data from the 57 ADNI sites, managing the image workflow whereby raw images transition from quarantine status to general availability and then proceed through preprocessing and postprocessing stages; integrating nonimaging data from other cores to enrich the search capabilities, managing data access and data sharing activities for the 1,0001 investigators using these data, and providing a central repository for disseminating data and related information to the ADNI community.
Parallel efforts by the Australian Imaging Biomarkers and Lifestyle Flagship Study of Ageing (AIBL) have resulted in a subset of AIBL data being placed in the LDA where it has been made available to the scientific community. The AIBL data were acquired using the same magnetic resonance imaging and positron emission tomography (PET) imaging protocols, making them compatible for cross-study collaboration [14
]. Investigators may apply for data access from the ADNI and AIBL studies, either individually or in combination, and may search across and obtain data from both projects simultaneously using a common LDA search interface.
2.1. Image data workflow
In short, the acquisition sites collect data from participants and enter or upload data into the clinical and imaging databases; the imaging cores perform quality control and preprocessing of the MR and PET images; the ADNI image analysts perform postprocessing and analysis of the preprocessed images and related data; the biochemical samples are processed and the results compiled; and investigators download and analyze data as best fits their individual research needs.
2.1.1. Raw image data
In keeping with the objectives of the ADNI project to make data available to the scientific community, without embargo, while meeting the needs of the core investigators, the image data workflow shown in was adopted. Initially, each acquisition site uploads image data to the repository through the LDA a web-based application that incorporates a number of data validation and data de-identification operations, including validation of the subject identifier, validation of the dataset as human or phantom, validation of the file format, image file de-identification, encrypted data transmission, database population, secure storage of the image files, and metadata and tracking of data accesses. The image archiving portion of the system is both robust and extremely easy to use with the bulk of new users requiring little, if any, training. Key system components supporting the process of archiving raw data are as follows:
- The subject identifier is validated against a set of acceptable, site-specific IDs.
- Potentially patient-identifying information is removed or replaced. Raw image data are encoded in the DICOM, ECAT, and HRRT file formats, from different scanner manufacturers and models (e.g., SIEMENS Symphony, GE SIGNA Excite, PHILIPS Intera, etc.). The Java applet de-identification engine is customized for each of the image file formats deemed acceptable by the ADNI imaging cores and any files not of an acceptable format are bypassed. Because the applet is sent to the upload site, all de-identification takes place at the acquisition site and no identifying information is transmitted.
- Images are checked to see that they “look” appropriate for the type of upload. Phantom images uploaded under a patient identifier are flagged and removed from the upload set. This check is accomplished using a classifier that has been trained to identify human and phantom images.
- Image files are transferred encrypted (HTTPS) to the repository in compliance with patient-privacy regulations.
- Metadata elements are extracted from the image files and inserted into the database to support optimal storage and findability. Customized database mappings were constructed for the various image file formats to provide consistency across scanners and image file formats.
- Newly received images are placed into quarantine status, and the images are queued for those charged with performing MR and PET quality assessment.
- Quality assessment results are imported from an external database and applied to the quarantined images. Images passing quality assessment are made available and images not passing quality control tagged as failing quality control.
Fig. 2 Clinical and imaging data flows from the acquisition sites into separate clinical and imaging databases. Quality assessments and preprocessed images are generated by the imaging cores and returned to the central archive where image analysts obtain, process, (more ...)
After raw data undergo quality assessment and are released from quarantine, they become immediately available to authorized users.
2.1.2. Processed image data
The imaging cores decided to use preprocessed images as the common, recommended set for analysis. The goals of preprocessing were to produce data standardized across site and scanner and with certain image artifacts corrected [15
]. Usability of processed data for further analysis requires an understanding of the data provenance, or information about the origin and subsequent processing applied to a set of data [17
]. To provide almost immediate access to preprocessed data in a manner that preserved the relationship between the raw and preprocessed images and that captured processing provenance, we developed an upload mechanism that links image data and provenance metadata. The Extended Markup Language (XML) upload method uses an XML schema, which defines required metadata elements as well as standardized taxonomies. As part of the preprocessed image upload process, unique image identifier(s) of associated images are validated keeping the relationship(s) among raw and processed images unambiguous with a clear lineage. The system supports uploading large batches of preprocessed images in a single session with minimal interaction required by the person performing the upload. A key aspect of this process is agreement on the definitions of provenance metadata descriptors. Using standardized terms to describe processing minimizes variability and aids investigators in gaining an unambiguous interpretation of the data [17
Preprocessed images are uploaded by the quality control sites on a fairly continuous basis and initially, each analyst had to search the data archive to find and then download data uploaded since the investigator's previous session. This was found to be cumbersome, so an automated data collection component was implemented, whereby newly uploaded preprocessed scans are placed into predefined, shared data collections. These shared collections, organized by patient diagnostic group (normal control, mild cognitive impairment, AD) and visit (Baseline, 6 month, etc.), together with a redesigned user interface () that clearly indicates which images have not previously been downloaded, greatly reduced the time and effort needed to obtain new data. The same process may be used for postprocessed data, allowing analysts to share processing protocols through descriptive information contained in the XML metadata files.
The Data Collections interface provides access to shared data collections and is organized to meet the workflow needs of the analysts. The ability to easily select only images not previously downloaded by the user saves time and effort.
2.2. Data integration
A subset of data from the clinical database was integrated into the LDA to support richer queries across the combined set. The selection of the initial set of clinical data elements was based on user surveys in which participants identified the elements they thought would be most useful in supporting their investigations. As a result, a subset of clinical assessment scores, as well the initial diagnostic group of each subject, was integrated into the LDA to be used in searches and incorporated into the metadata files that accompany each downloaded image. Because the clinical data originate in an external database, automated methods for obtaining and integrating the external data were developed that validate and synchronize the data from the two sources and ensure that data from the same subject visit are combined.
A robust and reliable infrastructure is a necessity for supporting a resource intended to serve a global community. The hardware infrastructure we built provides high performance, security, and reliability at each level. The fault–tolerant network infrastructure has no single points of failure. There are multiple switches, routers, and Internet connections. A firewall appliance protects and segments the network traffic, permitting only authorized ingress and egress. Multiple redundant database, application, and web servers ensure service continuity in the event of a single system failure and also provide improved performance through load balancing of requests across the multiple machines. To augment the network-based security practices and to ensure compliance with privacy requirements, the servers use Secure Sockets Layer (SSL) encryption for all data transfers. Post-transfer redundancy checking on the files is performed to guarantee the integrity of the data.
Communication with the LDA is managed by a set of redundant load balancers that divides client requests among groups of web servers for optimized resource use. One group of web servers is dedicated to receiving image files from contributors, and the other group sends data to authorized downloaders. Each web server communicates with redundant database servers that are organized in a master–slave configuration. The image files stored in the LDA reside on a multi-node Isilon storage cluster (Isilon Systems Inc., Seattle, WA). The storage system uses a block-based point-in-time snapshot feature that automatically and securely creates an internal copy of all files at the time of creation or modification. In the event of data loss or corruption in the archive, we can readily recover copies of files stored in the snapshot library without resorting to external backups.
Backup systems are designed to ensure data integrity and to protect data in the event of catastrophic failure. Incremental backups are performed nightly, with full backups stored on tape every week and sent off site. Full backups follow the industry's standard grandfather-father-son rotation scheme. To augment the data snapshot functionality, we perform nightly incremental and monthly full backups of the entire data repository. This automated backup is stored on tape in our secondary data center located in another campus building to protect against data loss in the case of a catastrophic event in our primary data center where the storage subsystem is housed. Additionally, we provide a tertiary level of protection against data loss by performing completely independent weekly tape backups of the entire collection which are deposited to an offsite vaulting service (Iron Mountain Inc., Boston, MA). This multipronged approach to data protection minimizes the risk of loss and ensures that a pristine copy of the data archive is always available ().
Redundant hardware and multiple backup systems ensure data are secure and accessible.
ADNI policy requires participating sites to upload new data within 24 hours of acquisition. To prevent a large number of downloaders from competing with resources needed by uploaders, the application servers are divided by upload/ download functionality. To prevent a single downloader from dominating a web server with multiple requests, the activity of each downloader is monitored and his/her download rate is throttled accordingly. Additionally, users are discouraged from downloading the same image files multiple times through the use of dialogs that interrupt and confirm the download process. These measures help to ensure ADNI data, and resources are equitably shared while maximizing the efficiency of the upload processes.
2.5. Data access and security
Access to ADNI data is restricted to those who are site participants and those who have applied for access and received approval from the Data Sharing and Publication Committee (DPC). Different levels of user access control the system features available to an individual. Those at the acquisition sites are able to upload data for subjects from their site but are not able to access data from other sites, whereas the imaging core leaders may upload, download, edit, or delete data. All data uploads, changes, and deletions are logged.
The ADNI DPC oversees access by external investigators. An online application and review feature is integrated into the LDA so that applicant information and committee decisions are recorded in the database and the e-mail communications acknowledging application receipt, approval, or disapproval are automatically generated. Approved ADNI data users are required to submit annual progress reports, and the online system provides mechanisms for this function along with related tasks, such as adding team members to an approved application and receiving manuscripts for DPC review. All data accesses are logged and numbers of uploaded and downloaded image data are available to project managers through interactive project summary features.
More than 100,000 image datasets (more than five million files) and related clinical imaging, biomarker, and genetic datasets are available to approved investigators. More than 800,000 downloads of raw, pre-, and postprocessed scans have been provided to authorized investigators. Downloads of the clinical, biomarker, image analysis results and genetic data have been downloaded more than 4,300 times.
Data download activity has increased each year since the data became available, increasing from 154,200 image data-sets downloaded in 2007 to almost 290,000 image datasets downloaded in 2009. With users from across the globe accessing the archive, activity occurs around the clock ().
The number of image downloads by hour of the day shows maximum activity occurring during U.S. working hours but still a significant amount of activity at other times in line with the numbers of investigators inside and outside the U.S.
2.6. Data management
With responsibilities for data and access oversight and administration spread across multiple institutions, we built a set of components to help those involved manage portions of the study. These include a set of data user management tools for reviewing data use applications, managing manuscript submissions, and sending notifications to investigators whose annual ADNI update is due (), and also a set of project summary tools that support interactive views of upload and download activities by site, user, time period, and provide exports of the same (). Other information, documents, and resources geared toward apprising investigators about the status of the study and data available in the archive are provided through the website.
Fig. 6 Project summary components include a tabular listing (above) and a graphical representation (below). Specific sites and time ranges control the information displayed and tabular data may be exported for use, such as in reports and in further analysis. (more ...)
User management components support the work of the DPC, the body charged with reviewing, approving, and tracking ADNI data usage and related publications.