As discussed in the methods section, there are seven upstream databases within the INDD database: AD, PD, ALS, FTLD clinical databases as well as bio-fluid database, neuropathology database and genetics database. The INDD database contains measures in the areas of demographics, clinical assessment, neuropychological tests, imaging, pathology, bio-fluid, genetics, and clinical trials. provides a summary of number of variables in each of these arrays and some key variables. As of July 27, 2010, there were a total of 460,000 observations (unique records) in the INDD database.
| Table 2Summary of Data Fields and Arrays of the INDD database |
Since the inception of the INDD database, there have been many examples of its utility and benefit in data retrieval, analysis and research. This is most clearly illustrated by a recent series of biomarker targeted proteomic studies that were performed across all disease domains in the INDD including AD, PD, FTLD and ALS as reviewed recently by Hu et al. [
16]. Since these studies included interrogation of ~1,500 bio-fluid samples from several hundred patients using a multiplex system to measure >150 analytes in each sample, it is hard to imagine how we could have completed these studies without using the Penn INDD. Thus, having a cross-disease database incorporating major neurodegenerative diseases (i.e., AD, PD, FTLD, ALS) along with bio-fluid samples, neuropathology and genetic information has conferred great advantages in the quantity and quality of neurodegenerative disease data sets at Penn. As summarized in the review by Hu et al [
16], abundant data fields within the database, as well as compatible data fields from across neurodegenerative disease centers, provided us with the information that was needed to correlate these biomarker data with clinical features of the different disorders. Thus, these studies illustrate the exceptional data mining capabilities of the INDD database. provides an example of the INDD database interface with patient background and family history.
One of the best examples showcasing the advantages and strengths of the INDD database was a biomarker study conducted at Penn through the Penn-Pfizer Alliance in which 1500 plasma and cerebrospinal fluid (CSF) samples from patients with AD, PD, FTLD or ALS and normal controls (NC) were interrogated using the Rules Based Medicine, Inc. (RBM) human Discovery/MAP panel of 151 analytes configured for the multiplex Luminex platform. The study initially required queries of the INDD database to ensure Penn had the necessary data from the four clinical disease centers to match various study criteria, as well as the ability to locate and extract the corresponding plasma and CSF samples. The study criteria for selecting the cases required that a subject have had either a plasma or CSF sample drawn from one of the four clinical centers with emphasis on having both plasma (e.g., epidermal growth factor) and CSF biomarkers (e.g., CSF t-tau). Additionally, each patient was required to have had a full clinical evaluation performed and had psychometrics tests (e.g., MMSE), vitals (e.g., blood pressure), and medical history (e.g., stroke) gathered.
We compare and contrast two database methods to extract the data that satisfy the above criteria in the Penn-Pfizer collaborative biomarker study. We demonstrate below how two different database schemes differ in design yet arrive at the same results.
The first database method used to generate the data was the traditional database design with separate and disjointed database containers. In this design, each clinical center housed their own center data locally using their center-specific IDs. Among others, a bio-fluid database, a neuropathology database, and a genetics database were also implemented in their individual containers segregated from others. When performing the same data extraction required by the above criteria for the Penn-Pfizer biomarker study, each of the four clinical center’s databases were queried separately along with three supporting databases. Once the data were queried and the Excel data files were collected, the next step was to compare each of the files and ensure that no duplicate patients among different centers were found and then combine the four separate Excel files. In this post-processing of the data, one must carefully examine the data to ensure that no duplicate records are found and take extra care when combining the files. In this example, after querying the databases, the resulting dataset contained more than 5000 records, which had to be examined and stitched together during post-processing. In a large study like this biomarker study, the investigators commonly request either the data to be rerun with additional data fields or rerun in the future after additional data has been added to the database. With this method of separate databases and the need to perform post-processing of the data, the task of rerunning the data extraction is time consuming and challenging. All the steps of extracting and combining the data must be repeated for each instance, leaving room for human error and possible misrepresentation of the data.
The second method used to perform the data extraction was the INDD database method. Utilizing the INDD database and its capability of centralized jointed tables, a single query was crafted to join 13 separate tables using the criteria listed above. The query generated 1103 records with each row representing a unique patient with the data points spanned across the columns. This result was exported to Excel, formatted and annotated for each column header, then sent to the investigators for their analysis. In the event of rerunning the query, the INDD database stores previous executed queries in the database. Because the data extraction was performed via a single query, the query could be modified to contain the additional fields investigators were seeking or the same query could be rerun to update the records of the dataset.
In the above case study, one can clearly see the advantage of the INDD database versus the traditionally deployed databases. The reduction in time and effort in utilizing the INDD database enables researchers and data managers to focus their efforts elsewhere and eliminate the steps required by manual post-processing, greatly reducing the chances of error in the data. While the conclusions of the two datasets are identical, the two different approaches vary significantly in the time, effort and accuracy of the resulting dataset. summarizes the key differences of the two database approaches.
| Table 3Comparison between the INDD Database Approach and the Separated Database Approach |
With the ability to query across multicenter datasets and to match those data with bio-fluid and/or genetic data, the INDD database played a key role in our ability to conduct this study. gives an example of a portion of the data set queried for the Penn-Pfizer biomarker study from several clinical core databases. It shows data from the ALS, AD, FTLD databases with education, race, ethnicity, and diagnosis along with Mini Mental State Exam (MMSE) date, MMSE total score, Luminex total CSF tau (t-tau) values, and Luminex CSF phosphorylated tau (p-tau) values. Since the interrogation of these 1500 plasma and CSF samples is complete, several analyses of the data have either been published, submitted or in preparation. Briefly, several analytical strategies are being used to identify classifying analytes according to clinical and pathological diagnosis, including significance analysis of microarrays (SAM) and random forest analysis. Many analytes differed between AD and NC subjects, but only a few differed between AD and non-AD dementias. This type of analysis required the model to adjust for basic demographic variables (age, gender, education) at the most superficial level, and additional adjustment for more complex time-dependent variables including disease duration at collection time for bio-fluid sample(s) and cognitive and neurological examination results corresponding to bio-fluid collection. As some patients had multiple types of bio-fluids collected (plasma and CSF), and a small subset had serial samples from different time points, a comprehensive INDD database is necessary to generate the data points associated with each patient at a particular time point. Novel analytes representing potential CSF biomarkers for AD and FTLD using the data generated from the INDD database have been studied and results have been published [
17] or submitted (Hu et al., Neurology, submitted). We also investigated plasma biomarkers that distinguish between AD from NC and other neurodegenerative diseases and these studies are being prepared for publication (Soares et al, in preparation). Thus, we have exploited the Penn INDD database to implement novel biomarker studies that would otherwise have been nearly impossible to accomplish in a timely fashion without an integrated database system.