The clinical informatics team met weekly over a two-year period to undertake the steps presented in the methods section. Through an iterative development process, we created the final data model for Oncoshare that is depicted in . Given the multi-disciplinary, often multi-institutional nature of breast cancer care, this simpler model proved more feasible than the one we had originally proposed (). This model was driven by the need to find a common ground between the different EHR systems implemented at the two sites as well as historical EHR migrations at each site. These system differences were identified, reviewed, and analyzed constantly during these weekly meetings. The final data model reflects the compromises necessary to integrate the data from multiple systems across entities and time.
final data model and primary sources for each data type. In this model relationships between tumor and treatment must be inferred by temporal proximity
Our initial dataset, compiled from our respective institutional EMRs from 2000 to 2009 inclusive, contained 8390 patients seen at PAMF and 11,010 at Stanford, 2137 of which were held in common (consisting of approximately 25% of the PAMF patients and 20% of the Stanford patients).
Clinical encounter data drawn from the EMR consisted of billing codes (the code, source vocabulary and descriptive text), patient, provider and date. This table was richly populated, averaging 80 encounter codes per patient, with all columns complete.
Tumor data drawn from the EMR and cancer registry consisted of patient identification code, date of diagnosis, stage and expression of clinically important tumor markers (estrogen receptor (ER), progesterone receptor (PR) and HER2/neu), histology, and pathologic Tumor, Node, Metastasis (TNM) staging. This table was sparsely populated, with TNM scores, staging and tumor markers available on only 50% of the patients originally identified by billing diagnosis codes alone. Due to significant rates of missing data, we defined an analytical cohort (N=12,116) of patients having adequate information for characterization of their breast cancer, including stage, tumor markers, and evidence of some treatment or diagnostic information.
Surgery data, while potentially available in the EMR, was most reliably extracted from the cancer registry, particularly for patients with complete staging, histology and tumor marker information reported by the registry. We were confident in interpreting a missing report in the state-wide cancer registry as a true absence of surgery, whereas a missing report in the EMR might reflect a billing error or performance of the procedure at a different hospital.
Given recent reports that SEER may under-ascertain specific treatment modalities, particularly radiation therapy [12
], we used EMR data to supplement the registry summary of treatment for each complex major modality, namely systemic therapy and radiotherapy. Efforts to add specific details of chemotherapy regimens, such as drug combinations, doses and intervals, proved challenging given the evolution of EMR-based drug ordering over the last decade. We anticipate that prospective data capture of chemotherapy regimens will prove more straightforward, with the increasing use of chemotherapy-specific electronic ordering programs such as Beacon. A major contribution from the EMR was the addition of billing codes for emerging diagnostic interventions including imaging strategies, genetic and tumor genomic tests; this information was not available through the cancer registry.
We obtained survival data from the state cancer registry according to their reported algorithm interrogating multiple national databases for last date of follow up [13
] and by querying the vital status field. We also integrated data from the Social Security Administration Death Master File (SSA DMF) [14
]. We used a consistent algorithm for patients seen at both institutions, to minimize any bias in death ascertainment. Since only 85% of our patients have provided us with a seemingly valid SSN as part of our normal registration process [5
], the heuristic used to match EMR patients to the SSA DMF is as follows:
- Match with the patient’s full SSN and the month/year of the birthdate (YYYYMM)
- If not found, match with the patient’s exact last name, exact first name, full birthdate (YYYYMMDD) and the last 4 digits of the SSN.
We are currently using the Oncoshare database to prepare a manuscript that will report on the patterns and outcomes of breast cancer care across these community and academic health systems over the last decade. Additionally, we are now collaborating with patient advocates to develop questionnaires for collection of patient-reported information on care preferences, symptoms, and outcomes, which we will integrate into Oncoshare [15
]. We plan on using and expanding this resource for other projects over time, e.g. adding genetic testing results and investigating their implications in treatment and outcomes.