presents a set of seven desiderata for the integration of genomic and other high volume biomolecular data into EHRs. We offer these functional characteristics both as a conceptual guideline for the design or extension of EHRs, and for their potential utility as evaluation criteria for the ‘meaningful use’ of EHRs to manage these types of data. The explanation and reasoning behind these desiderata are provided here.
Desiderata for the Integration of Genomic and other high volume biomolecular data into EHRs
1. Maintain separation of primary molecular observations from the clinical interpretations of those data
A common current practice for the reporting of genetic variation by clinical laboratories is to acquire a large number of molecular observations via high throughput technologies, such as solid state chips that measure hundreds to thousands of molecular variants. Laboratories then deliver into the health record a report in document format on paper, through an electronic interface between the laboratory and EHR or as open standard for document exchange such as PDF, that cites only a small number of the observations made combined with professional interpretation of the significance of those observations. This parallels the current reporting practice in diagnostic medicine, pathology and radiology. This practice is potentially limiting in the emerging era of personalized medicine in three respects. First, it embodies a lossy sampling approach where only a subset of data is reported via a filter of professional opinion (albeit guided by then-current scientific evidence), and the remainder of the primary observation data is either discarded or held inaccessible. Second, it renders the primary data in a document format that is optimized only for human interpretation and ill-suited to the use of computer-based decision support rules. Third, it represents a point-in-time interpretation in an immature field of clinical science that is rapidly changing. The vast majority of molecular variants are currently of unknown significance, but it can be reasonably expected that determinations of significance or lack thereof will be assigned to increasing numbers of variants as genomic science evolves. Thus, the separation of primary observations from their interpretations, and the ability to update and improve those interpretations at a later date, will more significantly impact biomolecular data than other common clinical data types. Novel approaches to dynamic reporting of clinical laboratory genotyping and associated genomic knowledge bases are currently being developed [6
2. Support lossless data compression from primary molecular observations to clinically manageable subsets
The large volume of each individual's DNA, protein and related data — hundreds of gigabytes to terabytes in its raw form --- exceeds the capacity of commonly available network bandwidth and disk storage in healthcare settings. In the absence of a major advance in data storage and transmission capabilities, this large volume of data will need to be compressed. Other high volume digital datasets do exist in healthcare, notably digital radiography and computed tomography, and specialized digital infrastructures (such as Picture Archiving and Communications Systems – PACS [8
]) have been developed to store and display these data. A variety of data compression algorithms and image representation formats have been developed to accommodate the efficient transfer and viewing of clinical digital images. Most of these formats offer ‘lossy’ compression (reduction in file size associated with removal of data such that the ability to faithfully reconstruct all of the content of the original large volume source image is sacrificed). Since the key features of clinical images are often not exquisitely dependent upon single pixel level detail, this is a robust and useful approach to data compression for many types of health-related images.
In contrast, changing even a single letter of the ‘genetic alphabet’ (i.e. a point mutation) may dramatically affect human physiology. In some cases the significance of such changes is well known, as demonstrated by sickle cell disease and other inherited disorders [9
Therefore, any sufficient data compression approach needs to be able to produce a fully accurate copy of the original sequence.
3. Maintain linkage of molecular observations to the laboratory methods used to generate them
Measurement technologies for DNA sequence and expressed proteins are rapidly evolving, and all are constrained by non-zero error rates and “blind spots” representing biological phenomena that are not detectable by the method. For the foreseeable future, the laboratory instruments, chemistry and methods used to obtain high throughput molecular measurements such as single nucleotide polymorphism (SNP) arrays, exome sequences and full genome sequences will continue to evolve, with successive generations of instruments having different strengths and weaknesses. Genomic sequence data representation standards such as the Genome Variation Format (GVF) [10
] and the Human Genome Variation Society's nomenclature for the description of sequence variants [11
] are being proposed to provide common coordination across sequencing platforms. For this reason it will be essential that EHRs maintain provenance that links molecular observations with the laboratory methods used to generate those observations. This binding of methods with results is a structural component of widely used laboratory data standards such as LOINC [12
], and is a feature of proposed and evolving data standards such as HL7 [13
4. Support compact representation of clinically actionable subsets for optimal performance
An important functionality of EHRs is the ability to rapidly find, assemble, and display the relevant clinical data for individual patients and groups of patients. Since the amount of molecular sequence data that currently has demonstrated clinical significance is a tiny fraction of the full genome and proteome, and it is neither computationally feasible nor desirable to query or analyze one's entire genome in real time to support healthcare-related decisions such as drug prescribing or diagnostic test ordering. EHR systems need to access and display relevant information, and/or recognize and act upon clinically relevant molecular patterns with sub-second response times [14
]. These requirements for speed and efficiency make the creation of compact, derived forms of data representing the underlying molecular variation an attractive technical option in EHR systems. These derived observations can be efficiently represented as short “keywords” or codes representing a physiologic state. For example, the observation that an individual has a minor allele variant such as CYP2C19*2, that is associated with altered metabolism of commonly prescribed drugs, can be represented by a compact code of just a few unique alphanumeric characters or a global unique identifier from structured vocabulary/ontology such as the Clinical Bioinformatics Ontology (CBO) [15
5. Simultaneously support human-viewable formats and machine-readable formats in order to facilitate implementation of decision support rules
In its simplest form, a single observation such as the value of a single nucleotide polymorphism, is recognizable upon inspection by a healthcare professional, and as noted above, genotyping results are commonly displayed as laboratory report documents.
However, molecular variation data introduces into clinical practice volumes of data whose complexity routinely exceeds the bounds of unaided human cognition [16
]. The rapidly expanding literature on the association between molecular variation patterns and clinical phenomena [17
] makes it difficult for even genetic medicine specialists to stay current, and far exceeds the interpretive capacity of most non-specialist providers. Consequently, perhaps more than for any class of clinical data that has preceded it into the EHR, molecular variation data will benefit from the implementation of clinical decision support rules that are designed to recognize key patterns (such as DNA variation that predicts altered drug response) and guide practitioners via patient-specific alerts and reminders at clinically relevant times. The inherently cryptic nature of genetic polymorphisms lobbies for systems approaches that guide not only specialists, but also providers who “do not know what they do not know” with respect to clinically important molecular variation. [18
6. Anticipate fundamental changes in the understanding of human molecular variation
Designing EHR capacity based on the expectation that an individual has a single, unique genome will be insufficient to accommodate the actual data requirements for EHRs . This premise, commonly associated with the genome contained in germline (i.e. heritable) DNA, needs to at a minimum be modified to accommodate diseases such as cancer, in which somatic mutations occur [1
]. Thus, EHR systems need to anticipate circumstances such as a “unique genome for each metastasis”. The state of the germline DNA is generally inferred through sampling of leukocyte DNA, generally from blood, and less often from saliva. Emerging evidence that the DNA represented in the leukocytes may undergo structural changes as a result of normal aging [20
] also suggests that as genomic science unfolds, EHRs may need to store multiple genome-scale datasets over an individual's lifetime. Other cell-, tissue- , organ-, and disease-specific genetic variations over time may yet be discovered. The oft-cited use case of storing 3 billion bases of DNA (which in reality is a minimum of 6 billion, since humans are diploid organisms) even when supplemented with data on copy number and splicing variation is only a starting point for a much larger universe of person-specific molecular variation data.
7. Support both individual clinical care and discovery science
Historically, the intersection of clinical care and biomedical research has been relatively minimal, as evidenced by the small fraction of eligible patients who enroll in clinical trials. [21
] Genomic science has unprecedented requirements for large numbers of individuals, each of whom has available large numbers of molecular observations, in order to confront the ‘curse of dimensionality’ (i.e. the expected false discovery rate of patterns that arise by chance when thousands to millions of simultaneous observations are made). Thus, to advance clinical science as rapidly and robustly as possible, the ability to support genomic discovery science as a secondary use of data acquired for person-specific care is at least a compelling opportunity, if not a social obligation to future generations. Such uses will be modulated by issues of consent and privacy, however a research focus on human molecular variation, and the emerging capability to measure that variation at all locations where it occurs in the genome and proteome, makes each individual's genome potentially a uniquely valuable research resource. Well-structured genomic information within the EHR will expedite secondary use of that data to support new discovery.