Genomic data for clinical decision making
Genomic data are becoming a routine component of clinical diagnosis and treatment. Prospective parents with familial or ethnic history of genetic disease have long been encouraged to undergo genetic counseling, including genotyping for disease alleles such as Tay-Sachs and Cystic Fibrosis [1
]. Recent research [3
], demonstrating that several treatment responses are conditional on genomic profile, promises to usher in the long-awaited era of personalized medicine, all based on the patient's gene sequence or gene expression signature.
Clinically pertinent genomic data extends far beyond the patient's somatic genome sequence. Advanced cancer treatment options include genetic testing of cancer cells for specific markers, e.g. estrogen receptors in breast cancer or the Philadelphia chromosome in CML [4
]. This type of diagnostic will likely expand into full genomic profiling of cancer cells to help determine appropriate treatment [5
]. In addition, much recent literature has uncovered correlations between gene expression patterns and clinical diagnosis [6
]. While genome sequence data changes rarely, gene expression data varies across cell types and time. The cost of both diagnostic tests is falling rapidly with high-throughput techniques [8
]. It is likely that all patients will be genotyped at some point in their lives, and that gene expression levels will be measured for many serious ailments. Concurrently with these significant technical developments, the Internet has made patients markedly more autonomous in medical decision making, perhaps even more knowledgeable than care providers, particularly in the realm of genetic testing [9
A new type of clinical data
Currently, the most efficient way to genotype "most of"a human being is to genotype tag SNPs according to the HapMap [13
], which is expected to highlight close to 600,000 SNPs that identify most of the clinically-useful genetic diversity. In addition, specific, larger mutations need to be checked, including large deletions or insertions. Considering necessary redundancy for indexing of partial records, the size of a single patient's genome sequence might quickly reach 3 or 4 megabytes. Gene expression level datasets like the popular Affymetrix U133 Plus 2.0 platform [14
] consider close to 54,000 RNA transcripts, which requires approximately 500 kilobytes of storage, including indexing. For long-term quality assurance, one may want to store raw instrumentation data, which increases the size of a single transcriptome to 12 megabytes. The size of these data, though much larger than typical diagnostic tests, is not entirely unprecedented: MRIs and other imaging results can require many megabytes of storage, too. The key difference is in the granularity of the data. While an MRI is generally shared as an atomic block of data, a patient will not likely want to share his entire genomic data, as it has been shown that a mere 20 randomly chosen SNPs are enough for unique, perpetual identification [15
]. Instead, within these tens of megabytes of genomic data, a patient might want to share a few SNPs with his doctor, nothing more. Even the existence of test results against other SNPs, particularly as the genomic tests are not yet complete scans of the HapMap, should remain private.
Thus, genomic data should be treated as a very large sequence of small results. Each result is clinically meaningful (e.g. a BRCA1 allele), and a small handful of results is enough to genomically fingerprint an individual. In other words, the management of genomic data presents a significant challenge: how can we efficiently store and retrieve patient genomic data while respecting strong privacy constraints?
Extension of PING
GenePING, as its name implies, is an extension to the Personal Internetworked Notary and Guardian (PING) health record management system [16
]. PING allows patients and health-care providers to share health record data with access control rules defined by the patient. All data is exchanged in XML with publicly-defined schemas, and the protocol is implemented using XML over HTTP (preferably HTTPS). The data store itself persists as a set of encrypted objects, thereby reducing the threat of disclosure from discarded or mismanaged disk drives that have been widely reported [18
]. PING also includes a Java client, PING Display, which GenePING also extends to present the user interface. Other groups are developing alternative open source PING clients, including ones that use dynamic HTML. On the back-end, GenePING is a revamp of the PING low-level storage and high-level medical document organization, in order to enable the secure storage of fine-grained genomic data. The threat model is left unchanged: the PING server remains a semi-trusted information broker with the keys to an encrypted data store. In addition, the new GenePING storage system addresses the specific threats of genomic data leakage combined with the efficiency requirements.
On the front-end, GenePING integrates into the default PING client (Figure ). Patients can view their genotype labs listed much as typical lab results (Figure ). A single lab can be browsed incrementally, one set of SNPs at a time (Figure ).
GenePING main screen. GenePING fits into the standard PING architecture. This screens shows the main PING screen with the added "Genotype Lab" option supported by the GenePING extension.
GenePING genotype lab list view. A patient's view of his list of genotype datasets.
GenePING single genotype lab view. A patient's view of one of his genotype datasets. Note the interface that allows datapoints to be downloaded in batches, or individually when queried.
GenePING defines a variable-size, keyed-block, low-level storage interface. This interface behaves much like a persistent hash table, and the underlying implementation of this interface is expected to support hundreds of millions of records. Possible implementations of this storage interface include a raw filesystem, a SQL database, an object store, or some distributed storage mechanism. The default GenePING uses Berkeley DB for Java [19
], a transactional block store whose functionality maps closely to the PING interface API. Before the data is stored via this low-level interface, it may be altered for various purposes. The GenePing Store interface supports generic extensions, each of which can, in turn, modify the name and value of the stored record. Once these extensions are registered, the name-and-value changes are applied automatically upon interaction with the block storage interface. Storage extensions can be particularly useful for data compression, data encryption, and obfuscation.
Security and obfuscation
The GenePING server requires two cryptographic keys to find and decrypt its data: an HMAC key [20
] for hashtable-name obfuscation, and an AES key [21
] for hashtable-data encryption. Without both these keys, the raw storage is useless (to an attacker, for example). In a production GenePING installation, these two keys are expected to be loaded into RAM from an administrator's secure token. They should never be stored on disk.
HMAC is typically used to create an authentication "fingerprint" of a message using a secret key. It is often thought of as a keyed hash function: it is collision resistant, meaning that it is extremely unlikely to ever find two messages with the same MAC, and it is one-way, meaning that, given a random MAC value, it is extremely difficult to identify even a single message whose MAC will match. The secret key adds a further dimension above and beyond hash functions: with it, a MAC on any given message is easy to compute, but without it, it is nearly impossible.
The HMAC is used to obfuscate the name of any record sent to the low-level block store. When GenePING wishes to store a record under the name patient/at/chip.org, the actual low-level block store will instead use the name HMACkey(patient/at/chip.org). Therefore, without the HMAC key, it is impossible to determine both whether a given record corresponds to a given Ping ID, or even whether a certain Ping ID is stored in the given system.
In order to secure the data, we use straight-forward AES encryption in CBC mode [22
]. Every time a record is stored via the low-level block interface, the data field is encrypted using AES with the single storage key and a new, random initialization vector used only that one time. The use of a new initialization vector prevents the inherent redundancy of genomic records from transpiring at the ciphertext level: two identical SNPs will never be recorded identically, since their initialization vectors will be different. Both the HMAC and AES algorithms are implemented as extensions to the block-level storage, so that all GenePING calls are automatically passed through these obfuscation and security filters.
Securing granular data
Storing individual blocks securely is not enough. A large block, for example, could inherently reveal the presence of a genomic dataset, given the uniquely large size of genomic data. In addition, storing such a large block would make access to a single SNP extremely inefficient: the entire genome sequence would need to be decrypted before the single SNP datapoint becomes available. Thus, it is crucial to store genomic datasets in small, meaningful chunks. Ideally, reading one SNP should require reading little more than just that encrypted datapoint.
For this purpose, we introduce the concept of a secure array, a mechanism for securely breaking up an array into its elements, storing them individually, and allowing for efficient retrieval of individual items, all while obfuscating the relationship between the elements of that array to anyone not in possession of the cryptographic storage keys. In our implementation, a secure array is defined by a small root block which contains two short pieces of information: an array size, and a unique HMAC key arraykey. Each element of the secure array is, as expected, indexed by its integer position in the array, call it i. The block that contains the i'th element of the array is then stored in the low-level block storage under the name computed as HMACarraykey(i). The collision-free property of the HMAC algorithm guarantees that this strategy will yield unique locations for any array element. The one-way property guarantees that two blocks can never be identified as belonging to the same array.
We then extend the secure array with a secure index, which is effectively an additional unique HMAC key to help locate elements of the array according to a different scheme than the element's integer position. For example, a SNP element might need to be located by its SNP id. Since the element is already stored at a given location indicated by HMACarraykey, a secure index requires an additional level of indirection. HMACindexkey(snp123) will yield the name of a block which will itself point to the real location of snp123's location.
This technique for secure granular storage allows for efficient browsing of genotype data. While a single record may contain megabytes of data, individual SNP data points can be browsed in batches of ten. The decryption and network transfer is thus also done in batches, either via a browsing interface or a search interface for specific SNPs (Figure ).
Privacy profiles & filtered documents
An additional complication in the personal management of genomic health records is patient education. While the average patient will likely understand how to share a blood test result, sharing genomic data becomes complicated: which SNPs should an individual share with his doctor? Is it realistic to expect patients to individually permission their SNPs? With hundreds of thousands of datapoints in a single test, a patient without guidance is likely to simply give any health care provider complete access to their genome, a choice that could seriously affect the user's privacy.
To address this issue, we introduced Privacy Profiles, each a list of SNPs, rare mutations, and gene names. Each privacy profile fulfills a given clinical purpose, e.g. "Breast Cancer Markers" would include all SNPs relevant to genetic breast cancer predisposition. Privacy profiles are represented using XML with a public schema. The definition of these profiles can be left to the proper organizations, e.g. patient advocacy groups or the FDA.
A patient can then apply a privacy profile to any genotype lab document in GenePING (Figure ), effectively creating a new Filtered Document within his record (Figure ), which can then be shared with the appropriate health care provider. Upon reading a patient's record, a health care provider will only see this new document containing only the filtered data, not the original, complete genotype document (Figure ). The fact that the document is the result of a filter is also invisible to anyone other than the patient himself.
GenePING filtering a document. A patient can choose to create a filtered version of a genotype dataset, according to one of a preset list of privacy profiles.
GenePING permission granting. Once this filtered document is created, a patient can grant read permission on this new document to his doctor.
Figure 6 GenePING health care provider filtered view. A doctor viewing a patient's records. Note how only one document shows up, and no information is available as to whether that document is filtered or not. Comparison with the patient's view reveals that it (more ...)