|Home | About | Journals | Submit | Contact Us | Français|
Tissue Microarrays (TMAs) have emerged as a powerful tool for examining the distribution of marker molecules in hundreds of different tissues displayed on a single slide. TMAs have been used successfully to validate candidate molecules discovered in gene array experiments. Like gene expression studies, TMA experiments are data intensive, requiring substantial information to interpret, replicate or validate. Recently, an open access Tissue Microarray Data Exchange Specification has been released that allows TMA data to be organized in a self-describing XML document annotated with well-defined common data elements. While this specification provides sufficient information for the reproduction of the experiment by outside research groups, its initial description did not contain instructions or examples of actual implementations, and no implementation studies have been published. The purpose of this paper is to demonstrate how the TMA Data Exchange Specification is implemented in a prostate cancer TMA.
The Cooperative Prostate Cancer Tissue Resource (CPCTR) is funded by the National Cancer Institute to provide researchers with samples of prostate cancer annotated with demographic and clinical data. The CPCTR now offers prostate cancer TMAs and has implemented a TMA database conforming to the new open access Tissue Microarray Data Exchange Specification. The bulk of the TMA database consists of clinical and demographic data elements for 299 patient samples. These data elements were extracted from an Excel database using a transformative Perl script. The Perl script and the TMA database are open access documents distributed with this manuscript.
TMA databases conforming to the Tissue Microarray Data Exchange Specification can be merged with other TMA files, expanded through the addition of data elements, or linked to data contained in external biological databases. This article describes an open access implementation of the TMA Data Exchange Specification and provides detailed guidance to researchers who wish to use the Specification.
TMA technology was introduced in 1998 . A TMA fundamentally differs from a conventional glass slide only in the number of tissue samples included [see Figure Figure1].1]. Tissue microarrays typically contain between 100 and 1,000 core tissue samples. A single TMA block can be sectioned and distributed to dozens of laboratories, saving years of preparation time, hundreds of thousands of dollars in tissue collection costs, and conserving experimental reagents by measuring a marker's distribution on hundreds of specimens arrayed on a single glass slide . Several studies have demonstrated the value of TMAs to validate the biologic relevance of candidate genes expressed in prostate cancers [2-6].
Because TMAs are designed to answer questions applicable to pathologic lesions with specific sets of attributes (e.g. stage or grade or diagnostic subtype), preparation of a TMA requires access to large archives of paraffin embedded tissues. Each TMA core tissue must be annotated with clinical, demographic or histopathologic information so that measurements on the TMA core samples can result in clinically useful correlations. To ensure inter-laboratory reproducibility, information describing the preparation of TMA blocks and slides need to be provided along with the TMA data records.
The Cooperative Prostate Cancer Tissue Resource (CPCTR) is a multi-institutional virtual tissue bank funded by the U.S. National Cancer Institute (NCI) to provide researchers with samples of prostate cancer tissues . The member institutions of the CPCTR are New York University, George Washington University, University of Pittsburgh and Medical College of Wisconsin. The CPCTR began service to the cancer research community on December 6, 2001. The CPCTR has over 5,000 prostate cancer specimens including radical prostatectomy cases (paraffin and fresh-frozen) and paraffinized needle biopsies. The CPCTR represents the largest repository of histologically-characterized and clinically annotated prostate cancer tissue in the USA. All accrued cases undergo pathology review and all clinical data is collected using methodology standardized across the participating institutions. CPCTR resources are available to all researchers, academic and commercial. Further information can be obtained from the CPCTR website .
The CPCTR has constructed a prostate cancer TMA implemented in conformance with the new TMA Data Exchange Specification (herinafter designated "the Specification"). The Specification was developed through a series of open workshops sponsored by the National Cancer Institute and the Association for Pathology Informatics . Tissue data included in the CPCTR TMA database is de-identified, and assembled in an open access database to permit data sharing, in compliance with current NIH policy on data sharing  and in concert with ongoing NIH initiatives to develop new methods for sharing research data [11,12].
The TMA data exchange specification was designed to allow TMA database files to be totally self-describing. The properties of a self-describing database file would include:
1. An informative header that explained the purpose of the file and provided all the information to understand the file (i.e., its organization).
2. Information regarding the creation of the file (e.g., creator, date of creation)
3. Rights of use (e.g. specifying any restrictions on use)
4. Warranty information
5. Methodology (e.g. how the data contained in the file was obtained)
7. Metadata (the data that describes or defines the actual data)
8. Metadata definitions (clear descriptions and definitions of the metadata)
The typical database contains data (property #6) but nothing else in the way of self-descriptive annotation. The CPCTR implementation of the Specification has all eight properties and employs the following enhancements:
1. Uses Uniform Resource Locators (URLs) to link the TMA database with web documents that provide detailed information supplementing the metadata tags. These external URLs are:
a. A link to the Dublin Core Meta Data Elements used in the header section of the document .
b. A link to the ISO-11179-compliant listing of Common Data Elements (CDEs) provided in the Specification .
c. A link to the CPCTR CDEs .
3. Supports complex TMAs within a single TMA file. In this case, a single TMA file contained four blocks, with cores from a single tissue samples appearing in multiple locations in more than one block.
2. Protects patient privacy (by deidentifying all data)
3. Allows data sharing (by permitting free distribution of the XML data document)
Tissue microarrays allow for the high throughput analysis of tissue samples and their association with clinical or outcomes data. Yet these experiments require a large amount of information for the subsequent analysis and evaluation, in particular by interested second parties. The Specification provides an accurate and reproducible method for the transfer of this information as is required for inter-laboratory reproducibility. One of the most important problems with modern data specifications is the daunting technical expertise required for their implementation. The Specification was written to permit maximal flexibility and minimal implementation requirements . This study demonstrates that the Specification can be implemented using a simple Perl script that converts an Excel database into XML-tagged data elements. The resulting large section of core-related XML text can be simply inserted into a conformant document containing header, block and slide information. The resulting TMA database can be validated with a Perl script provided with the Specification document.
All institutions participating in the CPCTR have Institutional Review Board (IRB) approval for human subjects research. Each CPCTR institution develops its own local protocols to protect the confidentiality and privacy of human subjects and obtains local IRB approval for all CPCTR activities. The IRB assurance numbers for each cooperating institution are: New York University – M1177; Medical College of Wisconsin – M1061; University of Pittsburgh Medical Center – M1256; and George Washington University Medical Center – M1125. Tissue data records from the cooperating institutions are submitted to a central data manager (Information Management Services, Inc., contracted by the NCI) as de-identified records. All institutions assign an arbitrary number to each record before submitting the de-identified record to the central database. This ensures that the central database has no links connecting records to patients. In addition, HIPAA's proscribed set of 18 data elements are omitted from core sample records (so-called safe harbor approach to HIPAA-compliance) .
The CPCTR maintains a publicly available Manual of Operations that describes its tissue collection procedures and policies .
Pathological characterization of specimens involves review of all cases by a CPCTR pathologist using diagnostic criteria explained in the publicly available CPCTR histologic atlas and manual .
The Specification is an open access document that can be used without restriction .
The Specification requires four general sections for each TMA file:
1) Header, containing the specification Dublin Core identifiers, 2) Block, describing the paraffin-embedded array of tissues, 3) Slide, describing the glass slides produced from the Block, and 4) Core, containing all data related to the individual tissue samples contained in the array. The simplest possible structure for a conforming TMA file consists of nothing more than empty tags designating the four required sections [see Figure Figure2]2] .
Common Data Elements (CDEs) are metadata tags that describe the data elements included in an XML database. To be of value, CDEs must be well-defined, uniquely identified and available for human review or computer access. Eighty CDEs, conforming to the ISO-11179  specification for data elements constitute the XML tags provided in the Specification . CDE descriptors are publicly available . However, the only CDEs that must appear in any conforming TMA file are the section CDEs (header, block, slide and core), the root CDE (histo) and the tma CDE itself (tma). A set of six simple semantic rules describe the syntax for the data exchange specification .
The Specification was designed for maximal flexibility. Flexibility in the first version of an XML specification permits the addition of greater structure in later versions built on tested implementations. A similar approach has been used for ANSI/HL7 Common Data Architecture (CDE) wherein the earliest version (Level One) is intentionally sparse . At this time, there is no DTD (Data Type Definition) or Schema included in the Specification. For those wishing to use a DTD, a Specification-compliant DTD has been prepared by David G. Nohle, Ohio State University Department of Pathology and the Mid-Region AIDS & Cancer Specimen Resource (ACSR) .
Constructing a TMA Database consists of the following:
1. Filling the four sections (header, block, slide and core)
2. Assembling the four sections into a TMA file with a proper file declaration, root element and TMA CDE.
3. Validating that the TMA file conforms to the specification
The header, block and slide sections of the TMA will vary only slightly from project to project within a laboratory. The CPCTR header, block and slide sections were prepared "by hand" using the section-specific CDEs provided in the specification.
The header section contains descriptive information about the file and its contents. With the exception of one CDE (filename), the header CDEs are the same CDEs used in the Dublin Core set of XML identifiers used by librarians. Detailed information describing the Dublin Core elements is available . A link to the Dublin Core elements is also included in CPCTR TMA database. The first few lines of the TMA database are shown [see Figure Figure3].3]. The block and slide headers of the TMA database are short and are also completed manually.
The cores are distributed for each block in an array, with cores assigned to specific locations [see Figure Figure4],4], and all the cores in an array are assigned to a slide, which is a numbered section derived from a block [see Figure Figure5].5]. The core section contains annotated data for each core in the TMA. The central database for all CPCTR tissues is maintained as an Excel database by an NCI-contracted information management service (IMS, Rockville, MD). IMS extracts an Excel sub-file consisting of records pertaining to the tissues selected for the TMA block. CPCTR-specific data elements included in the IMS records are publicly available .
A Perl script was written that converts Excel files to XML, enclosing the data associated with the spreadsheet cells to XML CDEs corresponding to the column headings. This creates the "core" section of the TMA database. A sample of an XML-tagged extracted data record is shown [see Figure Figure6].6]. The Perl script is available as an open access file with this article [see Additional file 1].
The CPCTR prostate cancer TMA consists of 299 core samples distributed in four blocks, each block having 300 arrayed cores. Each block contains about 150 core samples in two different locations in each block. The core duplicates are staggered in the array, to maximize the chance that a given core will be represented if an area of the slide section is lost in processing. The distribution of one set of core samples in multiple array locations in four blocks yields a complex TMA that cannot be adequately represented by separate descriptions of each block. The Specification permits multi-block TMA files. Within the block CDE are the nested sets of four blocks that compose the complex TMA. Each core CDE is nested within a specific block CDE, and one core may have two associated array locations [see Figure Figure66].
The four sections are concatenated as a single XML database file. The CPCTR database file is provided with this manuscript [see Additional file 2].
Once a TMA database is prepared, it needs to be validated to ensure conformance with the Specification. At this time, all TMA files should be validated using a software implementation written in Perl and distributed as an open access supplemental file with the Specification and with this publication [see Additional file 3]. The validating script requires a Perl installation but should operate equally well on any operating system. The validation software has a simple command-line interface. When the file successfully validates, the Perl script outputs the encountered CDEs from the Specification, a statement that the file is valid, and a one-way hash value specific for the validated file [see Figure Figure77].
The Perl scripts and files for the production of TMA databases that meet the Specification are available with this publication. The example prostate cancer TMA database is available as a supplementary file with this article [see Additional file 1]. The actual tissue microarray slides are available after an application process Although the CPCTR is a non-profit, government-sponsored resource, a surcharge is attached for glass slides, to help defray a portion of the costs of TMA production. The application process and charges are described at the CPCTR web site . Questions regarding any aspect of the CPCTR can be directed to the CPCTR email query service [firstname.lastname@example.org].
All authors have reported no competing interests. See funding sources in Acknowledgements.
Jules Berman was an author of the original TMA Data Exchange Specification, developed the Perl tools for implementing the CPCTR TMA in conformance with the Specification, and wrote the first draft of the manuscript. Milton Datta helped in the design and supervised the laboratory that constructed the physical TMA, provided the protocols for TMA construction, provided annotation data describing the core array grid. Andre Kajdacsy-Balla, Jonathan Melamed, Jan Orenstein, Ashok Patel, Rajiv Dhir and Michael J. Becich assisted in the design of the TMA and were responsible for the technical steps in TMA construction including block selection, core focus selection, core quality assurance and for the selection and extraction of all data annotations used in the TMA database. Kevin Dobbin was responsible for data review (for consistency and completeness) for the TMA database and was involved in the TMA design and case selection. All authors reviewed and commented on successive drafts of the manuscript and have provided the first author with approval of the final manuscript.
Excel to XML converting script. Opener7.pl is a Perl script that converts an Excel file to XML-tagged data elements that can be easily inserted into an XML file. This script only works under Windows and requires an installed version of Excel. It is distributed as an open access plain-text file.
CPCTR tissue microarray database file. Cpctrtma.xml is the XML representation of the CPCTR prostate cancer TMA. This file conforms to the TMA Data Exchange Specification and can be viewed as a formatted XML file on most web browsers. The file exceeds 700 KB in length.
TMA validating script. Validtma.pl is a validating Perl script distributed as a plain-text file. It parses through XML files and determines conformance to the TMA Data Exchange Specification . This is an open access Perl script, and is identical to the Perl script distributed with the Specification.
This work was supported by four grants from the National Cancer Institute for the support of the Cooperative Prostate Cancer Tissue Resource: U01 CA86772, U01 CA86743, U01 CA86735, and U01 CA86739. With the exceptions of Jules Berman and Kevin Dobbin, the authors are funding recipients of these grants. Jules Berman and Kevin Dobbin performed this work as part of his regular activities as a U.S. government employee. Hang Liu, of the University of Wisconsin, is acknowledged for writing a Perl script that extracted the array locations for core samples.