|Home | About | Journals | Submit | Contact Us | Français|
Stem cell biology has experienced explosive growth over the past decade as researchers attempt to generate therapeutically relevant cell types in the laboratory. Recapitulation of endogenous developmental trajectories is a dominant paradigm in the design of directed differentiation protocols, and attempts to guide stem cell differentiation are often based explicitly on knowledge of in vivo development. Therefore, when designing protocols, stem cell biologists rely heavily upon information including (i) cell type-specific gene expression profiles, (ii) anatomical and developmental relationships between cells and tissues and (iii) signals important for progression from progenitors to target cell types. Here, we present the Stem Cell Lineage Database (SCLD) (http://scld.mcb.uconn.edu) that aims to unify this information into a single resource where users can easily store and access information about cell type gene expression, cell lineage maps and stem cell differentiation protocols for both human and mouse stem cells and endogenous developmental lineages. By establishing the SCLD, we provide scientists with a centralized location to organize access and share data, dispute and resolve contentious relationships between cell types and within lineages, uncover discriminating cell type marker panels and design directed differentiation protocols.
The field of stem cell biology and the associated data has vastly expanded as more and more researchers attempt to generate therapeutically relevant cell types in the laboratory. Because most attempts to guide stem cell differentiation are based on in vivo development (1), stem cell biologists rely heavily upon developmental biology data such as cell type-specific gene expression profiles, anatomical and developmental relationships between cells and tissues and signals important for development and differentiation of stem cells to mature cell types. This data is currently accessible through a variety of resources (Publications, Databases, etc) and formats. The SCLD aims to consolidate these three key types of information into a database providing a central resource to the stem cell community.
Years of experimental embryology has generated large amounts of tissue-specific and cell type-specific gene expression data (2) that researchers can use to study cell types outside of their normal anatomical context. Cell type-specific gene expression profiles allow researchers to establish the identity of cells, provide insight to the developmental pathways activated in differentiating cells and monitor the progression of directed differentiation protocols. A wide variety of databases store this data and make it available to the entire research community. None, however, has the ability to compile data across existing databases and allows users to manipulate and edit the information.
During development, embryonic and somatic stem cells progress through a series of transitional cell types ultimately resulting in terminally differentiated cells. This developmental path is often graphically represented as a cell lineage map. Stem cell biologists rely heavily on these maps as guides when designing directed differentiation protocols to mimic endogenous development in vitro. Several resources are dedicated to describing developmental relationships in a variety of model species to assist in this navigation, yet none provide a complete set of cell lineages in a format that can be readily explored and exploited by a stem cell scientist. Stem cell biologists are currently forced to amass cell lineage information piecemeal from a disparate set of sources ranging from textbooks to primary scientific literature.
Culture conditions and environmental cues used to direct progenitor cells toward desired lineages or terminally differentiated cell types are among the most crucial pieces of information for the successful differentiation of stem cells toward useful cell types. While an enormous amount of effort has been invested in identifying these cues in vivo and in vitro, to the best of our knowledge no existing database curates this information in a computationally accessible manner. With a majority of stem cell culture protocols being founded upon our knowledge of developmental biology, ready access to this information would be extremely valuable to the stem cell community.
A large number of databases are dedicated to supporting and disseminating gene expression data [for review, see de Boer et al. (2)]. Some of the expression databases useful to a stem cell biologist include GXD (3) from Mouse Genome Informatics (MGI) at Jackson labs (4), which annotates gene expression profiles across all of mouse development. Another example is the Novartis Gene Expression Atlas (5), which stores human tissue-specific gene expression profiles. As useful as these resources are, they are designed to benefit a wide cross-section of scientists and, for this reason, cannot optimally address the specific needs of a stem cell scientist.
One significant effort to consolidate stem cell data from across labs is the Canadian Stem Cell Network’s (6) StemBase (7), which curates, organizes and stores gene expression data from genome-wide transcript profiling experiments on mouse and human stem cells and their derivatives. While StemBase provides search and analysis tools, it does not provide the ability for users to edit the data or to curate cell types and lineages.
Other recent data sets and databases do address a subset of these needs within specific cell lineages. The most well developed examples can be found for the hematopoietic lineage and include the HaemAtlas (8); the Erythropoiesis database (EpoDB) (9); LymphTF-DB, a database of transcription factors involved in lymphocyte development (10); Hematopoietic Fingerprints (11) and BloodExpress (12), which unite information on the entire hematopoietic lineage. Many of these databases combine lineage information with expression data in ways that would be advantageous to stem cell biologists if extended to additional lineages including musculoskeletal, pancreatic, neuronal, endodermal and cardiac as well as many others.
Well-defined lineage maps are required to unite gene expression data with defined stages of differentiation. Several resources describing developmental and lineage relationships in the embryo are available including the Edinburgh Mouse Atlas Project (EMAP) (13), and the Expressed Sequence Annotation for Humans (eVOC) (14), among others. Typically, these projects rely upon standardized ontologies that allow for the precise description of cell types and structures in the developing embryo. One issue with the use of these ontologies for stem cell biology is an extensive reliance upon anatomical and stage-specific relationships, with a relatively minor representation of the lineage relationships most important for the design of directed differentiation protocols. Recognition of this fact led to the initiation of Cell Type ontology (CL) (15) by Bard and colleagues, which lays the groundwork for curating lineage relationships throughout the vertebrate embryo.
Thus, no existing database brings together the data most useful to stem cell biologists in an expandable, easily accessed and annotated, user-editable interface. To address this need, we present a new resource called the Stem Cell Lineage Database (SCLD). The SCLD provides a user interface for the curation and browsing of cell type gene expression profiles, cell lineage maps and notes on the manipulations and culture conditions necessary to transition between cell types within a lineage. Our database stores information for both human and mouse cell types and lineages, is publicly accessible, directly user editable, and hyperlinked to a series of source databases, thus benefiting from the vast amounts of information presently available and maintaining consistency with current scientific nomenclature. The SCLD provides graphical lineage maps, graphical ontology building, cell type expression data and a variety of other data essential to stem cell biologists. By providing a user-editable interface for the exchange of directed differentiation protocols and sharing of stem cell research advances, the SCLD will help address the need for swift communication of new advances in the rapidly progressing field of stem cell science.
The SCLD is an internet application accessible through internet browsers (Firefox, Internet Explorer and Safari) communicating to a computer server sending and retrieving data. The server contains both the application and database servers. The application server (Apache Tomcat) is the central hub that runs the SCLD. This provides web pages to users as well as controls the data being sent and retrieved to both the user and the database. The database (MySQL) is where the actual data for the SCLD is stored and maintained. Direct access to the database is not provided; instead it is accessed indirectly through the application server. Open source technologies have been used in the design, development and deployment of the SCLD. Additional information on the computational and database design can be found in the Supplementary Data.
Stem cell research is progressing rapidly and laboratories are often working on different cell lineages making it unlikely that a centrally maintained database would be as useful or timely as an open-source counterpart. For this reason, we have designed the SCLD as a user-editable database. Users submit a request for a user account providing them the means to add/modify the data within the SCLD. Through the user accounts, users can enter and edit the cell types, markers and lineages most relevant to their own work. Researchers can also comment on cell types and lineage maps to facilitate discussion of contentious issues. This structure avoids the strict anticipation of the specific needs of independent users; each can shape the SCLD to be most beneficial to their research goals. To maintain data integrity, any data entered manually is required to have supporting evidence from a published reference(s) retrieved automatically from PubMed. In addition a history is maintained of all edits made to cell types and lineages. If necessary, previous edits can be restored for cell types and lineages.
Using existing biological resources, the SCLD database is populated with information such as gene names, cell type names and gene expression data. This significantly reduces the barrier to the entry for the annotation of cell types and cell lineages, prevents errors associated with manual entry and provides immediate value-added information to new users. For example, gene names for both mouse and human are automatically downloaded from MGI and the Human Gene Organization (HUGO) (16), and are periodically updated to reflect the latest changes ensuring consistency among entries.
While there are many instances where the data on cell types and lineages for mouse is similar to humans, this is not always the case. Keeping the data separated between mouse and human ensures that an accurate picture is presented for cell types and lineages for each respective species.
Cell types are commonly identified by name and by the expression of a unique set of genetic markers (i.e. insulin is expressed solely in pancreatic β-cells). The SCLD compiles gene expression data for cell types and allows searching for cell types by name or combination of gene markers expressed or not expressed. This search function allows researchers to efficiently identify the expression profiles of known cell types, determine which genes may be used to identify a cell type, and identify candidate cell types matching known expression profiles. When searching by name, the cell type name is either typed in or selected from the list of stored cell types (Figure 1A). Alternatively, gene markers known to be expressed or not expressed can be entered and the SCLD will produce a list of candidate cell types (Figure 1B). From either of these two searches, additional information on the cell type can be viewed including: (i) species; (ii) name and description of the cell type; (iii) gene markers used to identify the cell type; and (iv) associated publications supporting the data. Each gene marker and its corresponding reference is directly linked to MGI, HGNC or PubMed for easy access to additional information.
Along with viewing previously defined cell types [such as those in EMAP (13)], the SCLD supports editing cell types and adding new, user-defined cell types. Editing a cell type allows descriptions to be updated or gene markers and references to be added or removed. The SCLD also facilitates discussion by providing space for users to comment on all data corresponding to a cell type. Using these tools, the definition of a cell type can be modified to reflect the most current available data.
In addition to leveraging existing databases and ontologies, we have extracted condensed cell type definitions from these ontologies that are more applicable to stem cell scientists. Because MGI is annotated as an anatomical lineage, cell types imported from Jackson labs are organized by developmental time point [Theiler stage (TS)]. This organization results in developmentally (lineally) equivalent cell types represented as multiple unique entries, one for each pertinent TS, confounding the study of a single cell type across developmental time. Using the SCLD, we combined these cell types across stages into merged cell groups in a map dubbed ‘Stage All’. This reduced the 16000 entries from MGI into fewer than 5000 Stage All cell types (Figures 2A and B). Users can search Stage All cell types, EMAP cell types or user-defined cell types as desired (Figure 2C). In order to preserve interoperability, mapping relationships between cell types in discrete ontologies are maintained throughout and stored in the SCLD.
EMAP also separates lineally equivalent cell types across serially homologous anatomical boundaries within the embryo (such as somites, branchial arches and fore and hind limbs). In order to group these cell types into merged cell types more representative of the lineal relationships of interest to a stem cell biologist, we developed a Python GUI called Cell Type Condenser (CTC). CTC makes use of the ‘part of’ relationships in EMAP and displays a cell type’s path from itself to the root, allowing users to easily identify developmentally equivalent cell types and condense them into a single new cell type within the database. This mapping from a single cell type to a group of cell types enables existing and new annotations to propagate from one ontology or lineage to the other. CTC is available at the SCLD website, and saved files containing mapping information can be uploaded to and processed by the SCLD.
Cell lineage maps represent the other major component of the SCLD. Two types of lineage maps are supported: in vivo and in vitro (see Figures 3B and and4A4A for examples). These maps represent lineal relationships between individual cell types that allow the user to reorder cell types from combined anatomical/developmental ontologies like EMAP into exclusively lineage-based relationships. Stored lineages are defined by the following characteristics: (i) map type (in vivo or in vitro); (ii) description, including starting and terminal cell types; and (iii) published references supporting the data (Figure 3A).
To find, view or edit a lineage map, the SCLD supports searching based off the type of map, in vivo or in vitro, and/or any cell types contained in a map. For in vivo lineages, the SCLD can display the lineage map in a vertical (Figure 3B) or horizontal (Figure 4A) graphical format showing the cell types and their developmental trajectory from progenitor (top or left) to terminal (bottom or right). Highlighting a cell type on a lineage displays information on the gene markers and associated references for that cell type stored in the database. Gene names and references are hyperlinked to MGI, HGNC and PubMed (Figure 3C).
As such, the SCLD lineage interface serves as a powerful and intuitive graphical ontology editor where users can access cell types mined from other databases and align them into a developmental lineage complete with gene expression profiles, transition properties, experimental manipulations required for directed differentiation and user notes.
Specific environmental cues can cause cells to divide or differentiate, or to adopt one fate over another, and this knowledge is essential for effectively guiding stem cells toward therapeutically relevant cell types for use in regenerative medicine. The relationship between cell types in a lineage includes the established ‘derives from’ or ‘develops from’ relationships, but the SCLD has the added value of including information on the amount of time, reagents, physical manipulations and other experimental criteria required for transition between cell types. As shown in the in vitro lineage map example (Figure 4), selecting a transition symbol (<>), will display the conditions used to generate the next cell type. In order to facilitate debate and discussion, any of the transition properties between cell types can be modified, added or commented on by additional users. To the best of our knowledge, in vitro lineage maps including transition properties are not curated in any other database. Ready access to this data will facilitate the comparison and optimization of difficult steps in directed differentiation protocols.
To date, the SCLD is comprised of more than 50000 gene markers and 5000 cell types supported by nearly 4400 published references (Table 1). This database was rapidly populated by utilizing gene expression profiles catalogued within existing data resources that are extensive for mouse, but less comprehensive for human cell types.
To test the functionality of the SCLD, our laboratory curated in vivo lineage maps for the primary mesoderm-derived cell lineages outlined in Figure 5. Each of the seven mesodermal sub-lineages (bone, cartilage, cardiac, vascular, muscle, adipose and hematopoietic) was assigned to one lab member for curation. Referencing the primary literature, the curator established the critical cell types for their sub-lineage and updated existing relevant cell types or created new cell types with names, gene markers and references in the SCLD. Curators used Stage All cell types and the CTC to link individual EMAP cell types to specific cell types in each lineage. This allowed the SCLD to populate each cell type with known gene expression data from GXD. When available, additional markers were added from relevant publications. Then, using the lineage map editing function in the SCLD, the curator arranged the relevant cell types into a lineage map representing the major developmental transitions and decisions that occur from the mesodermal progenitor to the terminally differentiated cells in each sub-lineage.
In total, this mapping of mesoderm development covered seven sub-lineages, 112 cell types and 242 gene expression markers, thus summarizing a significant percentage of known mesoderm cell types and developmental lineages. By establishing maps of these mesodermal sub-lineages, SCLD users can now visually explore the relationships between the major mesodermal lineages and cell types, readily access and download gene expression profiles for mesodermal progenitors and differentiated cells, easily map the mesodermal sub-lineage to major developmental ontologies, design directed differentiation protocols based upon these in vivo lineage maps, and debate and discuss the finer points of the annotated developmental relationships with other SCLD users around the world.
This case study took approximately 5h of data assembly, cell type curation, ontology mapping and lineage building by each of seven individuals. It should be noted that none of these individuals was an expert on the chosen lineages, and that this time investment is likely to be even smaller for experts, especially when annotating the endogenous developmental cues that comprise the in vivo transition properties (not included in the described case study).
The very rapid manner in which data can be entered and stored for use is an indication of the low barrier to entry and the potential for true community shaping of this stem cell resource. With the support of the general stem cell community, primary literature will be quickly catalogued in the SCLD to produce a unifying reference source for all stem cell biologists.
Over the past decade, stem cell biology has become one of the most rapidly advancing fields in biology generating massive amounts of experimental data. A variety of databases have been established to organize, store and distribute this data, yet no resource has been developed that brings together the key components of developmental biology most helpful to a stem cell biologist. The user-friendly SCLD provides the stem cell field with a customizable framework to easily edit and navigate gene expression data, lineal relationships and transition properties between cell types. Its power lies in the ability to rapidly and effectively construct developmental lineages relevant to the user and serve as the basis for hypothesis building and experimental design as demonstrated by our case study.
We will continue to expand the SCLD’s compatibility so that we can utilize future cell type ontologies, genome-wide expression databases and literature sources and maintain the most current data in the database. Another important goal will be programming the automatic extraction of unique marker profiles for specific cell types with estimates of discriminative power to aid in in vitro cell type identification. Along these lines, we plan to provide an API providing direct read-only access to the database. This will provide better integration and interoperability among current biological databases as well as providing an interface allowing users to write bioinformatic programs to analyze the data according to their own needs. We also intend to implement private lab-specific domains that will allow users to manipulate their data in the SCLD and store proprietary data until publication. Finally, we expect to expand the user communication and annotation functions of the SCLD in order to facilitate discussion between scientists researching complementary problems. This will include automatic flagging of related lineages and apparent conflicts between related cell types and lineages to exploit the power of social networking between investigators for resolving persistent areas of uncertainty.
Supplementary Data are available at NAR Online.
Funding for open access charge: University of Connecticut Regional Campus Incentive Grant.
Conflict of interest statement. None declared.
We thank Dr Jin Jun for his advice on technical matters; Dr Mark Carter for annotating lineages and useful feedback; Kyung-Min Chung, Mark Marchitto, Ariel Gonzales and Erika Eitland for annotating the case study lineages; and Alexia Lalande for help with the figures.