|Home | About | Journals | Submit | Contact Us | Français|
The international Functional Annotation Of the Mammalian Genomes 4 (FANTOM4) research collaboration set out to better understand the transcriptional network that regulates macrophage differentiation and to uncover novel components of the transcriptome employing a series of high-throughput experiments. The primary and unique technique is cap analysis of gene expression (CAGE), sequencing mRNA 5′-ends with a second-generation sequencer to quantify promoter activities even in the absence of gene annotation. Additional genome-wide experiments complement the setup including short RNA sequencing, microarray gene expression profiling on large-scale perturbation experiments and ChIP–chip for epigenetic marks and transcription factors. All the experiments are performed in a differentiation time course of the THP-1 human leukemic cell line. Furthermore, we performed a large-scale mammalian two-hybrid (M2H) assay between transcription factors and monitored their expression profile across human and mouse tissues with qRT-PCR to address combinatorial effects of regulation by transcription factors. These interdependent data have been analyzed individually and in combination with each other and are published in related but distinct papers. We provide all data together with systematic annotation in an integrated view as resource for the scientific community (http://fantom.gsc.riken.jp/4/). Additionally, we assembled a rich set of derived analysis results including published predicted and validated regulatory interactions. Here we introduce the resource and its update after the initial release.
A wide range of molecular basis encoded in the genomes has been addressed with the progress of technologies in molecular biology. We have focused on the landscape of the mammalian transcriptome and revealed its striking complexity including a substantial population of non-coding RNA and frequently occurring sense/antisense transcription (1–4). Despite these efforts it still remains a major challenge to fully understand the processes responsible for determining the shape of the transcriptome.
In Functional Annotation of the mammalian genomes 4 (FANTOM4), an international collaborative research project, we focused on the differentiation process of a human myeloid leukemia cell line to deepen the understanding of the complex layers of the transcriptome and to reverse-engineer the transcriptional regulatory network in a data-driven manner (5–8). We performed a series of high-throughput experiments using second-generation sequencing together with microarrays to follow the time course of the differentiation process as well as systematic perturbations on a large-scale to characterize the transcriptional regulatory network (7). Furthermore, we addressed the combinatorial roles of transcription factors in human and mouse based on a large-scale screening of physical interactions among transcription factors (6). These big data obtained from a wide range of experiments were analyzed individually as well as in combination and published in related but distinct papers while all results are closely connected. We comprehensively provide all our produced data together with detailed annotation to facilitate easy visual inspection and to allow obtaining parts of or the whole data set for global analysis (9). By this, we provide a basis for further experiments and facilitate additional analyses in cellular differentiation. Here we introduce the resource and its update after the initial release.
We overview the whole set of experiments with their analysis and describe various ways to access the data subsequently. After our initial release of the FANTOM web resource (9) we added several data sets and made them visible to external systems (see below). All the data content together with the available interfaces are summarized in Table 1 and updates since our initial release are indicated.
We selected a human myeloid leukemia cell line, THP-1 (10), as a model of macrophage differentiation. Upon stimulation with phorbol myristate acetate (PMA), the THP-1 cells cease proliferation, become adherent and differentiate into a mature monocyte- and macrophage-like phenotype. We conducted a set of high-throughput experiments on this model system.
The primary technology employed is the cap analysis of gene expression (CAGE) with a next generation sequencer (termed deepCAGE), which identifies active transcription starting site (TSS) and quantifies their activities even in the absense of gene annotation by sequencing mRNA 5′-ends in a high-throughput way (11–13). We sequenced 24 millions mRNA 5′-ends (CAGE tags) over the differentiation time course, consisting of six time points in biological triplicates. Promoter activities are quantified by counting the CAGE tags aligned to the reference genome and normalized to fit a power-law distribution (14). We developed motif activity response analysis (MARA) on the promoter activities profiled by CAGE over the differentiation time course, which leads to the prediction of transcriptional regulatory interactions as well as the identification of key transcription factors (7). Predicted regulatory interactions for the THP-1 time course profiles are available from the FANTOM web resource while MARA analysis based on microarray gene expression data is available at SwissRegulon (15).
We complemented the transcriptome characterization with qRT–PCR profiling for around 2000 transcription factors (16), gene expression microarrays as well as small RNA sequencing. In particular, the small RNA sequencing lead to the discovery of tinyRNAs (8) and to the accurate identification of RNA editing (17). The profiling of RNA polymerase II binding and histone acetylation (H3K9) with ChIP–chip on genome tiling arrays revealed unique epigenetic patterns surrounding core promoters (18). ChIP–chip experiments on a promoter tiling arrays were also performed to investigate genomic binding sites of PU.1, SP1, EGR-1 and IRF8 (7,19,20). Furthermore, we profiled copy number variation of the THP1 cells to assess the difference of the genome of THP1 cells from the reference genome (21).
The transcriptional changes observed during the differentiation time course reflect the underlying transcriptional regulatory network that maintains the stable state of the cells before and after differentiation and defines the transition between these stable states. We further performed perturbation experiments of known and likely key regulators to elucidate the network architecture beyond the level that can be obtained from the differentiation alone.
First, we individually perturbed 52 transcription factors by small interfering RNA (siRNA) knockdown. Since around half of the transcription factors were chosen based on the results of the deepCAGE MARA analysis we employed them as validation experiments. We additionally over-expressed microRNAs (miRNAs), regulatory small RNAs that reduce gene expression of targeted genes in a wide range of biological processes, by introducing over-expression vectors (22). The series of perturbation experiments was performed in biological triplicates and followed by profiling gene expression with microarray.
Transcription factors typically form complexes with the same or other transcription factors, with histone modifiers, cofactors and with regulatory DNA regions to directly or indirectly control expression of targeted genes. To investigate combinatorial effects of transcription factor complexes, we screened for physical interactions (protein–protein interactions) among human transcription factors and among mouse transcription factors in a large-scale mammalian two-hybrid (M2H) assay. We additionally profiled transcription factor expression across 34 human tissues and 20 mouse tissues with qRT–PCR. Analysis of these data demonstrated the conservation of physical interactions between the two species and highlighted the importance of transcription factor complexes for determining cell fate (6).
We assembled a wide range of regulatory interactions in the course of the analysis on the data sets above. As validation for the MARA predicted transcriptional regulatory interactions, we compiled a list of genes responding to siRNA perturbation experiments and a list of genes bound by transcription factors based on our experiments above as well as other publicly available large-scale experiments. Furthermore, we screened over 1000 publications to extract 440 manually curated regulatory interactions between regulators where we required that the interactions were validated with EMSA or ChIP in human cells. We provide the corresponding Pubmed IDs as evidence for the interaction. We obtained ElMMo (23) miRNA target predictions from the MirZ web server (24) and complemented with genes responding to our miRNA over expression described above.
Several million CAGE tags were produced by us from a wide range of tissues and conditions of human and mouse in the transition between the FANTOM3 and the FANTOM4 project (13,25). Re-mapping all of these data to the same genome assemblies used in FANTOM4 (hg18 for human and mm9 for mouse) lead us to the finding that retrotransposon transcription substantially regulates the transcriptional output of the mammalian genome (5). We consistently aggregated all data into CAGE tag cluster, a unit of CAGE tags overlapping on the genome (25), by this facilitate the access to one of the largest resources of TSSs. We additionally provide the converted coordinates of FANTOM3 tag cluster data (26) to enable comparison to our earlier results.
We prepared multiple ways to access the different data types for visual inspection and for analysis. Graphical user-interfaces facilitate immediate visualization of data and analysis results (Figure 1). The Generic Genome Browser (GBrowse) (27) provides a genome-based view of our data (9). To furthermore facilitate interpretation of the data, we prepared an instance of the EdgeExpressDB (28) to view regulatory interactions combined with expression profiles in an integrated way.
For further bioinformatics analysis in addition to manual inspection we prepared an archive of data files including a standardized description of metadata (such as experimental protocols and parameters, conditions, relationship between samples) as well as the processed data describing the transcriptional input, output and regulatory interactions. For all experiments we adopted the sample and data relationship format (SDRF), a standardized way to describe details of analysis in a tab-delimited file. SDRF is proposed as a part of MAGE-tab (29) employed by ArrayExpress (30), and now employed by ISA-tab (31) covering more wide range of omics data. A graphical representation of the meta-data is available via SDRF2GRAPH (32) to facilitate the understanding of the complex details, in particular, the relationship between samples and data sets.
The entire set of meta-data is useful to understand the whole experiments completely, but it also requires efforts to understand the contents instead. An essential part of the meta-data coupled with data file itself would help a wide range of specific analysis. From this perspective we adopted a simple tab-delimited format where the meaning of the columns are described in the file header following a minimal set of rules. We termed this data description scheme as order switchable column table (OSC table) format, and its specification is available from the web resource while the file at the same time is self-explanatory.
On top of the tightly connected interfaces and the primary data archives within our system, the integration with other relevant resources outside of our system enables researchers to view our data in a different and even wider context. Our data is visible through the RIKEN integrated database of mammals on the SciNetS (Scientists’ Networking System) (33), which indexes a wide range of data resources and connects them based on the semantic web framework (34) using structured ontologies. We also provide our data in the UCSC Genome Browser (35) bigWig and bigBed file format. This way, other genome browsers, applications or command line tools can point to our large indexed binary data sets and import details from specific genomic regions avoiding the need to transfer all data from a track. Using the UCSC Genome Browser to overlay data produced from the ENCODE project (http://www.genome.gov/10005107) with FANTOM data is one example of jointly inspecting both data in an interface many researchers are familiar with. Conversely, the Gbrowse running in the FANTOM web resource can be pointed to the UCSC data files or other data sources to facilitate an integrated view.
We successfully assembled and updated a set of genome-wide experiments performed and published by the FANTOM consortium into a single web resource. This provides an integrated view and resource of all FANTOM data covering a wide range of aspects of transcriptome complexity. With the recognition that the transcriptome exceeds previously assumed complexity, the importance of an accurate understanding of transcriptional regulation is increased. Cell reprogramming reports, in particular, emphasize this need with the goal to manipulate the transcriptional state of cells to drive the transition between cell types at will. We keep developing new technologies such as nanoCAGE to facilitate the identification of promoters from very small sample sizes (~10ng of total RNA) and CAGEscan linking promoters and internal exons by adopting mate-pair sequencing (36). Additionally, we will keep updating our FANTOM web resource with related data to improve our efforts to provide a baseline for currently available data.
Ministry of Education, Culture, Sports, Science and Technology, Japan, Genome Network Project (to Y.H.); Innovative Cell Biology by Innovative Technology (Cell Innovation Program) from the MEXT, Japan (to Y.H.); RIKEN Omics Science Center from MEXT, Research Grant (to Y.H.). Funding for open access charge: RIKEN Omics Science Center from MEXT, Research Grant (to Y.H.).
Conflict of interest statement. None declared.
We would like to thank all of the members in the FANTOM consortium for fruitful collaboration.