Next-generation sequencers are revolutionizing our ability to sequence the genomes of new algae efficiently and in a cost effective manner. Several assembly tools have been developed that take short read data and assemble it into large continuous fragments of DNA. Gene prediction tools are also available which identify coding structures within these fragments. The resulting transcripts can then be analyzed to generate predicted protein sequences. The function of these protein sequences are subsequently determined by searching for close homologs in protein databases and transferring the annotation between the two proteins. While some versions of the previously described data processing pipeline have become commonplace in genome projects, the resulting functional annotation is typically fairly minimal and includes only limited biological pathway information and protein structure annotation. In contrast, the integration of a variety of pathway, function and protein databases allows for the generation of much richer and more valuable annotations for each protein.
A second challenge is the use of these protein-level annotations to interpret the output of genome-scale profiling experiments. High-throughput genomic techniques, such as RNA-seq experiments, produce measurements of large numbers of genes relevant to the biological processes being studied. In order to interpret the biological relevance of these gene lists, which commonly range in size from hundreds to thousands of genes, the members must be functionally classified into biological pathways and cellular mechanisms. Traditionally, the genes within these lists are examined using independent annotation databases to assign functions and pathways. Several of these annotation databases, such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) [1
], MetaCyc [2
], and Pfam [3
], include a rich set of functional data useful for these purposes.
However, presently researchers must explore these different knowledge bases separately, which requires a substantial amount of time and effort. Furthermore, without systematic integration of annotation data, it may be difficult to arrive at a cohesive biological picture. In addition, many of these annotation databases were designed to accommodate a single gene search, a methodology not optimal for functionally interpreting the large lists of genes derived from high-throughput genomic techniques. Thus, while modern genomic experiments generate data for many genes in parallel, their output must often still be analyzed on a gene-by-gene basis across different databases. This fragmented analysis approach presents a significant bottleneck in the pipeline of biological discovery.
One approach to solving this problem is integrating information from multiple annotation databases and providing access to the combined biological data from a single comprehensive portal that is equipped with the proper statistical foundations to effectively analyze large gene lists. For example, the DAVID database integrates information from several pathway, ontology, and protein family databases [4
]. Similarly, Ingenuity Pathway Analysis (IPA) provides an integrated knowledge base derived from published literature for the human genome [5
]. The integrated functional information and annotation terms are then assigned to lists of genes and for some analyses, enrichment tests are performed to determine which biological terms are overrepresented within the group of genes. By combining the information found in a number of knowledge bases and performing the analysis of lists of genes, these tools permit the efficient processing of high-throughput genomic experiments and thus expedite the process of biological discovery. However, most of these integrated databases have been developed for the analysis of well-annotated and thoroughly studied organisms, and are lacking for many newly genome-enabled organisms.
One large group of organisms for which integrated functional databases are lacking are the algae. The algae constitute a branch in the plant kingdom, although they form a polyphyletic group as they do not include all the descendants of their last common ancestor. As many as 10 algal genomes have been sequenced, including those of a red alga and several chlorophyte algae, with several more in the pipeline [6
]. Algal genomic studies have provided insights into photosymbiosis, evolutionary relationships between the different species of algae, as well as their unique properties and adaptations. Recently, there has been a renewed interest in the study of algal biochemistry and biology for their potential use in the development of renewable biofuels [reviewed in [12
]]. This has promoted the study of varied biochemical processes in diverse algae, such as hydrogen metabolism, fermentation, lipid biosynthesis, photosynthesis and nutrient assimilation [13
]. One of the most studied algae is Chlamydomonas reinhardtii
. It has a sequenced genome that has been assembled into large scaffolds that are placed on to chromosomes [6
]. For many years, Chlamydomonas
has served as a reference organism for the study of photosynthesis, photoreceptors, chloroplast biology and diseases involving flagellar dysfunction [21
]. Its transcriptome has recently been profiled by RNA-seq experiments under various conditions of nutrient deprivation [[26
], unpublished data (Castruita M., et al.)].
has been extensively characterized experimentally, annotation of its genome is still approximate. Although KEGG categorizes some C. reinhardtii
gene models into biological pathways, other databases - such as Reactome [28
] - do not directly provide information for proteins of this green alga. Complicating the analysis of Chlamydomonas
genes is the fact that there are two assemblies of the genome in use (version 3 and version 4) and multiple sets of gene models have been developed that are catalogued under diverse identifiers: Joint Genome Institute (JGI) FM3.1 protein IDs for the version 3 assembly, and JGI version FM4 protein IDs and Augustus version 5 IDs for the version 4 assembly [11
]. The differences between these assemblies are significant; for example, the version 3 assembly contains 1,557 continuous segments of sequence while the fourth version contains 88. Although the version 3 assembly is superseded by version 4, users presently access version 3 because of the richer user-based functional annotations. In addition, other sets of gene predictions have been generated using a variety of additional data, including ESTs and RNA-seq data, to more accurately delineate start and stop positions and improve upon existing gene models. One such gene prediction set is Augustus u10.2. As such, there are a variety of gene models between different assemblies being simultaneously used by researchers, presenting complications in genomics studies. To facilitate the analysis of Chlamydomonas
genome-scale data, we developed the Algal Functional Annotation Tool, which provides a comprehensive analysis suite for functionally interpreting C. reinhardtii
genes across all available protein identifiers. This web-based tool provides an integrative data-mining environment that assigns pathway, ontology, and protein family terms to proteins of C. reinhardtii
and enables term enrichment analysis for lists of genes. Expression data for several experimental conditions are also integrated into the tool, allowing the determination of overrepresented differentially expressed conditions. Additionally, a gene similarity search tool allows for genes with similar expression patterns to be identified based on expression levels across these conditions.