The assembly of each TIGR Gene Index builds upon previous releases, incorporating new ESTs and annotated gene sequences deposited in dbEST and GenBank, respectively. The first step in the process is the construction of a database of annotated gene sequences. For each species-specific Gene Index, all sequences from GenBank are downloaded and CDS and CDS Join features for full-length genes and mRNA sequences are parsed from the records. For redundant entries, one representative is chosen, although links to alternative GenBank accession numbers are maintained. The annotation of these expressed transcript sequences (ET; alternatively, Human Transcript, HT, for human records) is checked for consistency and the records are loaded into the TIGR Expressed Gene Anatomy Database (EGAD; http://www.tigr.org/tdb/egad/egad.html ). ESTs are downloaded daily from dbEST, and cleaned to remove untrimmed vector, linker, ribosomal, mitochondrial, low quality and poly(A)/poly(T) sequences.
Cleaned ESTs, ET sequences from EGAD, TC sequences from the previous build and previously unclustered sequences (singletons) are compared pair-wise to identify overlaps. Sequences sharing a minimum of 95% identity over a 40 nt or longer region with fewer than 20 bases of mismatched sequence at either end are grouped into a cluster. Each cluster is then assembled separately. For TCs appearing in a cluster, component EST and ET sequences are downloaded and added to any new EST or ET sequences. Clustered sequences are then assembled using CAP3 (13
), a sequence assembly program developed by Xiaoqiu Huang of Michigan Technical University. Assembly produces one or more consensus sequences for each cluster and rejects any chimeric, low-quality and non-overlapping sequences. Each cluster is assembled in the same fashion until the entire set has been exhausted. A second round of clustering and assembly, using only the newly constructed TC sequences as input, allows the identification and elimination of most redundancy introduced during the process. The resulting set of TCs is loaded into the appropriate species-specific Gene Index database for annotation.
Each Gene Index, consisting of the assembled TC sequences and singletons, is released through the TIGR web site. The TC presentation includes a FASTA-formatted consensus, a graphical representation of each component sequence within the TC, links to GenBank and other relevant records for each component sequence, and functional and mapping information where available.