Approximately 46% of the singletons and assembled contigs in our common carp EST project failed to yield an identity using the unattended BLASTx procedure [10
], some of which represent non-overlapping, or 3' sequences of identified genes. Other clones failed to yield a sequence on automated analysis yet provided suitable hybridisation probes on the microarray. Given these special problems and complications in this species due to a recently duplicated genome, the sometimes incoherent nature of subsequent gene losses [24
], and the divergent tissue- and response-specific expression patterns so generated [9
], we have explored how expression alignment techniques might complement the more usual sequence alignment methods to assign an identity for an otherwise unidentifiable sequence.
Our approach was based on the idea that expression profiles for non-overlapping probes derived from a same gene should be highly correlated when tested across a range of experimental treatments, and this should enable unidentified clones to be identified by comparison with identified clones. Similarly, comparing expression profiles for cDNA microarray probes possessing the same BLASTx identity, offers a means of testing their common identity, given that they may represent unrecognised isoforms or variants of a given gene. Thus, combining alignment of sequence data with that from gene expression data offers a useful means of improving the quality of gene identification, and for discriminating isoforms or members of gene families whose separate identify may not easily be made evident using conventional methods.
For this work we chose to include all available cDNA clones on the microarray, resulting in up to 80 clones per contig, and to gather expression profiles from a wide range of major organs and tissues, exposed to a range of experimental treatments. Consequently, the carp array included the substantial repetition of some genes and this provided greater support for the identified gene clusters. Our approach was based on the comparison of the expression profiles of pairs of probes using Pearsons correlation coefficients which were used to create a network linking genes together on the basis of their shared expression properties. The VxInsight algorithm uses a force-repulsion mechanism to gather the distributed gene networks into discrete clusters, which are then presented in an easy-to-understand landscape metaphor.
We show that the resulting landscape features, and the associated clusters were robust, first, because permuting and randomising the expression values generated neither high correlation coefficients nor landscape features, and second, because the form of the clusters are largely retained when using different scales of array data from small to large. We show that datasets that contain a wider range of experimental treatments and tissues can fragment the gene clusters into smaller forms, each with a distinctive character. Thus, the exact level of discrimination achieved depends upon the diversity of the data used in its construction, with extra experimental treatments offering additional changes in the expression relationships between genes, thereby refining the resulting correlations.
Gene identification using ExprAlign
Many of the resulting landscape features or mountains generated by VxInsight contained predominantly just one kind of BLAST-identified gene, and we show that there is substantial enrichment of these genes within the features compared to chance alone. Thus unidentified probes within that mountain were also tentatively labelled with that gene identity. Using this approach we were able to impute an identity to 522 unidentified clones in the GE landscape, which represented ~17% of all unidentified clones on the map. The validity of this assignment can be tested by the attended analysis of the clone sequences. This was achieved for mountain GE10 which possessed 5 different probes identified as myoglobin by BLASTx, and another 5 probes lacking an identify. Closer inspection of the corresponding sequences, and manual attempts at alignment, were subsequently able to demonstrate that all of the unidentified ESTs were also myoglobin, including additional examples of a unique brain-specific isoform [11
]. This indicates the limitations of unattended BLAST analysis, and where appropriate, the need for manual verification of clones assigned with an ExprAlign
identity. A similar result was obtained for parvalbumin [32
Separation of isoforms using ExprAlign
ExprAlign has also proved useful in separating clones that have been assigned a common BLASTx identity but which have distinctive expression profiles. If clones possessing the same gene BLAST identity were indeed sourced from the same transcriptional start site, then they should display identical expression patterns. On the other hand, if the clones were representative of different isoforms with distinctive expression properties, then they would occupy different features on the expression landscape, or could be distinguished using additional clustering techniques. We show that these expectations are largely met for a series of test cases, as described above. As a further example, in the case of fatty acid-binding protein (mountain GE17c, not shown), we identified 13 cDNA clones all possessing the same BLASTx sequence alignment, but which displayed two contrasting expression profiles. In this particular case, the differences were very subtle and limited to just one tissue (liver) responding to just one treatment condition (hypoxia at 17°C). Separation required application of K-means clustering to the genes contained within the landscape feature. This again indicates that the level of functional dissection possible by ExprAlign depends to a significant extent on the diversity of treatments for which array data is generated.