The D. melanogaster
genome annotations version 3.1 [62
] were obtained from the BDGP. Only genes in the euchromatic portion of the genome were used for analysis. C. elegans
genomic data were obtained from WormBase genome freeze WS100 [63
Expression data for D. melanogaster
were obtained from two independent sources. First, we determined the number of 'Expression and Phenotype' tags for all D. melanogaster
genes listed in FlyBase [65
]. Second, we measured embryonic expression complexity by counting the 'body parts' listed in the BDGP in situ
hybridization database [66
] (accessed 10 October 2003). This project uses a controlled vocabulary to annotate the expression of each gene during embryogenesis [29
]. C. elegans
expression data was obtained through AQL (Acedb Query Language) queries of WormBase for all genes that possessed 'Expr_pattern' entries.
The housekeeping (HK) gene set was generated by combining three lists of proposed human housekeeping genes [6
]. This nonredundant list was compared by BLAST [67
] to the D. melanogaster
and C. elegans
genomes. We retained only the best hit in each genome that exceeded an E-value of 1 × 10-20
. The CDY (C. elegans, D. melanogaster
, and yeast) dataset is derived from single-copy genes shared by Saccharomyces
]. We infer that these genes will largely have shared basal functions and few cell-type-specific functions [38
]. Gene lists and sequences were retrieved by EnsMart from the Ensembl Genome Browser [68
]. Because the C. elegans
genome annotation employs different GO terms from that of Drosophila
, we placed C. elegans
genes into corresponding GO categories by BLAST of the D. melanogaster
GO gene sets against the C. elegans
Data analysis and statistics
Data management and analysis were performed using a combination of PERL programs, Microsoft Excel and JMP 3.0 (SAS Institute).
Composition of individual indices and bins. FlyBase index (1,879 genes): Bin 1, genes with an index value of 1, corresponding to 1 'Expression and Phenotype' entry in FlyBase, N = 108 entries; Bin 2, two entries, N = 227; Bin 3, three entries, N = 172; Bin 4, four to five entries, N = 184; Bin 5, six to eight, N = 206; Bin 6, 9-13, N = 235; Bin 7, 14-18, N = 184; Bin 8, 19-29, N = 187; Bin 9, 30-49, N = 193; Bin 10, 50-336, N = 183.
BDGP index (1,698 genes): Bin1, one body part listed, N = 163; Bin 2, two body parts, N = 184; Bin 3, three body parts, N = 172; Bin 4, four body parts, N = 159; Bin 5, five body parts, N = 145; Bin 6, six to seven body parts, N = 201; Bin 7, eight to nine body parts, N = 180; Bin 8, 10-13, N = 144; Bin 9, 12-14, N = 142; Bin 10, 15-42, N = 208.
WormBase index (1,130 genes): Bin 1, one 'Expr_pattern' entry, N = 357; Bin 2, two entries, N = 192; Bin 3, three entries, N = 116; Bin 4, four entiries, N = 123; Bin 5, five entries, N = 98; Bin 6, six entries, N = 61; Bin 7, seven entries, N = 52; Bin 8, eight entries, N = 39; Bin 9, 9-11, N = 43; Bin 10, 12-27, N = 49.
Comparison of all pairs of bins in each index was performed using Tukey-Kramer HSD. As the size of intergenic DNA in each bin approximates a log-normal distribution (Figure , and data not shown) we compared both raw and log-transformed measurements. In all cases bins of higher inferred complexity tended to have higher average measures of intergenic DNA than bins of lower inferred complexity (Tukey-Kramer HSD, α = 0.05).
Composition of functional groups: CDY, Ce N = 1,237, Dm N = 1,250; general transcription factors, Ce N = 43, Dm N = 43; HK, Ce N = 540, Dm N = 609; pattern specification, Ce N = 73, Dm N = 73; embryonic development, Ce N = 88, Dm N = 88; specific transcription factors, Ce N = 45, Dm N = 45; metabolism, Ce N = 881, Dm N = 881; cell differentiation, Ce N = 46, Dm N = 46; receptor activity, Ce N = 106, Dm N = 106; ribosome constituents, Ce N = 93, Dm N = 93. The mean size of the intergenic DNA associated with each group suggested that the simple gene groups are not significantly different between species, but that both simple groups are smaller than both complex groups and that the C. elegans complex group is smaller than the D. melanogaster complex group (Tukey-Kramer HSD, α < 1e-4). This interpretation was confirmed by independent inspection of the intergenic DNA size distributions for each group. Complex groups had many more genes with large intergenic regions than simple groups did. Comparison between the C. elegans complex group and the D. melanogaster complex group was complicated by the observation that the D. melanogaster group contained both more genes with smaller than average intergenic regions and many more genes with much larger than average intergenic measures. We divided both raw and log-transformed measures from D. melanogaster and C. elegans into halves containing the largest and smallest 50% of genes. The largest 50% of complex genes in D. melanogaster is flanked by significantly more DNA than the largest 50% of C. elegans complex genes (Wilcoxon two-sample test, p < 0.001).