Similar numbers of genes composed the two genesets. 483 candidate genes whose expression varied reproducibly and highly significantly between the strains with no sex or sex by line effects (P < 0.01 for line and P > 0.05 for sex and sex by line) composed the first geneset. The second geneset comprised 527 candidate genes whose expression did not vary at all (P > 0.1 for sex, line, and sex by line). The remaining 11007 genes, whose significance values fell in between these P values, were not analyzed. Within the first geneset, 172 genes were more actively transcribed in Russian 2b females than in Oregon R females and 311 genes in Oregon R than Russian 2b females. The expression of 154 genes was greater for Russian 2b males than for Oregon R males and 329 genes for Oregon R than Russian 2b males. Thus, expression was consistently greater for the Oregon strain than the Russian strain (Table ).
| Table 1Categorization of microarray data set. |
Many genes in the first geneset are false positives. The expected number of candidate genes in the first geneset, 325, is less than 483, the observed number resulting in a False Discovery Rate (FDR) of 67.4%. Thus, of the 483 genes chosen in the first gene set, 67.4% or ~325 genes are called as significantly different when they are not. In other words, the expression of ~158 of the 483 genes in the first geneset presumably differs significantly between the two strains.
Polymorphisms are numerous in both genesets. Of the 34 promoters analyzed, six do not differ between strains; however, they belong to genes whose transcripts differ between strains. The remaining 28 proximal promoters contain at least one polymorphism, with one promoter containing as many as 37 polymorphisms (Figures and ). The mean, median, and mode of polymorphisms per gene are 8.5, 6, and 0, respectively. Of the 288 total polymorphisms detected in at least 1 kb of the proximal promoters, 258 (89.6%) were SNPs and 30 (10.4%) were indels. Over half of the SNPs were transitions within a nucleotide class, while 43.8% were transversions between nucleotide classes (Figure ). 239 putative binding sites for transcription factors were created or removed by these proximal promoter polymorphisms; thus 258/288 or 90% of proximal promoter polymorphisms fall within putative transcription factor binding sites (Figures and ).
Although few, indels varied between the strains by kind and from 1 to 43 nt in length [see
Additional File 1]. Indels were classified as direct repeats (dr), homopolymer repeats (hpr), microsatellites (mcs), or non-repetitive (nr) according to designations of Schaeffer [
19], page 165. In most cases, indels in one strain were the same as in the Celera strain used as a reference. However, six repeats in five different genes resulted in differences in sequence among the Russian 2b, Oregon R, and Celera strains. In
bin (CG18647), five additional Ts are found in the Oregon R strain in comparison with the poly-T
4 tract in the Russian 2b strain. This caused a transition and insertion from T
4CT
3 in the Celera strain. Distal to this in
bin, the ATACCCGTACCCGTACCCAT sequence in the Russian 2b strain was shortened to ATACCCGTACCCAT in the Celera strain but absent altogether in the Oregon R strain. In
tacc (CG9765), the poly-A tract varies from A
14–17 nt among individuals in the Russian 2b strain, to A
23–26 nt among individuals in the Oregon R strain, to A
22 in the Celera strain. In
qkr58E-3 (CG3584), a SNP and variation in the length of a poly-T tract resulted in T
3GT
4 in the Oregon R strain, T
9 in the Celera strain and T
11 in the Russian 2b strain. In
KP78a (CG6715), the dinucleotide microsatellite AC was repeated 9X in the Russian 2b strain, 10X in the Celera strain, and 11X in the Oregon R strain. In
stan (CG11895), the homopolymer repeat was T
10 nt long in the Celera strain, T
11 in the Russian 2b strain, and T
9 in the Oregon R strain.
Indels in
Scab, Cry, Ih, and
bin (CG8095, CG16963, CG8585, CG18647) contained small regions (12–15 nt long) that shared sequence similarity with known transposable elements. Despite these matches, the sequences are not long enough to discriminate confidently between a TE footprint and chance occurrence of the same sequence. Flanking the indel in
Cry listed in the Supplementary Table [see
additional file 1], a 152 bp sequence from -889 to -1041 relative to the translational start site in the Oregon R strain matched a DNA LINE retroelement. In the same location in the Russian 2b strain a 152 bp sequence matched two overlapping
DNAREP1_DM LINE elements [
20].
The diversity and frequency of polymorphisms did not differ in proximal promoters from genes differing in expression and those with similar expression (P = 0.911 for indels, P = 0.935 for transition SNPs, and P = 0.842 for transversion SNPs) (Figures and ). For example, we identified 59 transversion SNPs, 76 transition SNPs (summing to 135 total SNPs), and 16 indels in the first geneset, and 54 transversion SNPs, 69 transition SNPs (123 total SNPs), and 14 indels in the second geneset. Also, the average promoter length in the first geneset was 1629 nt and 1620 nt in the second geneset (Table ). Thus, for the 34 genes examined in this study, the lack of variation in proximal promoter sequence between the two genesets implicates alternative sources for divergent patterns of gene expression.
| Table 2Genes used in this study. From the data of Gibson et al. (2004), 34 candidate genes were chosen for study whose transcripts were expressed to higher levels in either Oregon R (OrR) or Russian 2b (R2b) strain (P < 0.01) (first geneset) or did not (more ...) |