Similar numbers of genes composed the two genesets. 483 candidate genes whose expression varied reproducibly and highly significantly between the strains with no sex or sex by line effects (P < 0.01 for line and P > 0.05 for sex and sex by line) composed the first geneset. The second geneset comprised 527 candidate genes whose expression did not vary at all (P > 0.1 for sex, line, and sex by line). The remaining 11007 genes, whose significance values fell in between these P values, were not analyzed. Within the first geneset, 172 genes were more actively transcribed in Russian 2b females than in Oregon R females and 311 genes in Oregon R than Russian 2b females. The expression of 154 genes was greater for Russian 2b males than for Oregon R males and 329 genes for Oregon R than Russian 2b males. Thus, expression was consistently greater for the Oregon strain than the Russian strain (Table ).
Categorization of microarray data set.
Many genes in the first geneset are false positives. The expected number of candidate genes in the first geneset, 325, is less than 483, the observed number resulting in a False Discovery Rate (FDR) of 67.4%. Thus, of the 483 genes chosen in the first gene set, 67.4% or ~325 genes are called as significantly different when they are not. In other words, the expression of ~158 of the 483 genes in the first geneset presumably differs significantly between the two strains.
Polymorphisms are numerous in both genesets. Of the 34 promoters analyzed, six do not differ between strains; however, they belong to genes whose transcripts differ between strains. The remaining 28 proximal promoters contain at least one polymorphism, with one promoter containing as many as 37 polymorphisms (Figures and ). The mean, median, and mode of polymorphisms per gene are 8.5, 6, and 0, respectively. Of the 288 total polymorphisms detected in at least 1 kb of the proximal promoters, 258 (89.6%) were SNPs and 30 (10.4%) were indels. Over half of the SNPs were transitions within a nucleotide class, while 43.8% were transversions between nucleotide classes (Figure ). 239 putative binding sites for transcription factors were created or removed by these proximal promoter polymorphisms; thus 258/288 or 90% of proximal promoter polymorphisms fall within putative transcription factor binding sites (Figures and ).
Figure 1 Schematics of proximal promoters. At least one kb of the proximal promoters of 34 candidate genes whose transcripts vary (left and center columns) or do not vary (right column) in expression between D. melanogaster strains. Genes whose expression is greater (more ...)
Figure 2 Schematics of proximal promoters. At least one kb of the proximal promoters of 34 candidate genes whose transcripts vary (left and center columns) or do not vary (right column) in expression between D. melanogaster strains. Genes whose expression is greater (more ...)
Figure 3 Categories of SNPs between two strains. We identified 258 single nucleotide polymorphisms (SNPs) in 1–2 kb of the promoter region 5' of the translational start site of genes whose expression does and does not vary on microarray. SNPs are reported (more ...)
Although few, indels varied between the strains by kind and from 1 to 43 nt in length [see Additional File 1
]. Indels were classified as direct repeats (dr), homopolymer repeats (hpr), microsatellites (mcs), or non-repetitive (nr) according to designations of Schaeffer [19
], page 165. In most cases, indels in one strain were the same as in the Celera strain used as a reference. However, six repeats in five different genes resulted in differences in sequence among the Russian 2b, Oregon R, and Celera strains. In bin
(CG18647), five additional Ts are found in the Oregon R strain in comparison with the poly-T4
tract in the Russian 2b strain. This caused a transition and insertion from T4
in the Celera strain. Distal to this in bin
, the ATACCCGTACCCGTACCCAT sequence in the Russian 2b strain was shortened to ATACCCGTACCCAT in the Celera strain but absent altogether in the Oregon R strain. In tacc
(CG9765), the poly-A tract varies from A14–17
nt among individuals in the Russian 2b strain, to A23–26
nt among individuals in the Oregon R strain, to A22
in the Celera strain. In qkr58E-3
(CG3584), a SNP and variation in the length of a poly-T tract resulted in T3
in the Oregon R strain, T9
in the Celera strain and T11
in the Russian 2b strain. In KP78a
(CG6715), the dinucleotide microsatellite AC was repeated 9X in the Russian 2b strain, 10X in the Celera strain, and 11X in the Oregon R strain. In stan
(CG11895), the homopolymer repeat was T10
nt long in the Celera strain, T11
in the Russian 2b strain, and T9
in the Oregon R strain.
Indels in Scab, Cry, Ih
, and bin
(CG8095, CG16963, CG8585, CG18647) contained small regions (12–15 nt long) that shared sequence similarity with known transposable elements. Despite these matches, the sequences are not long enough to discriminate confidently between a TE footprint and chance occurrence of the same sequence. Flanking the indel in Cry
listed in the Supplementary Table [see additional file 1
], a 152 bp sequence from -889 to -1041 relative to the translational start site in the Oregon R strain matched a DNA LINE retroelement. In the same location in the Russian 2b strain a 152 bp sequence matched two overlapping DNAREP1_DM
LINE elements [20
The diversity and frequency of polymorphisms did not differ in proximal promoters from genes differing in expression and those with similar expression (P = 0.911 for indels, P = 0.935 for transition SNPs, and P = 0.842 for transversion SNPs) (Figures and ). For example, we identified 59 transversion SNPs, 76 transition SNPs (summing to 135 total SNPs), and 16 indels in the first geneset, and 54 transversion SNPs, 69 transition SNPs (123 total SNPs), and 14 indels in the second geneset. Also, the average promoter length in the first geneset was 1629 nt and 1620 nt in the second geneset (Table ). Thus, for the 34 genes examined in this study, the lack of variation in proximal promoter sequence between the two genesets implicates alternative sources for divergent patterns of gene expression.
Table 2 Genes used in this study. From the data of Gibson et al. (2004), 34 candidate genes were chosen for study whose transcripts were expressed to higher levels in either Oregon R (OrR) or Russian 2b (R2b) strain (P < 0.01) (first geneset) or did not (more ...)