To find a suitable strain for SNP discovery, we investigated four independent wild isolates that grow well in culture, are interfertile with the sequenced AF16 strain, and represent both tropical and temperate groups [10
] (). We initially aligned a small number of random genomic sequences against the AF16 assembled sequence to ascertain the approximate incidence of single nucleotide variation. Two temperate strains of Japanese origin (HK104 and HK105) both had relatively high rates of difference (~1 SNP/110 bases) while the Hawaiian (VT847) and the Ohio (PB800) strains (tropical and temperate respectively) had apparently lower rates (see Materials and Methods
for details of SNP detection). We selected one strain of each level (HK104 and VT847) for more extensive SNP discovery. From 8,405 and 9,970 aligned sequence reads from whole genome shotgun libraries from each strain we identified respectively 32,246 and 14,183 substitutions with Phred [12
] quality scores of greater than 35, giving overall rates of 8.7 and 3.2 per kilobase. We also identified a number of candidate small insertion/deletion differences (7,118 events affecting 18,196 bases and 3,575 events affecting 8,315 bases, respectively).
Variations Detected in Four C. briggsae Strains
Construction of the Genetic Map
To construct a genetic map, 390 SNPs distributed across the sequence were tested against 93 AF16 × HK104 RI lines and the parental strains [10
] using the fluorescent polarization detection (FP-TDI) assay (Vieux et al. 2002; see Materials and Methods
for details of SNP assay). To maximize the amount of sequence mapped and to provide an independent check of the assembly, the 390 SNPs were selected from the larger supercontigs, thus ensuring that the larger physical map–based contigs (called fpc contigs for simplicity, after the program used to assemble the physical map [13
]) would contain multiple markers and thus serve to check the assembly. In about a quarter of the cases, a second SNP was selected within a single supercontig to test the assembly at this level. A SNP was declared as mapped when the assays were successful on between 80% and 100% of the 95 strains tested, with a total of 248 SNPs (64%) meeting this criterion. Some 84 SNPs (22%) had success rates between 0% and 40% and were deemed failures. The high rate of failures was likely caused by PCR problems due to unaccounted SNPs in primer sites, a problem faced by all investigated genotyping platforms [14
]. Some five SNPs (1.3%) were monomorphic and likely due to false SNP calls. Other SNPs failed quality control tests or had success rates of 40–80%.
These same SNP assays were also tested against the VT847 strain, for which RI lines were also available. Relatively few of the AF16/HK104 SNPs were polymorphic between AF16 and VT847, suggesting that the overlap in variation between the HK104 and VT847 is very limited. This meant that genotyping of these additional RI lines with these markers added little new map information.
We tested several different parameters for map construction, using the program Map Manager QTXb20 (http://www.mapmanager.org/
]). The different versions varied in map length per chromosome, total map length, and in the local order of markers within a chromosome, but assignment of markers to common linkage groups was a robust feature of the maps. The latter was due in part to the large number of nonrecombinant chromosomes in the RI lines (35–60% per chromosome), which allowed ready assignment to linkage groups. Based on these experiences, we adopted the following strategy to build version 3.0 of the genetic map: we used an initial set of 115 very high quality markers (>95% call rates) and a second set of slightly lower quality (80–95% call rates). We used the Haldane function and an initial probability of incorporation of a SNP into the map of 10−6
. Seven linkage groups were formed, one with only two SNPs (cb23233 and cb23314). We reduced the probability required for incorporation to 10−3
, and the latter group was incorporated into the end of chromosome CbIV. Thus the number of linkage groups matched the observed number of chromosomes (). The program provided map positions in centimorgans (cM) for each of the incorporated markers, with each of the chromosomes approximately 50 cM in length.
The 247 Markers Fall into Six Linkage Groups
Inspection of the raw data in the version 3.0 map in conjunction with the marker order in the sequence assembly highlighted places where markers of equivalent or nearly equivalent position in the genetic map could be shuffled to reconcile their order with that in the assembly. In addition the initial genetic map of the X chromosome (CbX; see below for chromosome assignments) showed a number of inconsistencies with the sequence assembly that could all be reconciled by a single inversion of the central segment of the genetic map for CbX. Additional recombinant data obtained for CbX from an experimental cross (see Materials and Methods
) supported the revised genetic marker order. These changes were incorporated into version 3.1. Finally, inspection of the raw data in conjunction with the known groupings of markers based on the assembly suggested alternate orders of markers on chromosomes CbI, CbIV, and CbV that reduced the number of multiply recombinant chromosomes. These changes reduced overall map length by over 16 cM and did not reduce logarithm of the odds scores (logarithm of the odds score is a statistical estimate of whether two loci are likely to be near each other on a chromosome and therefore likely to be inherited together) of any markers below the threshold; they were incorporated into the genetic map to produce version 3.2.
Using this framework map, inspection of the remaining markers with lower call rates indicated that 44 of them could be readily linked to chromosomes and tentatively positioned within the chromosome. These added markers sometimes helped in orienting contigs and in five cases, positioned previously unplaced contigs. However, the lower overall call rates of these markers make their placement less certain.
Segregation of Parental Markers
With the genetic map in place, we examined the frequency of parental alleles within the RI lines across the chromosomes. For chromosomes CbII and CbX, there was little variation from the expected value of 50% for each marker. But for other chromosomes, there were regions of biased representation of the AF16 and HK104 alleles. For example, the AF16 allele was consistently underrepresented for most of CbIII, whereas it was overrepresented for much of chromosome CbIV (). Chromosomes CbI and CbV also showed biased representation, but over more limited regions (see Datasets S1
). The biased representation of alleles presumably reflects some selective advantage for offspring with these regions, either singularly or in combination. The selection of the first progeny at each generation in establishing the RI lines may have contributed to this bias. The relatively small number of recombinant events in these lines however precludes finer localization of such factors.
Percent Recovery of the AF16 Allele in the RI Lines for Each of the Markers Plotted against Its Position in cM on C. briggsae Chromosomes CbIII and CbIV
Integrating Genetic and Sequence Maps
The sequence-based markers used in the construction of the genetic map allowed for ready integration of the genetic and sequence maps into a genome map. The association of a genetic marker with a supercontig and, in turn, an fpc contig positioned that sequence on a specific chromosome, and when multiple, genetically separated markers were assigned to a single sequence assembly, that sequence could be oriented. Generally, multiple markers from the same supercontig or fpc contig had adjacent positions in the genetic map, confirming the assembly in these instances.
However, markers assigned to 21 sequence assemblies were derived from more than one linkage group, indicating an error in either the genetic linkage assignment or in the sequence assembly. Because marker assignment to linkage groups was a robust feature of the genetic map and inspection of the raw data revealed no problems with the assignment in these discordant cases, the sequence map was probed for possible errors. Only one discrepancy was noted among 68 supercontigs with more than one marker, suggesting that misassemblies within supercontigs (constructed by using read-pair information to link sequence contigs) were unlikely to account for the bulk of the observed discrepancies. On the other hand, we noted that most markers with discordant linkage fell on fpc contigs (in which supercontigs were linked based on the physical clone map information). Detailed inspection showed that in these cases, a join based on the physical clone map information fell between discordant markers.
Once the conservation of synteny between C. elegans and C. briggsae chromosomes was established (see below), the 1:1 orthology landmarks were used to delimit the region with the assembly problem, making it clear that the discrepancies arose because of false joins based on the lower resolution physical clone map (). Inspection of the physical map in a sample of these regions revealed questionable clone overlaps often accompanied by an editor's comment to that effect, consistent with a misassembly at that point. As a result, 27 breaks were made in the fpc contigs at the site defined by the orthology landmarks (renamed as segments a, b, etc. of the parent contig). The single discordant supercontig was also broken at a site bounded by the ortholog landmarks. These breaks in the sequence assembly eliminated the inconsistencies between assignment of the markers to sequence assemblies and linkage groups ().
Example of the Transition in Synteny for an Assembly with Discrepant Genetic Markers
Integration with Sequence Map
Obviously, other misassemblies may remain undetected, because misassembled regions failed to have a genetic marker. Investigation of the entire sequence for clusters of discordant orthologs suggests five regions of more than 50 kilobases (kb) that are likely candidates for misassembly. Further, our analysis is less sensitive to misassemblies within the same chromosome, because precise order within linkage groups is less robust, making detection harder. Nonetheless, with one exception, markers in a single sequence assembly lie adjacent to one another in the current map. In the exception (cb25.fpc4010), a high-quality marker maps to the right end of chromosome CbIII, whereas two low-confidence markers suggest positions near the opposite end. Further, with one exception, multiple markers in a single sequence assembly fall in an order consistent with the genetic map order. In the single exception, a simple inversion of a pair of SNP markers in cb25.fpc3752 would reconcile the maps. However, we only altered the sequence assembly where there were compelling genetic data that the assembly was in error.
The integrated genetic and sequence map provide an initial genome map. The confidently placed genetic markers position 141 sequence assemblies, accounting for 89.4 Mb of the sequence along the chromosomes, with 42 of these oriented, accounting for 47.7 Mb. Inclusion of the lower-confidence markers provides tentative positions for an additional five assemblies, containing 1.8 Mb. By using read-pair information for assemblies adjacent in the genetic map, we were able to orient an additional 45 contigs, bringing the total oriented sequence to 67.3 Mb. In addition, by considering local order of 1:1 orthologs in both species (see below), we could tentatively order an additional 4.4 Mb. This reconciled genome map is reflected in version 3.3 of the genetic map.
Recombination Rates Vary along the Chromosomes
In C. elegans
, a distinctive feature of the genetic map is the reduced recombination per Mb of the centers of the autosomes compared with the arms [16
]. We looked at the recombination rates across the C. briggsae
autosomes and the putative X chromosome (see below) to see if the same features existed. Similar to that of C. elegans,
each of the C. briggsae
autosomes shows reduced recombination in the centers compared to the arms (A, Datasets S3
, and Figures S1
). Indeed, no recombinant events were observed in the RI lines over several megabases of the centers of several chromosomes, even though 60–70 recombinant events were observed on the 11–16-Mb autosomes. In contrast, recombination rates are more uniform on the presumptive X chromosome (B).
Recombination and Physical Distance in C. briggsae
These observations must be interpreted with some caution, because the C. briggsae genome map contains only 85% of the sequence, and assembly biases could mean that much of the unassigned sequence belongs on the arms. Further, some biases were introduced in the recovery of the RI lines, as noted above, which might also distort recombination rates. Finally the sequence differences and perhaps even inversions between the two strains could reduce recombination rates in local regions. Nonetheless, the general features seen here seem likely to be reflected in a more comprehensive map.
Repeats, Genes, and Conserved Gene Distribution
In addition to the marked variation in recombination rates along the autosomes in C. elegans,
repeat density and gene density were found to vary by region [6
] . We observed similar variation in density of these features in the C. briggsae
autosomes, with the repeat density higher and intron length greater on the arms and exon density greater in the centers (, Datasets S3
, and Figures S1
). Again, as seen in C. elegans
, telomere related repeats (TTAGGC) show a particularly marked difference in distribution. Strikingly, 1:1 orthologs, even after accounting for the greater exon density in the centers, are more common in the centers.
Variation of Features by Chromosomal Region Illustrated Using C. briggsae Chromosome III
Conservation of Synteny between C. elegans and C. briggsae Genomes
With the bulk of the C. briggsae
genome placed along chromosomes, the conservation of synteny (using synteny here in the originally defined sense of genes on the same linkage group or chromosome) and colinearity (meaning the order of genes along the chromosome) between C. elegans
and C. briggsae
could be investigated directly across the whole genome. Early analyses of colinearity, using clone-based datasets of limited sequence continuity, estimated median tract lengths of <10 kb in one study [5
] and 17 kb for autosomes and 41 kb for the sex chromosome in a second study [17
]. The initial analysis of the C. briggsae
whole-genome assembly observed a mean of 37,472 base pairs (bp) and a median 5,557 bp with a maximum block of 1.68 Mb [8
]. This initial analysis used genome-wide alignment data and allowed regions to match as many as five segments in the reciprocal genome. Inspection of the junctions between the 4,837 candidate colinear blocks (minimum length 1.8 kb) suggested the breakpoints represented 1,384 inversions, 244 translocations, and 2,735 transpositions.
To make the present analysis less sensitive to repeated sequences and to small blocks of similarity that may have arisen by the large number of transposition events, we began by identifying 9,767 1:1 gene pairs (where each gene was represented only once in its genome and matched only one gene in the other genome) using the previously defined gene set [8
]. These data provide an unambiguous orthologous landmark on average about every 10 kb. For those sequence assemblies that had only one genetic marker or that had genetic markers all on a single linkage group in the initial map, we found that the 1:1 orthologs on that assembly overwhelmingly derived from a single C. elegans
chromosome. The same observations held for the corrected assemblies. More remarkably, we noted for sequence assemblies assigned unambiguously to the same C. briggsae
linkage group that the 1:1 orthologs assignments were consistently from a single C. elegans
chromosome (). Exceptional orthologs were often isolated, singular events. This remarkable conservation of synteny between the two species allowed us to assign not just regions but each of the entire C. briggsae
linkage groups to its corresponding C. elegans
Reciprocal Mapping of 1:1 Orthologs
To look at the colinearity of the orthologs within chromosomes, we plotted their location in each of the pairs of syntenic chromosomes (, Dataset S5
, and Figure S3
). There have been extensive intrachromosomal rearrangements, but large colinear blocks remain, especially in the centers. More interestingly, sequences that are in the central, low-recombination segment of one species tend to be in the corresponding region in the other species. By contrast, there is mixing between the two arms.
Chromosomal Positions of Orthologs
To quantify this, we established blocks of sequence with the same order of genes in the two genomes allowing minor exceptions (see Materials and Methods
). Our methods yielded only 851 blocks using a minimum block size of one ortholog, with only a third of these more than 50 kb long. Because our analysis excludes repeated sequences, these numbers do not reflect most transposition events, which formed the bulk of the rearrangements detected in [8
]. Nonetheless, 351 of the 1:1 ortholog blocks are small enough (<20 kb) to be consistent with transposition events. Only 12 blocks greater than 20 kb involve nonsyntenic orthologs and might represent translocations; none of these have confirmatory genetic markers and could all represent assembly problems. Thus the only confirmed rearrangements represent intrachromosomal events. Their distribution across the chromosomes is distinctly nonrandom. As seen in , the block size of the X chromosomes is considerably larger than for the autosomes, and similarly within the autosomes, the block size in the centers is much larger than the arms. The median for the autosomes is similar to that obtained in [17
], whereas the median for the X is considerably larger, perhaps because of the greater continuity of the sequence in our study.
Colinear Block Size Characteristics
Syntenic and Nonsyntenic Orthologs
Given the overwhelming tendency of orthologs to remain on the same chromosome, we investigated the nonsyntenic ortholog pairs to see what features might distinguish them from syntenic pairs. To minimize the likely contamination of the nonsyntenic set with misassemblies, we excluded 12 larger clusters of nonsyntenic orthologs (see Materials and Methods
). The most distinctive difference between the two sets was the lower percent identity of the aligned nonsyntenic pairs (). These differences existed among pairs regardless of whether the members of the pair lay both on chromosome arms, both in chromosome centers, or one on an arm and one in the center.
Sequence Similarity of Syntenic and Nonsyntenic Ortholog Pairs
One explanation for the greater divergence of the nonsyntenic ortholog pairs might be that the true ortholog is missing in the draft C. briggsae sequence. We looked for evidence of this by finding the 1:1 orthologs (e.g., A and C) flanking the C. elegans member of a nonsyntenic ortholog pair, (ABC, where B is from the nonsyntenic pair) and then searching the region between the C. briggsae orthologs of A and C for evidence of large gaps or partial genes. Of 175 nonsyntenic ortholog pairs, we detected homology in the interval defined by the flanking orthologs for only 19 cases, and only 29 regions had 4% or more of the interval as uncalled bases (Ns). Almost half the intervals had less than 1% of the sequence as Ns. Thus, while the draft nature of the C. briggsae sequence may result in incorrect assignment of 1:1 orthology, producing an apparent increased divergence, it seems unlikely to account for the bulk of our observations.