|Home | About | Journals | Submit | Contact Us | Français|
A set of 22 551 unique human NotI flanking sequences (16.2 Mb) was generated. More than 40% of the set had regions with significant similarity to known proteins and expressed sequences. The data demonstrate that regions flanking NotI sites are less likely to form nucleosomes efficiently and resemble promoter regions. The draft human genome sequence contained 55.7% of the NotI flanking sequences, Celera’s database contained matches to 57.2% of the clones and all public databases (including non-human and previously sequenced NotI flanks) matched 89.2% of the NotI flanking sequences (identity ≥90% over at least 50 bp, data from December 2001). The data suggest that the shotgun sequencing approach used to generate the draft human genome sequence resulted in a bias against cloning and sequencing of NotI flanks. A rough estimation (based primarily on chromosomes 21 and 22) is that the human genome contains 15 000–20 000 NotI sites, of which 6000–9000 are unmethylated in any particular cell. The results of the study suggest that the existing tools for computational determination of CpG islands fail to identify a significant fraction of functional CpG islands, and unmethylated DNA stretches with a high frequency of CpG dinucleotides can be found even in regions with low CG content.
Draft sequences of the human genome were recently reported (1,2). Much work remains to be done to produce a complete finished sequence and progress can be best assured by a diversity of approaches (1). At present, one of the principal goals for genome research is a careful and systematic validation of the assembled sequence (1–4). In this respect it becomes critically important to develop and to apply strategies capable of fulfilling this crucial aim. These strategies should exploit approaches independent of those used to generate the original genome sequence. Short sequences flanking the rare restriction sites, for instance NotI, might serve as a tool for validation of human genome structure.
NotI linking clones contain pairs of sequences flanking a single NotI recognition site, while NotI jumping clones contain DNA sequences spanning between neighboring NotI restriction sites. Such clones were shown to be tightly associated with CpG islands and genes (5,6). The use of NotI linking and jumping clones as framework markers was proposed to define the structure of large regions of human chromosomes (7–13). To achieve this goal, simplified procedures for the construction of NotI jumping and NotI linking libraries were developed and a number of chromosome 3-specific and other chromosome-specific and total human NotI linking libraries were prepared (7–15).
One thousand human chromosome 3-specific NotI linking clones were partially sequenced (6). Among these, 249 unique clones were identified and 152 were carefully analyzed. To localize these clones, PCR, Southern hybridization, pulsed field gel electrophoresis (PFGE) and two- or three-color fluorescent in situ hybridization (FISH) were applied. In many cases, chromosome jumping was successfully used to resolve ambiguous mapping (6,13). This NotI map was compared to the chromosome 3 map, based on yeast artificial chromosome clones and radiation hybrids (14), and significant differences in several chromosome 3 regions were noticed. Importantly, these differences included a 3p14–p22 region with homozygous deletions and most likely containing tumor suppressor genes (6). These data supported earlier notions (13,15) that a NotI physical map can be more informative than genetic or radiation hybrid maps.
To enable a direct assessment of the value of NotI clones in genome research, high-density grids with 50 000 NotI linking clones derived from six representative NotI linking and three NotI jumping libraries were constructed. Altogether, these libraries contained nearly 100 times the total estimated number of NotI sites in the human genome. Sequencing of 20 000 NotI clones was projected to provide information linked to 10–20% of all human genes (9) and may help in the identification of new genes. Before starting a large-scale project, a pilot study to validate the proposed strategy was performed (16). In that work 3265 unique NotI flanking sequences were generated. Analysis of sequences demonstrated that ~50% of these clones displayed significant similarity to protein and cDNA sequences. Among these unique sequences, 1868 (57.2%) were novel sequences, not present in the EMBL or expressed sequence tag (EST) databases (similarity ≤90% over 50 bp). The work also showed tight, specific association of NotI sites with the first exons of genes. From that NotI resource several new genes have been identified, isolated and mapped (17–22).
As the pilot experiments confirmed expectations, the sequencing of NotI clones was continued and ~22 500 unique NotI sequences were generated. This work provides the initial analysis of these data.
Common molecular and microbiological methods were performed according to standard procedures (23). Plasmid DNA was isolated using a Biorobot 9600 (Qiagen) with REAL-prep kits according to the manufacturer’s instructions. Sequencing gels were run on ABI 377 automated sequencers (PE Applied Biosystems) according to the manufacturer’s protocols. Sequencing was done as described previously (16).
Construction of NotI linking and jumping libraries was as described (10,16). The CBMI-Ral-Sto cell line, selected for its unusually low level of methylation, was established by immortalization of human B cells with Epstein–Barr virus (EBV) strain B95-8 (24). Thus, the DNA isolated from this cell line contained EBV sequences. The nomenclature for the NotI linking libraries and clones used in this study is the same as in a previous work (16).
The EMBL/GenBank accession numbers for the NotI sequences used in this work are AQ936570–AQ939834 and AJ322533–AJ343893.
The analysis of sequences was performed at the Karolinska Institute Sequence Analysis Center (kisac.cgr.ki.se), using local versions of programs and public databases.
Protein and nucleotide similarity searches were performed with BLAST 2.0 (25,26). The high scoring segment pairs report cut-off (BLAST parameter –b) was restricted to 100 for protein and to 50 for nucleotide databases. The statistical significance threshold (BLAST parameter –e) was default (–e = 10) for the TREMBL (Translated EMBL) and SWISSPROT databases and for other databases searches was set to: EMBL and EST, –e = 1.E–10; Unigene (non-redundant set of gene-oriented clusters database), –e = 0.1; RefSeq (Reference Sequences) nucleotide and protein databases, –e = 0.001.
MSPcrunch (version 2.4) was used to filter the BLAST program for selection of significant matches (27). This filtering ensures that domains with weak but significant hits will not be missed due to other higher scoring domains and ‘junk’ matches with biased composition are eliminated. Similarity data was sorted with MSPcrunch using default parameters (–B = 0.8 and 5, –C upper = 75, –C lower = 35 for protein alignments; –B = 0.8 and 5, –C upper = 140, –C lower = 90 for nucleotide alignments) and stringent (–B = 0.85 and 0, –C upper = 85, –C lower = 45 for protein alignments; –B 0.85 and 0, –C upper = 150, –C lower = 100 for nucleotide alignments).
Default parameters were used to search the RefSeq, EST and Unigene databases. Stringent parameters were used for the EMBL, HTGS (High Throughput Genomic Sequences) and SWISSPROT + TREMBL databases. Empirical testing suggests that these parameters are effective in removing false matches (27).
All short, simple and low complexity repeats were excluded from the analysis using RepeatMasker with the default minimum Smith-Waterman score of 225 (http://repeatmasker.genome.washington.edu).
We used the RefSeq database release of December 2001 (http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html) with 14300 human gene entries.
Initial searches with NotI sequences were performed in June 2000 (EMBL database release 59, SWISSPROT database release 38, TREMBL database release 14) and December 2000 (EMBL database release 65, SWISSPROT database release 38, TREMBL database release 15). Additional comparisons were done in March 2001 against the draft of the human genome sequence (1,2). NotI sequences that failed to match to any of the above databases were searched again in June and December 2001 (EMBL database releases 67 and 69, SWISSPROT database releases 39 and 40, TREMBL database releases 17 and 19 and the draft human genome sequence).
In order to focus on NotI restriction sites within the large body of data from the human genome sequencing project, a NotI sequence database from the human genome sequencing data was constructed. Sequences (1 kb) extending out from NotI sites were taken from the HTGS database for subsequent analysis. We allowed for one mismatch to the NotI target sequence to ensure maximal coverage of NotI sites.
The analysis of 1 kb NotI flanking sequences for nucleosome formation potential was performed as described (28).
In this study 23 574 sequences flanking NotI sites were generated. Among them, 217 (0.9%) matched genomic sequences for Escherichia coli, 296 (1.3%) were from Epstein–Barr virus (EBV) and 131 (0.6%) sequences were most probably of Pseudomonas sp. origin. Escherichia coli DNA contaminated the vector preparation and EBV B95-8 was present in the human genomic DNA (see Materials and Methods). The level of identity in these matches was at least 90% over 100 bp. Therefore, 22 930 sequences were classified as human. However, Pseudomonas sp.-related sequences may be a real component of the human genome, as previously suggested (1). A few sequences related to Synechocystis sp., Bacillus sp. and Streptomyces sp. were also detected. Therefore, these data support the suggestion that the human genome contains sequences related to different bacterial species (1). The origin of such sequences is unclear (1,29).
Within the human subset, 379 (1.7%) redundant sequences were identified. Redundant sequences were defined as sequences sharing at least 99% identity over 360 bp of their length (this safeguard was used so as not to remove unique sequences that nevertheless have partial overlap due to common repeats or localization close to NotI, BamHI or EcoRI restriction sites).
When the stringency for the identification of redundant sequences was reduced to 93% then only 15 702 sequences were classified as unique. However, we have previously found that NotI flanking sequences originating from different chromosomes can have 95–100% identity over >700 bp (6,16,22). Therefore, to exclude the possibility of removing unique flanks we defined the unique human subset to include the 22 551 sequences identified with more stringent criteria.
The sequence set covered a total of 16 213 509 bp with an average sequence length of 720 bp. After sequencing, 0.3% (55 144 cases) of all nucleotides were reported as ambiguous (N). Comparisons between multiple sequencing reactions of the same clones indicated a sequencing accuracy of at least 98.5% over the first 160 bp. The sequence accuracy varied among clones with a negative relation to overall CG content and fluctuated along the sequence length. The achieved sequence fidelity was appropriate for the study as a short length cut-off for matching sequences was used (see Materials and Methods).
The repeat masking procedure (see Materials and Methods) identified 7227 (32.0%) of the NotI sequences as containing known repeats (Table (Table1).1). The repeats comprise 1 066 657 bp or 6.6% of the 16.2 Mb total and 20% of the total sequence from the repeat-containing clones. Comparison of these data with the draft human genome sequence revealed striking differences. All interspersed repeats occupied 44.8% of the total human genome sequence but only 4.7% of the NotI flanks. Alu repeats were present 6.6 times less frequently in the NotI sequences, while for LINE repeats this difference was almost 14-fold. Simple sequence repeats were also deficient in the vicinity of NotI sites (0.54 versus 3%). On the other hand, the youngest human LINE1 elements (L1Hs) represented 0.5% of all LINE elements in NotI flanking sequences compared to 0.1% in the total human genome. L1Hs are the only inter spersed repeats that still actively transpose in the human genome (1). One possibility is that NotI flanking sequences are located in actively transcribed chromosomal regions and therefore are available for transposition. However, sequences surrounding NotI sites may not tolerate such insertions and thus the older inserted elements were eliminated during evolution.
To estimate the influence of repeats on the sequence analysis results, we compared masked and unmasked NotI sequences with the RefSeq nucleotide and protein databases (Fig. (Fig.1).1). The presence of repeats in the NotI flanking sequences does not impact on the analysis results with stringent comparison criteria. In fact, only under the relaxed stringency (e.g. 60% similarity over 50 bp) do repeats influence the results dramatically. All further analyses utilized masked sequences.
Several data collections have been compared against the human genome (euchromatic) assembly to estimate sequence coverage, including the RefSeq cDNA database, the set of STS markers, the set of radiation hybrid markers and randomly produced raw sequences. The public consortium estimated the draft genome sequence covered 88–90% (together with other public databases up to 94%) of the genome and Celera estimated that their draft sequence contained 91–99% (2).
The inclusion of the set of NotI sequences can be used to assess the coverage of the draft sequences. Unmethylated NotI sites from the CBMI-Ral-STO cell line were anticipated to be present in the NotI clone grids, while the HTGS database contains both methylated and unmethylated NotI sites.
The draft human genome sequence (1,2) contains a significant portion of the NotI sequence collection (Fig. (Fig.2).2). With stringent criteria, 55.7% of the NotI flanking sequences were present in a recent public assembly of the human genome (December 2001, identity ≥90%). Inclusion of the Celera sequences identified an additional 1.5% of NotI flanks. All public databases (EMBL + HTGS + EST) matched 89.2% of the NotI flanking sequences and search stringency is important here: this number increased to 91.1% at identity ≥78% and went down to 84.1% at identity ≥95%. The public draft sequence contained 19 552 NotI sites (i.e. 39 104 NotI flanking sequences). The EMBL coverage is misleading, as the EMBL database contains more than 4500 NotI flanking sequences generated in the previous studies (6,16).
Comparison with the complete chromosome 21 and 22 sequences (3,4) revealed several interesting features. The assembled chromosome 21 sequence contains 122 NotI sites (methylated and unmethylated). Ichikawa et al. (13) cloned 40 NotI sites and it was sufficient to construct the complete NotI restriction map. This map contained 43 NotI fragments but, using incomplete digestion with 40 NotI clones, it was possible to order all NotI fragments. The NotI flanking sequence database contains 49 NotI sites for chromosome 21 (Table (Table2).2). Altogether, out of 390 possible NotI sites on chromosomes 21 and 22, the NotI database contains 168 (43%) sites. From our data we can conclude that unmethylated NotI sites represent at least 43%. Eighteen clones that were identified in our work (5%) were present in public sequences with one nucleotide mismatch in the NotI site. Thus these clones either represent polymorphic NotI sites or result from sequencing errors in the public data.
Considering the redundancies in the draft genomic sequences and large differences in methylation status of NotI sites across cell lines (15) it is difficult to estimate the coverage of NotI flanks in this study. Based on the completed chromosome 21 and 22 sequences we can draw some conclusions. First, it was shown that chromosome 22 contains >2-fold more genes than chromosome 21, and we see the same ratio within the NotI flanking sequences. We have demonstrated that nearly all of the NotI clones contained genes (6,9) and suggested that 12.5–20% of all genes contain NotI sites (9). This correlates well with the number of genes on chromosomes 21 and 22 (168/770 = 22%). Second, the two chromosomes contain 390 NotI sites. Therefore, if we assume that each NotI site is associated with a gene (in reality we have shown that sometimes two genes are located close to the same NotI site; 30), then almost half of the genes contain NotI sites. This estimate appears excessive. We suggest that there are two distinct classes of NotI sites. The first group is ‘live’ NotI sites that are unmethylated or, more accurately, are not always methylated. They are located in CpG islands and associated with genes. The ‘dead’ NotI sites comprise the second group and they are (always) methylated and located outside functional CpG islands and genes. Further research is necessary to test this hypothesis.
More stringent parameters were applied to the gene discovery pipeline than in the pilot study (Fig. (Fig.3;3; 16) (see Materials and Methods). A check against the SWISSPROT and TREMBL databases indicated that 23.2% of the total NotI flanking sequences were significantly similar to known proteins.
Of the 22 551 unique NotI flanking sequences 48.7% were novel, as they were not previously present in the EMBL and EST databases. For these novel sequences, potential novel coding sequences were analyzed. Based on the stringent selection criteria 8.9% of the total sequences were identified with similarity to known proteins. Among the remaining 8972 novel clone sequences, 1649 (7.3%) sequences had identity of >78% to sequences in the EMBL and EST databases and 7323 (32.5%) clones were not similar to previously identified sequences.
Results of a sequence comparison with full-length human cDNA protein coding sequences from Unigene are shown in Figure Figure4A.4A. As compared to results obtained in the pilot experiment (16) the portion with significant matches increased ~2-fold.
The number of sequences matching 5′ and 3′ ESTs is higher than the total number of NotI sequences that are likely to be expressed, e.g. 11.3% + 33.9% = 45.2% > 37.1% (for 90% similarity; see Fig. Fig.4B).4B). This is because the same NotI sequence can match 5′ as well as 3′ ESTs. These data further support a previous suggestion that many of the matching ‘3′ EST’ sequences are actually situated in the 5′ ends of genes that contain NotI sites in their first exons (6,16). Venter et al. (2) extended these results for the entire genome and demonstrated a strong correlation between CpG islands and first coding exons.
To estimate how many NotI flanking sequences matched genes from other organisms, NotI sequences were compared to ESTs from all organisms. Several hundred additional ESTs (e.g. 661 for identity ≥78%) were similar to NotI flanking sequences. These NotI clones most likely represented human genes evolutionarily related to the genes from other organisms.
It is well known that CpG islands are associated with genes and their most important feature is an absence of cytosine methylation (1,5). The human genome sequence data cannot discriminate between methylated and unmethylated cytosines. There are several algorithms for the identification of CpG islands on the basis of primary sequence. One quantitative definition holds that CpG islands are regions of DNA >200 bp long with a C+G content of >50% and a ratio of ‘observed versus expected’ frequency of CG dinucleotides which exceeds 0.6 (1,31,32). The ratio for the entire genome is approximately 0.2 (1). According to the previous data 82% of NotI sites are located in CpG islands (32,33). It is important to note that these data were obtained using either computational methods or limited experimental data sets. Using the NotI cloning method only unmethylated NotI sites can be isolated. An analysis of CG content for the first 350 bp is shown in Table Table3.3. Comparing these data with Lander et al. (1), two main features are apparent: the fraction of sequences with >80% CG content is nine times higher in the NotI collection, i.e. 142 versus 22 sequences. Another striking finding is that even NotI flanking sequences with a CG content <50% have a very high ratio of observed versus expected frequency of CG dinucleotides (0.71). This suggests that essentially all NotI flanking sequences generated in the study are located in CpG islands and, therefore, the computational method misses at least 8.7% of CpG islands associated with NotI sites.
Regulatory regions, especially promoters, are negatively associated with the formation of nucleosomes (28). A total of 142 chromosome 3-specific NotI sequences for which 1 kb flanks were available in the human genome sequence (phase 3 ‘finished’ sequence) were selected for an analysis of nucleosome formation potential (NFP). Positive NFP values indicate sequences that are likely to form nucleosomes efficiently, while negative scores indicate sequences likely to have poor nucleosome stacking ability. The results demonstrate (Fig. (Fig.5)5) that regions flanking NotI sites are less likely to form nucleosomes efficiently as their NFP values are below –1 and therefore resemble promoter regions in this feature.
It should be emphasized that the enormous efforts deployed on sequencing the human genome (1,2) are extremely important, however, there remains a critical role for verified, integrated maps. In sequencing, the short and long repeats spread throughout the genome are sources of numerous errors. These errors are difficult to identify with a shotgun strategy, but they become evident when mapping information is combined with the sequence. Furthermore, difficulties in sequence assembly caused by the existence of large families of recently duplicated genes and pseudogenes are easier to resolve using integrated maps.
In many cases, sequence and mapping information is duplicated, overlapping or contradictory. One must always keep in mind that even absolutely correct and long nucleotide sequences may be localized incorrectly along the chromosomal DNA if the appropriate accompanying mapping information is ignored. For this reason, in spite of the vast amount of information presently available, there is an urgent need to reconcile this information in a unified framework, to generate an integrated non-controversial map for each individual chromosome.
We believe that the NotI flanking sequences generated in this study will be helpful in verifying contig assemblies and in connecting orphan sequence contigs into a final genome assembly. NotI clones can serve as STSs that can be mapped precisely using PFGE and FISH. These flanking sequences have already been helpful in the isolation and mapping of new genes and resolving ambiguities in chromosome 3 maps (6,17–22,34). We think that the NotI clones will also be helpful as probes to close existing gaps in the draft human genome sequence and in estimating the completeness of the human genome sequence due to the independent approach used in this study. The data demonstrate that the draft human genome sequence has a strong bias against NotI flanking sequences, as a significant number of the human NotI sequences were not detected.
Several explanations can be offered to account for the low representation of NotI flanking sequences in the draft human genome sequences. We have cloned all of the NotI sites and constructed a physical map for two chromosome 3 regions containing tumor suppressor genes (6,34,35; A.I.Protopopov, V.I.Kashuba, V.Zabarovska, O.Muravenko, M.I.Lerman, G.Klein and E.R.Zabarovsky, unpublished results). In the course of these studies it became apparent that large-insert vectors from these regions were unstable and sensitive to deletions and rearrangements and that the original map was erroneous (35–37; A.I.Protopopov, V.I.Kashuba, V.Zabarovska, O.Muravenko, M.I.Lerman, G.Klein and E.R.Zabarovsky, unpublished results; http://www.ncbi.nlm.nih.gov/cgi-bin/Entrez/maps.cgi?org=hum&chr=3). Thus one potential explanation is that the cloning of some NotI site-containing regions may be selected against in experiments with large-insert cloning vectors. Our experience has also proven that even in small-insert plasmid vectors some human sequences are more easy to clone than others. In our procedure, we directly selected for clones containing NotI sites, while in a shotgun sequencing approach such sequences could be under-represented. An alternative explanation, based on the observation that some NotI flanking sequences can have 100% identity over long DNA stretches (22), is that some NotI sites were incorrectly fused in the assembly process. Furthermore, our experience demonstrates that sometimes it is very difficult to read NotI flanking sequences because of the extremely high CG content. During human genome assembly such sequences would be eliminated as possessing low quality data. Further experimental analysis is needed to conclusively identify the cause(s) of the bias.
The results of this work show that NotI flanking sequences are a rich source for identification of new genes. A difference in this study compared with the pilot experiment is the lower fraction of NotI sequences with protein similarity (23.2% now versus 51% in the pilot study), resulting from more stringent analysis criteria. The comparatively high level of sequence errors (1.5% for the first 160 bp) is likely to be a consequence of the CG-rich stretches of DNA. In some cases CG content exceeded 95% over 100 bp. However, the sequencing accuracy does not affect the main results of the study, as short matches were used for searches. Moreover, lower stringency criteria for searches (25% for proteins and 75–78% for DNA) did not significantly alter the results.
It is difficult to precisely determine the number of NotI sites in the human genome and the unmethylated portion in any particular cell type. Our rough estimation (based on chromosomes 21 and 22) is that the human genome contains 15000–20 000 NotI sites, of which 6000–9000 are unmethylated in a subset of cells.
The detection of CpG islands is difficult using only sequence data, as evidenced by existing computational methods missing a significant fraction of functional CpG islands. We conclude that unmethylated DNA stretches with a high frequency of CpG dinucleotides can be found in regions with low CG content. This conclusion is consistent with the surprising deduction made by Venter et al. (2): significantly more genes than expected are located in DNA regions with low CG content. This suggests that computational identification of CpG islands will improve if more weight is placed on the ratio of observed versus expected frequency of CG dinucleotides, rather than overall CG content.
In summary, this work has demonstrated that sequences flanking NotI restriction sites can be used to complement large-scale human genome sequencing. As the organization of the human genome and that of other mammals is similar, this approach will contribute to the success of future sequencing projects.
This work was supported by research funds from Pharmacia Corp. to the Center for Genomics and Bioinformatics, as well as grants from the Swedish Cancer Society, The Swedish Research Council, the Ingabritt och Arne Lundbergs Foundation, The Royal Swedish Academy of Sciences, the Karolinska Institute, The Swedish Foundation for International Cooperation in Research and Higher Education (to A.S.K.) and partly by the Russian National Human Genome Program (grants to O.V.M. and L.L.K.).
DDBJ/EMBL/GenBank accession nos+ To whom correspondence should be addressed at: Microbiology and Tumor Biology Center, Karolinska Institute, Box 280, 171 77 Stockholm, Sweden. Tel: +46 8 728 6750; Fax: +46 8 319 470; Email: The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors AQ936570–AQ939834, AJ322533–AJ343893