Search tips
Search criteria 


Logo of narLink to Publisher's site
Nucleic Acids Res. 1992 June 11; 20(11): 2741–2747.
PMCID: PMC336916

Corruption of genomic databases with anomalous sequence.


We describe evidence that DNA sequences from vectors used for cloning and sequencing have been incorporated accidentally into eukaryotic entries in the GenBank database. These incorporations were not restricted to one type of vector or to a single mechanism. Many minor instances may have been the result of simple editing errors, but some entries contained large blocks of vector sequence that had been incorporated by contamination or other accidents during cloning. Some cases involved unusual rearrangements and areas of vector distant from the normal insertion sites. Matches to vector were found in 0.23% of 20,000 sequences analyzed in GenBank Release 63. Although the possibility of anomalous sequence incorporation has been recognized since the inception of GenBank and should be easy to avoid, recent evidence suggests that this problem is increasing more quickly than the database itself. The presence of anomalous sequence may have serious consequences for the interpretation and use of database entries, and will have an impact on issues of database management. The incorporated vector fragments described here may also be useful for a crude estimate of the fidelity of sequence information in the database. In alignments with well-defined ends, the matching sequences showed 96.8% identity to vector; when poorer matches with arbitrary limits were included, the aggregate identity to vector sequence was 94.8%.

Full text

Full text is available as a scanned copy of the original print version. Get a printable copy (PDF file) of the complete article (1.2M), or click on a page image below to browse page by page. Links to PubMed are also available for Selected References.

Selected References

These references are in PubMed. This may not be the complete list of references from this article.
  • Yanisch-Perron C, Vieira J, Messing J. Improved M13 phage cloning vectors and host strains: nucleotide sequences of the M13mp18 and pUC19 vectors. Gene. 1985;33(1):103–119. [PubMed]
  • Lamperti ED, Rosen KM, Villa-Komaroff L. Characterization of the gene and messages for vasoactive intestinal polypeptide (VIP) in rat and mouse. Brain Res Mol Brain Res. 1991 Feb;9(3):217–231. [PubMed]
  • Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 1988 Apr;85(8):2444–2448. [PubMed]
  • Smith TF, Waterman MS, Burks C. The statistical distribution of nucleic acid similarities. Nucleic Acids Res. 1985 Jan 25;13(2):645–656. [PMC free article] [PubMed]
  • Karlin S, Ghandour G. Comparative statistics for DNA and protein sequences: single sequence analysis. Proc Natl Acad Sci U S A. 1985 Sep;82(17):5800–5804. [PubMed]
  • Reuter D, Schuh R, Jäckle H. The homeotic gene spalt (sal) evolved during Drosophila speciation. Proc Natl Acad Sci U S A. 1989 Jul;86(14):5483–5486. [PubMed]
  • Anderson KP, Croyle ML, Lingrel JB. Primary structure of a gene encoding rat T-kininogen. Gene. 1989 Sep 1;81(1):119–128. [PubMed]
  • Sutcliffe JG. Complete nucleotide sequence of the Escherichia coli plasmid pBR322. Cold Spring Harb Symp Quant Biol. 1979;43(Pt 1):77–90. [PubMed]
  • Frischauf AM, Lehrach H, Poustka A, Murray N. Lambda replacement vectors carrying polylinker sequences. J Mol Biol. 1983 Nov 15;170(4):827–842. [PubMed]
  • Bodner M, Fridkin M, Gozes I. Coding sequences for vasoactive intestinal peptide and PHM-27 peptide are located on two adjacent exons in the human genome. Proc Natl Acad Sci U S A. 1985 Jun;82(11):3548–3551. [PubMed]
  • Saba JA, Busch H, Reddy R. U4 small nuclear RNA pseudogenes from rat genome have common truncated 3'-ends. Biochem Biophys Res Commun. 1985 Jul 31;130(2):828–834. [PubMed]
  • Joseph LJ, Chang LC, Stamenkovich D, Sukhatme VP. Complete nucleotide and deduced amino acid sequences of human and murine preprocathepsin L. An abundant transcript induced by transformation of fibroblasts. J Clin Invest. 1988 May;81(5):1621–1629. [PMC free article] [PubMed]
  • Eick-Helmerich K, Braun V. Import of biopolymers into Escherichia coli: nucleotide sequences of the exbB and exbD genes are homologous to those of the tolQ and tolR genes, respectively. J Bacteriol. 1989 Sep;171(9):5117–5126. [PMC free article] [PubMed]
  • Aronson AI, Song HY, Bourne N. Gene structure and precursor processing of a novel Bacillus subtilis spore coat protein. Mol Microbiol. 1989 Mar;3(3):437–444. [PubMed]
  • Ogawa H, Fujioka M. Nucleotide sequence of the rat guanidinoacetate methyltransferase gene. Nucleic Acids Res. 1988 Sep 12;16(17):8715–8716. [PMC free article] [PubMed]
  • Li JM, Russell CS, Cosloy SD. The structure of the Escherichia coli hemB gene. Gene. 1989 Jan 30;75(1):177–184. [PubMed]
  • Stockinger H, Schmidtke J, Bostock C, Epplen JT. Human DNA sequences isolated with an immunoglobulin switch region probe: sequence, chromosomal localization, and restriction fragment length polymorphisms. Hum Genet. 1986 Jun;73(2):104–109. [PubMed]
  • Maeda N. Nucleotide sequence of the haptoglobin and haptoglobin-related gene pair. The haptoglobin-related gene contains a retrovirus-like element. J Biol Chem. 1985 Jun 10;260(11):6698–6709. [PubMed]
  • Neumann H, Schwass V, Eckerskorn C, Zillig W. Identification and characterization of the genes encoding three structural proteins of the Thermoproteus tenax virus TTV1. Mol Gen Genet. 1989 May;217(1):105–110. [PubMed]
  • Deschenes RJ, Haun RS, Funckes CL, Dixon JE. A gene encoding rat cholecystokinin. Isolation, nucleotide sequence, and promoter activity. J Biol Chem. 1985 Jan 25;260(2):1280–1286. [PubMed]
  • Tepler I, Shimizu A, Leder P. The gene for the rat mast cell high affinity IgE receptor alpha chain. Structure and alternative mRNA splicing patterns. J Biol Chem. 1989 Apr 5;264(10):5912–5915. [PubMed]
  • Son HJ, Cook GA, Hall T, Donelson JE. Expression site associated genes of Trypanosoma brucei rhodesiense. Mol Biochem Parasitol. 1989 Feb;33(1):59–66. [PubMed]
  • Scott AL, Dinman J, Sussman DJ, Yenbutr P, Ward S. Major sperm protein genes from Onchocerca volvulus. Mol Biochem Parasitol. 1989 Sep;36(2):119–126. [PubMed]
  • Sudol M, Kieswetter C, Zhao YH, Dorai T, Wang LH, Hanafusa H. Nucleotide sequence of a cDNA for the chick yes proto-oncogene: comparison with the viral yes gene. Nucleic Acids Res. 1988 Oct 25;16(20):9876–9876. [PMC free article] [PubMed]
  • Nichols R, Schneuwly SA, Dixon JE. Identification and characterization of a Drosophila homologue to the vertebrate neuropeptide cholecystokinin. J Biol Chem. 1988 Sep 5;263(25):12167–12170. [PubMed]
  • May LT, Landsberger FR, Inouye M, Sehgal PB. Significance of similarities in patterns: an application to beta interferon-related DNA on human chromosome 2. Proc Natl Acad Sci U S A. 1985 Jun;82(12):4090–4094. [PubMed]
  • Van den Ouweland AM, Van Duijnhoven HL, Deichmann KA, Van Groningen JJ, de Leij L, Van de Ven WJ. Characteristics of a multicopy gene family predominantly consisting of processed pseudogenes. Nucleic Acids Res. 1989 May 25;17(10):3829–3843. [PMC free article] [PubMed]
  • Gharib SD, Roy A, Wierman ME, Chin WW. Isolation and characterization of the gene encoding the beta-subunit of rat follicle-stimulating hormone. DNA. 1989 Jun;8(5):339–349. [PubMed]
  • Nakashima H, Yamamoto M, Goto K, Osumi T, Hashimoto T, Endo H. Isolation and characterization of the rat catalase-encoding gene. Gene. 1989 Jul 15;79(2):279–288. [PubMed]
  • Soininen R, Huotari M, Ganguly A, Prockop DJ, Tryggvason K. Structural organization of the gene for the alpha 1 chain of human type IV collagen. J Biol Chem. 1989 Aug 15;264(23):13565–13571. [PubMed]
  • Creighton TE, Charles IG. Sequences of the genes and polypeptide precursors for two bovine protease inhibitors. J Mol Biol. 1987 Mar 5;194(1):11–22. [PubMed]
  • de Martynoff G, Pohl V, Mercken L, van Ommen GJ, Vassart G. Structural organization of the bovine thyroglobulin gene and of its 5'-flanking region. Eur J Biochem. 1987 May 4;164(3):591–599. [PubMed]
  • Lenstra R, d'Auriol L, Andrieu B, Le Bras J, Galibert F. Cloning and sequencing of Plasmodium falciparum DNA fragments containing repetitive regions potentially coding for histidine-rich proteins: identification of two overlapping reading frames. Biochem Biophys Res Commun. 1987 Jul 15;146(1):368–377. [PubMed]
  • Tani T, Ohsumi J, Mita K, Takiguchi Y. Identification of a novel class of elastase isozyme, human pancreatic elastase III, by cDNA and genomic gene cloning. J Biol Chem. 1988 Jan 25;263(3):1231–1239. [PubMed]
  • Wilson BW, Edwards KJ, Sleigh MJ, Byrne CR, Ward KA. Complete sequence of a type-I microfibrillar wool keratin gene. Gene. 1988 Dec 15;73(1):21–31. [PubMed]
  • Rentier-Delrue F, Swennen D, Prunet P, Lion M, Martial JA. Tilapia prolactin: molecular cloning of two cDNAs and expression in Escherichia coli. DNA. 1989 May;8(4):261–270. [PubMed]
  • Ponzi M, Birago C, Battaglia PA. Two identical symmetrical regions in the minicircle structure of Trypanosoma lewisi kinetoplast DNA. Mol Biochem Parasitol. 1984 Sep;13(1):111–119. [PubMed]
  • Scherer SE, Veres G, Caskey CT. The genetic structure of mouse ornithine transcarbamylase. Nucleic Acids Res. 1988 Feb 25;16(4):1593–1601. [PMC free article] [PubMed]
  • Barklis E, Pontius B, Lodish HF. Structure of the Dictyostelium discoideum prestalk D11 gene and protein. Mol Cell Biol. 1985 Jun;5(6):1473–1479. [PMC free article] [PubMed]
  • Ann DK, Gadbois D, Carlson DM. Structure, organization, and regulation of a hamster proline-rich protein gene. A multigene family. J Biol Chem. 1987 Mar 25;262(9):3958–3963. [PubMed]
  • Xu YX, Pitcovski J, Peterson L, Auffray C, Bourlet Y, Gerndt BM, Nordskog AW, Lamont SJ, Warner CM. Isolation and characterization of three class II MHC genomic clones from the chicken. J Immunol. 1989 Mar 15;142(6):2122–2132. [PubMed]
  • Levy E, Liem RK, D'Eustachio P, Cowan NJ. Structure and evolutionary origin of the gene encoding mouse NF-M, the middle-molecular-mass neurofilament protein. Eur J Biochem. 1987 Jul 1;166(1):71–77. [PubMed]
  • Freimark B, Pickering L, Concannon P, Fox R. Nucleotide sequence of a uniquely expressed human T cell receptor beta chain variable region gene (V beta) in Sjogren's syndrome. Nucleic Acids Res. 1989 Jan 11;17(1):455–455. [PMC free article] [PubMed]
  • Handy DE, Larsen SH, Karn RC, Hodes ME. Identification of a human salivary amylase gene. Partial sequence of genomic DNA suggests a mode of regulation different from that of mouse, Amy1. Mol Biol Med. 1987 Jun;4(3):145–155. [PubMed]
  • Begley CG, Aplan PD, Davey MP, Nakahara K, Tchorz K, Kurtzberg J, Hershfield MS, Haynes BF, Cohen DI, Waldmann TA, et al. Chromosomal translocation in a human leukemic stem-cell line disrupts the T-cell antigen receptor delta-chain diversity region and results in a previously unreported fusion transcript. Proc Natl Acad Sci U S A. 1989 Mar;86(6):2031–2035. [PubMed]
  • Sakaguchi N, Kashiwamura S, Kimoto M, Thalmann P, Melchers F. B lymphocyte lineage-restricted expression of mb-1, a gene with CD3-like structural properties. EMBO J. 1988 Nov;7(11):3457–3464. [PubMed]
  • Vanderslice P, Craik CS, Nadel JA, Caughey GH. Molecular cloning of dog mast cell tryptase and a related protease: structural evidence of a unique mode of serine protease activation. Biochemistry. 1989 May 16;28(10):4148–4155. [PubMed]
  • Corral M, Baffet G, Defer N. Structure of a cDNA clone specific to hepatoma cells with rearranged mitochondrial sequences. Nucleic Acids Res. 1988 Nov 25;16(22):10935–10935. [PMC free article] [PubMed]
  • Greslin AF, Prescott DM, Oka Y, Loukin SH, Chappell JC. Reordering of nine exons is necessary to form a functional actin gene in Oxytricha nova. Proc Natl Acad Sci U S A. 1989 Aug;86(16):6264–6268. [PubMed]
  • Hartley DA, Preiss A, Artavanis-Tsakonas S. A deduced gene product from the Drosophila neurogenic locus, enhancer of split, shows homology to mammalian G-protein beta subunit. Cell. 1988 Dec 2;55(5):785–795. [PubMed]
  • Hodgson CP. Cloning vector artifacts in the DNA database. Biotechniques. 1990 Jul;9(1):54–55. [PubMed]
  • Lopez R, Kristensen T, Prydz H. Database contamination. Nature. 1992 Jan 16;355(6357):211–211. [PubMed]
  • Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403–410. [PubMed]
  • Cinkosky MJ, Fickett JW, Gilna P, Burks C. Electronic data publishing and GenBank. Science. 1991 May 31;252(5010):1273–1277. [PubMed]
  • States DJ, Botstein D. Molecular sequence accuracy and the analysis of protein coding regions. Proc Natl Acad Sci U S A. 1991 Jul 1;88(13):5518–5522. [PubMed]
  • Brunak S, Engelbrecht J, Knudsen S. Cleaning up gene databases. Nature. 1990 Jan 11;343(6254):123–123. [PubMed]
  • Roberts L. Finding DNA sequencing errors. Science. 1991 May 31;252(5010):1255–1256. [PubMed]
  • Olson M, Hood L, Cantor C, Botstein D. A common language for physical mapping of the human genome. Science. 1989 Sep 29;245(4925):1434–1435. [PubMed]
  • Roberts L. New game plan for genome mapping. Science. 1989 Sep 29;245(4925):1438–1440. [PubMed]
  • Short JM, Fernandez JM, Sorge JA, Huse WD. Lambda ZAP: a bacteriophage lambda expression vector with in vivo excision properties. Nucleic Acids Res. 1988 Aug 11;16(15):7583–7600. [PMC free article] [PubMed]
  • Yamagami T, Ohsawa K, Nishizawa M, Inoue C, Gotoh E, Yanaihara N, Yamamoto H, Okamoto H. Complete nucleotide sequence of human vasoactive intestinal peptide/PHM-27 gene and its inducible promoter. Ann N Y Acad Sci. 1988;527:87–102. [PubMed]
  • Emi M, Horii A, Tomita N, Nishide T, Ogawa M, Mori T, Matsubara K. Overlapping two genes in human DNA: a salivary amylase gene overlaps with a gamma-actin pseudogene that carries an integrated human endogenous retroviral DNA. Gene. 1988;62(2):229–235. [PubMed]
  • Gardner RC, Howarth AJ, Messing J, Shepherd RJ. Cloning and sequencing of restriction fragments generated by Eco RI*. DNA. 1982;1(2):109–115. [PubMed]
  • Gadaleta G, Pepe G, De Candia G, Quagliariello C, Sbisà E, Saccone C. The complete nucleotide sequence of the Rattus norvegicus mitochondrial genome: cryptic signals revealed by comparative analysis between vertebrates. J Mol Evol. 1989 Jun;28(6):497–516. [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press