Search tips
Search criteria

Results 1-25 (25)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
1.  Accuracy and coverage assessment of Oryctolagus cuniculus (Rabbit) Genes Encoding Immunoglobulins in the Whole Genome Sequence Assembly (OryCun2.0) and Localization of the IGH Locus to Chromosome 20 
Immunogenetics  2013;65(10):749-762.
We report analyses of genes encoding immunoglobulin heavy and light chains in the rabbit 6.51x whole genome assembly. This OryCun2.0 assembly confirms previous mapping of the duplicated IGK1 and IGK2 loci to chromosome 2 and the IGL lambda light chain locus to chromosome 21. The most frequently rearranged and expressed IGHV1 that is closest to IG DH and IGHJ genes encodes rabbit VHa allotypes. The partially inbred Thorbecke strain rabbit used for whole-genome sequencing was homozygous at the IGK but heterozygous with the IGHV1a1 allele in one of 79 IGHV-containing unplaced scaffolds and IGHV1a2, IGHM, IGHG and IGHE sequences in another. Some IGKV, IGLV and IGHA genes are also in other unplaced scaffolds. By fluorescence in situ hybridization, we assigned the previously unmapped IGH locus to the q-telomeric region of rabbit chromosome 20. An approximately 3 Mb segment of human chromosome 14 including IGH genes predicted to map to this telomeric region based on synteny analysis could not be located on assembled chromosome 20. Unplaced scaffold chrUn0053 contains some of the genes that comparative mapping predicts to be missing. We identified discrepancies between previous targeted studies and the OryCun2.0 assembly and some new BAC clones with IGH sequences that can guide other studies to further sequence and improve the OryCun2.0 assembly. Complete knowledge of gene sequences encoding variable regions of rabbit heavy, kappa and lambda chains will lead to better understanding of how and why rabbits produce antibodies of high specificity and affinity through gene conversion and somatic hypermutation.
PMCID: PMC3780782  PMID: 23925440
Rabbit; Immunoglobulin Genes; Heavy Chains; Fluorescence in situ hybridization; Chromosome 20; Light Chains
2.  Database resources of the National Center for Biotechnology Information 
Nucleic Acids Research  2013;42(Database issue):D7-D17.
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI, provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI Web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central, PubReader, Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link, Primer-BLAST, COBALT, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, the Genetic Testing Registry, Genome and related tools, the Map Viewer, Trace Archive, Sequence Read Archive, BioProject, BioSample, ClinVar, MedGen, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus, Probe, Online Mendelian Inheritance in Animals, the Molecular Modeling Database, the Conserved Domain Database, the Conserved Domain Architecture Retrieval Tool, Biosystems, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All these resources can be accessed through the NCBI home page.
PMCID: PMC3965057  PMID: 24259429
4.  A radiation hybrid map of river buffalo (Bubalus bubalis) chromosome one (BBU1) 
Cytogenetic and genome research  2007;119(0):100-104.
The largest chromosome in the river buffalo karyotype, BBU1, is a submetacentric chromosome with reported homology between BBU1q and bovine chromosome 1 and between BBU1p and BTA27. We present the first radiation hybrid map of this chromosome containing 69 cattle derived markers including 48 coding genes, 17 microsatellites and four ESTs distributed in two linkage groups spanning a total length of 1330.1 cR5000. The RH map was constructed based on analysis of a recently developed river buffalo-hamster whole genome radiation hybrid (BBURH5000) panel. The retention frequency of individual markers across the panel ranged from 17.8% to 52.2%. With few exceptions, the order of markers within linkage groups is identical to the order established for corresponding cattle RH maps. The BBU1 map provides a starting point for comparison of gene order rearrangements between river buffalo chromosome 1 and its bovine homologs.
PMCID: PMC3780412  PMID: 18160788
5.  A 1.5 Megabase Resolution Radiation Hybrid Map of the Cat Genome and Comparative Analysis with the Canine and Human Genomes 
Genomics  2006;89(2):189-196.
We report the construction of a 1.5 Mb resolution radiation hybrid map of the domestic cat genome. This new map includes novel microsatellite loci and markers derived from the 2X genome sequence that target previous gaps in the feline-human comparative map. Ninety-six percent of the 1793 cat markers we mapped have identifiable orthologues in the canine and human genome sequences. The updated autosomal and X chromosome comparative maps identify 152 cat-human and 134 cat-dog homologous synteny blocks. Comparative analysis shows the marked change in chromosomal evolution in the canid lineage relative to the felid lineage since divergence from their carnivoran ancestor. The canid lineage has a thirty-fold difference in the number of interchromosomal rearrangments relative to felids, while the felid lineage has primarily undergone intrachromosomal rearrangements. We have also refined the pseudoautosomal region and boundary in the cat and show that it is markedly longer than those of human or mouse. This improved RH comparative map provides a useful tool to facilitate positional cloning studies in the feline model.
PMCID: PMC3760348  PMID: 16997530
domestic cat; radiation hybrid map; canine genome; genome evolution; synteny; chromosome rearrangement
6.  Comparative analysis of genome sequences of the Th2 cytokine region of rabbit (Oryctolagus cuniculus) with those of nine different species 
The regions encoding the coordinately regulated Th2 cytokines IL5, IL4 and IL13 are located on chromosomes 5 of man and 11 of mouse. They have been intensively studied because these interleukins have protective roles in helminth infections, but may lead to detrimental effects such as allergy, asthma, and fibrosis in lung and liver. We added to previous studies by comparing sequences of syntenic regions on chromosome 3 of the rabbit (Oryctolagus cuniculus) genome OryCun 2.0 assembly from a tuberculosis-susceptible strain, with the corresponding region of ENCODE ENm002 from a normal rabbit as well as with 9 other mammalian species. We searched for rabbit transcription factor binding sites in putative promoter and other non-coding regions of IL5, RAD50, IL13 and IL4. Although we identified several differences between the two donor rabbits in coding and non-coding regions of potential functional significance, confirmation awaits additional sequencing of other rabbits.
PMCID: PMC3519392  PMID: 23239928
7.  Domain enhanced lookup time accelerated BLAST 
Biology Direct  2012;7:12.
BLAST is a commonly-used software package for comparing a query sequence to a database of known sequences; in this study, we focus on protein sequences. Position-specific-iterated BLAST (PSI-BLAST) iteratively searches a protein sequence database, using the matches in round i to construct a position-specific score matrix (PSSM) for searching the database in round i + 1. Biegert and Söding developed Context-sensitive BLAST (CS-BLAST), which combines information from searching the sequence database with information derived from a library of short protein profiles to achieve better homology detection than PSI-BLAST, which builds its PSSMs from scratch.
We describe a new method, called domain enhanced lookup time accelerated BLAST (DELTA-BLAST), which searches a database of pre-constructed PSSMs before searching a protein-sequence database, to yield better homology detection. For its PSSMs, DELTA-BLAST employs a subset of NCBI’s Conserved Domain Database (CDD). On a test set derived from ASTRAL, with one round of searching, DELTA-BLAST achieves a ROC5000 of 0.270 vs. 0.116 for CS-BLAST. The performance advantage diminishes in iterated searches, but DELTA-BLAST continues to achieve better ROC scores than CS-BLAST.
DELTA-BLAST is a useful program for the detection of remote protein homologs. It is available under the “Protein BLAST” link at
This article was reviewed by Arcady Mushegian, Nick V. Grishin, and Frank Eisenhaber.
PMCID: PMC3438057  PMID: 22510480
8.  A novel mitofusin 2 mutation causes canine fetal-onset neuroaxonal dystrophy 
Neurogenetics  2011;12(3):223-232.
We recently reported autosomal recessive fetal-onset neuroaxonal dystrophy (FNAD) in a large family of dogs that is not caused by mutation in the PLA2G6 locus (2010 J Comp Neurol 518:3771–3784). Here we report a genome-wide linkage analysis using 333 microsatellite markers to map canine FNAD to the telomeric end of chromosome 2. The interval of zero recombination was refined by single nucleotide polymorphism (SNP) haplotype analysis to ~ 200 kb, and the included genes were sequenced. We found a homozygous 3-nucleotide deletion in exon 13 of mitofusin 2 (MFN2), predicting loss of a glutamate residue at position 539 in the protein of affected dogs. RT-PCR demonstrated near normal expression of the mutant mRNA, but MFN2 expression was undetectable to very low on western blots of affected dog brainstem, cerebrum, kidney, and cultured fibroblasts and by immunohistochemistry on brainstem sections. MFN2 is a multifunctional, membrane-bound GTPase of mitochondria and endoplasmic reticulum most commonly associated with human Charcot-Marie-Tooth disease type 2A2. The canine disorder extends the range of MFN2-associated phenotypes and suggests MFN2 as a candidate gene for rare cases of human FNAD.
PMCID: PMC3165057  PMID: 21643798
Neuroaxonal dystrophy; mitofusin 2; linkage analysis; animal model; disease gene
11.  PedHunter 2.0 and its usage to characterize the founder structure of the Old Order Amish of Lancaster County 
BMC Medical Genetics  2010;11:68.
Because they are a closed founder population, the Old Order Amish (OOA) of Lancaster County have been the subject of many medical genetics studies. We constructed four versions of Anabaptist Genealogy Database (AGDB) using three sources of genealogies and multiple updates. In addition, we developed PedHunter, a suite of query software that can solve pedigree-related problems automatically and systematically.
We report on how we have used new features in PedHunter to quantify the number and expected genetic contribution of founders to the OOA. The queries and utility of PedHunter programs are illustrated by examples using AGDB in this paper. For example, we calculated the number of founders expected to be contributing genetic material to the present-day living OOA and estimated the mean relative founder representation for each founder. New features in PedHunter also include pedigree trimming and pedigree renumbering, which should prove useful for studying large pedigrees.
With PedHunter version 2.0 querying AGDB version 4.0, we identified 34,160 presumed living OOA individuals and connected them into a 14-generation pedigree descending from 554 founders (332 females and 222 males) after trimming. From the analysis of cumulative mean relative founder representation, 128 founders (78 females and 50 males) accounted for over 95% of the mean relative founder contribution among living OOA descendants.
The OOA are a closed founder population in which a modest number of founders account for the genetic variation present in the current OOA population. Improvements to the PedHunter software will be useful in future studies of both the OOA and other populations with large and computerized genealogies.
PMCID: PMC2880975  PMID: 20433770
12.  A High-Resolution Cat Radiation Hybrid and Integrated FISH Mapping Resource for Phylogenomic Studies across Felidae 
Genomics  2008;93(4):299-304.
We describe the construction of a high-resolution radiation hybrid (RH) map of the domestic cat genome, which includes 2,662 markers, translating to an estimated average intermarker distance of 939 kilobases (Kb). Targeted marker selection utilized the recent feline 1.9x genome assembly, concentrating on regions of low marker density on feline autosomes and the X chromosome, in addition to regions flanking interspecies chromosomal breakpoints. Average gap (breakpoint) size between cat-human ordered conserved segments is less than 900 Kb. The map was used for a fine-scale comparison of conserved syntenic blocks with the human and canine genomes. Corroborative fluorescence in situ hybridization (FISH) data were generated using 129 domestic cat BAC-clones as probes, providing independent confirmation of the long-range correctness of the map. Cross-species hybridization of BAC probes on divergent felids from the genera Profelis (serval) and Panthera (snow leopard) provides further evidence for karyotypic conservation within felids, and demonstrates the utility of such probes for future studies of chromosome evolution within the cat family and in related carnivores. The integrated map constitutes a comprehensive framework for identifying genes controlling feline phenotypes of interest, and to aid in assembly of a higher coverage feline genome sequence.
PMCID: PMC2656592  PMID: 18951970
domestic cat; radiation hybrid map; comparative mapping; FISH mapping; synteny; chromosome rearrangement; Felidae
14.  Lineage-Specific Biology Revealed by a Finished Genome Assembly of the Mouse 
PLoS Biology  2009;7(5):e1000112.
A finished clone-based assembly of the mouse genome reveals extensive recent sequence duplication during recent evolution and rodent-specific expansion of certain gene families. Newly assembled duplications contain protein-coding genes that are mostly involved in reproductive function.
The mouse (Mus musculus) is the premier animal model for understanding human disease and development. Here we show that a comprehensive understanding of mouse biology is only possible with the availability of a finished, high-quality genome assembly. The finished clone-based assembly of the mouse strain C57BL/6J reported here has over 175,000 fewer gaps and over 139 Mb more of novel sequence, compared with the earlier MGSCv3 draft genome assembly. In a comprehensive analysis of this revised genome sequence, we are now able to define 20,210 protein-coding genes, over a thousand more than predicted in the human genome (19,042 genes). In addition, we identified 439 long, non–protein-coding RNAs with evidence for transcribed orthologs in human. We analyzed the complex and repetitive landscape of 267 Mb of sequence that was missing or misassembled in the previously published assembly, and we provide insights into the reasons for its resistance to sequencing and assembly by whole-genome shotgun approaches. Duplicated regions within newly assembled sequence tend to be of more recent ancestry than duplicates in the published draft, correcting our initial understanding of recent evolution on the mouse lineage. These duplicates appear to be largely composed of sequence regions containing transposable elements and duplicated protein-coding genes; of these, some may be fixed in the mouse population, but at least 40% of segmentally duplicated sequences are copy number variable even among laboratory mouse strains. Mouse lineage-specific regions contain 3,767 genes drawn mainly from rapidly-changing gene families associated with reproductive functions. The finished mouse genome assembly, therefore, greatly improves our understanding of rodent-specific biology and allows the delineation of ancestral biological functions that are shared with human from derived functions that are not.
Author Summary
The availability of an accurate genome sequence provides the bedrock upon which modern biomedical research is based. Here we describe a high-quality assembly, Build 36, of the mouse genome. This assembly was put together by aligning overlapping individual clones representing parts of the genome, and it provides a more complete picture than previous assemblies, because it adds much rodent-specific sequence that was previously unavailable. The addition of these sequences provides insight into both the genomic architecture and the gene complement of the mouse. In particular, it highlights recent gene duplications and the expansion of certain gene families during rodent evolution. An improved understanding of the mouse genome and thus mouse biology will enhance the utility of the mouse as a model for human disease.
PMCID: PMC2680341  PMID: 19468303
15.  A 4103 marker integrated physical and comparative map of the horse genome 
Cytogenetic and genome research  2008;122(1):28-36.
A comprehensive second-generation whole genome radiation hybrid (RH II), cytogenetic and comparative map of the horse genome (2n=64) has been developed using the 5000rad horse × hamster radiation hybrid panel and fluorescence in situ hybridization (FISH). The map contains 4,103 markers (3,816 RH, 1,144 FISH) assigned to all 31 pairs of autosomes and the X chromosome. The RH maps of individual chromosomes are anchored and oriented using 857 cytogenetic markers. The overall resolution of the map is one marker per 775 kilobase-pairs (kb), which represents a more than five-fold improvement over the first-generation map. The RH II incorporates 920 markers shared jointly with the two recently reported meiotic maps. Consequently the two maps were aligned with the RH II maps of individual autosomes and the X chromosome. Additionally, a comparative map of the horse genome was generated by connecting 1,904 loci on the horse map with genome sequences available for eight diverse vertebrates to highlight regions of evolutionarily conserved syntenies, linkages and chromosomal breakpoints. The integrated map thus obtained presents the most comprehensive information on the physical and comparative organization of the equine genome and will assist future assemblies of whole genome BAC fingerprint maps and the genome sequence. It will also serve as a tool to identify genes governing health, disease and performance traits in horses and assist us in understanding the evolution of the equine genome in relation to other species.
PMCID: PMC2587302  PMID: 18931483
radiation hybrid map; horse; comparative; whole genome
16.  PSI-BLAST pseudocounts and the minimum description length principle 
Nucleic Acids Research  2008;37(3):815-824.
Position specific score matrices (PSSMs) are derived from multiple sequence alignments to aid in the recognition of distant protein sequence relationships. The PSI-BLAST protein database search program derives the column scores of its PSSMs with the aid of pseudocounts, added to the observed amino acid counts in a multiple alignment column. In the absence of theory, the number of pseudocounts used has been a completely empirical parameter. This article argues that the minimum description length principle can motivate the choice of this parameter. Specifically, for realistic alignments, the principle supports the practice of using a number of pseudocounts essentially independent of alignment size. However, it also implies that more highly conserved columns should use fewer pseudocounts, increasing the inter-column contrast of the implied PSSMs. A new method for calculating pseudocounts that significantly improves PSI-BLAST's; retrieval accuracy is now employed by default.
PMCID: PMC2647318  PMID: 19088134
17.  Application of dissociation curve analysis to radiation hybrid panel marker scoring: generation of a map of river buffalo (B. bubalis) chromosome 20 
BMC Genomics  2008;9:544.
Fluorescence of dyes bound to double-stranded PCR products has been utilized extensively in various real-time quantitative PCR applications, including post-amplification dissociation curve analysis, or differentiation of amplicon length or sequence composition. Despite the current era of whole-genome sequencing, mapping tools such as radiation hybrid DNA panels remain useful aids for sequence assembly, focused resequencing efforts, and for building physical maps of species that have not yet been sequenced. For placement of specific, individual genes or markers on a map, low-throughput methods remain commonplace. Typically, PCR amplification of DNA from each panel cell line is followed by gel electrophoresis and scoring of each clone for the presence or absence of PCR product. To improve sensitivity and efficiency of radiation hybrid panel analysis in comparison to gel-based methods, we adapted fluorescence-based real-time PCR and dissociation curve analysis for use as a novel scoring method.
As proof of principle for this dissociation curve method, we generated new maps of river buffalo (Bubalus bubalis) chromosome 20 by both dissociation curve analysis and conventional marker scoring. We also obtained sequence data to augment dissociation curve results. Few genes have been previously mapped to buffalo chromosome 20, and sequence detail is limited, so 65 markers were screened from the orthologous chromosome of domestic cattle. Thirty bovine markers (46%) were suitable as cross-species markers for dissociation curve analysis in the buffalo radiation hybrid panel under a standard protocol, compared to 25 markers suitable for conventional typing. Computational analysis placed 27 markers on a chromosome map generated by the new method, while the gel-based approach produced only 20 mapped markers. Among 19 markers common to both maps, the marker order on the map was maintained perfectly.
Dissociation curve analysis is reliable and efficient for radiation hybrid panel scoring, and is more sensitive and robust than conventional gel-based typing methods. Several markers could be scored only by the new method, and ambiguous scores were reduced. PCR-based dissociation curve analysis decreases both time and resources needed for construction of radiation hybrid panel marker maps and represents a significant improvement over gel-based methods in any species.
PMCID: PMC2621213  PMID: 19014630
18.  Investigations of the Y Chromosome, Male Founder Structure and YSTR Mutation Rates in the Old Order Amish 
Human Heredity  2007;65(2):91-104.
Using Y chromosome short tandem repeat (YSTR) genotypes, (1) evaluate the accuracy and completeness of the Lancaster County Old Order Amish (OOA) genealogical records and (2) estimate YSTR mutation rates.
Nine YSTR markers were genotyped in 739 Old Order Amish males who participated in several ongoing genetic studies of complex traits and could be connected into one of 28 all-male lineage pedigrees constructed using the Anabaptist Genealogy Database and the query software PedHunter. A putative founder YSTR haplotype was constructed for each pedigree, and observed and inferred father-son transmissions were used to estimate YSTR mutation rates.
We inferred 27 distinct founder Y chromosome haplotypes in the 28 male lineages, which encompassed 27 surnames accounting for 98% of Lancaster OOA households. Nearly all deviations from founder haplotypes were consistent with mutation events rather than errors. The estimated marker-specific mutation rates ranged from 0 to 1.09% (average 0.33% using up to 283 observed meioses only and 0.28% using up to 1,232 observed and inferred meioses combined).
These data confirm the accuracy and completeness of the male lineage portion of the Anabaptist Genealogy Database and contribute mutation rate estimates for several commonly used Y chromosome STR markers.
PMCID: PMC2857628  PMID: 17898540
Amish; STR mutation rates; Y chromosomes STRs; Genealogy; Founder population
19.  The National Center for Biotechnology Information's Protein Clusters Database 
Nucleic Acids Research  2008;37(Database issue):D216-D223.
Rapid increases in DNA sequencing capabilities have led to a vast increase in the data generated from prokaryotic genomic studies, which has been a boon to scientists studying micro-organism evolution and to those who wish to understand the biological underpinnings of microbial systems. The NCBI Protein Clusters Database (ProtClustDB) has been created to efficiently maintain and keep the deluge of data up to date. ProtClustDB contains both curated and uncurated clusters of proteins grouped by sequence similarity. The May 2008 release contains a total of 285 386 clusters derived from over 1.7 million proteins encoded by 3806 nt sequences from the RefSeq collection of complete chromosomes and plasmids from four major groups: prokaryotes, bacteriophages and the mitochondrial and chloroplast organelles. There are 7180 clusters containing 376 513 proteins with curated gene and protein functional annotation. PubMed identifiers and external cross references are collected for all clusters and provide additional information resources. A suite of web tools is available to explore more detailed information, such as multiple alignments, phylogenetic trees and genomic neighborhoods. ProtClustDB provides an efficient method to aggregate gene and protein annotation for researchers and is available at
PMCID: PMC2686591  PMID: 18940865
20.  Database indexing for production MegaBLAST searches 
Bioinformatics  2008;24(16):1757-1764.
Motivation: The BLAST software package for sequence comparison speeds up homology search by preprocessing a query sequence into a lookup table. Numerous research studies have suggested that preprocessing the database instead would give better performance. However, production usage of sequence comparison methods that preprocess the database has been limited to programs such as BLAT and SSAHA that are designed to find matches when query and database subsequences are highly similar.
Results: We developed a new version of the MegaBLAST module of BLAST that does the initial phase of finding short seeds for matches by searching a database index. We also developed a program makembindex that preprocesses the database into a data structure for rapid seed searching. We show that the new ‘indexed MegaBLAST’ is faster than the ‘non-indexed’ version for most practical uses. We show that indexed MegaBLAST is faster than miBLAST, another implementation of BLAST nucleotide searching with a preprocessed database, for most of the 200 queries we tested. To deploy indexed MegaBLAST as part of NCBI'sWeb BLAST service, the storage of databases and the queueing mechanism were modified, so that some machines are now dedicated to serving queries for a specific database. The response time for such Web queries is now faster than it was when each computer handled queries for multiple databases.
Availability: The code for indexed MegaBLAST is part of the blastn program in the NCBI C++ toolkit. The preprocessor program makembindex is also in the toolkit. Indexed MegaBLAST has been used in production on NCBI's Web BLAST service to search one version of the human and mouse genomes since October 2007. The Linux command-line executables for blastn and makembindex, documentation, and some query sets used to carry out the tests described below are available in the directory:
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2696921  PMID: 18567917
21.  rh_tsp_map 3.0: End-to-end radiation hybrid mapping with improved speed and quality control 
Bioinformatics (Oxford, England)  2007;23(9):1156-1158.
rh_tsp_map is a software package for computing radiation hybrid (RH) maps and for integrating physical and genetic maps. It solves the central mapping instances by reducing them to the traveling salesman problem (TSP) and using a modification of the CONCORDE package to solve the TSP instances. We present some of the features added between the initial rh_tsp_map version 1.0 and the current version 3.0, emphasizing the automation of many steps and addition of various checks designed to find problems with the input data. Iterations of improved input data followed by fast re-computation of the maps improves the quality of the final maps.
rh_tsp_map source code and documentation including a tutorial is available at CONCORDE modified for RH mapping is available in the directory The QSopt library needed for CONCORDE is available at
Contact, FAX: 301-480-2288 (Please send email concurrently with any fax.)
PMCID: PMC2266093  PMID: 17332018
22.  Protein Database Searches Using Compositionally Adjusted Substitution Matrices 
The FEBS journal  2005;272(20):5101-5109.
Almost all protein database search methods use amino acid substitution matrices for scoring, optimizing, and assessing the statistical significance of sequence alignments. Much care and effort has therefore gone into constructing substitution matrices, and the quality of search results can depend strongly upon the choice of the proper matrix. A long-standing problem has been the comparison of sequences with biased amino acid compositions, for which standard substitution matrices are not optimal. To address this problem, we have recently developed a general procedure for transforming a standard matrix into one appropriate for the comparison of two sequences with arbitrary, and possibly differing compositions. Such adjusted matrices yield, on average, improved alignments and alignment scores when applied to the comparison of proteins with markedly biased compositions.
Here we review the application of compositionally adjusted matrices and consider whether they may also be applied fruitfully to general purpose protein sequence database searches, in which related sequence pairs do not necessarily have strong compositional biases. Although it is not advisable to apply compositional adjustment indiscriminately, we describe several simple criteria under which invoking such adjustment is on average beneficial. In a typical database search, at least one of these criteria is satisfied by over half the related sequence pairs. Compositional substitution matrix adjustment is now available in NCBI's protein-protein version of BLAST.
PMCID: PMC1343503  PMID: 16218944
substitution matrices; compositional adjustment; protein database searches; BLAST; BLOSUM
23.  Composition-based statistics and translated nucleotide searches: Improving the TBLASTN module of BLAST 
BMC Biology  2006;4:41.
TBLASTN is a mode of operation for BLAST that aligns protein sequences to a nucleotide database translated in all six frames. We present the first description of the modern implementation of TBLASTN, focusing on new techniques that were used to implement composition-based statistics for translated nucleotide searches. Composition-based statistics use the composition of the sequences being aligned to generate more accurate E-values, which allows for a more accurate distinction between true and false matches. Until recently, composition-based statistics were available only for protein-protein searches. They are now available as a command line option for recent versions of TBLASTN and as an option for TBLASTN on the NCBI BLAST web server.
We evaluate the statistical and retrieval accuracy of the E-values reported by a baseline version of TBLASTN and by two variants that use different types of composition-based statistics. To test the statistical accuracy of TBLASTN, we ran 1000 searches using scrambled proteins from the mouse genome and a database of human chromosomes. To test retrieval accuracy, we modernize and adapt to translated searches a test set previously used to evaluate the retrieval accuracy of protein-protein searches. We show that composition-based statistics greatly improve the statistical accuracy of TBLASTN, at a small cost to the retrieval accuracy.
TBLASTN is widely used, as it is common to wish to compare proteins to chromosomes or to libraries of mRNAs. Composition-based statistics improve the statistical accuracy, and therefore the reliability, of TBLASTN results. The algorithms used by TBLASTN are not widely known, and some of the most important are reported here. The data used to test TBLASTN are available for download and may be useful in other studies of translated search algorithms.
PMCID: PMC1779365  PMID: 17156431
24.  Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches 
Nucleic Acids Research  2006;34(20):5966-5973.
Protein sequence database search programs may be evaluated both for their retrieval accuracy—the ability to separate meaningful from chance similarities—and for the accuracy of their statistical assessments of reported alignments. However, methods for improving statistical accuracy can degrade retrieval accuracy by discarding compositional evidence of sequence relatedness. This evidence may be preserved by combining essentially independent measures of alignment and compositional similarity into a unified measure of sequence similarity. A version of the BLAST protein database search program, modified to employ this new measure, outperforms the baseline program in both retrieval and statistical accuracy on ASTRAL, a SCOP-based test set.
PMCID: PMC1635310  PMID: 17068079
25.  Novel Gene Acquisition on Carnivore Y Chromosomes 
PLoS Genetics  2006;2(3):e43.
Despite its importance in harboring genes critical for spermatogenesis and male-specific functions, the Y chromosome has been largely excluded as a priority in recent mammalian genome sequencing projects. Only the human and chimpanzee Y chromosomes have been well characterized at the sequence level. This is primarily due to the presumed low overall gene content and highly repetitive nature of the Y chromosome and the ensuing difficulties using a shotgun sequence approach for assembly. Here we used direct cDNA selection to isolate and evaluate the extent of novel Y chromosome gene acquisition in the genome of the domestic cat, a species from a different mammalian superorder than human, chimpanzee, and mouse (currently being sequenced). We discovered four novel Y chromosome genes that do not have functional copies in the finished human male-specific region of the Y or on other mammalian Y chromosomes explored thus far. Two genes are derived from putative autosomal progenitors, and the other two have X chromosome homologs from different evolutionary strata. All four genes were shown to be multicopy and expressed predominantly or exclusively in testes, suggesting that their duplication and specialization for testis function were selected for because they enhance spermatogenesis. Two of these genes have testis-expressed, Y-borne copies in the dog genome as well. The absence of the four newly described genes on other characterized mammalian Y chromosomes demonstrates the gene novelty on this chromosome between mammalian orders, suggesting it harbors many lineage-specific genes that may go undetected by traditional comparative genomic approaches. Specific plans to identify the male-specific genes encoded in the Y chromosome of mammals should be a priority.
Y chromosomes are typically gene poor and enriched with repetitive elements, making them difficult to sequence by standard methods. Hence, the Y chromosome gene repertoire in mammalian species other than human has not been explored until very recently. Here the authors used a directed approach to isolate Y chromosome genes of the domestic cat, an evolutionary divergent species from human and mouse. They found that the feline Y chromosome harbors its own unique set of genes that are expressed specifically in the testes, presumably where they play an important role in spermatogenesis. Paralleling the discoveries seen from the full human Y chromosome sequence, the feline Y chromosome has acquired and remodeled some genes from autosomes, while other genes have a shared ancestry with the X chromosome. However, none of the four new genes are found on the Y chromosomes of human or mouse, although two are shared with the canine Y chromosome. This work highlights the Y chromosome as a source of potential gene novelty in different species and suggests that more directed efforts at characterizing this hitherto understudied chromosome will further enrich our understanding of the types of genes found there and the roles they may play in mammalian spermatogenesis.
PMCID: PMC1420679  PMID: 16596168

Results 1-25 (25)