The largest chromosome in the river buffalo karyotype, BBU1, is a submetacentric chromosome with reported homology between BBU1q and bovine chromosome 1 and between BBU1p and BTA27. We present the first radiation hybrid map of this chromosome containing 69 cattle derived markers including 48 coding genes, 17 microsatellites and four ESTs distributed in two linkage groups spanning a total length of 1330.1 cR5000. The RH map was constructed based on analysis of a recently developed river buffalo-hamster whole genome radiation hybrid (BBURH5000) panel. The retention frequency of individual markers across the panel ranged from 17.8% to 52.2%. With few exceptions, the order of markers within linkage groups is identical to the order established for corresponding cattle RH maps. The BBU1 map provides a starting point for comparison of gene order rearrangements between river buffalo chromosome 1 and its bovine homologs.
We report the construction of a 1.5 Mb resolution radiation hybrid map of the domestic cat genome. This new map includes novel microsatellite loci and markers derived from the 2X genome sequence that target previous gaps in the feline-human comparative map. Ninety-six percent of the 1793 cat markers we mapped have identifiable orthologues in the canine and human genome sequences. The updated autosomal and X chromosome comparative maps identify 152 cat-human and 134 cat-dog homologous synteny blocks. Comparative analysis shows the marked change in chromosomal evolution in the canid lineage relative to the felid lineage since divergence from their carnivoran ancestor. The canid lineage has a thirty-fold difference in the number of interchromosomal rearrangments relative to felids, while the felid lineage has primarily undergone intrachromosomal rearrangements. We have also refined the pseudoautosomal region and boundary in the cat and show that it is markedly longer than those of human or mouse. This improved RH comparative map provides a useful tool to facilitate positional cloning studies in the feline model.
domestic cat; radiation hybrid map; canine genome; genome evolution; synteny; chromosome rearrangement
The regions encoding the coordinately regulated Th2 cytokines IL5, IL4 and IL13 are located on chromosomes 5 of man and 11 of mouse. They have been intensively studied because these interleukins have protective roles in helminth infections, but may lead to detrimental effects such as allergy, asthma, and fibrosis in lung and liver. We added to previous studies by comparing sequences of syntenic regions on chromosome 3 of the rabbit (Oryctolagus cuniculus) genome OryCun 2.0 assembly from a tuberculosis-susceptible strain, with the corresponding region of ENCODE ENm002 from a normal rabbit as well as with 9 other mammalian species. We searched for rabbit transcription factor binding sites in putative promoter and other non-coding regions of IL5, RAD50, IL13 and IL4. Although we identified several differences between the two donor rabbits in coding and non-coding regions of potential functional significance, confirmation awaits additional sequencing of other rabbits.
BLAST is a commonly-used software package for comparing a query sequence to a database of known sequences; in this study, we focus on protein sequences. Position-specific-iterated BLAST (PSI-BLAST) iteratively searches a protein sequence database, using the matches in round i to construct a position-specific score matrix (PSSM) for searching the database in round i + 1. Biegert and Söding developed Context-sensitive BLAST (CS-BLAST), which combines information from searching the sequence database with information derived from a library of short protein profiles to achieve better homology detection than PSI-BLAST, which builds its PSSMs from scratch.
We describe a new method, called domain enhanced lookup time accelerated BLAST (DELTA-BLAST), which searches a database of pre-constructed PSSMs before searching a protein-sequence database, to yield better homology detection. For its PSSMs, DELTA-BLAST employs a subset of NCBI’s Conserved Domain Database (CDD). On a test set derived from ASTRAL, with one round of searching, DELTA-BLAST achieves a ROC5000 of 0.270 vs. 0.116 for CS-BLAST. The performance advantage diminishes in iterated searches, but DELTA-BLAST continues to achieve better ROC scores than CS-BLAST.
DELTA-BLAST is a useful program for the detection of remote protein homologs. It is available under the “Protein BLAST” link at http://blast.ncbi.nlm.nih.gov.
This article was reviewed by Arcady Mushegian, Nick V. Grishin, and Frank Eisenhaber.
We recently reported autosomal recessive fetal-onset neuroaxonal dystrophy (FNAD) in a large family of dogs that is not caused by mutation in the PLA2G6 locus (2010 J Comp Neurol 518:3771–3784). Here we report a genome-wide linkage analysis using 333 microsatellite markers to map canine FNAD to the telomeric end of chromosome 2. The interval of zero recombination was refined by single nucleotide polymorphism (SNP) haplotype analysis to ~ 200 kb, and the included genes were sequenced. We found a homozygous 3-nucleotide deletion in exon 13 of mitofusin 2 (MFN2), predicting loss of a glutamate residue at position 539 in the protein of affected dogs. RT-PCR demonstrated near normal expression of the mutant mRNA, but MFN2 expression was undetectable to very low on western blots of affected dog brainstem, cerebrum, kidney, and cultured fibroblasts and by immunohistochemistry on brainstem sections. MFN2 is a multifunctional, membrane-bound GTPase of mitochondria and endoplasmic reticulum most commonly associated with human Charcot-Marie-Tooth disease type 2A2. The canine disorder extends the range of MFN2-associated phenotypes and suggests MFN2 as a candidate gene for rare cases of human FNAD.
Neuroaxonal dystrophy; mitofusin 2; linkage analysis; animal model; disease gene
Because they are a closed founder population, the Old Order Amish (OOA) of Lancaster County have been the subject of many medical genetics studies. We constructed four versions of Anabaptist Genealogy Database (AGDB) using three sources of genealogies and multiple updates. In addition, we developed PedHunter, a suite of query software that can solve pedigree-related problems automatically and systematically.
We report on how we have used new features in PedHunter to quantify the number and expected genetic contribution of founders to the OOA. The queries and utility of PedHunter programs are illustrated by examples using AGDB in this paper. For example, we calculated the number of founders expected to be contributing genetic material to the present-day living OOA and estimated the mean relative founder representation for each founder. New features in PedHunter also include pedigree trimming and pedigree renumbering, which should prove useful for studying large pedigrees.
With PedHunter version 2.0 querying AGDB version 4.0, we identified 34,160 presumed living OOA individuals and connected them into a 14-generation pedigree descending from 554 founders (332 females and 222 males) after trimming. From the analysis of cumulative mean relative founder representation, 128 founders (78 females and 50 males) accounted for over 95% of the mean relative founder contribution among living OOA descendants.
The OOA are a closed founder population in which a modest number of founders account for the genetic variation present in the current OOA population. Improvements to the PedHunter software will be useful in future studies of both the OOA and other populations with large and computerized genealogies.
We describe the construction of a high-resolution radiation hybrid (RH) map of the domestic cat genome, which includes 2,662 markers, translating to an estimated average intermarker distance of 939 kilobases (Kb). Targeted marker selection utilized the recent feline 1.9x genome assembly, concentrating on regions of low marker density on feline autosomes and the X chromosome, in addition to regions flanking interspecies chromosomal breakpoints. Average gap (breakpoint) size between cat-human ordered conserved segments is less than 900 Kb. The map was used for a fine-scale comparison of conserved syntenic blocks with the human and canine genomes. Corroborative fluorescence in situ hybridization (FISH) data were generated using 129 domestic cat BAC-clones as probes, providing independent confirmation of the long-range correctness of the map. Cross-species hybridization of BAC probes on divergent felids from the genera Profelis (serval) and Panthera (snow leopard) provides further evidence for karyotypic conservation within felids, and demonstrates the utility of such probes for future studies of chromosome evolution within the cat family and in related carnivores. The integrated map constitutes a comprehensive framework for identifying genes controlling feline phenotypes of interest, and to aid in assembly of a higher coverage feline genome sequence.
domestic cat; radiation hybrid map; comparative mapping; FISH mapping; synteny; chromosome rearrangement; Felidae
A finished clone-based assembly of the mouse genome reveals extensive recent sequence duplication during recent evolution and rodent-specific expansion of certain gene families. Newly assembled duplications contain protein-coding genes that are mostly involved in reproductive function.
The mouse (Mus musculus) is the premier animal model for understanding human disease and development. Here we show that a comprehensive understanding of mouse biology is only possible with the availability of a finished, high-quality genome assembly. The finished clone-based assembly of the mouse strain C57BL/6J reported here has over 175,000 fewer gaps and over 139 Mb more of novel sequence, compared with the earlier MGSCv3 draft genome assembly. In a comprehensive analysis of this revised genome sequence, we are now able to define 20,210 protein-coding genes, over a thousand more than predicted in the human genome (19,042 genes). In addition, we identified 439 long, non–protein-coding RNAs with evidence for transcribed orthologs in human. We analyzed the complex and repetitive landscape of 267 Mb of sequence that was missing or misassembled in the previously published assembly, and we provide insights into the reasons for its resistance to sequencing and assembly by whole-genome shotgun approaches. Duplicated regions within newly assembled sequence tend to be of more recent ancestry than duplicates in the published draft, correcting our initial understanding of recent evolution on the mouse lineage. These duplicates appear to be largely composed of sequence regions containing transposable elements and duplicated protein-coding genes; of these, some may be fixed in the mouse population, but at least 40% of segmentally duplicated sequences are copy number variable even among laboratory mouse strains. Mouse lineage-specific regions contain 3,767 genes drawn mainly from rapidly-changing gene families associated with reproductive functions. The finished mouse genome assembly, therefore, greatly improves our understanding of rodent-specific biology and allows the delineation of ancestral biological functions that are shared with human from derived functions that are not.
The availability of an accurate genome sequence provides the bedrock upon which modern biomedical research is based. Here we describe a high-quality assembly, Build 36, of the mouse genome. This assembly was put together by aligning overlapping individual clones representing parts of the genome, and it provides a more complete picture than previous assemblies, because it adds much rodent-specific sequence that was previously unavailable. The addition of these sequences provides insight into both the genomic architecture and the gene complement of the mouse. In particular, it highlights recent gene duplications and the expansion of certain gene families during rodent evolution. An improved understanding of the mouse genome and thus mouse biology will enhance the utility of the mouse as a model for human disease.
A comprehensive second-generation whole genome radiation hybrid (RH II), cytogenetic and comparative map of the horse genome (2n=64) has been developed using the 5000rad horse × hamster radiation hybrid panel and fluorescence in situ hybridization (FISH). The map contains 4,103 markers (3,816 RH, 1,144 FISH) assigned to all 31 pairs of autosomes and the X chromosome. The RH maps of individual chromosomes are anchored and oriented using 857 cytogenetic markers. The overall resolution of the map is one marker per 775 kilobase-pairs (kb), which represents a more than five-fold improvement over the first-generation map. The RH II incorporates 920 markers shared jointly with the two recently reported meiotic maps. Consequently the two maps were aligned with the RH II maps of individual autosomes and the X chromosome. Additionally, a comparative map of the horse genome was generated by connecting 1,904 loci on the horse map with genome sequences available for eight diverse vertebrates to highlight regions of evolutionarily conserved syntenies, linkages and chromosomal breakpoints. The integrated map thus obtained presents the most comprehensive information on the physical and comparative organization of the equine genome and will assist future assemblies of whole genome BAC fingerprint maps and the genome sequence. It will also serve as a tool to identify genes governing health, disease and performance traits in horses and assist us in understanding the evolution of the equine genome in relation to other species.
radiation hybrid map; horse; comparative; whole genome
Position specific score matrices (PSSMs) are derived from multiple sequence alignments to aid in the recognition of distant protein sequence relationships. The PSI-BLAST protein database search program derives the column scores of its PSSMs with the aid of pseudocounts, added to the observed amino acid counts in a multiple alignment column. In the absence of theory, the number of pseudocounts used has been a completely empirical parameter. This article argues that the minimum description length principle can motivate the choice of this parameter. Specifically, for realistic alignments, the principle supports the practice of using a number of pseudocounts essentially independent of alignment size. However, it also implies that more highly conserved columns should use fewer pseudocounts, increasing the inter-column contrast of the implied PSSMs. A new method for calculating pseudocounts that significantly improves PSI-BLAST's; retrieval accuracy is now employed by default.
Fluorescence of dyes bound to double-stranded PCR products has been utilized extensively in various real-time quantitative PCR applications, including post-amplification dissociation curve analysis, or differentiation of amplicon length or sequence composition. Despite the current era of whole-genome sequencing, mapping tools such as radiation hybrid DNA panels remain useful aids for sequence assembly, focused resequencing efforts, and for building physical maps of species that have not yet been sequenced. For placement of specific, individual genes or markers on a map, low-throughput methods remain commonplace. Typically, PCR amplification of DNA from each panel cell line is followed by gel electrophoresis and scoring of each clone for the presence or absence of PCR product. To improve sensitivity and efficiency of radiation hybrid panel analysis in comparison to gel-based methods, we adapted fluorescence-based real-time PCR and dissociation curve analysis for use as a novel scoring method.
As proof of principle for this dissociation curve method, we generated new maps of river buffalo (Bubalus bubalis) chromosome 20 by both dissociation curve analysis and conventional marker scoring. We also obtained sequence data to augment dissociation curve results. Few genes have been previously mapped to buffalo chromosome 20, and sequence detail is limited, so 65 markers were screened from the orthologous chromosome of domestic cattle. Thirty bovine markers (46%) were suitable as cross-species markers for dissociation curve analysis in the buffalo radiation hybrid panel under a standard protocol, compared to 25 markers suitable for conventional typing. Computational analysis placed 27 markers on a chromosome map generated by the new method, while the gel-based approach produced only 20 mapped markers. Among 19 markers common to both maps, the marker order on the map was maintained perfectly.
Dissociation curve analysis is reliable and efficient for radiation hybrid panel scoring, and is more sensitive and robust than conventional gel-based typing methods. Several markers could be scored only by the new method, and ambiguous scores were reduced. PCR-based dissociation curve analysis decreases both time and resources needed for construction of radiation hybrid panel marker maps and represents a significant improvement over gel-based methods in any species.
Using Y chromosome short tandem repeat (YSTR) genotypes, (1) evaluate the accuracy and completeness of the Lancaster County Old Order Amish (OOA) genealogical records and (2) estimate YSTR mutation rates.
Nine YSTR markers were genotyped in 739 Old Order Amish males who participated in several ongoing genetic studies of complex traits and could be connected into one of 28 all-male lineage pedigrees constructed using the Anabaptist Genealogy Database and the query software PedHunter. A putative founder YSTR haplotype was constructed for each pedigree, and observed and inferred father-son transmissions were used to estimate YSTR mutation rates.
We inferred 27 distinct founder Y chromosome haplotypes in the 28 male lineages, which encompassed 27 surnames accounting for 98% of Lancaster OOA households. Nearly all deviations from founder haplotypes were consistent with mutation events rather than errors. The estimated marker-specific mutation rates ranged from 0 to 1.09% (average 0.33% using up to 283 observed meioses only and 0.28% using up to 1,232 observed and inferred meioses combined).
These data confirm the accuracy and completeness of the male lineage portion of the Anabaptist Genealogy Database and contribute mutation rate estimates for several commonly used Y chromosome STR markers.
Amish; STR mutation rates; Y chromosomes STRs; Genealogy; Founder population
Rapid increases in DNA sequencing capabilities have led to a vast increase in the data generated from prokaryotic genomic studies, which has been a boon to scientists studying micro-organism evolution and to those who wish to understand the biological underpinnings of microbial systems. The NCBI Protein Clusters Database (ProtClustDB) has been created to efficiently maintain and keep the deluge of data up to date. ProtClustDB contains both curated and uncurated clusters of proteins grouped by sequence similarity. The May 2008 release contains a total of 285 386 clusters derived from over 1.7 million proteins encoded by 3806 nt sequences from the RefSeq collection of complete chromosomes and plasmids from four major groups: prokaryotes, bacteriophages and the mitochondrial and chloroplast organelles. There are 7180 clusters containing 376 513 proteins with curated gene and protein functional annotation. PubMed identifiers and external cross references are collected for all clusters and provide additional information resources. A suite of web tools is available to explore more detailed information, such as multiple alignments, phylogenetic trees and genomic neighborhoods. ProtClustDB provides an efficient method to aggregate gene and protein annotation for researchers and is available at http://www.ncbi.nlm.nih.gov/sites/entrez?db=proteinclusters.
Motivation: The BLAST software package for sequence comparison speeds up homology search by preprocessing a query sequence into a lookup table. Numerous research studies have suggested that preprocessing the database instead would give better performance. However, production usage of sequence comparison methods that preprocess the database has been limited to programs such as BLAT and SSAHA that are designed to find matches when query and database subsequences are highly similar.
Results: We developed a new version of the MegaBLAST module of BLAST that does the initial phase of finding short seeds for matches by searching a database index. We also developed a program makembindex that preprocesses the database into a data structure for rapid seed searching. We show that the new ‘indexed MegaBLAST’ is faster than the ‘non-indexed’ version for most practical uses. We show that indexed MegaBLAST is faster than miBLAST, another implementation of BLAST nucleotide searching with a preprocessed database, for most of the 200 queries we tested. To deploy indexed MegaBLAST as part of NCBI'sWeb BLAST service, the storage of databases and the queueing mechanism were modified, so that some machines are now dedicated to serving queries for a specific database. The response time for such Web queries is now faster than it was when each computer handled queries for multiple databases.
Availability: The code for indexed MegaBLAST is part of the blastn program in the NCBI C++ toolkit. The preprocessor program makembindex is also in the toolkit. Indexed MegaBLAST has been used in production on NCBI's Web BLAST service to search one version of the human and mouse genomes since October 2007. The Linux command-line executables for blastn and makembindex, documentation, and some query sets used to carry out the tests described below are available in the directory: ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/indexed_megablast
Supplementary information: Supplementary data are available at Bioinformatics online.
rh_tsp_map is a software package for computing radiation hybrid (RH) maps and for integrating physical and genetic maps. It solves the central mapping instances by reducing them to the traveling salesman problem (TSP) and using a modification of the CONCORDE package to solve the TSP instances. We present some of the features added between the initial rh_tsp_map version 1.0 and the current version 3.0, emphasizing the automation of many steps and addition of various checks designed to find problems with the input data. Iterations of improved input data followed by fast re-computation of the maps improves the quality of the final maps.
rh_tsp_map source code and documentation including a tutorial is available at ftp://ftp.ncbi.nih.gov/pub/agarwala/rhmapping/rh_tsp_map.tar.gz. CONCORDE modified for RH mapping is available in the directory http://www.isye.gatech.edu/~wcook/rh/. The QSopt library needed for CONCORDE is available at http://www2.isye.gatech.edu/~wcook/qsopt/downloads/downloads.htm.
email@example.com, FAX: 301-480-2288 (Please send email concurrently with any fax.)
Almost all protein database search methods use amino acid substitution matrices for scoring, optimizing, and assessing the statistical significance of sequence alignments. Much care and effort has therefore gone into constructing substitution matrices, and the quality of search results can depend strongly upon the choice of the proper matrix. A long-standing problem has been the comparison of sequences with biased amino acid compositions, for which standard substitution matrices are not optimal. To address this problem, we have recently developed a general procedure for transforming a standard matrix into one appropriate for the comparison of two sequences with arbitrary, and possibly differing compositions. Such adjusted matrices yield, on average, improved alignments and alignment scores when applied to the comparison of proteins with markedly biased compositions.
Here we review the application of compositionally adjusted matrices and consider whether they may also be applied fruitfully to general purpose protein sequence database searches, in which related sequence pairs do not necessarily have strong compositional biases. Although it is not advisable to apply compositional adjustment indiscriminately, we describe several simple criteria under which invoking such adjustment is on average beneficial. In a typical database search, at least one of these criteria is satisfied by over half the related sequence pairs. Compositional substitution matrix adjustment is now available in NCBI's protein-protein version of BLAST.
substitution matrices; compositional adjustment; protein database searches; BLAST; BLOSUM
TBLASTN is a mode of operation for BLAST that aligns protein sequences to a nucleotide database translated in all six frames. We present the first description of the modern implementation of TBLASTN, focusing on new techniques that were used to implement composition-based statistics for translated nucleotide searches. Composition-based statistics use the composition of the sequences being aligned to generate more accurate E-values, which allows for a more accurate distinction between true and false matches. Until recently, composition-based statistics were available only for protein-protein searches. They are now available as a command line option for recent versions of TBLASTN and as an option for TBLASTN on the NCBI BLAST web server.
We evaluate the statistical and retrieval accuracy of the E-values reported by a baseline version of TBLASTN and by two variants that use different types of composition-based statistics. To test the statistical accuracy of TBLASTN, we ran 1000 searches using scrambled proteins from the mouse genome and a database of human chromosomes. To test retrieval accuracy, we modernize and adapt to translated searches a test set previously used to evaluate the retrieval accuracy of protein-protein searches. We show that composition-based statistics greatly improve the statistical accuracy of TBLASTN, at a small cost to the retrieval accuracy.
TBLASTN is widely used, as it is common to wish to compare proteins to chromosomes or to libraries of mRNAs. Composition-based statistics improve the statistical accuracy, and therefore the reliability, of TBLASTN results. The algorithms used by TBLASTN are not widely known, and some of the most important are reported here. The data used to test TBLASTN are available for download and may be useful in other studies of translated search algorithms.
Protein sequence database search programs may be evaluated both for their retrieval accuracy—the ability to separate meaningful from chance similarities—and for the accuracy of their statistical assessments of reported alignments. However, methods for improving statistical accuracy can degrade retrieval accuracy by discarding compositional evidence of sequence relatedness. This evidence may be preserved by combining essentially independent measures of alignment and compositional similarity into a unified measure of sequence similarity. A version of the BLAST protein database search program, modified to employ this new measure, outperforms the baseline program in both retrieval and statistical accuracy on ASTRAL, a SCOP-based test set.
Despite its importance in harboring genes critical for spermatogenesis and male-specific functions, the Y chromosome has been largely excluded as a priority in recent mammalian genome sequencing projects. Only the human and chimpanzee Y chromosomes have been well characterized at the sequence level. This is primarily due to the presumed low overall gene content and highly repetitive nature of the Y chromosome and the ensuing difficulties using a shotgun sequence approach for assembly. Here we used direct cDNA selection to isolate and evaluate the extent of novel Y chromosome gene acquisition in the genome of the domestic cat, a species from a different mammalian superorder than human, chimpanzee, and mouse (currently being sequenced). We discovered four novel Y chromosome genes that do not have functional copies in the finished human male-specific region of the Y or on other mammalian Y chromosomes explored thus far. Two genes are derived from putative autosomal progenitors, and the other two have X chromosome homologs from different evolutionary strata. All four genes were shown to be multicopy and expressed predominantly or exclusively in testes, suggesting that their duplication and specialization for testis function were selected for because they enhance spermatogenesis. Two of these genes have testis-expressed, Y-borne copies in the dog genome as well. The absence of the four newly described genes on other characterized mammalian Y chromosomes demonstrates the gene novelty on this chromosome between mammalian orders, suggesting it harbors many lineage-specific genes that may go undetected by traditional comparative genomic approaches. Specific plans to identify the male-specific genes encoded in the Y chromosome of mammals should be a priority.
Y chromosomes are typically gene poor and enriched with repetitive elements, making them difficult to sequence by standard methods. Hence, the Y chromosome gene repertoire in mammalian species other than human has not been explored until very recently. Here the authors used a directed approach to isolate Y chromosome genes of the domestic cat, an evolutionary divergent species from human and mouse. They found that the feline Y chromosome harbors its own unique set of genes that are expressed specifically in the testes, presumably where they play an important role in spermatogenesis. Paralleling the discoveries seen from the full human Y chromosome sequence, the feline Y chromosome has acquired and remodeled some genes from autosomes, while other genes have a shared ancestry with the X chromosome. However, none of the four new genes are found on the Y chromosomes of human or mouse, although two are shared with the canine Y chromosome. This work highlights the Y chromosome as a source of potential gene novelty in different species and suggests that more directed efforts at characterizing this hitherto understudied chromosome will further enrich our understanding of the types of genes found there and the roles they may play in mammalian spermatogenesis.