Search tips
Search criteria

Results 1-25 (36)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
Document Types
1.  Genetic history of an archaic hominin group from Denisova Cave in Siberia 
Nature  2010;468(7327):1053-1060.
Using DNA extracted from a finger bone found in Denisova Cave in southern Siberia, we have sequenced the genome of an archaic hominin to about 1.9-fold coverage. This individual is from a group that shares a common origin with Neanderthals. This population was not involved in the putative gene flow from Neanderthals into Eurasians; however, the data suggest that it contributed 4–6% of its genetic material to the genomes of present-day Melanesians. We designate this hominin population ‘Denisovans’ and suggest that it may have been widespread in Asia during the Late Pleistocene epoch. A tooth found in Denisova Cave carries a mitochondrial genome highly similar to that of the finger bone. This tooth shares no derived morphological features with Neanderthals or modern humans, further indicating that Denisovans have an evolutionary history distinct from Neanderthals and modern humans.
PMCID: PMC4306417  PMID: 21179161
2.  Whole genome sequencing of Turkish genomes reveals functional private alleles and impact of genetic interactions with Europe, Asia and Africa 
BMC Genomics  2014;15(1):963.
Turkey is a crossroads of major population movements throughout history and has been a hotspot of cultural interactions. Several studies have investigated the complex population history of Turkey through a limited set of genetic markers. However, to date, there have been no studies to assess the genetic variation at the whole genome level using whole genome sequencing. Here, we present whole genome sequences of 16 Turkish individuals resequenced at high coverage (32 × -48×).
We show that the genetic variation of the contemporary Turkish population clusters with South European populations, as expected, but also shows signatures of relatively recent contribution from ancestral East Asian populations. In addition, we document a significant enrichment of non-synonymous private alleles, consistent with recent observations in European populations. A number of variants associated with skin color and total cholesterol levels show frequency differentiation between the Turkish populations and European populations. Furthermore, we have analyzed the 17q21.31 inversion polymorphism region (MAPT locus) and found increased allele frequency of 31.25% for H1/H2 inversion polymorphism when compared to European populations that show about 25% of allele frequency.
This study provides the first map of common genetic variation from 16 western Asian individuals and thus helps fill an important geographical gap in analyzing natural human variation and human migration. Our data will help develop population-specific experimental designs for studies investigating disease associations and demographic history in Turkey.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-963) contains supplementary material, which is available to authorized users.
PMCID: PMC4236450  PMID: 25376095
3.  Annotated features of domestic cat – Felis catus genome 
GigaScience  2014;3:13.
Domestic cats enjoy an extensive veterinary medical surveillance which has described nearly 250 genetic diseases analogous to human disorders. Feline infectious agents offer powerful natural models of deadly human diseases, which include feline immunodeficiency virus, feline sarcoma virus and feline leukemia virus. A rich veterinary literature of feline disease pathogenesis and the demonstration of a highly conserved ancestral mammal genome organization make the cat genome annotation a highly informative resource that facilitates multifaceted research endeavors.
Here we report a preliminary annotation of the whole genome sequence of Cinnamon, a domestic cat living in Columbia (MO, USA), bisulfite sequencing of Boris, a male cat from St. Petersburg (Russia), and light 30× sequencing of Sylvester, a European wildcat progenitor of cat domestication. The annotation includes 21,865 protein-coding genes identified by a comparative approach, 217 loci of endogenous retrovirus-like elements, repetitive elements which comprise about 55.7% of the whole genome, 99,494 new SNVs, 8,355 new indels, 743,326 evolutionary constrained elements, and 3,182 microRNA homologues. The methylation sites study shows that 10.5% of cat genome cytosines are methylated. An assisted assembly of a European wildcat, Felis silvestris silvestris, was performed; variants between F. silvestris and F. catus genomes were derived and compared to F. catus.
The presented genome annotation extends beyond earlier ones by closing gaps of sequence that were unavoidable with previous low-coverage shotgun genome sequencing. The assembly and its annotation offer an important resource for connecting the rich veterinary and natural history of cats to genome discovery.
PMCID: PMC4138527  PMID: 25143822
Felis catus; Domestic cat; Felis silvestris silvestris; European wildcat; Genome sequence; Annotation; Assembly
4.  Genome structural variation discovery and genotyping 
Nature reviews. Genetics  2011;12(5):363-376.
Comparisons of human genomes show that more base pairs are altered as a result of structural variation — including copy number variation — than as a result of point mutations. Here we review advances and challenges in the discovery and genotyping of structural variation. The recent application of massively parallel sequencing methods has complemented microarray-based methods and has led to an exponential increase in the discovery of smaller structural-variation events. Some global discovery biases remain, but the integration of experimental and computational approaches is proving fruitful for accurate characterization of the copy, content and structure of variable regions. We argue that the long-term goal should be routine, cost-effective and high quality de novo assembly of human genomes to comprehensively assess all classes of structural variation.
PMCID: PMC4108431  PMID: 21358748
5.  mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications 
Nucleic Acids Research  2014;42(Web Server issue):W494-W500.
High throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for processing and downstream analysis. While tools that report the ‘best’ mapping location of each read provide a fast way to process HTS data, they are not suitable for many types of downstream analysis such as structural variation detection, where it is important to report multiple mapping loci for each read. For this purpose we introduce mrsFAST-Ultra, a fast, cache oblivious, SNP-aware aligner that can handle the multi-mapping of HTS reads very efficiently. mrsFAST-Ultra improves mrsFAST, our first cache oblivious read aligner capable of handling multi-mapping reads, through new and compact index structures that reduce not only the overall memory usage but also the number of CPU operations per alignment. In fact the size of the index generated by mrsFAST-Ultra is 10 times smaller than that of mrsFAST. As importantly, mrsFAST-Ultra introduces new features such as being able to (i) obtain the best mapping loci for each read, and (ii) return all reads that have at most n mapping loci (within an error threshold), together with these loci, for any user specified n. Furthermore, mrsFAST-Ultra is SNP-aware, i.e. it can map reads to reference genome while discounting the mismatches that occur at common SNP locations provided by db-SNP; this significantly increases the number of reads that can be mapped to the reference genome. Notice that all of the above features are implemented within the index structure and are not simple post-processing steps and thus are performed highly efficiently. Finally, mrsFAST-Ultra utilizes multiple available cores and processors and can be tuned for various memory settings. Our results show that mrsFAST-Ultra is roughly five times faster than its predecessor mrsFAST. In comparison to newly enhanced popular tools such as Bowtie2, it is more sensitive (it can report 10 times or more mappings per read) and much faster (six times or more) in the multi-mapping mode. Furthermore, mrsFAST-Ultra has an index size of 2GB for the entire human reference genome, which is roughly half of that of Bowtie2. mrsFAST-Ultra is open source and it can be accessed at
PMCID: PMC4086126  PMID: 24810850
6.  Great ape genetic diversity and population history 
Prado-Martinez, Javier | Sudmant, Peter H. | Kidd, Jeffrey M. | Li, Heng | Kelley, Joanna L. | Lorente-Galdos, Belen | Veeramah, Krishna R. | Woerner, August E. | O’Connor, Timothy D. | Santpere, Gabriel | Cagan, Alexander | Theunert, Christoph | Casals, Ferran | Laayouni, Hafid | Munch, Kasper | Hobolth, Asger | Halager, Anders E. | Malig, Maika | Hernandez-Rodriguez, Jessica | Hernando-Herraez, Irene | Prüfer, Kay | Pybus, Marc | Johnstone, Laurel | Lachmann, Michael | Alkan, Can | Twigg, Dorina | Petit, Natalia | Baker, Carl | Hormozdiari, Fereydoun | Fernandez-Callejo, Marcos | Dabad, Marc | Wilson, Michael L. | Stevison, Laurie | Camprubí, Cristina | Carvalho, Tiago | Ruiz-Herrera, Aurora | Vives, Laura | Mele, Marta | Abello, Teresa | Kondova, Ivanela | Bontrop, Ronald E. | Pusey, Anne | Lankester, Felix | Kiyang, John A. | Bergl, Richard A. | Lonsdorf, Elizabeth | Myers, Simon | Ventura, Mario | Gagneux, Pascal | Comas, David | Siegismund, Hans | Blanc, Julie | Agueda-Calpena, Lidia | Gut, Marta | Fulton, Lucinda | Tishkoff, Sarah A. | Mullikin, James C. | Wilson, Richard K. | Gut, Ivo G. | Gonder, Mary Katherine | Ryder, Oliver A. | Hahn, Beatrice H. | Navarro, Arcadi | Akey, Joshua M. | Bertranpetit, Jaume | Reich, David | Mailund, Thomas | Schierup, Mikkel H. | Hvilsom, Christina | Andrés, Aida M. | Wall, Jeffrey D. | Bustamante, Carlos D. | Hammer, Michael F. | Eichler, Evan E. | Marques-Bonet, Tomas
Nature  2013;499(7459):10.1038/nature12228.
PMCID: PMC3822165  PMID: 23823723
7.  Genome Sequencing Highlights the Dynamic Early History of Dogs 
PLoS Genetics  2014;10(1):e1004016.
To identify genetic changes underlying dog domestication and reconstruct their early evolutionary history, we generated high-quality genome sequences from three gray wolves, one from each of the three putative centers of dog domestication, two basal dog lineages (Basenji and Dingo) and a golden jackal as an outgroup. Analysis of these sequences supports a demographic model in which dogs and wolves diverged through a dynamic process involving population bottlenecks in both lineages and post-divergence gene flow. In dogs, the domestication bottleneck involved at least a 16-fold reduction in population size, a much more severe bottleneck than estimated previously. A sharp bottleneck in wolves occurred soon after their divergence from dogs, implying that the pool of diversity from which dogs arose was substantially larger than represented by modern wolf populations. We narrow the plausible range for the date of initial dog domestication to an interval spanning 11–16 thousand years ago, predating the rise of agriculture. In light of this finding, we expand upon previous work regarding the increase in copy number of the amylase gene (AMY2B) in dogs, which is believed to have aided digestion of starch in agricultural refuse. We find standing variation for amylase copy number variation in wolves and little or no copy number increase in the Dingo and Husky lineages. In conjunction with the estimated timing of dog origins, these results provide additional support to archaeological finds, suggesting the earliest dogs arose alongside hunter-gathers rather than agriculturists. Regarding the geographic origin of dogs, we find that, surprisingly, none of the extant wolf lineages from putative domestication centers is more closely related to dogs, and, instead, the sampled wolves form a sister monophyletic clade. This result, in combination with dog-wolf admixture during the process of domestication, suggests that a re-evaluation of past hypotheses regarding dog origins is necessary.
Author Summary
The process of dog domestication is still poorly understood, largely because no studies thus far have leveraged deeply sequenced whole genomes from wolves and dogs to simultaneously evaluate support for the proposed source regions: East Asia, the Middle East, and Europe. To investigate dog origins, we sequence three wolf genomes from the putative centers of origin, two basal dog breeds (Basenji and Dingo), and a golden jackal as an outgroup. We find that none of the wolf lineages from the hypothesized domestication centers is supported as the source lineage for dogs, and that dogs and wolves diverged 11,000–16,000 years ago in a process involving extensive admixture and that was followed by a bottleneck in wolves. In addition, we investigate the amylase (AMY2B) gene family expansion in dogs, which has recently been suggested as being critical to domestication in response to increased dietary starch. We find standing variation in AMY2B copy number in wolves and show that some breeds, such as Dingo and Husky, lack the AMY2B expansion. This suggests that, at the beginning of the domestication process, dogs may have been characterized by a more carnivorous diet than their modern day counterparts, a diet held in common with early hunter-gatherers.
PMCID: PMC3894170  PMID: 24453982
8.  Complete Khoisan and Bantu genomes from southern Africa 
Nature  2010;463(7283):943-947.
The genetic structure of the indigenous hunter-gatherer peoples of southern Africa, the oldest known lineage of modern human, is important for understanding human diversity. Studies based on mitochondrial1 and small sets of nuclear markers2 have shown that these hunter-gatherers, known as Khoisan, San, or Bushmen, are genetically divergent from other humans1,3. However, until now, fully sequenced human genomes have been limited to recently diverged populations4–8. Here we present the complete genome sequences of an indigenous hunter-gatherer from the Kalahari Desert and a Bantu from southern Africa, as well as protein-coding regions from an additional three hunter-gatherers from disparate regions of the Kalahari. We characterize the extent of whole-genome and exome diversity among the five men, reporting 1.3 million novel DNA differences genome-wide, including 13,146 novel amino acid variants. In terms of nucleotide substitutions, the Bushmen seem to be, on average, more different from each other than, for example, a European and an Asian. Observed genomic differences between the hunter-gatherers and others may help to pinpoint genetic adaptations to an agricultural lifestyle. Adding the described variants to current databases will facilitate inclusion of southern Africans in medical research efforts, particularly when family and medical histories can be correlated with genome-wide data.
PMCID: PMC3890430  PMID: 20164927
9.  SCALCE: boosting sequence compression algorithms using locally consistent encoding 
Bioinformatics  2012;28(23):3051-3057.
Motivation: The high throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for the computational infrastructure. Data management, storage and analysis have become major logistical obstacles for those adopting the new platforms. The requirement for large investment for this purpose almost signalled the end of the Sequence Read Archive hosted at the National Center for Biotechnology Information (NCBI), which holds most of the sequence data generated world wide. Currently, most HTS data are compressed through general purpose algorithms such as gzip. These algorithms are not designed for compressing data generated by the HTS platforms; for example, they do not take advantage of the specific nature of genomic sequence data, that is, limited alphabet size and high similarity among reads. Fast and efficient compression algorithms designed specifically for HTS data should be able to address some of the issues in data management, storage and communication. Such algorithms would also help with analysis provided they offer additional capabilities such as random access to any read and indexing for efficient sequence similarity search. Here we present SCALCE, a ‘boosting’ scheme based on Locally Consistent Parsing technique, which reorganizes the reads in a way that results in a higher compression speed and compression rate, independent of the compression algorithm in use and without using a reference genome.
Results: Our tests indicate that SCALCE can improve the compression rate achieved through gzip by a factor of 4.19—when the goal is to compress the reads alone. In fact, on SCALCE reordered reads, gzip running time can improve by a factor of 15.06 on a standard PC with a single core and 6 GB memory. Interestingly even the running time of SCALCE + gzip improves that of gzip alone by a factor of 2.09. When compared with the recently published BEETL, which aims to sort the (inverted) reads in lexicographic order for improving bzip2, SCALCE + gzip provides up to 2.01 times better compression while improving the running time by a factor of 5.17. SCALCE also provides the option to compress the quality scores as well as the read names, in addition to the reads themselves. This is achieved by compressing the quality scores through order-3 Arithmetic Coding (AC) and the read names through gzip through the reordering SCALCE provides on the reads. This way, in comparison with gzip compression of the unordered FASTQ files (including reads, read names and quality scores), SCALCE (together with gzip and arithmetic encoding) can provide up to 3.34 improvement in the compression rate and 1.26 improvement in running time.
Availability: Our algorithm, SCALCE (Sequence Compression Algorithm using Locally Consistent Encoding), is implemented in C++ with both gzip and bzip2 compression options. It also supports multithreading when gzip option is selected, and the pigz binary is available. It is available at
Contact: or
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3509486  PMID: 23047557
10.  The genome sequencing of an albino Western lowland gorilla reveals inbreeding in the wild 
BMC Genomics  2013;14:363.
The only known albino gorilla, named Snowflake, was a male wild born individual from Equatorial Guinea who lived at the Barcelona Zoo for almost 40 years. He was diagnosed with non-syndromic oculocutaneous albinism, i.e. white hair, light eyes, pink skin, photophobia and reduced visual acuity. Despite previous efforts to explain the genetic cause, this is still unknown. Here, we study the genetic cause of his albinism and making use of whole genome sequencing data we find a higher inbreeding coefficient compared to other gorillas.
We successfully identified the causal genetic variant for Snowflake’s albinism, a non-synonymous single nucleotide variant located in a transmembrane region of SLC45A2. This transporter is known to be involved in oculocutaneous albinism type 4 (OCA4) in humans. We provide experimental evidence that shows that this amino acid replacement alters the membrane spanning capability of this transmembrane region. Finally, we provide a comprehensive study of genome-wide patterns of autozygogosity revealing that Snowflake’s parents were related, being this the first report of inbreeding in a wild born Western lowland gorilla.
In this study we demonstrate how the use of whole genome sequencing can be extended to link genotype and phenotype in non-model organisms and it can be a powerful tool in conservation genetics (e.g., inbreeding and genetic diversity) with the expected decrease in sequencing cost.
PMCID: PMC3673836  PMID: 23721540
Gorilla; Albinism; Inbreeding; Genome; Conservation
Science (New York, N.Y.)  2012;338(6104):222-226.
We present a DNA library preparation method that has allowed us to reconstruct a high coverage (30X) genome sequence of a Denisovan, an extinct relative of Neandertals. The quality of this genome allows a direct estimation of Denisovan heterozygosity indicating that genetic diversity in these archaic hominins was extremely low. It also allows tentative dating of the specimen on the basis of “missing evolution” in its genome, detailed measurements of Denisovan and Neandertal admixture into present-day human populations, and the generation of a near-complete catalog of genetic changes that swept to high frequency in modern humans since their divergence from Denisovans.
PMCID: PMC3617501  PMID: 22936568
12.  Accelerating read mapping with FastHASH 
BMC Genomics  2013;14(Suppl 1):S13.
With the introduction of next-generation sequencing (NGS) technologies, we are facing an exponential increase in the amount of genomic sequence data. The success of all medical and genetic applications of next-generation sequencing critically depends on the existence of computational techniques that can process and analyze the enormous amount of sequence data quickly and accurately. Unfortunately, the current read mapping algorithms have difficulties in coping with the massive amounts of data generated by NGS.
We propose a new algorithm, FastHASH, which drastically improves the performance of the seed-and-extend type hash table based read mapping algorithms, while maintaining the high sensitivity and comprehensiveness of such methods. FastHASH is a generic algorithm compatible with all seed-and-extend class read mapping algorithms. It introduces two main techniques, namely Adjacency Filtering, and Cheap K-mer Selection.
We implemented FastHASH and merged it into the codebase of the popular read mapping program, mrFAST. Depending on the edit distance cutoffs, we observed up to 19-fold speedup while still maintaining 100% sensitivity and high comprehensiveness.
PMCID: PMC3549798  PMID: 23369189
13.  Sensitive and fast mapping of di-base encoded reads 
Bioinformatics  2011;28(1):150.
PMCID: PMC3276229
14.  The bonobo genome compared with the chimpanzee and human genomes 
Nature  2012;486(7404):527-531.
Two African apes are the closest living relatives of humans: the chimpanzee (Pan troglodytes) and the bonobo (Pan paniscus). Although they are similar in many respects, bonobos and chimpanzees differ strikingly in key social and sexual behaviours1–4, and for some of these traits they show more similarity with humans than with each other. Here we report the sequencing and assembly of the bonobo genome to study its evolutionary relationship with the chimpanzee and human genomes. We find that more than three per cent of the human genome is more closely related to either the bonobo or the chimpanzee genome than these are to each other. These regions allow various aspects of the ancestry of the two ape species to be reconstructed. In addition, many of the regions that overlap genes may eventually help us understand the genetic basis of phenotypes that humans share with one of the two apes to the exclusion of the other.
PMCID: PMC3498939  PMID: 22722832
15.  Identification and validation of a novel mature microRNA encoded by the Merkel cell polyomavirus in human Merkel cell carcinomas 
Merkel cell polyomavirus (MCPyV) is present in approximately 80% of human Merkel cell carcinomas (MCCs). A previous in silico prediction suggested MCPyV encodes a microRNA (miRNA) that may regulate cellular and viral genes.
To determine the presence and prevalence of a putative MCPyV-encoded miRNA in human MCC tumors.
Study Design
Over 30 million small RNAs from 7 cryopreserved MCC tumors and 1 perilesional sample were sequenced. 45 additional MCC tumors were examined for expression of an MCPyV-encoded mature miRNA by reverse transcription real-time PCR.
An MCPyV-encoded mature miRNA, “MCV-miR-M1-5p”, was detected by direct sequencing in 2 of 3 MCPyV-positive MCC tumors. Although a precursor miRNA, MCV-miR-M1, had been predicted in silico and studied in vitro by Seo et al., no MCPyV-encoded miRNAs have been directly detected in human tissues. Importantly, the mature sequence of MCV-miR-M1 found in vivo was identical in all 79 reads obtained but differed from the in silico predicted mature miRNA by a 2-nucleotide shift, resulting in a distinct seed region and a different set of predicted target genes. This mature miRNA was detected by real-time PCR in 50% of MCPyV-positive MCCs (n=38) and in 0% of MCPyV-negative MCCs (n=13).
MCV-miR-M1-5p is expressed at low levels in 50% of MCPyV-positive MCCs. This virus-encoded miRNA is predicted to target genes that may play a role in promoting immune evasion and regulating viral DNA replication.
PMCID: PMC3196837  PMID: 21907614
MCV-miR-M1; Merkel cell polyomavirus; Merkel cell carcinoma; microRNA
16.  A hexanucleotide repeat expansion in C9ORF72 is the cause of chromosome 9p21-linked ALS-FTD 
Renton, Alan E. | Majounie, Elisa | Waite, Adrian | Simón-Sánchez, Javier | Rollinson, Sara | Gibbs, J. Raphael | Schymick, Jennifer C. | Laaksovirta, Hannu | van Swieten, John C. | Myllykangas, Liisa | Kalimo, Hannu | Paetau, Anders | Abramzon, Yevgeniya | Remes, Anne M. | Kaganovich, Alice | Scholz, Sonja W. | Duckworth, Jamie | Ding, Jinhui | Harmer, Daniel W. | Hernandez, Dena G. | Johnson, Janel O. | Mok, Kin | Ryten, Mina | Trabzuni, Danyah | Guerreiro, Rita J. | Orrell, Richard W. | Neal, James | Murray, Alex | Pearson, Justin | Jansen, Iris E. | Sondervan, David | Seelaar, Harro | Blake, Derek | Young, Kate | Halliwell, Nicola | Callister, Janis | Toulson, Greg | Richardson, Anna | Gerhard, Alex | Snowden, Julie | Mann, David | Neary, David | Nalls, Michael A. | Peuralinna, Terhi | Jansson, Lilja | Isoviita, Veli-Matti | Kaivorinne, Anna-Lotta | Hölttä-Vuori, Maarit | Ikonen, Elina | Sulkava, Raimo | Benatar, Michael | Wuu, Joanne | Chiò, Adriano | Restagno, Gabriella | Borghero, Giuseppe | Sabatelli, Mario | Heckerman, David | Rogaeva, Ekaterina | Zinman, Lorne | Rothstein, Jeffrey | Sendtner, Michael | Drepper, Carsten | Eichler, Evan E. | Alkan, Can | Abdullaev, Zied | Pack, Svetlana D. | Dutra, Amalia | Pak, Evgenia | Hardy, John | Singleton, Andrew | Williams, Nigel M. | Heutink, Peter | Pickering-Brown, Stuart | Morris, Huw R. | Tienari, Pentti J. | Traynor, Bryan J.
Neuron  2011;72(2):257-268.
The chromosome 9p21 amyotrophic lateral sclerosis-frontotemporal dementia (ALS-FTD) locus contains one of the last major unidentified autosomal dominant genes underlying these common neurodegenerative diseases. We have previously shown that a founder haplotype, covering the MOBKL2b, IFNK and C9ORF72 genes, is present in the majority of cases linked to this region. Here we show that there is a large hexanucleotide (GGGGCC) repeat expansion in the first intron of C9ORF72 on the affected haplotype. This repeat expansion segregates perfectly with disease in the Finnish population, underlying 46.0% of familial ALS and 21.1% of sporadic ALS in that population. Taken together with the D90A SOD1 mutation, 87% of familial ALS in Finland is now explained by a simple monogenic cause. The repeat expansion is also present in one third of familial ALS cases of outbred European descent making it the most common genetic cause of these fatal neurodegenerative diseases identified to date.
PMCID: PMC3200438  PMID: 21944779
17.  Insights into hominid evolution from the gorilla genome sequence 
Nature  2012;483(7388):169-175.
Gorillas are humans’ closest living relatives after chimpanzees, and are of comparable importance for the study of human origins and evolution. Here we present the assembly and analysis of a genome sequence for the western lowland gorilla, and compare the whole genomes of all extant great ape genera. We propose a synthesis of genetic and fossil evidence consistent with placing the human-chimpanzee and human-chimpanzee-gorilla speciation events at approximately 6 and 10 million years ago (Mya). In 30% of the genome, gorilla is closer to human or chimpanzee than the latter are to each other; this is rarer around coding genes, indicating pervasive selection throughout great ape evolution, and has functional consequences in gene expression. A comparison of protein coding genes reveals approximately 500 genes showing accelerated evolution on each of the gorilla, human and chimpanzee lineages, and evidence for parallel acceleration, particularly of genes involved in hearing. We also compare the western and eastern gorilla species, estimating an average sequence divergence time 1.75 million years ago, but with evidence for more recent genetic exchange and a population bottleneck in the eastern species. The use of the genome sequence in these and future analyses will promote a deeper understanding of great ape biology and evolution.
PMCID: PMC3303130  PMID: 22398555
18.  Detection of structural variants and indels within exome data 
Nature methods  2011;9(2):176-178.
We report an algorithm to detect structural variation and indels from 1 base pair to 1 megabase pair within exome sequence datasets. Splitread uses one-end anchored placements to cluster the mappings of subsequences of unanchored ends to identify the size, content and location of variants with good specificity and high sensitivity. The algorithm discovers indels, structural variants, de novo events and copy-number polymorphic processed pseudogenes missed by other methods.
PMCID: PMC3269549  PMID: 22179552
19.  Sensitive and fast mapping of di-base encoded reads 
Bioinformatics  2011;27(14):1915-1921.
Motivation: Discovering variation among high-throughput sequenced genomes relies on efficient and effective mapping of sequence reads. The speed, sensitivity and accuracy of read mapping are crucial to determining the full spectrum of single nucleotide variants (SNVs) as well as structural variants (SVs) in the donor genomes analyzed.
Results: We present drFAST, a read mapper designed for di-base encoded ‘color-space’ sequences generated with the AB SOLiD platform. drFAST is specially designed for better delineation of structural variants, including segmental duplications, and is able to return all possible map locations and underlying sequence variation of short reads within a user-specified distance threshold. We show that drFAST is more sensitive in comparison to all commonly used aligners such as Bowtie, BFAST and SHRiMP. drFAST is also faster than both BFAST and SHRiMP and achieves a mapping speed comparable to Bowtie.
Availability: The source code for drFAST is available at
PMCID: PMC3129524  PMID: 21586516
20.  Mapping copy number variation by population scale genome sequencing 
Nature  2011;470(7332):59-65.
Genomic structural variants (SVs) are abundant in humans, differing from other variation classes in extent, origin, and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (i.e., copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analyzing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.
PMCID: PMC3077050  PMID: 21293372
21.  Comparative and demographic analysis of orangutan genomes 
Locke, Devin P. | Hillier, LaDeana W. | Warren, Wesley C. | Worley, Kim C. | Nazareth, Lynne V. | Muzny, Donna M. | Yang, Shiaw-Pyng | Wang, Zhengyuan | Chinwalla, Asif T. | Minx, Pat | Mitreva, Makedonka | Cook, Lisa | Delehaunty, Kim D. | Fronick, Catrina | Schmidt, Heather | Fulton, Lucinda A. | Fulton, Robert S. | Nelson, Joanne O. | Magrini, Vincent | Pohl, Craig | Graves, Tina A. | Markovic, Chris | Cree, Andy | Dinh, Huyen H. | Hume, Jennifer | Kovar, Christie L. | Fowler, Gerald R. | Lunter, Gerton | Meader, Stephen | Heger, Andreas | Ponting, Chris P. | Marques-Bonet, Tomas | Alkan, Can | Chen, Lin | Cheng, Ze | Kidd, Jeffrey M. | Eichler, Evan E. | White, Simon | Searle, Stephen | Vilella, Albert J. | Chen, Yuan | Flicek, Paul | Ma, Jian | Raney, Brian | Suh, Bernard | Burhans, Richard | Herrero, Javier | Haussler, David | Faria, Rui | Fernando, Olga | Darré, Fleur | Farré, Domènec | Gazave, Elodie | Oliva, Meritxell | Navarro, Arcadi | Roberto, Roberta | Capozzi, Oronzo | Archidiacono, Nicoletta | Valle, Giuliano Della | Purgato, Stefania | Rocchi, Mariano | Konkel, Miriam K. | Walker, Jerilyn A. | Ullmer, Brygg | Batzer, Mark A. | Smit, Arian F. A. | Hubley, Robert | Casola, Claudio | Schrider, Daniel R. | Hahn, Matthew W. | Quesada, Victor | Puente, Xose S. | Ordoñez, Gonzalo R. | López-Otín, Carlos | Vinar, Tomas | Brejova, Brona | Ratan, Aakrosh | Harris, Robert S. | Miller, Webb | Kosiol, Carolin | Lawson, Heather A. | Taliwal, Vikas | Martins, André L. | Siepel, Adam | RoyChoudhury, Arindam | Ma, Xin | Degenhardt, Jeremiah | Bustamante, Carlos D. | Gutenkunst, Ryan N. | Mailund, Thomas | Dutheil, Julien Y. | Hobolth, Asger | Schierup, Mikkel H. | Chemnick, Leona | Ryder, Oliver A. | Yoshinaga, Yuko | de Jong, Pieter J. | Weinstock, George M. | Rogers, Jeffrey | Mardis, Elaine R. | Gibbs, Richard A. | Wilson, Richard K.
Nature  2011;469(7331):529-533.
“Orangutan” is derived from the Malay term “man of the forest” and aptly describes the Southeast Asian great apes native to Sumatra and Borneo. The orangutan species, Pongo abelii (Sumatran) and Pongo pygmaeus (Bornean), are the most phylogenetically distant great apes from humans, thereby providing an informative perspective on hominid evolution. Here we present a Sumatran orangutan draft genome assembly and short read sequence data from five Sumatran and five Bornean orangutan genomes. Our analyses reveal that, compared to other primates, the orangutan genome has many unique features. Structural evolution of the orangutan genome has proceeded much more slowly than other great apes, evidenced by fewer rearrangements, less segmental duplication, a lower rate of gene family turnover and surprisingly quiescent Alu repeats, which have played a major role in restructuring other primate genomes. We also describe the first primate polymorphic neocentromere, found in both Pongo species, emphasizing the gradual evolution of orangutan genome structure. Orangutans have extremely low energy usage for a eutherian mammal1, far lower than their hominid relatives. Adding their genome to the repertoire of sequenced primates illuminates new signals of positive selection in several pathways including glycolipid metabolism. From the population perspective, both Pongo species are deeply diverse; however, Sumatran individuals possess greater diversity than their Bornean counterparts, and more species-specific variation. Our estimate of Bornean/Sumatran speciation time, 400k years ago (ya), is more recent than most previous studies and underscores the complexity of the orangutan speciation process. Despite a smaller modern census population size, the Sumatran effective population size (Ne) expanded exponentially relative to the ancestral Ne after the split, while Bornean Ne declined over the same period. Overall, the resources and analyses presented here offer new opportunities in evolutionary genomics, insights into hominid biology, and an extensive database of variation for conservation efforts.
PMCID: PMC3060778  PMID: 21270892
22.  Limitations of next-generation genome sequence assembly 
Nature methods  2010;8(1):61-65.
High-throughput sequencing technologies promise to transform the fields of genetics and comparative biology by delivering tens of thousands of genomes in the near future. Although it is feasible to construct de novo genome assemblies in a few months, there has been relatively little attention to what is lost by sole application of short sequence reads. We compared the recent de novo assemblies using the short oligonucleotide analysis package (SOAP), generated from the genomes of a Han Chinese individual and a Yoruban individual, to experimentally validated genomic features. We found that de novo assemblies were 16.2% shorter than the reference genome and that 420.2 megabase pairs of common repeats and 99.1% of validated duplicated sequences were missing from the genome. Consequently, over 2,377 coding exons were completely missing. We conclude that high-quality sequencing approaches must be considered in conjunction with high-throughput sequencing for comparative genomics analyses and studies of genome evolution.
PMCID: PMC3115693  PMID: 21102452
23.  Haplotype-resolved genome sequencing of a Gujarati Indian individual 
Nature biotechnology  2010;29(1):59-63.
Haplotype information is essential to the complete description and interpretation of genomes1, genetic diversity2 and genetic ancestry3. Although individual human genome sequencing is increasingly routine4, nearly all such genomes are unresolved with respect to haplotype. Here we combine the throughput of massively parallel sequencing5 with the contiguity information provided by large-insert cloning6 to experimentally determine the haplotype-resolved genome of a South Asian individual. A single fosmid library was split into a modest number of pools, each providing ~3% physical coverage of the diploid genome. Sequencing of each pool yielded reads overwhelmingly derived from only one homologous chromosome at any given location. These data were combined with whole-genome shotgun sequence to directly phase 94% of ascertained heterozygous single nucleotide polymorphisms (SNPs) into long haplotype blocks (N50 of 386 kilobases (kbp)). This method also facilitates the analysis of structural variation, for example, to anchor novel insertions7,8 to specific locations and haplotypes.
PMCID: PMC3116788  PMID: 21170042
25.  Detection and characterization of novel sequence insertions using paired-end next-generation sequencing 
Bioinformatics  2010;26(10):1277-1283.
Motivation: In the past few years, human genome structural variation discovery has enjoyed increased attention from the genomics research community. Many studies were published to characterize short insertions, deletions, duplications and inversions, and associate copy number variants (CNVs) with disease. Detection of new sequence insertions requires sequence data, however, the ‘detectable’ sequence length with read-pair analysis is limited by the insert size. Thus, longer sequence insertions that contribute to our genetic makeup are not extensively researched.
Results: We present NovelSeq: a computational framework to discover the content and location of long novel sequence insertions using paired-end sequencing data generated by the next-generation sequencing platforms. Our framework can be built as part of a general sequence analysis pipeline to discover multiple types of genetic variation (SNPs, structural variation, etc.), thus it requires significantly less-computational resources than de novo sequence assembly. We apply our methods to detect novel sequence insertions in the genome of an anonymous donor and validate our results by comparing with the insertions discovered in the same genome using various sources of sequence data.
Availability: The implementation of the NovelSeq pipeline is available at;
PMCID: PMC2865866  PMID: 20385726

Results 1-25 (36)