Turkey is a crossroads of major population movements throughout history and has been a hotspot of cultural interactions. Several studies have investigated the complex population history of Turkey through a limited set of genetic markers. However, to date, there have been no studies to assess the genetic variation at the whole genome level using whole genome sequencing. Here, we present whole genome sequences of 16 Turkish individuals resequenced at high coverage (32 × -48×).
We show that the genetic variation of the contemporary Turkish population clusters with South European populations, as expected, but also shows signatures of relatively recent contribution from ancestral East Asian populations. In addition, we document a significant enrichment of non-synonymous private alleles, consistent with recent observations in European populations. A number of variants associated with skin color and total cholesterol levels show frequency differentiation between the Turkish populations and European populations. Furthermore, we have analyzed the 17q21.31 inversion polymorphism region (MAPT locus) and found increased allele frequency of 31.25% for H1/H2 inversion polymorphism when compared to European populations that show about 25% of allele frequency.
This study provides the first map of common genetic variation from 16 western Asian individuals and thus helps fill an important geographical gap in analyzing natural human variation and human migration. Our data will help develop population-specific experimental designs for studies investigating disease associations and demographic history in Turkey.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-963) contains supplementary material, which is available to authorized users.
High throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for processing and downstream analysis. While tools that report the ‘best’ mapping location of each read provide a fast way to process HTS data, they are not suitable for many types of downstream analysis such as structural variation detection, where it is important to report multiple mapping loci for each read. For this purpose we introduce mrsFAST-Ultra, a fast, cache oblivious, SNP-aware aligner that can handle the multi-mapping of HTS reads very efficiently. mrsFAST-Ultra improves mrsFAST, our first cache oblivious read aligner capable of handling multi-mapping reads, through new and compact index structures that reduce not only the overall memory usage but also the number of CPU operations per alignment. In fact the size of the index generated by mrsFAST-Ultra is 10 times smaller than that of mrsFAST. As importantly, mrsFAST-Ultra introduces new features such as being able to (i) obtain the best mapping loci for each read, and (ii) return all reads that have at most n mapping loci (within an error threshold), together with these loci, for any user specified n. Furthermore, mrsFAST-Ultra is SNP-aware, i.e. it can map reads to reference genome while discounting the mismatches that occur at common SNP locations provided by db-SNP; this significantly increases the number of reads that can be mapped to the reference genome. Notice that all of the above features are implemented within the index structure and are not simple post-processing steps and thus are performed highly efficiently. Finally, mrsFAST-Ultra utilizes multiple available cores and processors and can be tuned for various memory settings. Our results show that mrsFAST-Ultra is roughly five times faster than its predecessor mrsFAST. In comparison to newly enhanced popular tools such as Bowtie2, it is more sensitive (it can report 10 times or more mappings per read) and much faster (six times or more) in the multi-mapping mode. Furthermore, mrsFAST-Ultra has an index size of 2GB for the entire human reference genome, which is roughly half of that of Bowtie2. mrsFAST-Ultra is open source and it can be accessed at http://mrsfast.sourceforge.net.
Motivation: The high throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for the computational infrastructure. Data management, storage and analysis have become major logistical obstacles for those adopting the new platforms. The requirement for large investment for this purpose almost signalled the end of the Sequence Read Archive hosted at the National Center for Biotechnology Information (NCBI), which holds most of the sequence data generated world wide. Currently, most HTS data are compressed through general purpose algorithms such as gzip. These algorithms are not designed for compressing data generated by the HTS platforms; for example, they do not take advantage of the specific nature of genomic sequence data, that is, limited alphabet size and high similarity among reads. Fast and efficient compression algorithms designed specifically for HTS data should be able to address some of the issues in data management, storage and communication. Such algorithms would also help with analysis provided they offer additional capabilities such as random access to any read and indexing for efficient sequence similarity search. Here we present SCALCE, a ‘boosting’ scheme based on Locally Consistent Parsing technique, which reorganizes the reads in a way that results in a higher compression speed and compression rate, independent of the compression algorithm in use and without using a reference genome.
Results: Our tests indicate that SCALCE can improve the compression rate achieved through gzip by a factor of 4.19—when the goal is to compress the reads alone. In fact, on SCALCE reordered reads, gzip running time can improve by a factor of 15.06 on a standard PC with a single core and 6 GB memory. Interestingly even the running time of SCALCE + gzip improves that of gzip alone by a factor of 2.09. When compared with the recently published BEETL, which aims to sort the (inverted) reads in lexicographic order for improving bzip2, SCALCE + gzip provides up to 2.01 times better compression while improving the running time by a factor of 5.17. SCALCE also provides the option to compress the quality scores as well as the read names, in addition to the reads themselves. This is achieved by compressing the quality scores through order-3 Arithmetic Coding (AC) and the read names through gzip through the reordering SCALCE provides on the reads. This way, in comparison with gzip compression of the unordered FASTQ files (including reads, read names and quality scores), SCALCE (together with gzip and arithmetic encoding) can provide up to 3.34 improvement in the compression rate and 1.26 improvement in running time.
Availability: Our algorithm, SCALCE (Sequence Compression Algorithm using Locally Consistent Encoding), is implemented in C++ with both gzip and bzip2 compression options. It also supports multithreading when gzip option is selected, and the pigz binary is available. It is available at http://scalce.sourceforge.net.
Contact: email@example.com or firstname.lastname@example.org
Supplementary data are available at Bioinformatics online.
Next-generation sequencing is making sequence-based molecular pathology and personalized oncology viable. We selected an individual initially diagnosed with conventional but aggressive prostate adenocarcinoma and sequenced the genome and transcriptome from primary and metastatic tissues collected prior to hormone therapy. The histology-pathology and copy number profiles were remarkably homogeneous, yet it was possible to propose the quadrant of the prostate tumour that likely seeded the metastatic diaspora. Despite a homogeneous cell type, our transcriptome analysis revealed signatures of both luminal and neuroendocrine cell types. Remarkably, the repertoire of expressed but apparently private gene fusions, including C15orf21:MYC, recapitulated this biology. We hypothesize that the amplification and over-expression of the stem cell gene MSI2 may have contributed to the stable hybrid cellular identity. This hybrid luminal-neuroendocrine tumour appears to represent a novel and highly aggressive case of prostate cancer with unique biological features and, conceivably, a propensity for rapid progression to castrate-resistance. Overall, this work highlights the importance of integrated analyses of genome, exome and transcriptome sequences for basic tumour biology, sequence-based molecular pathology and personalized oncology.
RNA sequencing; DNA sequencing; prostate cancer; fusion genes; neuroendocrine; personalized medicine; cancer genetics
Chimeric transcripts, including partial and internal tandem duplications (PTDs, ITDs) and gene fusions, are important in the detection, prognosis, and treatment of human cancers.
We describe Barnacle, a production-grade analysis tool that detects such chimeras in de novo assemblies of RNA-seq data, and supports prioritizing them for review and validation by reporting the relative coverage of co-occurring chimeric and wild-type transcripts. We demonstrate applications in large-scale disease studies, by identifying PTDs in MLL, ITDs in FLT3, and reciprocal fusions between PML and RARA, in two deeply sequenced acute myeloid leukemia (AML) RNA-seq datasets.
Our analyses of real and simulated data sets show that, with appropriate filter settings, Barnacle makes highly specific predictions for three types of chimeric transcripts that are important in a range of cancers: PTDs, ITDs, and fusions. High specificity makes manual review and validation efficient, which is necessary in large-scale disease studies. Characterizing an extended range of chimera types will help generate insights into progression, treatment, and outcomes for complex diseases.
Transcriptome assembly; Chimeric transcripts; Fusion; Partial tandem duplication; PTD; Internal tandem duplication; ITD; RNA-seq; Transcriptome
The current paradigm of cancer care relies on predictive nomograms which integrate detailed histopathology with clinical data. However, when predictions fail, the consequences for patients are often catastrophic, especially in prostate cancer where nomograms influence the decision to therapeutically intervene. We hypothesized that the high dimensional data afforded by massively parallel sequencing (MPS) is not only capable of providing biological insights, but may aid molecular pathology of prostate tumours. We assembled a cohort of six patients with high-risk disease, and performed deep RNA and shallow DNA sequencing in primary tumours and matched metastases where available. Our analysis identified copy number abnormalities, accurately profiled gene expression levels, and detected both differential splicing and expressed fusion genes. We revealed occult and potentially dormant metastases, unambiguously supporting the patients’ clinical history, and implicated the REST transcriptional complex in the development of neuroendocrine prostate cancer, validating this finding in a large independent cohort. We massively expand on the number of novel fusion genes described in prostate cancer; provide fresh evidence for the growing link between fusion gene aetiology and gene expression profiles; and show the utility of fusion genes for molecular pathology. Finally, we identified chromothripsis in a patient with chronic prostatitis. Our results provide a strong foundation for further development of MPS-based molecular pathology.
molecular pathology; massively parallel sequencing; neuroendocrine prostate cancer; REST repressor; chromothripsis
Motivation: Discovering variation among high-throughput sequenced genomes relies on efficient and effective mapping of sequence reads. The speed, sensitivity and accuracy of read mapping are crucial to determining the full spectrum of single nucleotide variants (SNVs) as well as structural variants (SVs) in the donor genomes analyzed.
Results: We present drFAST, a read mapper designed for di-base encoded ‘color-space’ sequences generated with the AB SOLiD platform. drFAST is specially designed for better delineation of structural variants, including segmental duplications, and is able to return all possible map locations and underlying sequence variation of short reads within a user-specified distance threshold. We show that drFAST is more sensitive in comparison to all commonly used aligners such as Bowtie, BFAST and SHRiMP. drFAST is also faster than both BFAST and SHRiMP and achieves a mapping speed comparable to Bowtie.
Availability: The source code for drFAST is available at http://drfast.sourceforge.net
Motivation: Computational identification of genomic structural variants via high-throughput sequencing is an important problem for which a number of highly sophisticated solutions have been recently developed. With the advent of high-throughput transcriptome sequencing (RNA-Seq), the problem of identifying structural alterations in the transcriptome is now attracting significant attention.
In this article, we introduce two novel algorithmic formulations for identifying transcriptomic structural variants through aligning transcripts to the reference genome under the consideration of such variation. The first formulation is based on a nucleotide-level alignment model; a second, potentially faster formulation is based on chaining fragments shared between each transcript and the reference genome. Based on these formulations, we introduce a novel transcriptome-to-genome alignment tool, Dissect (DIScovery of Structural Alteration Event Containing Transcripts), which can identify and characterize transcriptomic events such as duplications, inversions, rearrangements and fusions. Dissect is suitable for whole transcriptome structural variation discovery problems involving sufficiently long reads or accurately assembled contigs.
Results: We tested Dissect on simulated transcripts altered via structural events, as well as assembled RNA-Seq contigs from human prostate cancer cell line C4-2. Our results indicate that Dissect has high sensitivity and specificity in identifying structural alteration events in simulated transcripts as well as uncovering novel structural alterations in cancer transcriptomes.
Availability: Dissect is available for public use at: http://dissect-trans.sourceforge.net
email@example.com; firstname.lastname@example.org; email@example.com
Motivation: Molecular profiles of tumour samples have been widely and successfully used for classification problems. A number of algorithms have been proposed to predict classes of tumor samples based on expression profiles with relatively high performance. However, prediction of response to cancer treatment has proved to be more challenging and novel approaches with improved generalizability are still highly needed. Recent studies have clearly demonstrated the advantages of integrating protein–protein interaction (PPI) data with gene expression profiles for the development of subnetwork markers in classification problems.
Results: We describe a novel network-based classification algorithm (OptDis) using color coding technique to identify optimally discriminative subnetwork markers. Focusing on PPI networks, we apply our algorithm to drug response studies: we evaluate our algorithm using published cohorts of breast cancer patients treated with combination chemotherapy. We show that our OptDis method improves over previously published subnetwork methods and provides better and more stable performance compared with other subnetwork and single gene methods. We also show that our subnetwork method produces predictive markers that are more reproducible across independent cohorts and offer valuable insight into biological processes underlying response to therapy.
Availability: The implementation is available at: http://www.cs.sfu.ca/~pdao/personal/OptDis.html
Contact: firstname.lastname@example.org; email@example.com; firstname.lastname@example.org
Gene fusions created by somatic genomic rearrangements are known to play an important role in the onset and development of some cancers, such as lymphomas and sarcomas. RNA-Seq (whole transcriptome shotgun sequencing) is proving to be a useful tool for the discovery of novel gene fusions in cancer transcriptomes. However, algorithmic methods for the discovery of gene fusions using RNA-Seq data remain underdeveloped. We have developed deFuse, a novel computational method for fusion discovery in tumor RNA-Seq data. Unlike existing methods that use only unique best-hit alignments and consider only fusion boundaries at the ends of known exons, deFuse considers all alignments and all possible locations for fusion boundaries. As a result, deFuse is able to identify fusion sequences with demonstrably better sensitivity than previous approaches. To increase the specificity of our approach, we curated a list of 60 true positive and 61 true negative fusion sequences (as confirmed by RT-PCR), and have trained an adaboost classifier on 11 novel features of the sequence data. The resulting classifier has an estimated value of 0.91 for the area under the ROC curve. We have used deFuse to discover gene fusions in 40 ovarian tumor samples, one ovarian cancer cell line, and three sarcoma samples. We report herein the first gene fusions discovered in ovarian cancer. We conclude that gene fusions are not infrequent events in ovarian cancer and that these events have the potential to substantially alter the expression patterns of the genes involved; gene fusions should therefore be considered in efforts to comprehensively characterize the mutational profiles of ovarian cancer transcriptomes.
Genome rearrangements and associated gene fusions are known to be important oncogenic events in some cancers. We have developed a novel computational method called deFuse for detecting gene fusions in RNA-Seq data and have applied it to the discovery of novel gene fusions in sarcoma and ovarian tumors. We assessed the accuracy of our method and found that deFuse produces substantially better sensitivity and specificity than two other published methods. We have also developed a set of 60 positive and 61 negative examples that will be useful for accurate identification of gene fusions in future RNA-Seq datasets. We have trained a classifier on 11 novel features of the 121 examples, and show that the classifier is able to accurately identify real gene fusions. The 45 gene fusions reported in this study represent the first ovarian cancer fusions reported, as well as novel sarcoma fusions. By examining the expression patterns of the affected genes, we find that many fusions are predicted to have functional consequences and thus merit experimental followup to determine their clinical relevance.
Motivation: In the past few years, human genome structural variation discovery has enjoyed increased attention from the genomics research community. Many studies were published to characterize short insertions, deletions, duplications and inversions, and associate copy number variants (CNVs) with disease. Detection of new sequence insertions requires sequence data, however, the ‘detectable’ sequence length with read-pair analysis is limited by the insert size. Thus, longer sequence insertions that contribute to our genetic makeup are not extensively researched.
Results: We present NovelSeq: a computational framework to discover the content and location of long novel sequence insertions using paired-end sequencing data generated by the next-generation sequencing platforms. Our framework can be built as part of a general sequence analysis pipeline to discover multiple types of genetic variation (SNPs, structural variation, etc.), thus it requires significantly less-computational resources than de novo sequence assembly. We apply our methods to detect novel sequence insertions in the genome of an anonymous donor and validate our results by comparing with the insertions discovered in the same genome using various sources of sequence data.
Availability: The implementation of the NovelSeq pipeline is available at http://compbio.cs.sfu.ca/strvar.htm
Although many RNA molecules contain pseudoknots, computational prediction of pseudoknotted RNA structure is still in its infancy due to high running time and space consumption implied by the dynamic programming formulations of the problem.
In this paper, we introduce sparsification to significantly speedup the dynamic programming approaches for pseudoknotted RNA structure prediction, which also lower the space requirements. Although sparsification has been applied to a number of RNA-related structure prediction problems in the past few years, we provide the first application of sparsification to pseudoknotted RNA structure prediction specifically and to handling gapped fragments more generally - which has a much more complex recursive structure than other problems to which sparsification has been applied. We analyse how to sparsify four pseudoknot structure prediction algorithms, among those the most general method available (the Rivas-Eddy algorithm) and the fastest one (Reeder-Giegerich algorithm). In all algorithms the number of "candidate" substructures to be considered is reduced.
Our experimental results on the sparsified Reeder-Giegerich algorithm suggest a linear speedup over the unsparsified implementation.
Recent years have witnessed an increase in research activity for the detection of structural variants (SVs) and their association to human disease. The advent of next-generation sequencing technologies make it possible to extend the scope of structural variation studies to a point previously unimaginable as exemplified by the 1000 Genomes Project. Although various computational methods have been described for the detection of SVs, no such algorithm is yet fully capable of discovering transposon insertions, a very important class of SVs to the study of human evolution and disease. In this article, we provide a complete and novel formulation to discover both loci and classes of transposons inserted into genomes sequenced with high-throughput sequencing technologies. In addition, we also present ‘conflict resolution’ improvements to our earlier combinatorial SV detection algorithm (VariationHunter) by taking the diploid nature of the human genome into consideration. We test our algorithms with simulated data from the Venter genome (HuRef) and are able to discover >85% of transposon insertion events with precision of >90%. We also demonstrate that our conflict resolution algorithm (denoted as VariationHunter-CR) outperforms current state of the art (such as original VariationHunter, BreakDancer and MoDIL) algorithms when tested on the genome of the Yoruba African individual (NA18507).
Availability: The implementation of algorithm is available at http://compbio.cs.sfu.ca/strvar.htm.
Contact: email@example.com; firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics online.
Despite their importance in gene innovation and phenotypic variation, duplicated regions have remained largely intractable due to difficulties in accurately resolving their structure, copy number and sequence content. We present an algorithm (mrFAST) to comprehensively map next-generation sequence reads allowing for the prediction of absolute copy-number variation of duplicated segments and genes. We examine three human genomes and experimentally validate genome-wide copy-number differences. We estimate that 73–87 genes will be on average copy-number variable between two human genomes and find that these genic differences overwhelmingly correspond to segmental duplications (OR=135; p<2.2e-16). Our method can distinguish between different copies of highly identical genes, providing a more accurate census of gene content and insight into functional constraint without the limitations of array-based technology.
Motivation: PSORTb has remained the most precise bacterial protein subcellular localization (SCL) predictor since it was first made available in 2003. However, the recall needs to be improved and no accurate SCL predictors yet make predictions for archaea, nor differentiate important localization subcategories, such as proteins targeted to a host cell or bacterial hyperstructures/organelles. Such improvements should preferably be encompassed in a freely available web-based predictor that can also be used as a standalone program.
Results: We developed PSORTb version 3.0 with improved recall, higher proteome-scale prediction coverage, and new refined localization subcategories. It is the first SCL predictor specifically geared for all prokaryotes, including archaea and bacteria with atypical membrane/cell wall topologies. It features an improved standalone program, with a new batch results delivery system complementing its web interface. We evaluated the most accurate SCL predictors using 5-fold cross validation plus we performed an independent proteomics analysis, showing that PSORTb 3.0 is the most accurate but can benefit from being complemented by Proteome Analyst predictions.
Availability: http://www.psort.org/psortb (download open source software or use the web interface).
Supplementary Information: Supplementary data are availableat Bioinformatics online.
Regulatory antisense RNAs are a class of ncRNAs that regulate gene expression by prohibiting the translation of an mRNA by establishing stable interactions with a target sequence. There is great demand for efficient computational methods to predict the specific interaction between an ncRNA and its target mRNA(s). There are a number of algorithms in the literature which can predict a variety of such interactions - unfortunately at a very high computational cost. Although some existing target prediction approaches are much faster, they are specialized for interactions with a single binding site.
In this paper we present a novel algorithm to accurately predict the minimum free energy structure of RNA-RNA interaction under the most general type of interactions studied in the literature. Moreover, we introduce a fast heuristic method to predict the specific (multiple) binding sites of two interacting RNAs.
We verify the performance of our algorithms for joint structure and binding site prediction on a set of known interacting RNA pairs. Experimental results show our algorithms are highly accurate and outperform all competitive approaches.
Recent interests, such as RNA interference and antisense RNA regulation, strongly motivate the problem of predicting whether two nucleic acid strands interact.
Motivation: Regulatory non-coding RNAs (ncRNAs) such as microRNAs play an important role in gene regulation. Studies on both prokaryotic and eukaryotic cells show that such ncRNAs usually bind to their target mRNA to regulate the translation of corresponding genes. The specificity of these interactions depends on the stability of intermolecular and intramolecular base pairing. While methods like deep sequencing allow to discover an ever increasing set of ncRNAs, there are no high-throughput methods available to detect their associated targets. Hence, there is an increasing need for precise computational target prediction. In order to predict base-pairing probability of any two bases in interacting nucleic acids, it is necessary to compute the interaction partition function over the whole ensemble. The partition function is a scalar value from which various thermodynamic quantities can be derived. For example, the equilibrium concentration of each complex nucleic acid species and also the melting temperature of interacting nucleic acids can be calculated based on the partition function of the complex.
Results: We present a model for analyzing the thermodynamics of two interacting nucleic acid strands considering the most general type of interactions studied in the literature. We also present a corresponding dynamic programming algorithm that computes the partition function over (almost) all physically possible joint secondary structures formed by two interacting nucleic acids in O(n6) time. We verify the predictive power of our algorithm by computing (i) the melting temperature for interacting RNA pairs studied in the literature and (ii) the equilibrium concentration for several variants of the OxyS–fhlA complex. In both experiments, our algorithm shows high accuracy and outperforms competitors.
Availability: Software and web server is available at http://compbio.cs.sfu.ca/taverna/pirna/
Contact: email@example.com; firstname.lastname@example.org
Supplementary information: Supplementary data are avaliable at Bioinformatics online.
Non-coding RNAs (ncRNAs) have important functional roles in the cell: for example, they regulate gene expression by means of establishing stable joint structures with target mRNAs via complementary sequence motifs. Sequence motifs are also important determinants of the structure of ncRNAs. Although ncRNAs are abundant, discovering novel ncRNAs on genome sequences has proven to be a hard task; in particular past attempts for ab initio ncRNA search mostly failed with the exception of tools that can identify micro RNAs.
We present a very general ab initio ncRNA gene finder that exploits differential distributions of sequence motifs between ncRNAs and background genome sequences.
Our method, once trained on a set of ncRNAs from a given species, can be applied to a genome sequences of other organisms to find not only ncRNAs homologous to those in the training set but also others that potentially belong to novel (and perhaps unknown) ncRNA families. Availability: http://compbio.cs.sfu.ca/taverna/smyrna
New generation sequencing technologies offer unique opportunities and challenges for re-sequencing studies. In this article, we focus on re-sequencing experiments using the Solexa technology, based on bacterial artificial chromosome (BAC) clones, and address an experimental design problem. In these specific experiments, approximate coordinates of the BACs on a reference genome are known, and fine-scale differences between the BAC sequences and the reference are of interest. The high-throughput characteristics of the sequencing technology makes it possible to multiplex BAC sequencing experiments by pooling BACs for a cost-effective operation. However, the way BACs are pooled in such re-sequencing experiments has an effect on the downstream analysis of the generated data, mostly due to subsequences common to multiple BACs. The experimental design strategy we develop in this article offers combinatorial solutions based on approximation algorithms for the well-known max n-cut problem and the related max n-section problem on hypergraphs. Our algorithms, when applied to a number of sample cases give more than a 2-fold performance improvement over random partitioning.
Protein–protein interaction (PPI) networks of many organisms share global topological features such as degree distribution, k-hop reachability, betweenness and closeness. Yet, some of these networks can differ significantly from the others in terms of local structures: e.g. the number of specific network motifs can vary significantly among PPI networks.
Counting the number of network motifs provides a major challenge to compare biomolecular networks. Recently developed algorithms have been able to count the number of induced occurrences of subgraphs with k≤ 7 vertices. Yet no practical algorithm exists for counting non-induced occurrences, or counting subgraphs with k≥ 8 vertices. Counting non-induced occurrences of network motifs is not only challenging but also quite desirable as available PPI networks include several false interactions and miss many others.
In this article, we show how to apply the ‘color coding’ technique for counting non-induced occurrences of subgraph topologies in the form of trees and bounded treewidth subgraphs. Our algorithm can count all occurrences of motif G′ with k vertices in a network G with n vertices in time polynomial with n, provided k=O(log n). We use our algorithm to obtain ‘treelet’ distributions for k≤ 10 of available PPI networks of unicellular organisms (Saccharomyces cerevisiae Escherichia coli and Helicobacter Pyloris), which are all quite similar, and a multicellular organism (Caenorhabditis elegans) which is significantly different. Furthermore, the treelet distribution of the unicellular organisms are similar to that obtained by the ‘duplication model’ but are quite different from that of the ‘preferential attachment model’. The treelet distribution is robust w.r.t. sparsification with bait/edge coverage of 70% but differences can be observed when bait/edge coverage drops to 50%.
The major DNA constituent of primate centromeres is alpha satellite DNA. As much as 2%–5% of sequence generated as part of primate genome sequencing projects consists of this material, which is fragmented or not assembled as part of published genome sequences due to its highly repetitive nature. Here, we develop computational methods to rapidly recover and categorize alpha-satellite sequences from previously uncharacterized whole-genome shotgun sequence data. We present an algorithm to computationally predict potential higher-order array structure based on paired-end sequence data and then experimentally validate its organization and distribution by experimental analyses. Using whole-genome shotgun data from the human, chimpanzee, and macaque genomes, we examine the phylogenetic relationship of these sequences and provide further support for a model for their evolution and mutation over the last 25 million years. Our results confirm fundamental differences in the dispersal and evolution of centromeric satellites in the Old World monkey and ape lineages of evolution.
Centromeric DNA has been described as the last frontier of genomic sequencing; such regions are typically poorly assembled during the whole-genome shotgun sequence assembly process due to their repetitive complexity. This paper develops a computational algorithm to systematically extract data regarding primate centromeric DNA structure and organization from that ∼5% of sequence that is not included as part of standard genome sequence assemblies. Using this computational approach, we identify and reconstruct published human higher-order alpha satellite arrays and discover new families in human, chimpanzee, and Old World monkeys. Experimental validation confirms the utility of this computational approach to understanding the centromere organization of other nonhuman primates. An evolutionary analysis in diverse primate genomes supports fundamental differences in the structure and organization of centromere DNA between ape and Old World monkey lineages. The ability to extract meaningful biological data from random shotgun sequence data helps to fill an important void in large-scale sequencing of primate genomes, with implications for other genome sequencing projects.
The (asymptotic) degree distributions of the best-known “scale-free” network models are all similar and are independent of the seed graph used; hence, it has been tempting to assume that networks generated by these models are generally similar. In this paper, we observe that several key topological features of such networks depend heavily on the specific model and the seed graph used. Furthermore, we show that starting with the “right” seed graph (typically a dense subgraph of the protein–protein interaction network analyzed), the duplication model captures many topological features of publicly available protein–protein interaction networks very well.
The interactions among proteins in an organism can be represented as a protein–protein interaction (PPI) network, where each protein is represented with a node, and each interaction is represented with an edge between two nodes. As PPI networks of several model organisms become available, their topological features attract considerable attention. It is believed that the available PPI networks are (1) “small-world” networks, and (2) their degree distribution is in the form of a “power law.” In other words, (1) it is possible to reach from a protein to any other protein in only a small (approximately six) number of hops, and (2) although most proteins have only a few interactions (one or two), there are a few proteins with many more interactions (200 or more) and that act as “hubs.” It has thus been tempting to develop simple mathematical network generators with topological features similar to those of the available PPI networks. One such model, the “duplication model,” is based on Ohno's model of genome growth. It starts with a small “seed network” and grows by “duplicating” one of the existing nodes at a time, with an identical set of interactions; a randomly selected subset of these interactions is then deleted, and a few new interactions are added at random. It has been mathematically proven that the duplication model provides a small-world network and also has a power-law degree distribution. What we show in this paper is that by choosing the “right” seed network, many other topological features of the available PPI networks can be captured by the duplication model. The right seed network in this case turns out to include two sizable “cliques” (subnetworks where all node pairs are connected) with many interactions in between. In this paper, we also consider the preferential attachment model, which again grows by adding to a seed network one node at a time and connecting the new node to every other node with probability proportional to the existing degree of the second node. Because the preferential attachment model also provides a small-world network and has a power-law degree distribution, it has been considered equivalent to the duplication model. We show that the two models are vastly different in terms of other topological features we consider, and the preferential attachment model cannot capture some key features of the available PPI networks.
We present taveRNA, a web server package that hosts three RNA web services: alteRNA, inteRNA and pRuNA. alteRNA is a new alternative for RNA secondary structure prediction. It is based on a dynamic programming solution that minimizes the sum of energy density and free energy of an RNA structure. inteRNA is the first RNA–RNA interaction structure prediction web service. It also employs a dynamic programming algorithm to minimize the free energy of the resulting joint structure of the two interacting RNAs. Lastly, pRuNA is an efficient database pruning service; which given a query RNA, eliminates a significant portion of an ncRNA database and returns only a few ncRNAs as potential regulators. taveRNA is available at http://compbio.cs.sfu.ca/taverna.