Search tips
Search criteria

Results 1-25 (1247779)

Clipboard (0)

Related Articles

1.  Sequencing technologies and genome sequencing 
Journal of Applied Genetics  2011;52(4):413-435.
The high-throughput - next generation sequencing (HT-NGS) technologies are currently the hottest topic in the field of human and animals genomics researches, which can produce over 100 times more data compared to the most sophisticated capillary sequencers based on the Sanger method. With the ongoing developments of high throughput sequencing machines and advancement of modern bioinformatics tools at unprecedented pace, the target goal of sequencing individual genomes of living organism at a cost of $1,000 each is seemed to be realistically feasible in the near future. In the relatively short time frame since 2005, the HT-NGS technologies are revolutionizing the human and animal genome researches by analysis of chromatin immunoprecipitation coupled to DNA microarray (ChIP-chip) or sequencing (ChIP-seq), RNA sequencing (RNA-seq), whole genome genotyping, genome wide structural variation, de novo assembling and re-assembling of genome, mutation detection and carrier screening, detection of inherited disorders and complex human diseases, DNA library preparation, paired ends and genomic captures, sequencing of mitochondrial genome and personal genomics. In this review, we addressed the important features of HT-NGS like, first generation DNA sequencers, birth of HT-NGS, second generation HT-NGS platforms, third generation HT-NGS platforms: including single molecule Heliscope™, SMRT™ and RNAP sequencers, Nanopore, Archon Genomics X PRIZE foundation, comparison of second and third HT-NGS platforms, applications, advances and future perspectives of sequencing technologies on human and animal genome research.
PMCID: PMC3189340  PMID: 21698376
CHIP-chip; Chip-seq; De novo assembling; High-throughput next generation sequencing; Personal genomics; Re-sequencing; RNA-seq
2.  Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery 
Bioinformatics  2010;26(12):i350-i357.
Recent years have witnessed an increase in research activity for the detection of structural variants (SVs) and their association to human disease. The advent of next-generation sequencing technologies make it possible to extend the scope of structural variation studies to a point previously unimaginable as exemplified by the 1000 Genomes Project. Although various computational methods have been described for the detection of SVs, no such algorithm is yet fully capable of discovering transposon insertions, a very important class of SVs to the study of human evolution and disease. In this article, we provide a complete and novel formulation to discover both loci and classes of transposons inserted into genomes sequenced with high-throughput sequencing technologies. In addition, we also present ‘conflict resolution’ improvements to our earlier combinatorial SV detection algorithm (VariationHunter) by taking the diploid nature of the human genome into consideration. We test our algorithms with simulated data from the Venter genome (HuRef) and are able to discover >85% of transposon insertion events with precision of >90%. We also demonstrate that our conflict resolution algorithm (denoted as VariationHunter-CR) outperforms current state of the art (such as original VariationHunter, BreakDancer and MoDIL) algorithms when tested on the genome of the Yoruba African individual (NA18507).
Availability: The implementation of algorithm is available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2881400  PMID: 20529927
3.  REDHORSE-REcombination and Double crossover detection in Haploid Organisms using next-geneRation SEquencing data 
BMC Genomics  2015;16(1):133.
Next-generation sequencing technology provides a means to study genetic exchange at a higher resolution than was possible using earlier technologies. However, this improvement presents challenges as the alignments of next generation sequence data to a reference genome cannot be directly used as input to existing detection algorithms, which instead typically use multiple sequence alignments as input. We therefore designed a software suite called REDHORSE that uses genomic alignments, extracts genetic markers, and generates multiple sequence alignments that can be used as input to existing recombination detection algorithms. In addition, REDHORSE implements a custom recombination detection algorithm that makes use of sequence information and genomic positions to accurately detect crossovers. REDHORSE is a portable and platform independent suite that provides efficient analysis of genetic crosses based on Next-generation sequencing data.
We demonstrated the utility of REDHORSE using simulated data and real Next-generation sequencing data. The simulated dataset mimicked recombination between two known haploid parental strains and allowed comparison of detected break points against known true break points to assess performance of recombination detection algorithms. A newly generated NGS dataset from a genetic cross of Toxoplasma gondii allowed us to demonstrate our pipeline. REDHORSE successfully extracted the relevant genetic markers and was able to transform the read alignments from NGS to the genome to generate multiple sequence alignments. Recombination detection algorithm in REDHORSE was able to detect conventional crossovers and double crossovers typically associated with gene conversions whilst filtering out artifacts that might have been introduced during sequencing or alignment. REDHORSE outperformed other commonly used recombination detection algorithms in finding conventional crossovers. In addition, REDHORSE was the only algorithm that was able to detect double crossovers.
REDHORSE is an efficient analytical pipeline that serves as a bridge between genomic alignments and existing recombination detection algorithms. Moreover, REDHORSE is equipped with a recombination detection algorithm specifically designed for Next-generation sequencing data. REDHORSE is portable, platform independent Java based utility that provides efficient analysis of genetic crosses based on Next-generation sequencing data. REDHORSE is available at
Electronic supplementary material
The online version of this article (doi:10.1186/s12864-015-1309-7) contains supplementary material, which is available to authorized users.
PMCID: PMC4348101  PMID: 25766039
Next-generation sequencing; Recombination detection; Conventional crossovers; Double crossovers; Haploid genome; Toxoplasma gondii; Multiple sequence alignments; Single nucleotide variations; Merged allele file and allele extraction
4.  PeSV-Fisher: Identification of Somatic and Non-Somatic Structural Variants Using Next Generation Sequencing Data 
PLoS ONE  2013;8(5):e63377.
Next-generation sequencing technologies expedited research to develop efficient computational tools for the identification of structural variants (SVs) and their use to study human diseases. As deeper data is obtained, the existence of higher complexity SVs in some genomes becomes more evident, but the detection and definition of most of these complex rearrangements is still in its infancy. The full characterization of SVs is a key aspect for discovering their biological implications. Here we present a pipeline (PeSV-Fisher) for the detection of deletions, gains, intra- and inter-chromosomal translocations, and inversions, at very reasonable computational costs. We further provide comprehensive information on co-localization of SVs in the genome, a crucial aspect for studying their biological consequences. The algorithm uses a combination of methods based on paired-reads and read-depth strategies. PeSV-Fisher has been designed with the aim to facilitate identification of somatic variation, and, as such, it is capable of analysing two or more samples simultaneously, producing a list of non-shared variants between samples. We tested PeSV-Fisher on available sequencing data, and compared its behaviour to that of frequently deployed tools (BreakDancer and VariationHunter). We have also tested this algorithm on our own sequencing data, obtained from a tumour and a normal blood sample of a patient with chronic lymphocytic leukaemia, on which we have also validated the results by targeted re-sequencing of different kinds of predictions. This allowed us to determine confidence parameters that influence the reliability of breakpoint predictions.
PeSV-Fisher is available at
PMCID: PMC3660373  PMID: 23704902
5.  Massively Parallel Sequencing Approaches for Characterization of Structural Variation 
The emergence of next-generation sequencing (NGS) technologies offers an incredible opportunity to comprehensively study DNA sequence variation in human genomes. Commercially available platforms from Roche (454), Illumina (Genome Analyzer and Hiseq 2000), and Applied Biosystems (SOLiD) have the capability to completely sequence individual genomes to high levels of coverage. NGS data is particularly advantageous for the study of structural variation (SV) because it offers the sensitivity to detect variants of various sizes and types, as well as the precision to characterize their breakpoints at base pair resolution. In this chapter, we present methods and software algorithms that have been developed to detect SVs and copy number changes using massively parallel sequencing data. We describe visualization and de novo assembly strategies for characterizing SV breakpoints and removing false positives.
PMCID: PMC3679911  PMID: 22228022
Next-generation sequencing; Paired-end sequencing; 454; Illumina; Solexa; Abi solid; Insertions; Deletions; Duplications; Inversions; Translocations; Indels; Copy number variants
6.  Genome Wide Distributions and Functional Characterization of Copy Number Variations between Chinese and Western Pigs 
PLoS ONE  2015;10(7):e0131522.
Copy number variations (CNVs) refer to large insertions, deletions and duplications in the genomic structure ranging from one thousand to several million bases in size. Since the development of next generation sequencing technology, several methods have been well built for detection of copy number variations with high credibility and accuracy. Evidence has shown that CNV occurring in gene region could lead to phenotypic changes due to the alteration in gene structure and dosage. However, it still remains unexplored whether CNVs underlie the phenotypic differences between Chinese and Western domestic pigs. Based on the read-depth methods, we investigated copy number variations using 49 individuals derived from both Chinese and Western pig breeds. A total of 3,131 copy number variation regions (CNVRs) were identified with an average size of 13.4 Kb in all individuals during domestication, harboring 1,363 genes. Among them, 129 and 147 CNVRs were Chinese and Western pig specific, respectively. Gene functional enrichments revealed that these CNVRs contribute to strong disease resistance and high prolificacy in Chinese domestic pigs, but strong muscle tissue development in Western domestic pigs. This finding is strongly consistent with the morphologic characteristics of Chinese and Western pigs, indicating that these group-specific CNVRs might have been preserved by artificial selection for the favored phenotypes during independent domestication of Chinese and Western pigs. In this study, we built high-resolution CNV maps in several domestic pig breeds and discovered the group specific CNVs by comparing Chinese and Western pigs, which could provide new insight into genomic variations during pigs’ independent domestication, and facilitate further functional studies of CNV-associated genes.
PMCID: PMC4496047  PMID: 26154170
7.  Huvariome: a web server resource of whole genome next-generation sequencing allelic frequencies to aid in pathological candidate gene selection 
Next generation sequencing provides clinical research scientists with direct read out of innumerable variants, including personal, pathological and common benign variants. The aim of resequencing studies is to determine the candidate pathogenic variants from individual genomes, or from family-based or tumor/normal genome comparisons. Whilst the use of appropriate controls within the experimental design will minimize the number of false positive variations selected, this number can be reduced further with the use of high quality whole genome reference data to minimize false positives variants prior to candidate gene selection. In addition the use of platform related sequencing error models can help in the recovery of ambiguous genotypes from lower coverage data.
We have developed a whole genome database of human genetic variations, Huvariome, determined by whole genome deep sequencing data with high coverage and low error rates. The database was designed to be sequencing technology independent but is currently populated with 165 individual whole genomes consisting of small pedigrees and matched tumor/normal samples sequenced with the Complete Genomics sequencing platform. Common variants have been determined for a Benelux population cohort and represented as genotypes alongside the results of two sets of control data (73 of the 165 genomes), Huvariome Core which comprises 31 healthy individuals from the Benelux region, and Diversity Panel consisting of 46 healthy individuals representing 10 different populations and 21 samples in three Pedigrees. Users can query the database by gene or position via a web interface and the results are displayed as the frequency of the variations as detected in the datasets. We demonstrate that Huvariome can provide accurate reference allele frequencies to disambiguate sequencing inconsistencies produced in resequencing experiments. Huvariome has been used to support the selection of candidate cardiomyopathy related genes which have a homozygous genotype in the reference cohorts. This database allows the users to see which selected variants are common variants (> 5% minor allele frequency) in the Huvariome core samples, thus aiding in the selection of potentially pathogenic variants by filtering out common variants that are not listed in one of the other public genomic variation databases. The no-call rate and the accuracy of allele calling in Huvariome provides the user with the possibility of identifying platform dependent errors associated with specific regions of the human genome.
Huvariome is a simple to use resource for validation of resequencing results obtained by NGS experiments. The high sequence coverage and low error rates provide scientists with the ability to remove false positive results from pedigree studies. Results are returned via a web interface that displays location-based genetic variation frequency, impact on protein function, association with known genetic variations and a quality score of the variation base derived from Huvariome Core and the Diversity Panel data. These results may be used to identify and prioritize rare variants that, for example, might be disease relevant. In testing the accuracy of the Huvariome database, alleles of a selection of ambiguously called coding single nucleotide variants were successfully predicted in all cases. Data protection of individuals is ensured by restricted access to patient derived genomes from the host institution which is relevant for future molecular diagnostics.
PMCID: PMC3549785  PMID: 23164068
Medical genetics; Medical genomics; Whole genome sequencing; Allele frequency; Cardiomyopathy
8.  The use of ultra-dense array CGH analysis for the discovery of micro-copy number alterations and gene fusions in the cancer genome 
BMC Medical Genomics  2011;4:16.
Molecular alterations critical to development of cancer include mutations, copy number alterations (amplifications and deletions) as well as genomic rearrangements resulting in gene fusions. Massively parallel next generation sequencing, which enables the discovery of such changes, uses considerable quantities of genomic DNA (> 5 ug), a serious limitation in ever smaller clinical samples. However, a commonly available microarray platforms such as array comparative genomic hybridization (array CGH) allows the characterization of gene copy number at a single gene resolution using much smaller amounts of genomic DNA. In this study we evaluate the sensitivity of ultra-dense array CGH platforms developed by Agilent, especially that of the 1 million probe array (1 M array), and their application when whole genome amplification is required because of limited sample quantities.
We performed array CGH on whole genome amplified and not amplified genomic DNA from MCF-7 breast cancer cells, using 244 K and 1 M Agilent arrays. The ADM-2 algorithm was used to identify micro-copy number alterations that measured less than 1 Mb in genomic length.
DNA from MCF-7 breast cancer cells was analyzed for micro-copy number alterations, defined as measuring less than 1 Mb in genomic length. The 4-fold extra resolution of the 1 M array platform relative to the less dense 244 K array platform, led to the improved detection of copy number variations (CNVs) and micro-CNAs. The identification of intra-genic breakpoints in areas of DNA copy number gain signaled the possible presence of gene fusion events. However, the ultra-dense platforms, especially the densest 1 M array, detect artifacts inherent to whole genome amplification and should be used only with non-amplified DNA samples.
This is a first report using 1 M array CGH for the discovery of cancer genes and biomarkers. We show the remarkable capacity of this technology to discover CNVs, micro-copy number alterations and even gene fusions. However, these platforms require excellent genomic DNA quality and do not tolerate relatively small imperfections related to the whole genome amplification.
PMCID: PMC3041991  PMID: 21272361
9.  Phased Whole-Genome Genetic Risk in a Family Quartet Using a Major Allele Reference Sequence 
PLoS Genetics  2011;7(9):e1002280.
Whole-genome sequencing harbors unprecedented potential for characterization of individual and family genetic variation. Here, we develop a novel synthetic human reference sequence that is ethnically concordant and use it for the analysis of genomes from a nuclear family with history of familial thrombophilia. We demonstrate that the use of the major allele reference sequence results in improved genotype accuracy for disease-associated variant loci. We infer recombination sites to the lowest median resolution demonstrated to date (<1,000 base pairs). We use family inheritance state analysis to control sequencing error and inform family-wide haplotype phasing, allowing quantification of genome-wide compound heterozygosity. We develop a sequence-based methodology for Human Leukocyte Antigen typing that contributes to disease risk prediction. Finally, we advance methods for analysis of disease and pharmacogenomic risk across the coding and non-coding genome that incorporate phased variant data. We show these methods are capable of identifying multigenic risk for inherited thrombophilia and informing the appropriate pharmacological therapy. These ethnicity-specific, family-based approaches to interpretation of genetic variation are emblematic of the next generation of genetic risk assessment using whole-genome sequencing.
Author Summary
An individual's genetic profile plays an important role in determining risk for disease and response to medical therapy. The development of technologies that facilitate rapid whole-genome sequencing will provide unprecedented power in the estimation of disease risk. Here we develop methods to characterize genetic determinants of disease risk and response to medical therapy in a nuclear family of four, leveraging population genetic profiles from recent large scale sequencing projects. We identify the way in which genetic information flows through the family to identify sequencing errors and inheritance patterns of genes contributing to disease risk. In doing so we identify genetic risk factors associated with an inherited predisposition to blood clot formation and response to blood thinning medications. We find that this aligns precisely with the most significant disease to occur to date in the family, namely pulmonary embolism, a blood clot in the lung. These ethnicity-specific, family-based approaches to interpretation of individual genetic profiles are emblematic of the next generation of genetic risk assessment using whole-genome sequencing.
PMCID: PMC3174201  PMID: 21935354
10.  Evaluation of Paired-End Sequencing Strategies for Detection of Genome Rearrangements in Cancer 
PLoS Computational Biology  2008;4(4):e1000051.
Paired-end sequencing is emerging as a key technique for assessing genome rearrangements and structural variation on a genome-wide scale. This technique is particularly useful for detecting copy-neutral rearrangements, such as inversions and translocations, which are common in cancer and can produce novel fusion genes. We address the question of how much sequencing is required to detect rearrangement breakpoints and to localize them precisely using both theoretical models and simulation. We derive a formula for the probability that a fusion gene exists in a cancer genome given a collection of paired-end sequences from this genome. We use this formula to compute fusion gene probabilities in several breast cancer samples, and we find that we are able to accurately predict fusion genes in these samples with a relatively small number of fragments of large size. We further demonstrate how the ability to detect fusion genes depends on the distribution of gene lengths, and we evaluate how different parameters of a sequencing strategy impact breakpoint detection, breakpoint localization, and fusion gene detection, even in the presence of errors that suggest false rearrangements. These results will be useful in calibrating future cancer sequencing efforts, particularly large-scale studies of many cancer genomes that are enabled by next-generation sequencing technologies.
Author Summary
Cancer is driven by genomic mutations that can range from single nucleotide changes to chromosomal aberrations that rearrange large pieces of DNA. Often, these chromosomal aberrations disrupt a gene sequence, and even fuse the sequences of two genes, producing a “fusion gene.” Fusion genes have been identified as key participants in the development of several types of cancer. Using genome-sequencing technology it is now possible to identify chromosomal aberrations genome-wide and at high resolution. In this paper, we address the question of how much sequencing is required to detect a chromosomal aberration and to determine the location of the aberration precisely enough to identify if a fusion gene is created by this aberration. We derive a mathematical formula that accurately predicts a number of fusion genes in a breast cancer sequencing study. We also demonstrate how the ability to detect chromosomal aberrations and fusion genes depends on both the size of the fusion gene and the parameters of the genome sequencing strategy that is used. These results will be useful in calibrating future cancer sequencing efforts, especially those using next-generation sequencing technologies.
PMCID: PMC2278375  PMID: 18404202
11.  Dynamics of Genome Rearrangement in Bacterial Populations 
PLoS Genetics  2008;4(7):e1000128.
Genome structure variation has profound impacts on phenotype in organisms ranging from microbes to humans, yet little is known about how natural selection acts on genome arrangement. Pathogenic bacteria such as Yersinia pestis, which causes bubonic and pneumonic plague, often exhibit a high degree of genomic rearrangement. The recent availability of several Yersinia genomes offers an unprecedented opportunity to study the evolution of genome structure and arrangement. We introduce a set of statistical methods to study patterns of rearrangement in circular chromosomes and apply them to the Yersinia. We constructed a multiple alignment of eight Yersinia genomes using Mauve software to identify 78 conserved segments that are internally free from genome rearrangement. Based on the alignment, we applied Bayesian statistical methods to infer the phylogenetic inversion history of Yersinia. The sampling of genome arrangement reconstructions contains seven parsimonious tree topologies, each having different histories of 79 inversions. Topologies with a greater number of inversions also exist, but were sampled less frequently. The inversion phylogenies agree with results suggested by SNP patterns. We then analyzed reconstructed inversion histories to identify patterns of rearrangement. We confirm an over-representation of “symmetric inversions”—inversions with endpoints that are equally distant from the origin of chromosomal replication. Ancestral genome arrangements demonstrate moderate preference for replichore balance in Yersinia. We found that all inversions are shorter than expected under a neutral model, whereas inversions acting within a single replichore are much shorter than expected. We also found evidence for a canonical configuration of the origin and terminus of replication. Finally, breakpoint reuse analysis reveals that inversions with endpoints proximal to the origin of DNA replication are nearly three times more frequent. Our findings represent the first characterization of genome arrangement evolution in a bacterial population evolving outside laboratory conditions. Insight into the process of genomic rearrangement may further the understanding of pathogen population dynamics and selection on the architecture of circular bacterial chromosomes.
Author Summary
Whole-genome sequencing has revealed that organisms exhibit extreme variability in chromosome structure. One common type of chromosome structure variation is genome arrangement variation: changes in the ordering of genes on the chromosome. Not only do we find differences in genome arrangement across species, but in some organisms, members of the same species have radically different genome arrangements. We studied the evolution of genome arrangement in pathogenic bacteria from the genus Yersinia. The Yersinia exhibit substantial variation in genome arrangement both within and across species. We reconstructed the history of genome rearrangement by inversion in a group of eight Yersinia, and we statistically quantified the forces shaping their genome arrangement evolution. In particular, we discovered an excess of rearrangement activity near the origin of chromosomal replication and found evidence for a preferred configuration for the relative orientations of the origin and terminus of replication. We also found real inversions to be significantly shorter than expected. Finally, we discovered that no single reconstruction of inversion history is parsimonious with respect to the total number of inversion mutations, but on average, reconstructed genome arrangements favor “balanced” genomes—where the replication origin is positioned opposite the terminus on the circular chromosome.
PMCID: PMC2483231  PMID: 18650965
12.  Detection of identity by descent using next-generation whole genome sequencing data 
BMC Bioinformatics  2012;13:121.
Identity by descent (IBD) has played a fundamental role in the discovery of genetic loci underlying human diseases. Both pedigree-based and population-based linkage analyses rely on estimating recent IBD, and evidence of ancient IBD can be used to detect population structure in genetic association studies. Various methods for detecting IBD, including those implemented in the soft- ware programs fastIBD and GERMLINE, have been developed in the past several years using population genotype data from microarray platforms. Now, next-generation DNA sequencing data is becoming increasingly available, enabling the comprehensive analysis of genomes, in- cluding identifying rare variants. These sequencing data may provide an opportunity to detect IBD with higher resolution than previously possible, potentially enabling the detection of disease causing loci that were previously undetectable with sparser genetic data.
Here, we investigate how different levels of variant coverage in sequencing and microarray genotype data influences the resolution at which IBD can be detected. This includes microarray genotype data from the WTCCC study, denser genotype data from the HapMap Project, low coverage sequencing data from the 1000 Genomes Project, and deep coverage complete genome data from our own projects. With high power (78%), we can detect segments of length 0.4 cM or larger using fastIBD and GERMLINE in sequencing data. This compares to similar power to detect segments of length 1.0 cM or higher with microarray genotype data. We find that GERMLINE has slightly higher power than fastIBD for detecting IBD segments using sequencing data, but also has a much higher false positive rate.
We further quantify the effect of variant density, conditional on genetic map length, on the power to resolve IBD segments. These investigations into IBD resolution may help guide the design of future next generation sequencing studies that utilize IBD, including family-based association studies, association studies in admixed populations, and homozygosity mapping studies.
PMCID: PMC3403908  PMID: 22672699
13.  ClipCrop: a tool for detecting structural variations with single-base resolution using soft-clipping information 
BMC Bioinformatics  2011;12(Suppl 14):S7.
Structural variations (SVs) change the structure of the genome and are therefore the causes of various diseases. Next-generation sequencing allows us to obtain a multitude of sequence data, some of which can be used to infer the position of SVs.
We developed a new method and implementation named ClipCrop for detecting SVs with single-base resolution using soft-clipping information. A soft-clipped sequence is an unmatched fragment in a partially mapped read. To assess the performance of ClipCrop with other SV-detecting tools, we generated various patterns of simulation data – SV lengths, read lengths, and the depth of coverage of short reads – with insertions, deletions, tandem duplications, inversions and single nucleotide alterations in a human chromosome. For comparison, we selected BreakDancer, CNVnator and Pindel, each of which adopts a different approach to detect SVs, e.g. discordant pair approach, depth of coverage approach and split read approach, respectively.
Our method outperformed BreakDancer and CNVnator in both discovering rate and call accuracy in any type of SV. Pindel offered a similar performance as our method, but our method crucially outperformed for detecting small duplications. From our experiments, ClipCrop infer reliable SVs for the data set with more than 50 bases read lengths and 20x depth of coverage, both of which are reasonable values in current NGS data set.
ClipCrop can detect SVs with higher discovering rate and call accuracy than any other tool in our simulation data set.
PMCID: PMC3287472  PMID: 22373054
14.  Detection and characterization of novel sequence insertions using paired-end next-generation sequencing 
Bioinformatics  2010;26(10):1277-1283.
Motivation: In the past few years, human genome structural variation discovery has enjoyed increased attention from the genomics research community. Many studies were published to characterize short insertions, deletions, duplications and inversions, and associate copy number variants (CNVs) with disease. Detection of new sequence insertions requires sequence data, however, the ‘detectable’ sequence length with read-pair analysis is limited by the insert size. Thus, longer sequence insertions that contribute to our genetic makeup are not extensively researched.
Results: We present NovelSeq: a computational framework to discover the content and location of long novel sequence insertions using paired-end sequencing data generated by the next-generation sequencing platforms. Our framework can be built as part of a general sequence analysis pipeline to discover multiple types of genetic variation (SNPs, structural variation, etc.), thus it requires significantly less-computational resources than de novo sequence assembly. We apply our methods to detect novel sequence insertions in the genome of an anonymous donor and validate our results by comparing with the insertions discovered in the same genome using various sources of sequence data.
Availability: The implementation of the NovelSeq pipeline is available at;
PMCID: PMC2865866  PMID: 20385726
15.  HapFABIA: Identification of very short segments of identity by descent characterized by rare variants in large sequencing data 
Nucleic Acids Research  2013;41(22):e202.
Identity by descent (IBD) can be reliably detected for long shared DNA segments, which are found in related individuals. However, many studies contain cohorts of unrelated individuals that share only short IBD segments. New sequencing technologies facilitate identification of short IBD segments through rare variants, which convey more information on IBD than common variants. Current IBD detection methods, however, are not designed to use rare variants for the detection of short IBD segments. Short IBD segments reveal genetic structures at high resolution. Therefore, they can help to improve imputation and phasing, to increase genotyping accuracy for low-coverage sequencing and to increase the power of association studies. Since short IBD segments are further assumed to be old, they can shed light on the evolutionary history of humans. We propose HapFABIA, a computational method that applies biclustering to identify very short IBD segments characterized by rare variants. HapFABIA is designed to detect short IBD segments in genotype data that were obtained from next-generation sequencing, but can also be applied to DNA microarray data. Especially in next-generation sequencing data, HapFABIA exploits rare variants for IBD detection. HapFABIA significantly outperformed competing algorithms at detecting short IBD segments on artificial and simulated data with rare variants. HapFABIA identified 160 588 different short IBD segments characterized by rare variants with a median length of 23 kb (mean 24 kb) in data for chromosome 1 of the 1000 Genomes Project. These short IBD segments contain 752 000 single nucleotide variants (SNVs), which account for 39% of the rare variants and 23.5% of all variants. The vast majority—152 000 IBD segments—are shared by Africans, while only 19 000 and 11 000 are shared by Europeans and Asians, respectively. IBD segments that match the Denisova or the Neandertal genome are found significantly more often in Asians and Europeans but also, in some cases exclusively, in Africans. The lengths of IBD segments and their sharing between continental populations indicate that many short IBD segments from chromosome 1 existed before humans migrated out of Africa. Thus, rare variants that tag these short IBD segments predate human migration from Africa. The software package HapFABIA is available from Bioconductor. All data sets, result files and programs for data simulation, preprocessing and evaluation are supplied at
PMCID: PMC3905877  PMID: 24174545
16.  Genome-wide detection of copy number variations using high-density SNP genotyping platforms in Holsteins 
BMC Genomics  2013;14:131.
Copy number variations (CNVs) are widespread in the human or animal genome and are a significant source of genetic variation, which has been demonstrated to play an important role in phenotypic diversity. Advances in technology have allowed for identification of a large number of CNVs in cattle. Comprehensive explore novel CNVs in the bovine genome would provide valuable information for functional analyses of genome structural variation and facilitating follow-up association studies between complex traits and genetic variants.
In this study, we performed a genome-wide CNV detection based on high-density SNP genotyping data of 96 Chinese Holstein cattle. A total of 367 CNV regions (CNVRs) across the genome were identified, which cover 42.74Mb of the cattle genome and correspond to 1.61% of the genome sequence. The length of the CNVRs on autosomes range from 10.76 to 2,806.42 Kb with an average of 96.23 Kb. 218 out of these CNVRs contain 610 annotated genes, which possess a wide spectrum of molecular functions. To confirm these findings, quantitative PCR (qPCR) was performed for 17 CNVRs and 13(76.5%) of them were successfully validated.
Our study demonstrates the high density SNP array can significantly improve the accuracy and sensitivity of CNV calling. Integration of different platforms can enhance the detection of genomic structure variants. Our results provide a significant replenishment for the high resolution map of copy number variation in the bovine genome and valuable information for investigation of genomic structural variation underlying traits of interest in cattle.
PMCID: PMC3639935  PMID: 23442346
Copy number variations; Cattle; BovineHD beadChip; Genome variation; Quantitative real time PCR
17.  Global Mutational Profiling of Formalin Fixed Human Colon Cancers from a Pathology Archive 
The advent of Next-Generation sequencing technologies, which significantly increases the throughput and reduces the cost of large scale sequencing efforts, provides an unprecedented opportunity for discovery of novel gene mutations in human cancers. However, it remains a challenge to apply Next-Generation technologies to DNA extracted from formalin fixed paraffin embedded cancer specimens. We describe here the successful development of a custom DNA capture method using Next-Generation for detection of 140 driver genes in 5 formalin fixed paraffin embedded human colon cancer samples using an improved extraction process to produce high quality DNA. Isolated DNA was enriched for targeted exons and sequenced using the Illumina Next-Generation platform. An analytical pipeline using 3 software platforms to define single nucleotide variants was used to evaluate the data output. Approximately 250x average coverage was obtained with >96% of target bases having at least 30 sequence reads. Results were then compared to previously performed high throughput Sanger sequencing. Using an algorithm of needing a positive call from all 3 callers to give a positive result, 98% of the verified Sanger sequencing somatic driver gene mutations were identified by our method with a specificity of 90%. 13 insertions and deletions identified by Next-Generation were confirmed by Sanger sequencing. We also applied this technology to two components of a biphasic colon cancer which had strikingly differing histology. Remarkably, no new driver gene mutation accumulation was identified in the more undifferentiated component. Applying this method to profiling of formalin fixed paraffin embedded colon cancer tissue samples yields equivalent sensitivity and specificity for mutation detection as Sanger sequencing of matched cell lines derived from these cancers. This method directly enables high throughput comprehensive mutational profiling of colon cancer samples, and is easily extendable to enable targeted sequencing from formalin fixed paraffin embedded material for other tumor types.
PMCID: PMC3697090  PMID: 22878650
next generation sequencing; colon cancer; driver gene mutations
18.  WaveCNV: allele-specific copy number alterations in primary tumors and xenograft models from next-generation sequencing 
Bioinformatics  2013;30(6):768-774.
Motivation: Copy number variations (CNVs) are a major source of genomic variability and are especially significant in cancer. Until recently microarray technologies have been used to characterize CNVs in genomes. However, advances in next-generation sequencing technology offer significant opportunities to deduce copy number directly from genome sequencing data. Unfortunately cancer genomes differ from normal genomes in several aspects that make them far less amenable to copy number detection. For example, cancer genomes are often aneuploid and an admixture of diploid/non-tumor cell fractions. Also patient-derived xenograft models can be laden with mouse contamination that strongly affects accurate assignment of copy number. Hence, there is a need to develop analytical tools that can take into account cancer-specific parameters for detecting CNVs directly from genome sequencing data.
Results: We have developed WaveCNV, a software package to identify copy number alterations by detecting breakpoints of CNVs using translation-invariant discrete wavelet transforms and assign digitized copy numbers to each event using next-generation sequencing data. We also assign alleles specifying the chromosomal ratio following duplication/loss. We verified copy number calls using both microarray (correlation coefficient 0.97) and quantitative polymerase chain reaction (correlation coefficient 0.94) and found them to be highly concordant. We demonstrate its utility in pancreatic primary and xenograft sequencing data.
Availability and implementation: Source code and executables are available at The segmentation algorithm is implemented in MATLAB, and copy number assignment is implemented Perl.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3957071  PMID: 24192544
19.  An Integrative Approach for Interpretation of Clinical NGS Genomic Variant Data 
Antibody (Ab) discovery research has accelerated as monoclonal Ab (mAb)-based biologic strategies have proved efficacious in the treatment of many human diseases, ranging from cancer to autoimmunity. Initial steps in the discovery of therapeutic mAb require epitope characterization and preclinical studies in vitro and in animal models often using limited quantities of Ab. To facilitate this research, our Shared Resource Laboratory (SRL) offers microscale Ab conjugation. Ab submitted for conjugation may or may not be commercially produced, but have not been characterized for use in immunofluorescence applications. Purified mAb and even polyclonal Ab (pAb) can be efficiently conjugated, although the advantages of direct conjugation are more obvious for mAb. To improve consistency of results in microscale (<100ug) conjugation reactions, we chose to utilize several different varieties of commercial kits. Kits tested were limited to covalent fluorophore labeling. Established quality control (QC) processes to validate fluorophore labeling either rely solely on spectrophotometry or utilize flow cytometry of cells expected to express the target antigen. This methodology is not compatible with microscale reactions using uncharacterized Ab. We developed a novel method for cell-free QC of our conjugates that reflects conjugation quality, but is independent of the biological properties of the Ab itself. QC is critical, as amine reactive chemistry relies on the absence of even trace quantities of competing amine moieties such as those found in the Good buffers (HEPES, MOPS, TES, etc.) or irrelevant proteins. Herein, we present data used to validate our method of assessing the extent of labeling and the removal of free dye by using flow cytometric analysis of polystyrene Ab capture beads to verify product quality. This microscale custom conjugation and QC allows for the rapid development and validation of high quality reagents, specific to the needs of our colleagues and clientele. Next generation sequencing (NGS) technologies provide the potential for developing high-throughput and low-cost platforms for clinical diagnostics. A limiting factor to clinical applications of genomic NGS is downstream bioinformatics analysis. Most analysis pipelines do not connect genomic variants to disease and protein specific information during the initial filtering and selection of relevant variants. Robust bioinformatics pipelines were implemented for trimming, genome alignment, SNP, INDEL, or structural variation detection of whole genome or exon-capture sequencing data from Illumina. Quality control metrics were analyzed at each step of the pipeline to ensure data integrity for clinical applications. We further annotate the variants with statistics regarding the diseased population and variant impact. Custom algorithms were developed to analyze the variant data by filtering variants based upon criteria such as quality of variant, inheritance pattern (e.g. dominant, recessive, X-linked), and impact of variant. The resulting variants and their associated genes are linked to Integrated Genome Browser (IGV) in a genome context, and to the PIR iProXpress system for rich protein and disease information. This poster will present detailed analysis of whole exome sequencing performed on patients with facio-skeletal anomalies. We will compare and contrast data analysis methods and report on potential clinically relevant leads discovered by implementing our new clinical variant pipeline. Our variant analysis of these patients and their unaffected family members resulted in more than 500,000 variants. By applying our system of annotations, prioritizations, inheritance filters, and functional profiling and analysis, we have created a unique methodology for further filtering of disease relevant variants that impact protein coding genes. Taken together, the integrative approach allows better selection of disease relevant genomic variants by using both genomic and disease/protein centric information. This type of clustering approach can help clinicians better understand the association of variants to the disease phenotype, enabling application to personalized medicine approaches.
PMCID: PMC4162289
20.  Designing deep sequencing experiments: detecting structural variation and estimating transcript abundance 
BMC Genomics  2010;11:385.
Massively parallel DNA sequencing technologies have enabled the sequencing of several individual human genomes. These technologies are also being used in novel ways for mRNA expression profiling, genome-wide discovery of transcription-factor binding sites, small RNA discovery, etc. The multitude of sequencing platforms, each with their unique characteristics, pose a number of design challenges, regarding the technology to be used and the depth of sequencing required for a particular sequencing application. Here we describe a number of analytical and empirical results to address design questions for two applications: detection of structural variations from paired-end sequencing and estimating mRNA transcript abundance.
For structural variation, our results provide explicit trade-offs between the detection and resolution of rearrangement breakpoints, and the optimal mix of paired-read insert lengths. Specifically, we prove that optimal detection and resolution of breakpoints is achieved using a mix of exactly two insert library lengths. Furthermore, we derive explicit formulae to determine these insert length combinations, enabling a 15% improvement in breakpoint detection at the same experimental cost. On empirical short read data, these predictions show good concordance with Illumina 200 bp and 2 Kbp insert length libraries. For transcriptome sequencing, we determine the sequencing depth needed to detect rare transcripts from a small pilot study. With only 1 Million reads, we derive corrections that enable almost perfect prediction of the underlying expression probability distribution, and use this to predict the sequencing depth required to detect low expressed genes with greater than 95% probability.
Together, our results form a generic framework for many design considerations related to high-throughput sequencing. We provide software tools to derive platform independent guidelines for designing sequencing experiments (amount of sequencing, choice of insert length, mix of libraries) for novel applications of next generation sequencing.
PMCID: PMC3091630  PMID: 20565853
21.  Genomic Variation Research Group (GVRG): Analysis of Copy Number Variation in C. elegans using Next Generation Sequencing 
Structural variations (SV) found in eukaryotic genomes include insertions, deletions, inversions, translocations, and copy number variations (CNV). The emerging body of literature clearly illustrates the role of SVs, such as CNVs, in the susceptibility or resistance to certain diseases. While microarrays have traditionally been an effective tool for identifying CNVs, recent advances in paired-end, next-generation sequencing technology now provide an alternative approach.To detect copy number variants using paired-end sequencing it is essential to leverage a bioinformatics pipeline that can distinguish the mapped ends, and identify statistically significant differences as compared to a reference sequence. Several commercial and open source software tools for automatic detection of SVs and CNVs are available and this list is rapidly growing. Though the number of tools continues to increase, they are neither as robust nor mature as those currently available for analyzing microarray experiments.As such, packages are being constantly evaluated with the intent of determining which performs best in this capacity.For its annual research project, the GVRG has hypothesized that an optimal combination of the statistical models and paired-end reads will have the most traction in next-generation sequencing for CNV detection. Using C. elegans as a model organism, we have performed and directed experiments to study the capability of next-generation sequencing for such a purpose. It has been reported that there is nearly 2% natural gene content variation between the Bristol and Hawaii C. elegans strains as determined by aCGH.These published differences, which include a number of CNVs, have provided a valuable framework for conducting the experiments and analyzing results.
PMCID: PMC2918071
22.  TIARA: a database for accurate analysis of multiple personal genomes based on cross-technology 
Nucleic Acids Research  2010;39(Database issue):D883-D888.
High-throughput genomic technologies have been used to explore personal human genomes for the past few years. Although the integration of technologies is important for high-accuracy detection of personal genomic variations, no databases have been prepared to systematically archive genomes and to facilitate the comparison of personal genomic data sets prepared using a variety of experimental platforms. We describe here the Total Integrated Archive of Short-Read and Array (TIARA; database, which contains personal genomic information obtained from next generation sequencing (NGS) techniques and ultra-high-resolution comparative genomic hybridization (CGH) arrays. This database improves the accuracy of detecting personal genomic variations, such as SNPs, short indels and structural variants (SVs). At present, 36 individual genomes have been archived and may be displayed in the database. TIARA supports a user-friendly genome browser, which retrieves read-depths (RDs) and log2 ratios from NGS and CGH arrays, respectively. In addition, this database provides information on all genomic variants and the raw data, including short reads and feature-level CGH data, through anonymous file transfer protocol. More personal genomes will be archived as more individuals are analyzed by NGS or CGH array. TIARA provides a new approach to the accurate interpretation of personal genomes for genome research.
PMCID: PMC3013693  PMID: 21051338
23.  iSVP: an integrated structural variant calling pipeline from high-throughput sequencing data 
BMC Systems Biology  2013;7(Suppl 6):S8.
Structural variations (SVs), such as insertions, deletions, inversions, and duplications, are a common feature in human genomes, and a number of studies have reported that such SVs are associated with human diseases. Although the progress of next generation sequencing (NGS) technologies has led to the discovery of a large number of SVs, accurate and genome-wide detection of SVs remains challenging. Thus far, various calling algorithms based on NGS data have been proposed. However, their strategies are diverse and there is no tool able to detect a full range of SVs accurately.
We focused on evaluating the performance of existing deletion calling algorithms for various spanning ranges from low- to high-coverage simulation data. The simulation data was generated from a whole genome sequence with artificial SVs constructed based on the distribution of variants obtained from the 1000 Genomes Project. From the simulation analysis, deletion calls of various deletion sizes were obtained with each caller, and it was found that the performance was quite different according to the type of algorithms and targeting deletion size. Based on these results, we propose an integrated structural variant calling pipeline (iSVP) that combines existing methods with a newly devised filtering and merging processes. It achieved highly accurate deletion calling with >90% precision and >90% recall on the 30× read data for a broad range of size. We applied iSVP to the whole-genome sequence data of a CEU HapMap sample, and detected a large number of deletions, including notable peaks around 300 bp and 6,000 bp, which corresponded to Alus and long interspersed nuclear elements, respectively. In addition, many of the predicted deletions were highly consistent with experimentally validated ones by other studies.
We present iSVP, a new deletion calling pipeline to obtain a genome-wide landscape of deletions in a highly accurate manner. From simulation and real data analysis, we show that iSVP is broadly applicable to human whole-genome sequencing data, which will elucidate relationships between SVs across genomes and associated diseases or biological functions.
PMCID: PMC4029547  PMID: 24564972
24.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation 
Nature methods  2009;6(9):677-681.
Detection and characterization of genomic structural variation are important for understanding the landscape of genetic variation in human populations and in complex diseases such as cancer. Recent studies demonstrate the feasibility of detecting structural variation using next-generation, short-insert, paired-end sequencing reads. However, the utility of these reads is not entirely clear, nor are the analysis methods under which accurate detection can be achieved. The algorithm BreakDancer predicts a wide variety of structural variants including indels, inversions, and translocations. We examined BreakDancer's performance in simulation, comparison with other methods, analysis of an acute myeloid leukemia sample, and the 1,000 Genomes trio individuals. We found that it substantially improved the detection of small and intermediate size indels from 10 bp to 1 Mbp that are difficult to detect via a single conventional approach.
PMCID: PMC3661775  PMID: 19668202
25.  High throughput sequencing in mice: a platform comparison identifies a preponderance of cryptic SNPs 
BMC Genomics  2009;10:379.
Allelic variation is the cornerstone of genetically determined differences in gene expression, gene product structure, physiology, and behavior. However, allelic variation, particularly cryptic (unknown or not annotated) variation, is problematic for follow up analyses. Polymorphisms result in a high incidence of false positive and false negative results in hybridization based analyses and hinder the identification of the true variation underlying genetically determined differences in physiology and behavior. Given the proliferation of mouse genetic models (e.g., knockout models, selectively bred lines, heterogeneous stocks derived from standard inbred strains and wild mice) and the wealth of gene expression microarray and phenotypic studies using genetic models, the impact of naturally-occurring polymorphisms on these data is critical. With the advent of next-generation, high-throughput sequencing, we are now in a position to determine to what extent polymorphisms are currently cryptic in such models and their impact on downstream analyses.
We sequenced the two most commonly used inbred mouse strains, DBA/2J and C57BL/6J, across a region of chromosome 1 (171.6 – 174.6 megabases) using two next generation high-throughput sequencing platforms: Applied Biosystems (SOLiD) and Illumina (Genome Analyzer). Using the same templates on both platforms, we compared realignments and single nucleotide polymorphism (SNP) detection with an 80 fold average read depth across platforms and samples. While public datasets currently annotate 4,527 SNPs between the two strains in this interval, thorough high-throughput sequencing identified a total of 11,824 SNPs in the interval, including 7,663 new SNPs. Furthermore, we confirmed 40 missense SNPs and discovered 36 new missense SNPs.
Comparisons utilizing even two of the best characterized mouse genetic models, DBA/2J and C57BL/6J, indicate that more than half of naturally-occurring SNPs remain cryptic. The magnitude of this problem is compounded when using more divergent or poorly annotated genetic models. This warrants full genomic sequencing of the mouse strains used as genetic models.
PMCID: PMC2743714  PMID: 19686600

Results 1-25 (1247779)