Search tips
Search criteria

Results 1-25 (919689)

Clipboard (0)

Related Articles

1.  Genome-wide SNP identification in multiple morphotypes of allohexaploid tall fescue (Festuca arundinacea Schreb) 
BMC Genomics  2012;13:219.
Single nucleotide polymorphisms (SNPs) provide essential tools for the advancement of research in plant genomics, and the development of SNP resources for many species has been accelerated by the capabilities of second-generation sequencing technologies. The current study aimed to develop and use a novel bioinformatic pipeline to generate a comprehensive collection of SNP markers within the agriculturally important pasture grass tall fescue; an outbreeding allopolyploid species displaying three distinct morphotypes: Continental, Mediterranean and rhizomatous.
A bioinformatic pipeline was developed that successfully identified SNPs within genotypes from distinct tall fescue morphotypes, following the sequencing of 414 polymerase chain reaction (PCR) – generated amplicons using 454 GS FLX technology. Equivalent amplicon sets were derived from representative genotypes of each morphotype, including six Continental, five Mediterranean and one rhizomatous. A total of 8,584 and 2,292 SNPs were identified with high confidence within the Continental and Mediterranean morphotypes respectively. The success of the bioinformatic approach was demonstrated through validation (at a rate of 70%) of a subset of 141 SNPs using both SNaPshot™ and GoldenGate™ assay chemistries. Furthermore, the quantitative genotyping capability of the GoldenGate™ assay revealed that approximately 30% of the putative SNPs were accessible to co-dominant scoring, despite the hexaploid genome structure. The sub-genome-specific origin of each SNP validated from Continental tall fescue was predicted using a phylogenetic approach based on comparison with orthologous sequences from predicted progenitor species.
Using the appropriate bioinformatic approach, amplicon resequencing based on 454 GS FLX technology is an effective method for the identification of polymorphic SNPs within the genomes of Continental and Mediterranean tall fescue. The GoldenGate™ assay is capable of high-throughput co-dominant SNP allele detection, and minimises the problems associated with SNP genotyping in a polyploid by effectively reducing the complexity to a diploid system. This SNP collection may now be refined and used in applications such as cultivar identification, genetic linkage map construction, genome-wide association studies and genomic selection in tall fescue. The bioinformatic pipeline described here represents an effective general method for SNP discovery within outbreeding allopolyploid species.
PMCID: PMC3444928  PMID: 22672128
Lolium arundinaceum; Molecular marker; DNA sequencing; Haplotype; Sub-genome
2.  Annotation-based genome-wide SNP discovery in the large and complex Aegilops tauschii genome using next-generation sequencing without a reference genome sequence 
BMC Genomics  2011;12:59.
Many plants have large and complex genomes with an abundance of repeated sequences. Many plants are also polyploid. Both of these attributes typify the genome architecture in the tribe Triticeae, whose members include economically important wheat, rye and barley. Large genome sizes, an abundance of repeated sequences, and polyploidy present challenges to genome-wide SNP discovery using next-generation sequencing (NGS) of total genomic DNA by making alignment and clustering of short reads generated by the NGS platforms difficult, particularly in the absence of a reference genome sequence.
An annotation-based, genome-wide SNP discovery pipeline is reported using NGS data for large and complex genomes without a reference genome sequence. Roche 454 shotgun reads with low genome coverage of one genotype are annotated in order to distinguish single-copy sequences and repeat junctions from repetitive sequences and sequences shared by paralogous genes. Multiple genome equivalents of shotgun reads of another genotype generated with SOLiD or Solexa are then mapped to the annotated Roche 454 reads to identify putative SNPs. A pipeline program package, AGSNP, was developed and used for genome-wide SNP discovery in Aegilops tauschii-the diploid source of the wheat D genome, and with a genome size of 4.02 Gb, of which 90% is repetitive sequences. Genomic DNA of Ae. tauschii accession AL8/78 was sequenced with the Roche 454 NGS platform. Genomic DNA and cDNA of Ae. tauschii accession AS75 was sequenced primarily with SOLiD, although some Solexa and Roche 454 genomic sequences were also generated. A total of 195,631 putative SNPs were discovered in gene sequences, 155,580 putative SNPs were discovered in uncharacterized single-copy regions, and another 145,907 putative SNPs were discovered in repeat junctions. These SNPs were dispersed across the entire Ae. tauschii genome. To assess the false positive SNP discovery rate, DNA containing putative SNPs was amplified by PCR from AL8/78 and AS75 and resequenced with the ABI 3730 xl. In a sample of 302 randomly selected putative SNPs, 84.0% in gene regions, 88.0% in repeat junctions, and 81.3% in uncharacterized regions were validated.
An annotation-based genome-wide SNP discovery pipeline for NGS platforms was developed. The pipeline is suitable for SNP discovery in genomic libraries of complex genomes and does not require a reference genome sequence. The pipeline is applicable to all current NGS platforms, provided that at least one such platform generates relatively long reads. The pipeline package, AGSNP, and the discovered 497,118 Ae. tauschii SNPs can be accessed at (
PMCID: PMC3041743  PMID: 21266061
3.  Science and Technology at a High Throughput Genome Center 
The Genome Center (GC) at Washington University School of Medicine has developed a state of the art genomics facility. Our scientists work on a variety of cutting edge projects with researchers from around the world. These collaborative research projects lead to cutting edge advances in the field of genomics. The structural organization at the GC reflects these efforts and is centered around six major scientific areas: Transcriptome Sequencing, Genome Assembly, Whole Genome Sequencing, Human Microbiome, Human Genetics, and Targeted Resequencing. These specific scientific areas are supported by one central data production pipeline. Attributes of this pipeline include detailed sample screening protocols, sample barcoding capabilities that allow for a broad range of sample cohorts, multiplatform data production, and the ability to select from more than one method of sequencing strategies. All of this is supported by one centralized LIMS group dedicated to maintaining and developing the data production capabilities. The technology development group investigates new techniques and instrumentation prior to any changes in the main data production pipeline. Only robust protocols and instrumentation are allowed into the data production pipeline. This strategy allows The Genome Center to run a base data production pipeline while constantly infusing high quality advances. Sequence data for each project is sent into an advanced analysis pipeline built to conduct a multitude of assessments. When needed, validation (a second sequence event) can be used to confirm variants detected by the analysis software.
PMCID: PMC3186537
4.  A graph-based approach for designing extensible pipelines 
BMC Bioinformatics  2012;13:163.
In bioinformatics, it is important to build extensible and low-maintenance systems that are able to deal with the new tools and data formats that are constantly being developed. The traditional and simplest implementation of pipelines involves hardcoding the execution steps into programs or scripts. This approach can lead to problems when a pipeline is expanding because the incorporation of new tools is often error prone and time consuming. Current approaches to pipeline development such as workflow management systems focus on analysis tasks that are systematically repeated without significant changes in their course of execution, such as genome annotation. However, more dynamism on the pipeline composition is necessary when each execution requires a different combination of steps.
We propose a graph-based approach to implement extensible and low-maintenance pipelines that is suitable for pipeline applications with multiple functionalities that require different combinations of steps in each execution. Here pipelines are composed automatically by compiling a specialised set of tools on demand, depending on the functionality required, instead of specifying every sequence of tools in advance. We represent the connectivity of pipeline components with a directed graph in which components are the graph edges, their inputs and outputs are the graph nodes, and the paths through the graph are pipelines. To that end, we developed special data structures and a pipeline system algorithm. We demonstrate the applicability of our approach by implementing a format conversion pipeline for the fields of population genetics and genetic epidemiology, but our approach is also helpful in other fields where the use of multiple software is necessary to perform comprehensive analyses, such as gene expression and proteomics analyses. The project code, documentation and the Java executables are available under an open source license at The system has been tested on Linux and Windows platforms.
Our graph-based approach enables the automatic creation of pipelines by compiling a specialised set of tools on demand, depending on the functionality required. It also allows the implementation of extensible and low-maintenance pipelines and contributes towards consolidating openness and collaboration in bioinformatics systems. It is targeted at pipeline developers and is suited for implementing applications with sequential execution steps and combined functionalities. In the format conversion application, the automatic combination of conversion tools increased both the number of possible conversions available to the user and the extensibility of the system to allow for future updates with new file formats.
PMCID: PMC3496580  PMID: 22788675
5.  SNP-PHAGE – High throughput SNP discovery pipeline 
BMC Bioinformatics  2006;7:468.
Single nucleotide polymorphisms (SNPs) as defined here are single base sequence changes or short insertion/deletions between or within individuals of a given species. As a result of their abundance and the availability of high throughput analysis technologies SNP markers have begun to replace other traditional markers such as restriction fragment length polymorphisms (RFLPs), amplified fragment length polymorphisms (AFLPs) and simple sequence repeats (SSRs or microsatellite) markers for fine mapping and association studies in several species. For SNP discovery from chromatogram data, several bioinformatics programs have to be combined to generate an analysis pipeline. Results have to be stored in a relational database to facilitate interrogation through queries or to generate data for further analyses such as determination of linkage disequilibrium and identification of common haplotypes. Although these tasks are routinely performed by several groups, an integrated open source SNP discovery pipeline that can be easily adapted by new groups interested in SNP marker development is currently unavailable.
We developed SNP-PHAGE (SNP discovery Pipeline with additional features for identification of common haplotypes within a sequence tagged site (Haplotype Analysis) and GenBank (-dbSNP) submissions. This tool was applied for analyzing sequence traces from diverse soybean genotypes to discover over 10,000 SNPs. This package was developed on UNIX/Linux platform, written in Perl and uses a MySQL database. Scripts to generate a user-friendly web interface are also provided with common queries for preliminary data analysis. A machine learning tool developed by this group for increasing the efficiency of SNP discovery is integrated as a part of this package as an optional feature. The SNP-PHAGE package is being made available open source at .
SNP-PHAGE provides a bioinformatics solution for high throughput SNP discovery, identification of common haplotypes within an amplicon, and GenBank (dbSNP) submissions. SNP selection and visualization are aided through a user-friendly web interface. This tool is useful for analyzing sequence tagged sites (STSs) of genomic sequences, and this software can serve as a starting point for groups interested in developing SNP markers.
PMCID: PMC1626092  PMID: 17059604
6.  Removing Noise From Pyrosequenced Amplicons 
BMC Bioinformatics  2011;12:38.
In many environmental genomics applications a homologous region of DNA from a diverse sample is first amplified by PCR and then sequenced. The next generation sequencing technology, 454 pyrosequencing, has allowed much larger read numbers from PCR amplicons than ever before. This has revolutionised the study of microbial diversity as it is now possible to sequence a substantial fraction of the 16S rRNA genes in a community. However, there is a growing realisation that because of the large read numbers and the lack of consensus sequences it is vital to distinguish noise from true sequence diversity in this data. Otherwise this leads to inflated estimates of the number of types or operational taxonomic units (OTUs) present. Three sources of error are important: sequencing error, PCR single base substitutions and PCR chimeras. We present AmpliconNoise, a development of the PyroNoise algorithm that is capable of separately removing 454 sequencing errors and PCR single base errors. We also introduce a novel chimera removal program, Perseus, that exploits the sequence abundances associated with pyrosequencing data. We use data sets where samples of known diversity have been amplified and sequenced to quantify the effect of each of the sources of error on OTU inflation and to validate these algorithms.
AmpliconNoise outperforms alternative algorithms substantially reducing per base error rates for both the GS FLX and latest Titanium protocol. All three sources of error lead to inflation of diversity estimates. In particular, chimera formation has a hitherto unrealised importance which varies according to amplification protocol. We show that AmpliconNoise allows accurate estimates of OTU number. Just as importantly AmpliconNoise generates the right OTUs even at low sequence differences. We demonstrate that Perseus has very high sensitivity, able to find 99% of chimeras, which is critical when these are present at high frequencies.
AmpliconNoise followed by Perseus is a very effective pipeline for the removal of noise. In addition the principles behind the algorithms, the inference of true sequences using Expectation-Maximization (EM), and the treatment of chimera detection as a classification or 'supervised learning' problem, will be equally applicable to new sequencing technologies as they appear.
PMCID: PMC3045300  PMID: 21276213
7.  Differentially expressed alternatively spliced genes in Malignant Pleural Mesothelioma identified using massively parallel transcriptome sequencing 
BMC Medical Genetics  2009;10:149.
Analyses of Expressed Sequence Tags (ESTs) databases suggest that most human genes have multiple alternative splice variants. The alternative splicing of pre-mRNA is tightly regulated during development and in different tissue types. Changes in splicing patterns have been described in disease states. Recently, we used whole-transcriptome shotgun pryrosequencing to characterize 4 malignant pleural mesothelioma (MPM) tumors, 1 lung adenocarcinoma and 1 normal lung. We hypothesized that alternative splicing profiles might be detected in the sequencing data for the expressed genes in these samples.
We developed a software pipeline to map the transcriptome read sequences of the 4 MPM samples and 1 normal lung sample onto known exon junction sequences in the comprehensive AceView database of expressed sequences and to count how many reads map to each junction. 13,274,187 transcriptome reads generated by the Roche/454 sequencing platform for 5 samples were compared with 151,486 exon junctions from the AceView database. The exon junction expression index (EJEI) was calculated for each exon junction in each sample to measure the differential expression of alternative splicing events. Top ten exon junctions with the largest EJEI difference between the 4 mesothelioma and the normal lung sample were then examined for differential expression using Quantitative Real Time PCR (qRT-PCR) in the 5 sequenced samples. Two of the differentially expressed exon junctions (ACTG2.aAug05 and CDK4.aAug05) were further examined with qRT-PCR in additional 18 MPM and 18 normal lung specimens.
We found 70,953 exon junctions covered by at least one sequence read in at least one of the 5 samples. All 10 identified most differentially expressed exon junctions were validated as present by RT-PCR, and 8 were differentially expressed exactly as predicted by the sequence analysis. The differential expression of the AceView exon junctions for the ACTG2 and CDK4 genes were also observed to be statistically significant in an additional 18 MPM and 18 normal lung samples examined using qRT-PCR. The differential expression of these two junctions was shown to successfully classify these mesothelioma and normal lung specimens with high sensitivity (89% and 78%, respectively).
Whole-transcriptome shotgun sequencing, combined with a downstream bioinformatics pipeline, provides powerful tools for the identification of differentially expressed exon junctions resulting from alternative splice variants. The alternatively spliced genes discovered in the study could serve as useful diagnostic markers as well as potential therapeutic targets for MPM.
PMCID: PMC2808307  PMID: 20043850
8.  Quantitative and qualitative differences in celiac disease epitopes among durum wheat varieties identified through deep RNA-amplicon sequencing 
BMC Genomics  2013;14:905.
Wheat gluten is important for the industrial quality of bread wheat (Triticum aestivum L.) and durum wheat (T. turgidum L.). Gluten proteins are also the source of immunogenic peptides that can trigger a T cell reaction in celiac disease (CD) patients, leading to inflammatory responses in the small intestine. Various peptides with three major T cell epitopes involved in CD are derived from alpha-gliadin fraction of gluten. Alpha-gliadins are encoded by a large multigene family and amino acid variation in the CD epitopes is known to influence the immunogenicity of individual gene family members. Current commercial methods of gluten detection are unable to distinguish between immunogenic and non-immunogenic CD epitope variants and thus to accurately quantify the overall CD epitope load of a given wheat variety. Such quantification is indispensable for correct selection of wheat varieties with low potential to cause CD.
A 454 RNA-amplicon sequencing method was developed for alpha-gliadin transcripts encompassing the three major CD epitopes and their variants. The method was used to screen developing grains on plants of 61 different durum wheat cultivars and accessions. A dedicated sequence analysis pipeline returned a total of 304 unique alpha-gliadin transcripts, corresponding to a total of 171 ‘unique deduced protein fragments’ of alpha-gliadins. The numbers of these fragments obtained in each plant were used to calculate quantitative and quantitative differences between the CD epitopes expressed in the endosperm of these wheat plants. A few plants showed a lower fraction of CD epitope-encoding alpha-gliadin transcripts, but none were free of CD epitopes.
The dedicated 454 RNA-amplicon sequencing method enables 1) the grouping of wheat plants according to the genetic variation in alpha-gliadin transcripts, and 2) the screening for plants which are potentially less CD-immunogenic. The resulting alpha-gliadin sequence database will be useful as a reference in proteomics analysis regarding the immunogenic potential of mature wheat grains.
PMCID: PMC3890609  PMID: 24354426
9.  Bioinformatics Pipelines for Targeted Resequencing and Whole-Exome Sequencing of Human and Mouse Genomes: A Virtual Appliance Approach for Instant Deployment 
PLoS ONE  2014;9(4):e95217.
Targeted resequencing by massively parallel sequencing has become an effective and affordable way to survey small to large portions of the genome for genetic variation. Despite the rapid development in open source software for analysis of such data, the practical implementation of these tools through construction of sequencing analysis pipelines still remains a challenging and laborious activity, and a major hurdle for many small research and clinical laboratories. We developed TREVA (Targeted REsequencing Virtual Appliance), making pre-built pipelines immediately available as a virtual appliance. Based on virtual machine technologies, TREVA is a solution for rapid and efficient deployment of complex bioinformatics pipelines to laboratories of all sizes, enabling reproducible results. The analyses that are supported in TREVA include: somatic and germline single-nucleotide and insertion/deletion variant calling, copy number analysis, and cohort-based analyses such as pathway and significantly mutated genes analyses. TREVA is flexible and easy to use, and can be customised by Linux-based extensions if required. TREVA can also be deployed on the cloud (cloud computing), enabling instant access without investment overheads for additional hardware. TREVA is available at
PMCID: PMC3994043  PMID: 24752294
10.  Applications of the pipeline environment for visual informatics and genomics computations 
BMC Bioinformatics  2011;12:304.
Contemporary informatics and genomics research require efficient, flexible and robust management of large heterogeneous data, advanced computational tools, powerful visualization, reliable hardware infrastructure, interoperability of computational resources, and detailed data and analysis-protocol provenance. The Pipeline is a client-server distributed computational environment that facilitates the visual graphical construction, execution, monitoring, validation and dissemination of advanced data analysis protocols.
This paper reports on the applications of the LONI Pipeline environment to address two informatics challenges - graphical management of diverse genomics tools, and the interoperability of informatics software. Specifically, this manuscript presents the concrete details of deploying general informatics suites and individual software tools to new hardware infrastructures, the design, validation and execution of new visual analysis protocols via the Pipeline graphical interface, and integration of diverse informatics tools via the Pipeline eXtensible Markup Language syntax. We demonstrate each of these processes using several established informatics packages (e.g., miBLAST, EMBOSS, mrFAST, GWASS, MAQ, SAMtools, Bowtie) for basic local sequence alignment and search, molecular biology data analysis, and genome-wide association studies. These examples demonstrate the power of the Pipeline graphical workflow environment to enable integration of bioinformatics resources which provide a well-defined syntax for dynamic specification of the input/output parameters and the run-time execution controls.
The LONI Pipeline environment provides a flexible graphical infrastructure for efficient biomedical computing and distributed informatics research. The interactive Pipeline resource manager enables the utilization and interoperability of diverse types of informatics resources. The Pipeline client-server model provides computational power to a broad spectrum of informatics investigators - experienced developers and novice users, user with or without access to advanced computational-resources (e.g., Grid, data), as well as basic and translational scientists. The open development, validation and dissemination of computational networks (pipeline workflows) facilitates the sharing of knowledge, tools, protocols and best practices, and enables the unbiased validation and replication of scientific findings by the entire community.
PMCID: PMC3199760  PMID: 21791102
11.  Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing 
Genome Medicine  2013;5(3):28.
To facilitate the clinical implementation of genomic medicine by next-generation sequencing, it will be critically important to obtain accurate and consistent variant calls on personal genomes. Multiple software tools for variant calling are available, but it is unclear how comparable these tools are or what their relative merits in real-world scenarios might be.
We sequenced 15 exomes from four families using commercial kits (Illumina HiSeq 2000 platform and Agilent SureSelect version 2 capture kit), with approximately 120X mean coverage. We analyzed the raw data using near-default parameters with five different alignment and variant-calling pipelines (SOAP, BWA-GATK, BWA-SNVer, GNUMAP, and BWA-SAMtools). We additionally sequenced a single whole genome using the sequencing and analysis pipeline from Complete Genomics (CG), with 95% of the exome region being covered by 20 or more reads per base. Finally, we validated 919 single-nucleotide variations (SNVs) and 841 insertions and deletions (indels), including similar fractions of GATK-only, SOAP-only, and shared calls, on the MiSeq platform by amplicon sequencing with approximately 5000X mean coverage.
SNV concordance between five Illumina pipelines across all 15 exomes was 57.4%, while 0.5 to 5.1% of variants were called as unique to each pipeline. Indel concordance was only 26.8% between three indel-calling pipelines, even after left-normalizing and intervalizing genomic coordinates by 20 base pairs. There were 11% of CG variants falling within targeted regions in exome sequencing that were not called by any of the Illumina-based exome analysis pipelines. Based on targeted amplicon sequencing on the MiSeq platform, 97.1%, 60.2%, and 99.1% of the GATK-only, SOAP-only and shared SNVs could be validated, but only 54.0%, 44.6%, and 78.1% of the GATK-only, SOAP-only and shared indels could be validated. Additionally, our analysis of two families (one with four individuals and the other with seven), demonstrated additional accuracy gained in variant discovery by having access to genetic data from a multi-generational family.
Our results suggest that more caution should be exercised in genomic medicine settings when analyzing individual genomes, including interpreting positive and negative findings with scrutiny, especially for indels. We advocate for renewed collection and sequencing of multi-generational families to increase the overall accuracy of whole genomes.
PMCID: PMC3706896  PMID: 23537139
12.  Data handling strategies for high throughput pyrosequencers 
BMC Bioinformatics  2007;8(Suppl 1):S22.
New high throughput pyrosequencers such as the 454 Life Sciences GS 20 are capable of massively parallelizing DNA sequencing providing an unprecedented rate of output data as well as potentially reducing costs. However, these new pyrosequencers bear a different error profile and provide shorter reads than those of a more traditional Sanger sequencer. These facts pose new challenges regarding how the data are handled and analyzed, in addition, the steep increase in the sequencers throughput calls for much computation power at a low cost.
To address these challenges, we created an automated multi-step computation pipeline integrated with a database storage system. This allowed us to store, handle, index and search (1) the output data from the GS20 sequencer (2) analysis projects, possibly multiple on every dataset (3) final results of analysis computations (4) intermediate results of computations (these allow hand-made comparisons and hence further searches by the biologists). Repeatability of computations was also a requirement. In order to access the needed computation power, we ported the pipeline to the European Grid: a large community of clusters, load balanced as a whole. In order to better achieve this Grid port we created Vnas: an innovative Grid job submission, virtual sandbox manager and job callback framework.
After some runs of the pipeline aimed at tuning the parameters and thresholds for optimal results, we successfully analyzed 273 sequenced amplicons from a cancerous human sample and correctly found punctual mutations confirmed by either Sanger resequencing or NCBI dbSNP. The sequencing was performed with our 454 Life Sciences GS 20 pyrosequencer.
We handled the steep increase in throughput from the new pyrosequencer by building an automated computation pipeline associated with database storage, and by leveraging the computing power of the European Grid. The Grid platform offers a very cost effective choice for uneven workloads, typical in many scientific research fields, provided its peculiarities can be accepted (these are discussed). The mentioned infrastructure was used to analyze human amplicons for mutations. More analyses will be performed in the future.
PMCID: PMC1885852  PMID: 17430567
13.  Massively parallel tag sequencing reveals the complexity of anaerobic marine protistan communities 
BMC Biology  2009;7:72.
Recent advances in sequencing strategies make possible unprecedented depth and scale of sampling for molecular detection of microbial diversity. Two major paradigm-shifting discoveries include the detection of bacterial diversity that is one to two orders of magnitude greater than previous estimates, and the discovery of an exciting 'rare biosphere' of molecular signatures ('species') of poorly understood ecological significance. We applied a high-throughput parallel tag sequencing (454 sequencing) protocol adopted for eukaryotes to investigate protistan community complexity in two contrasting anoxic marine ecosystems (Framvaren Fjord, Norway; Cariaco deep-sea basin, Venezuela). Both sampling sites have previously been scrutinized for protistan diversity by traditional clone library construction and Sanger sequencing. By comparing these clone library data with 454 amplicon library data, we assess the efficiency of high-throughput tag sequencing strategies. We here present a novel, highly conservative bioinformatic analysis pipeline for the processing of large tag sequence data sets.
The analyses of ca. 250,000 sequence reads revealed that the number of detected Operational Taxonomic Units (OTUs) far exceeded previous richness estimates from the same sites based on clone libraries and Sanger sequencing. More than 90% of this diversity was represented by OTUs with less than 10 sequence tags. We detected a substantial number of taxonomic groups like Apusozoa, Chrysomerophytes, Centroheliozoa, Eustigmatophytes, hyphochytriomycetes, Ichthyosporea, Oikomonads, Phaeothamniophytes, and rhodophytes which remained undetected by previous clone library-based diversity surveys of the sampling sites. The most important innovations in our newly developed bioinformatics pipeline employ (i) BLASTN with query parameters adjusted for highly variable domains and a complete database of public ribosomal RNA (rRNA) gene sequences for taxonomic assignments of tags; (ii) a clustering of tags at k differences (Levenshtein distance) with a newly developed algorithm enabling very fast OTU clustering for large tag sequence data sets; and (iii) a novel parsing procedure to combine the data from individual analyses.
Our data highlight the magnitude of the under-sampled 'protistan gap' in the eukaryotic tree of life. This study illustrates that our current understanding of the ecological complexity of protist communities, and of the global species richness and genome diversity of protists, is severely limited. Even though 454 pyrosequencing is not a panacea, it allows for more comprehensive insights into the diversity of protistan communities, and combined with appropriate statistical tools, enables improved ecological interpretations of the data and projections of global diversity.
PMCID: PMC2777867  PMID: 19886985
14.  Version VI of the ESTree db: an improved tool for peach transcriptome analysis 
BMC Bioinformatics  2008;9(Suppl 2):S9.
The ESTree database (db) is a collection of Prunus persica and Prunus dulcis EST sequences that in its current version encompasses 75,404 sequences from 3 almond and 19 peach libraries. Nine peach genotypes and four peach tissues are represented, from four fruit developmental stages. The aim of this work was to implement the already existing ESTree db by adding new sequences and analysis programs. Particular care was given to the implementation of the web interface, that allows querying each of the database features.
A Perl modular pipeline is the backbone of sequence analysis in the ESTree db project. Outputs obtained during the pipeline steps are automatically arrayed into the fields of a MySQL database. Apart from standard clustering and annotation analyses, version VI of the ESTree db encompasses new tools for tandem repeat identification, annotation against genomic Rosaceae sequences, and positioning on the database of oligomer sequences that were used in a peach microarray study. Furthermore, known protein patterns and motifs were identified by comparison to PROSITE. Based on data retrieved from sequence annotation against the UniProtKB database, a script was prepared to track positions of homologous hits on the GO tree and build statistics on the ontologies distribution in GO functional categories. EST mapping data were also integrated in the database. The PHP-based web interface was upgraded and extended. The aim of the authors was to enable querying the database according to all the biological aspects that can be investigated from the analysis of data available in the ESTree db. This is achieved by allowing multiple searches on logical subsets of sequences that represent different biological situations or features.
The version VI of ESTree db offers a broad overview on peach gene expression. Sequence analyses results contained in the database, extensively linked to external related resources, represent a large amount of information that can be queried via the tools offered in the web interface. Flexibility and modularity of the ESTree analysis pipeline and of the web interface allowed the authors to set up similar structures for different datasets, with limited manual intervention.
PMCID: PMC2323672  PMID: 18387211
15.  Applying Next Generation Sequencing Technologies in a Core Facility Environment 
Journal of Biomolecular Techniques : JBT  2010;21(3 Suppl):S73-S74.
New DNA sequencing technologies presents an exceptional opportunity for novel and creative applications with the potential for breakthrough discoveries. To support such research efforts, the Cornell University Life Sciences Core Laboratories Center has implemented the Illumina Solexa Genome Analyzer IIx and the Roche 454 Genome Sequencer FLX platforms as academic core facility shared research resources. We have established sample handling methods, wikiLIMS tools and informatics analysis pipelines in support of these new technologies. Our DNA sequencing and genotyping core laboratory provides sample preparation and data generation services and in collaboration with the microarrays and informatics core facilities, provides both project consultation and analysis support for a wide range of possible applications, including de novo or reference based genome assembly, detection of genetic variation, transcriptome sequencing, small RNA profiling, and genome-wide epigenomic measurements of protein-nucleic interactions. Implementation of next generation sequencing platforms as shared resources with multi-disciplinary core facility support enables cost effective access and broad based use of these technologies.
PMCID: PMC2918155
16.  The IGOR Cloud Platform: Collaborative, Scalable, and Peer-Reviewed NGS Data Analysis 
Technical challenges facing researchers performing next-generation sequencing (NGS) analysis threaten to slow the pace of discovery and delay clinical applications of genomics data. Particularly for core laboratories, these challenges include: (1) Computation and storage have to scale with the vase amount of data generated. (2) Analysis pipelines are complex to design, set up, and share. (3) Collaboration, reproducibility, and sharing are hampered by privacy concerns and the sheer volume of data involved. Based on hands-on experience from large-scale NGS projects such as the 1000 Genomes Project, Seven Bridges Genomics has developed IGOR, a comprehensive cloud platform for NGS Data analysis that fully addresses these challenges: IGOR is a cloud-based platform for researchers and facilities to manage NGS data, design and run complex analysis pipelines, and efficiently collaborate on projects.Over a dozen curated and peer-reviewed NGS data analysis pipelines are publicly available for free, including alignment, variant calling, and RNA-Seq. All pipelines are based on open source tools and built to peer-reviewed specifications in close collaboration with researchers at leading institutions such as the Harvard Stem Cell Institute.Without any command-line knowledge, NGS pipelines can be built and customized in an intuitive graphical editor choosing from over 50 open source tools.When executing pipelines, IGOR automatically takes care of all resource management. Resources are seamlessly and automatically made available from Amazon Web Services and optimized for time and cost.Collaboration is facilitated through a project structure that allows researchers working in and across institutions to share files and pipelines. Fine-grained permissions allow detailed access control on a user-by-user basis for each project. Pipelines can be embedded and accessed through web pages akin to YouTube videos.Extensive batch processing and parallelization capabilities mean that hundreds of samples can be analyzed in the same amount of time that a single sample can be processed. Using file metadata, batch processing can be automated, e.g., by file, library, sample or lane.
The IGOR platform enables NGS research as a “turnkey” solution: Researchers can set up and run complex pipelines without expertise in command-line utilities or cloud computing. From a lab and facility perspective, the cloud-based architecture also eliminates the need to set up and maintain a large-scale infrastructure, typically resulting in at least 50% cost savings on infrastructure. By facilitating collaboration and easing analysis replication, the IGOR platform frees up the time of core laboratories to emphasize and focus on the research questions that ultimately guide them.
PMCID: PMC3635388
17.  A software pipeline for processing and identification of fungal ITS sequences 
Fungi from environmental samples are typically identified to species level through DNA sequencing of the nuclear ribosomal internal transcribed spacer (ITS) region for use in BLAST-based similarity searches in the International Nucleotide Sequence Databases. These searches are time-consuming and regularly require a significant amount of manual intervention and complementary analyses. We here present software – in the form of an identification pipeline for large sets of fungal ITS sequences – developed to automate the BLAST process and several additional analysis steps. The performance of the pipeline was evaluated on a dataset of 350 ITS sequences from fungi growing as epiphytes on building material.
The pipeline was written in Perl and uses a local installation of NCBI-BLAST for the similarity searches of the query sequences. The variable subregion ITS2 of the ITS region is extracted from the sequences and used for additional searches of higher sensitivity. Multiple alignments of each query sequence and its closest matches are computed, and query sequences sharing at least 50% of their best matches are clustered to facilitate the evaluation of hypothetically conspecific groups. The pipeline proved to speed up the processing, as well as enhance the resolution, of the evaluation dataset considerably, and the fungi were found to belong chiefly to the Ascomycota, with Penicillium and Aspergillus as the two most common genera. The ITS2 was found to indicate a different taxonomic affiliation than did the complete ITS region for 10% of the query sequences, though this figure is likely to vary with the taxonomic scope of the query sequences.
The present software readily assigns large sets of fungal query sequences to their respective best matches in the international sequence databases and places them in a larger biological context. The output is highly structured to be easy to process, although it still needs to be inspected and possibly corrected for the impact of the incomplete and sometimes erroneously annotated fungal entries in these databases. The open source pipeline is available for UNIX-type platforms, and updated releases of the target database are made available biweekly. The pipeline is easily modified to operate on other molecular regions and organism groups.
PMCID: PMC2649129  PMID: 19146660
18.  Integrated Core Facility Support and Optimization of Next Generation Sequencing Technologies 
New DNA sequencing technologies present an exceptional opportunity for novel and creative applications with the potential for breakthrough discoveries. To support such research efforts, the Cornell University Life Sciences Core Laboratories Center has implemented the Illumina HiSeq 2000 and the Roche 454 GS FLX platforms as academic core facility shared research resources. We have established sample handling methods, LIMS tools and BioHPC informatics analysis pipelines in support of these new technologies. Our genomics core laboratory, in collaboration with our epigenomics core and bioinformatics core, provides sample preparation and data generation services and both project consultation and analysis support for a wide range of possible applications, including de novo or reference based genome assembly, detection of genetic variation, transcriptome sequencing, small RNA profiling, and genome-wide epigenomic measurements of methylation and protein-nucleic acid interactions. Implementation of next generation sequencing platforms as shared resources with multidisciplinary core facility support enables cost effective access and broad based use of these technologies.
PMCID: PMC3186497
19.  CANGS: a user-friendly utility for processing and analyzing 454 GS-FLX data in biodiversity studies 
BMC Research Notes  2010;3:3.
Next generation sequencing (NGS) technologies have substantially increased the sequence output while the costs were dramatically reduced. In addition to the use in whole genome sequencing, the 454 GS-FLX platform is becoming a widely used tool for biodiversity surveys based on amplicon sequencing. In order to use NGS for biodiversity surveys, software tools are required, which perform quality control, trimming of the sequence reads, removal of PCR primers, and generation of input files for downstream analyses. A user-friendly software utility that carries out these steps is still lacking.
We developed CANGS (Cleaning and Analyzing Next Generation Sequences) a flexible and user-friendly integrated software utility: CANGS is designed for amplicon based biodiversity surveys using the 454 sequencing platform. CANGS filters low quality sequences, removes PCR primers, filters singletons, identifies barcodes, and generates input files for downstream analyses. The downstream analyses rely either on third party software (e.g.: rarefaction analyses) or CANGS-specific scripts. The latter include modules linking 454 sequences with the name of the closest taxonomic reference retrieved from the NCBI database and the sequence divergence between them. Our software can be easily adapted to handle sequencing projects with different amplicon sizes, primer sequences, and quality thresholds, which makes this software especially useful for non-bioinformaticians.
CANGS performs PCR primer clipping, filtering of low quality sequences, links sequences to NCBI taxonomy and provides input files for common rarefaction analysis software programs. CANGS is written in Perl and runs on Mac OS X/Linux and is available at
PMCID: PMC2830946  PMID: 20180949
20.  Parallel-META: efficient metagenomic data analysis based on high-performance computation 
BMC Systems Biology  2012;6(Suppl 1):S16.
Metagenomics method directly sequences and analyses genome information from microbial communities. There are usually more than hundreds of genomes from different microbial species in the same community, and the main computational tasks for metagenomic data analyses include taxonomical and functional component examination of all genomes in the microbial community. Metagenomic data analysis is both data- and computation- intensive, which requires extensive computational power. Most of the current metagenomic data analysis softwares were designed to be used on a single computer or single computer clusters, which could not match with the fast increasing number of large metagenomic projects' computational requirements. Therefore, advanced computational methods and pipelines have to be developed to cope with such need for efficient analyses.
In this paper, we proposed Parallel-META, a GPU- and multi-core-CPU-based open-source pipeline for metagenomic data analysis, which enabled the efficient and parallel analysis of multiple metagenomic datasets and the visualization of the results for multiple samples. In Parallel-META, the similarity-based database search was parallelized based on GPU computing and multi-core CPU computing optimization. Experiments have shown that Parallel-META has at least 15 times speed-up compared to traditional metagenomic data analysis method, with the same accuracy of the results
The parallel processing of current metagenomic data would be very promising: with current speed up of 15 times and above, binning would not be a very time-consuming process any more. Therefore, some deeper analysis of the metagenomic data, such as the comparison of different samples, would be feasible in the pipeline, and some of these functionalities have been included into the Parallel-META pipeline.
PMCID: PMC3403166  PMID: 23046922
21.  Calling SNPs without a reference sequence 
BMC Bioinformatics  2010;11:130.
The most common application for the next-generation sequencing technologies is resequencing, where short reads from the genome of an individual are aligned to a reference genome sequence for the same species. These mappings can then be used to identify genetic differences among individuals in a population, and perhaps ultimately to explain phenotypic variation. Many algorithms capable of aligning short reads to the reference, and determining differences between them have been reported. Much less has been reported on how to use these technologies to determine genetic differences among individuals of a species for which a reference sequence is not available, which drastically limits the number of species that can easily benefit from these new technologies.
We describe a computational pipeline, called DIAL (De novo Identification of Alleles), for identifying single-base substitutions between two closely related genomes without the help of a reference genome. The method works even when the depth of coverage is insufficient for de novo assembly, and it can be extended to determine small insertions/deletions. We evaluate the software's effectiveness using published Roche/454 sequence data from the genome of Dr. James Watson (to detect heterozygous positions) and recent Illumina data from orangutan, in each case comparing our results to those from computational analysis that uses a reference genome assembly. We also illustrate the use of DIAL to identify nucleotide differences among transcriptome sequences.
DIAL can be used for identification of nucleotide differences in species for which no reference sequence is available. Our main motivation is to use this tool to survey the genetic diversity of endangered species as the identified sequence differences can be used to design genotyping arrays to assist in the species' management. The DIAL source code is freely available at
PMCID: PMC2851604  PMID: 20230626
22.  A community-based resource for automatic exome variant-calling and annotation in Mendelian disorders 
BMC Genomics  2014;15(Suppl 3):S5.
Mendelian disorders are mostly caused by single mutations in the DNA sequence of a gene, leading to a phenotype with pathologic consequences. Whole Exome Sequencing of patients can be a cost-effective alternative to standard genetic screenings to find causative mutations of genetic diseases, especially when the number of cases is limited. Analyzing exome sequencing data requires specific expertise, high computational resources and a reference variant database to identify pathogenic variants.
We developed a database of variations collected from patients with Mendelian disorders, which is automatically populated thanks to an associated exome-sequencing pipeline. The pipeline is able to automatically identify, annotate and store insertions, deletions and mutations in the database. The resource is freely available online The exome sequencing pipeline automates the analysis workflow (quality control and read trimming, mapping on reference genome, post-alignment processing, variation calling and annotation) using state-of-the-art software tools. The exome-sequencing pipeline has been designed to run on a computing cluster in order to analyse several samples simultaneously. The detected variants are annotated by the pipeline not only with the standard variant annotations (e.g. allele frequency in the general population, the predicted effect on gene product activity, etc.) but, more importantly, with allele frequencies across samples progressively collected in the database itself, stratified by Mendelian disorder.
We aim at providing a resource for the genetic disease community to automatically analyse whole exome-sequencing samples with a standard and uniform analysis pipeline, thus collecting variant allele frequencies by disorder. This resource may become a valuable tool to help dissecting the genotype underlying the disease phenotype through an improved selection of putative patient-specific causative or phenotype-associated variations.
PMCID: PMC4083405  PMID: 25078076
23.  Management of High-Throughput DNA Sequencing Projects: Alpheus 
High-throughput DNA sequencing has enabled systems biology to begin to address areas in health, agricultural and basic biological research. Concomitant with the opportunities is an absolute necessity to manage significant volumes of high-dimensional and inter-related data and analysis. Alpheus is an analysis pipeline, database and visualization software for use with massively parallel DNA sequencing technologies that feature multi-gigabase throughput characterized by relatively short reads, such as Illumina-Solexa (sequencing-by-synthesis), Roche-454 (pyrosequencing) and Applied Biosystem’s SOLiD (sequencing-by-ligation). Alpheus enables alignment to reference sequence(s), detection of variants and enumeration of sequence abundance, including expression levels in transcriptome sequence. Alpheus is able to detect several types of variants, including non-synonymous and synonymous single nucleotide polymorphisms (SNPs), insertions/deletions (indels), premature stop codons, and splice isoforms. Variant detection is aided by the ability to filter variant calls based on consistency, expected allele frequency, sequence quality, coverage, and variant type in order to minimize false positives while maximizing the identification of true positives. Alpheus also enables comparisons of genes with variants between cases and controls or bulk segregant pools. Sequence-based differential expression comparisons can be developed, with data export to SAS JMP Genomics for statistical analysis.
PMCID: PMC2819532  PMID: 20151039
Alpheus; sequencing-by-synthesis; pyrosequencing; GMAP; GSNAP; resequencing; transcriptome sequencing
24.  Novel computational methods for increasing PCR primer design effectiveness in directed sequencing 
BMC Bioinformatics  2008;9:191.
Polymerase chain reaction (PCR) is used in directed sequencing for the discovery of novel polymorphisms. As the first step in PCR directed sequencing, effective PCR primer design is crucial for obtaining high-quality sequence data for target regions. Since current computational primer design tools are not fully tuned with stable underlying laboratory protocols, researchers may still be forced to iteratively optimize protocols for failed amplifications after the primers have been ordered. Furthermore, potentially identifiable factors which contribute to PCR failures have yet to be elucidated. This inefficient approach to primer design is further intensified in a high-throughput laboratory, where hundreds of genes may be targeted in one experiment.
We have developed a fully integrated computational PCR primer design pipeline that plays a key role in our high-throughput directed sequencing pipeline. Investigators may specify target regions defined through a rich set of descriptors, such as Ensembl accessions and arbitrary genomic coordinates. Primer pairs are then selected computationally to produce a minimal amplicon set capable of tiling across the specified target regions. As part of the tiling process, primer pairs are computationally screened to meet the criteria for success with one of two PCR amplification protocols. In the process of improving our sequencing success rate, which currently exceeds 95% for exons, we have discovered novel and accurate computational methods capable of identifying primers that may lead to PCR failures. We reveal the laboratory protocols and their associated, empirically determined computational parameters, as well as describe the novel computational methods which may benefit others in future primer design research.
The high-throughput PCR primer design pipeline has been very successful in providing the basis for high-quality directed sequencing results and for minimizing costs associated with labor and reprocessing. The modular architecture of the primer design software has made it possible to readily integrate additional primer critique tests based on iterative feedback from the laboratory. As a result, the primer design software, coupled with the laboratory protocols, serves as a powerful tool for low and high-throughput primer design to enable successful directed sequencing.
PMCID: PMC2396641  PMID: 18405373
25.  Extraction of human kinase mutations from literature, databases and genotyping studies 
BMC Bioinformatics  2009;10(Suppl 8):S1.
There is a considerable interest in characterizing the biological role of specific protein residue substitutions through mutagenesis experiments. Additionally, recent efforts related to the detection of disease-associated SNPs motivated both the manual annotation, as well as the automatic extraction, of naturally occurring sequence variations from the literature, especially for protein families that play a significant role in signaling processes such as kinases. Systematic integration and comparison of kinase mutation information from multiple sources, covering literature, manual annotation databases and large-scale experiments can result in a more comprehensive view of functional, structural and disease associated aspects of protein sequence variants. Previously published mutation extraction approaches did not sufficiently distinguish between two fundamentally different variation origin categories, namely natural occurring and induced mutations generated through in vitro experiments.
We present a literature mining pipeline for the automatic extraction and disambiguation of single-point mutation mentions from both abstracts as well as full text articles, followed by a sequence validation check to link mutations to their corresponding kinase protein sequences. Each mutation is scored according to whether it corresponds to an induced mutation or a natural sequence variant. We were able to provide direct literature links for a considerable fraction of previously annotated kinase mutations, enabling thus more efficient interpretation of their biological characterization and experimental context. In order to test the capabilities of the presented pipeline, the mutations in the protein kinase domain of the kinase family were analyzed. Using our literature extraction system, we were able to recover a total of 643 mutations-protein associations from PubMed abstracts and 6,970 from a large collection of full text articles. When compared to state-of-the-art annotation databases and high throughput genotyping studies, the mutation mentions extracted from the literature overlap to a good extent with the existing knowledgebases, whereas the remaining mentions suggest new mutation records that were not previously annotated in the databases.
Using the proposed residue disambiguation and classification approach, we were able to differentiate between natural variant and mutagenesis types of mutations with an accuracy of 93.88. The resulting system is useful for constructing a Gold Standard set of mutations extracted from the literature by human experts with minimal manual curation effort, providing direct pointers to relevant evidence sentences. Our system is able to recover mutations from the literature that are not present in state-of-the-art databases. Human expert manual validation of a subset of the literature extracted mutations conducted on 100 mutations from PubMed abstracts highlights that almost three quarters (72%) of the extracted mutations turned out to be correct, and more than half of these had not been previously annotated in databases.
PMCID: PMC2745582  PMID: 19758464

Results 1-25 (919689)