This article reviews basic concepts, general applications, and the potential impact of next-generation sequencing (NGS) technologies on genomics, with particular reference to currently available and possible future platforms and bioinformatics. NGS technologies have demonstrated the capacity to sequence DNA at unprecedented speed, thereby enabling previously unimaginable scientific achievements and novel biological applications. But, the massive data produced by NGS also presents a significant challenge for data storage, analyses, and management solutions. Advanced bioinformatic tools are essential for the successful application of NGS technology. As evidenced throughout this review, NGS technologies will have a striking impact on genomic research and the entire biological field. With its ability to tackle the unsolved challenges unconquered by previous genomic technologies, NGS is likely to unravel the complexity of the human genome in terms of genetic variations, some of which may be confined to susceptible loci for some common human conditions. The impact of NGS technologies on genomics will be far reaching and likely change the field for years to come.
Next-generation sequencing; Genomics; Genetic variation; Polymorphism; Targeted sequence enrichment; Bioinformatics
We are focusing specifically on fast data analysis and retrieval in bioinformatics that will have a direct impact on the quality of human health and the environment. The exponential growth of data generated in biology research, from small atoms to big ecosystems, necessitates an increasingly large computational component to perform analyses. Novel DNA sequencing technologies and complementary high-throughput approaches--such as proteomics, genomics, metabolomics, and meta-genomics--drive data-intensive bioinformatics. While individual research centers or universities could once provide for these applications, this is no longer the case. Today, only specialized national centers can deliver the level of computing resources required to meet the challenges posed by rapid data growth and the resulting computational demand. Consequently, we are developing massively parallel applications to analyze the growing flood of biological data and contribute to the rapid discovery of novel knowledge.
The efforts of previous National Science Foundation (NSF) projects provided for the generation of parallel modules for widely used bioinformatics applications on the Kraken supercomputer. We have profiled and optimized the code of some of the scientific community's most widely used desktop and small-cluster-based applications, including BLAST from the National Center for Biotechnology Information (NCBI), HMMER, and MUSCLE; scaled them to tens of thousands of cores on high-performance computing (HPC) architectures; made them robust and portable to next-generation architectures; and incorporated these parallel applications in science gateways with a web-based portal.
This paper will discuss the various developmental stages, challenges, and solutions involved in taking bioinformatics applications from the desktop to petascale with a front-end portal for very-large-scale data analysis in the life sciences.
This research will help to bridge the gap between the rate of data generation and the speed at which scientists can study this data. The ability to rapidly analyze data at such a large scale is having a significant, direct impact on science achieved by collaborators who are currently using these tools on supercomputers.
With the completion of the human genome sequence the functional analysis and characterization of the encoded proteins has become the next urging challenge in the post-genome era. The lack of comprehensive ORFeome resources has thus far hampered systematic applications by protein gain-of-function analysis. Gene and ORF coverage with full-length ORF clones thus needs to be extended. In combination with a unique and versatile cloning system, these will provide the tools for genome-wide systematic functional analyses, to achieve a deeper insight into complex biological processes.
Here we describe the generation of a full-ORF clone resource of human genes applying the Gateway cloning technology (Invitrogen). A pipeline for efficient cloning and sequencing was developed and a sample tracking database was implemented to streamline the clone production process targeting more than 2,200 different ORFs. In addition, a robust cloning strategy was established, permitting the simultaneous generation of two clone variants that contain a particular ORF with as well as without a stop codon by the implementation of only one additional working step into the cloning procedure. Up to 92 % of the targeted ORFs were successfully amplified by PCR and more than 93 % of the amplicons successfully cloned.
The German cDNA Consortium ORFeome resource currently consists of more than 3,800 sequence-verified entry clones representing ORFs, cloned with and without stop codon, for about 1,700 different gene loci. 177 splice variants were cloned representing 121 of these genes. The entry clones have been used to generate over 5,000 different expression constructs, providing the basis for functional profiling applications. As a member of the recently formed international ORFeome collaboration we substantially contribute to generating and providing a whole genome human ORFeome collection in a unique cloning system that is made freely available in the community.
With the advent of DNA sequencing technologies, more and more reference genome sequences are available for many organisms. Analyzing sequence variation and understanding its biological importance are becoming a major research aim. However, how to store and process the huge amount of eukaryotic genome data, such as those of the human, mouse and rice, has become a challenge to biologists. Currently available bioinformatics tools used to compress genome sequence data have some limitations, such as the requirement of the reference single nucleotide polymorphisms (SNPs) map and information on deletions and insertions. Here, we present a novel compression tool for storing and analyzing Genome ReSequencing data, named GRS. GRS is able to process the genome sequence data without the use of the reference SNPs and other sequence variation information and automatically rebuild the individual genome sequence data using the reference genome sequence. When its performance was tested on the first Korean personal genome sequence data set, GRS was able to achieve ∼159-fold compression, reducing the size of the data from 2986.8 to 18.8 MB. While being tested against the sequencing data from rice and Arabidopsis thaliana, GRS compressed the 361.0 MB rice genome data to 4.4 MB, and the A. thaliana genome data from 115.1 MB to 6.5 KB. This de novo compression tool is available at http://gmdd.shgmo.org/Computational-Biology/GRS.
Motivation: The avalanche of data arriving since the development of NGS technologies have prompted the need for developing fast, accurate and easily automated bioinformatic tools capable of dealing with massive datasets. Among the most productive applications of NGS technologies is the sequencing of cellular RNA, known as RNA-Seq. Although RNA-Seq provides similar or superior dynamic range than microarrays at similar or lower cost, the lack of standard and user-friendly pipelines is a bottleneck preventing RNA-Seq from becoming the standard for transcriptome analysis.
Results: In this work we present a pipeline for processing and analyzing RNA-Seq data, that we have named Grape (Grape RNA-Seq Analysis Pipeline Environment). Grape supports raw sequencing reads produced by a variety of technologies, either in FASTA or FASTQ format, or as prealigned reads in SAM/BAM format. A minimal Grape configuration consists of the file location of the raw sequencing reads, the genome of the species and the corresponding gene and transcript annotation.
Grape first runs a set of quality control steps, and then aligns the reads to the genome, a step that is omitted for prealigned read formats. Grape next estimates gene and transcript expression levels, calculates exon inclusion levels and identifies novel transcripts.
Grape can be run on a single computer or in parallel on a computer cluster. It is distributed with specific mapping and quantification tools, but given its modular design, any tool supporting popular data interchange formats can be integrated.
Availability: Grape can be obtained from the Bioinformatics and Genomics website at: http://big.crg.cat/services/grape.
firstname.lastname@example.org or email@example.com
The Human Genome Project s completion of the human genome sequence in 2003 was a landmark scientific achievement of historic significance. It also signified a critical transition for the field of genomics, as the new foundation of genomic knowledge started to be used in powerful ways by researchers and clinicians to tackle increasingly complex problems in biomedicine. To exploit the opportunities provided by the human genome sequence and to ensure the productive growth of genomics as one of the most vital biomedical disciplines of the 21st century, the National Human Genome Research Institute (NHGRI) is pursuing a broad vision for genomics research beyond the Human Genome Project. This vision includes facilitating and supporting the highest-priority research areas that interconnect genomics to biology, to health, and to society. Current efforts in genomics research are focused on using genomic data, technologies, and insights to acquire a deeper understanding of biology and to uncover the genetic basis of human disease. Some of the most profound advances are being catalyzed by revolutionary new DNA sequencing technologies; these methods are already producing prodigious amounts of DNA sequence data, including from large numbers of individual patients. Such a capability, coupled with better associations between genetic diseases and specific regions of the human genome, are accelerating our understanding of the genetic basis for complex genetic disorders and for drug response. Together, these developments will usher in the era of genomic medicine.
Systems biology and infectious disease (host-pathogen-environment) research and development is becoming increasingly dependent on integrating data from diverse and dynamic sources. Maintaining integrated resources over long periods of time presents distinct challenges. This paper describes experiences and lessons learned from integrating data in two five-year projects focused on pathosystems biology: the Pathosystems Resource Integration Center (PATRIC, http://patric.vbi.vt.edu/), with a goal of developing bioinformatics resources for the research and countermeasures development communities based on genomics data, and the Resource Center for Biodefense Proteomics Research (RCBPR, http://www.proteomicsresource.org/), with a goal of developing resources based on the experiment data such as microarray and proteomics data from diverse sources and technologies. Some challenges include integrating genomic sequence and experiment data, data synchronization, data quality control, and usability engineering. We present examples of a variety of data integration problems drawn from our experiences with PATRIC and RBPRC, as well as open research questions related to long term sustainability, and describe the next steps to meeting these challenges. Novel contributions of this work include (1) an approach for addressing discrepancies between experiment results and interpreted results and (2) expanding the range of data integration techniques to include usability engineering at the presentation level.
High density genotyping arrays have become established as a valuable research tool in human genetics. Currently, more than 300 genome wide association studies were published for human reporting about 1,000 SNPs that are associated with a phenotype. Also in animal sciences high density genotyping arrays are harnessed to analyse genetic variation. To exploit the full potential of this technology single nucleotide polymorphisms (SNPs) on the chips should be well characterized and their chromosomal position should be precisely known. This, however, is a challenge if the genome sequence is still subject to changes.
We have developed a mapping strategy and a suite of software scripts to update the chromosomal positions of oligomer sequences used for SNP genotyping on high density arrays. We describe the mapping procedure in detail so that scientists with moderate bioinformatics skills can reproduce it. We furthermore present a case study in which we re-mapped 54,001 oligomer sequences from Ilumina's BovineSNP50 beadchip to the bovine genome sequence. We found in 992 cases substantial discrepancies between the manufacturer's annotations and our results. The software scripts in the Perl and R programming languages are provided as supplements.
The positions of oligomer sequences in the genome are volatile even within one build of the genome. To facilitate the analysis of data from a GWAS or from an expression study, especially with species whose genome assembly is still unstable, it is recommended to update the oligomer positions before data analysis.
High-performance next-generation sequencing (NGS) technologies are advancing genomics and molecular biological research. However, the immense amount of sequence data requires computational skills and suitable hardware resources that are a challenge to molecular biologists. The DNA Data Bank of Japan (DDBJ) of the National Institute of Genetics (NIG) has initiated a cloud computing-based analytical pipeline, the DDBJ Read Annotation Pipeline (DDBJ Pipeline), for a high-throughput annotation of NGS reads. The DDBJ Pipeline offers a user-friendly graphical web interface and processes massive NGS datasets using decentralized processing by NIG supercomputers currently free of charge. The proposed pipeline consists of two analysis components: basic analysis for reference genome mapping and de novo assembly and subsequent high-level analysis of structural and functional annotations. Users may smoothly switch between the two components in the pipeline, facilitating web-based operations on a supercomputer for high-throughput data analysis. Moreover, public NGS reads of the DDBJ Sequence Read Archive located on the same supercomputer can be imported into the pipeline through the input of only an accession number. This proposed pipeline will facilitate research by utilizing unified analytical workflows applied to the NGS data. The DDBJ Pipeline is accessible at http://p.ddbj.nig.ac.jp/.
next-generation sequencing; sequence read archive; cloud computing; analytical pipeline; genome analysis
EST sequencing projects are increasing in scale and scope as the genome sequencing technologies migrate from core sequencing centers to individual research laboratories. Effectively, generating EST data is no longer a bottleneck for investigators. However, processing large amounts of EST data remains a non-trivial challenge for many. Web-based EST analysis tools are proving to be the most convenient option for biologists when performing their analysis, so these tools must continuously improve on their utility to keep in step with the growing needs of research communities. We have developed a web-based EST analysis pipeline called ESTPiper, which streamlines typical large-scale EST analysis components.
The intuitive web interface guides users through each step of base calling, data cleaning, assembly, genome alignment, annotation, analysis of gene ontology (GO), and microarray oligonucleotide probe design. Each step is modularized. Therefore, a user can execute them separately or together in batch mode. In addition, the user has control over the parameters used by the underlying programs. Extensive documentation of ESTPiper's functionality is embedded throughout the web site to facilitate understanding of the required input and interpretation of the computational results. The user can also download intermediate results and port files to separate programs for further analysis. In addition, our server provides a time-stamped description of the run history for reproducibility. The pipeline can also be installed locally, allowing researchers to modify ESTPiper to suit their own needs.
ESTPiper streamlines the typical process of EST analysis. The pipeline was initially designed in part to support the Daphnia pulex cDNA sequencing project. A web server hosting ESTPiper is provided at to now support projects of all size. The software is also freely available from the authors for local installations.
Motivation: Next-generation sequencing technologies have enabled the sequencing of several human genomes in their entirety. However, the routine resequencing of complete genomes remains infeasible. The massive capacity of next-generation sequencers can be harnessed for sequencing specific genomic regions in hundreds to thousands of individuals. Sequencing-based association studies are currently limited by the low level of multiplexing offered by sequencing platforms. Pooled sequencing represents a cost-effective approach for studying rare variants in large populations. To utilize the power of DNA pooling, it is important to accurately identify sequence variants from pooled sequencing data. Detection of rare variants from pooled sequencing represents a different challenge than detection of variants from individual sequencing.
Results: We describe a novel statistical approach, CRISP [Comprehensive Read analysis for Identification of Single Nucleotide Polymorphisms (SNPs) from Pooled sequencing] that is able to identify both rare and common variants by using two approaches: (i) comparing the distribution of allele counts across multiple pools using contingency tables and (ii) evaluating the probability of observing multiple non-reference base calls due to sequencing errors alone. Information about the distribution of reads between the forward and reverse strands and the size of the pools is also incorporated within this framework to filter out false variants. Validation of CRISP on two separate pooled sequencing datasets generated using the Illumina Genome Analyzer demonstrates that it can detect 80–85% of SNPs identified using individual sequencing while achieving a low false discovery rate (3–5%). Comparison with previous methods for pooled SNP detection demonstrates the significantly lower false positive and false negative rates for CRISP.
Availability: Implementation of this method is available at http://polymorphism.scripps.edu/∼vbansal/software/CRISP/
Application of single nucleotide polymorphisms (SNPs) is revolutionizing human bio-medical research. However, discovery of polymorphisms in low polymorphic species is still a challenging and costly endeavor, despite widespread availability of Sanger sequencing technology. We present CRoPS™ as a novel approach for polymorphism discovery by combining the power of reproducible genome complexity reduction of AFLP® with Genome Sequencer (GS) 20/GS FLX next-generation sequencing technology. With CRoPS, hundreds-of-thousands of sequence reads derived from complexity-reduced genome sequences of two or more samples are processed and mined for SNPs using a fully-automated bioinformatics pipeline. We show that over 75% of putative maize SNPs discovered using CRoPS are successfully converted to SNPWave® assays, confirming them to be true SNPs derived from unique (single-copy) genome sequences. By using CRoPS, polymorphism discovery will become affordable in organisms with high levels of repetitive DNA in the genome and/or low levels of polymorphism in the (breeding) germplasm without the need for prior sequence information.
Until recently, sequencing has primarily been carried out in large genome centers which have invested heavily in developing the computational infrastructure that enables genomic sequence analysis. The recent advancements in next generation sequencing (NGS) have led to a wide dissemination of sequencing technologies and data, to highly diverse research groups. It is expected that clinical sequencing will become part of diagnostic routines shortly. However, limited accessibility to computational infrastructure and high quality bioinformatic tools, and the demand for personnel skilled in data analysis and interpretation remains a serious bottleneck. To this end, the cloud computing and Software-as-a-Service (SaaS) technologies can help address these issues.
We successfully enabled the Atlas2 Cloud pipeline for personal genome analysis on two different cloud service platforms: a community cloud via the Genboree Workbench, and a commercial cloud via the Amazon Web Services using Software-as-a-Service model. We report a case study of personal genome analysis using our Atlas2 Genboree pipeline. We also outline a detailed cost structure for running Atlas2 Amazon on whole exome capture data, providing cost projections in terms of storage, compute and I/O when running Atlas2 Amazon on a large data set.
We find that providing a web interface and an optimized pipeline clearly facilitates usage of cloud computing for personal genome analysis, but for it to be routinely used for large scale projects there needs to be a paradigm shift in the way we develop tools, in standard operating procedures, and in funding mechanisms.
Technical challenges facing researchers performing next-generation sequencing (NGS) analysis threaten to slow the pace of discovery and delay clinical applications of genomics data. Particularly for core laboratories, these challenges include: (1) Computation and storage have to scale with the vase amount of data generated. (2) Analysis pipelines are complex to design, set up, and share. (3) Collaboration, reproducibility, and sharing are hampered by privacy concerns and the sheer volume of data involved. Based on hands-on experience from large-scale NGS projects such as the 1000 Genomes Project, Seven Bridges Genomics has developed IGOR, a comprehensive cloud platform for NGS Data analysis that fully addresses these challenges:
IGOR is a cloud-based platform for researchers and facilities to manage NGS data, design and run complex analysis pipelines, and efficiently collaborate on projects.Over a dozen curated and peer-reviewed NGS data analysis pipelines are publicly available for free, including alignment, variant calling, and RNA-Seq. All pipelines are based on open source tools and built to peer-reviewed specifications in close collaboration with researchers at leading institutions such as the Harvard Stem Cell Institute.Without any command-line knowledge, NGS pipelines can be built and customized in an intuitive graphical editor choosing from over 50 open source tools.When executing pipelines, IGOR automatically takes care of all resource management. Resources are seamlessly and automatically made available from Amazon Web Services and optimized for time and cost.Collaboration is facilitated through a project structure that allows researchers working in and across institutions to share files and pipelines. Fine-grained permissions allow detailed access control on a user-by-user basis for each project. Pipelines can be embedded and accessed through web pages akin to YouTube videos.Extensive batch processing and parallelization capabilities mean that hundreds of samples can be analyzed in the same amount of time that a single sample can be processed. Using file metadata, batch processing can be automated, e.g., by file, library, sample or lane.
The IGOR platform enables NGS research as a “turnkey” solution: Researchers can set up and run complex pipelines without expertise in command-line utilities or cloud computing. From a lab and facility perspective, the cloud-based architecture also eliminates the need to set up and maintain a large-scale infrastructure, typically resulting in at least 50% cost savings on infrastructure. By facilitating collaboration and easing analysis replication, the IGOR platform frees up the time of core laboratories to emphasize and focus on the research questions that ultimately guide them.
Mitochondrial (mt) genomics represents an understudied but important field of molecular biology. Increasingly, mt dysfunction is being linked to a range of human diseases, including neurodegenerative disorders, diabetes and impairment of childhood development. In addition, mt genomes provide important markers for systematic, evolutionary and population genetic studies. Some technological limitations have prevented the expanded generation and utilization of mt genomic data for some groups of organisms. These obstacles most acutely impede, but are not limited to, studies requiring the determination of complete mt genomic data from minute amounts of material (e.g. biopsy samples or microscopic organisms). Furthermore, post-sequencing bioinformatic annotation and analyses of mt genomes are time consuming and inefficient. Herein, we describe a high-throughput sequencing and bioinformatic pipeline for mt genomics, which will have implications for the annotation and analysis of other organellar (e.g. plastid or apicoplast genomes) and virus genomes as well as long, contiguous regions in nuclear genomes. We utilize this pipeline to sequence and annotate the complete mt genomes of 12 species of parasitic nematode (order Strongylida) simultaneously, each from an individual organism. These mt genomic data provide a rich source of markers for studies of the systematics and population genetics of a group of socioeconomically important pathogens of humans and other animals.
Summary: Reduced representation bisulfite sequencing (RRBS) is a cost-effective approach for genome-wide methylation pattern profiling. Analyzing RRBS sequencing data is challenging and specialized alignment/mapping programs are needed. Although such programs have been developed, a comprehensive solution that provides researchers with good quality and analyzable data is still lacking. To address this need, we have developed a Streamlined Analysis and Annotation Pipeline for RRBS data (SAAP-RRBS) that integrates read quality assessment/clean-up, alignment, methylation data extraction, annotation, reporting and visualization. This package facilitates a rapid transition from sequencing reads to a fully annotated CpG methylation report to biological interpretation.
Availability and implementation: SAAP-RRBS is freely available to non-commercial users at the web site http://ndc.mayo.edu/mayo/research/biostat/stand-alone-packages.cfm.
firstname.lastname@example.org or email@example.com
Supplementary data are available at Bioinformatics online.
The Genetics and Genomics group of the Advanced Technology Program of SAIC-Frederick exists to bring innovative genomic expertise, tools and analysis to NCI and the scientific community. The Sequencing Facility (SF) provides next generation short read (Illumina) sequencing capacity to investigators using a streamlined production approach. The Laboratory of Molecular Technology (LMT) offers a wide range of genomics core services including microarray expression analysis, miRNA analysis, array comparative genome hybridization, long read (Roche) next generation sequencing, quantitative real time PCR, transgenic genotyping, Sanger sequencing, and clinical mutation detection services to investigators from across the NIH. As the technology supporting this genomic research becomes more complex, the need for basic quality processes within all aspects of the core service groups becomes critical. The Quality Management group works alongside members of these labs to establish or improve processes supporting operations control (equipment, reagent and materials management), process improvement (reengineering/optimization, automation, acceptance criteria for new technologies and tech transfer), and quality assurance and customer support (controlled documentation/SOPs, training, service deficiencies and continual improvement efforts). Implementation and expansion of quality programs within unregulated environments demonstrates SAIC-Frederick's dedication to providing the highest quality products and services to the NIH community.
The Genetics and Genomics group of the Advanced Technology Program of SAIC-Frederick exists to bring innovative genomic expertise, tools and analysis to NCI and the scientific community. The Sequencing Facility (SF) provides next generation short read (Illumina) sequencing capacity to investigators using a streamlined production approach. The Laboratory of Molecular Technology (LMT) offers a wide range of genomics core services including microarray expression analysis, miRNA analysis, long read (Roche) next generation sequencing, transgenic genotyping, Sanger sequencing, and clinical mutation detection services to investigators from across the NIH. SF and LMT are working together to bring online the third generation Pacific Bioscience SMRT sequencing platform. As the technology supporting this genomic research becomes more complex, the need for basic quality processes within all aspects of the core service groups becomes critical. The Quality Management group works alongside members of these labs to establish or improve processes supporting operations control (equipment, reagent and materials management), process improvement (reengineering/optimization, automation, acceptance criteria for new technologies and tech transfer), and quality assurance and customer support (controlled documentation/SOPs, training, service deficiencies and continual improvement efforts). Implementation and expansion of quality programs within unregulated environments demonstrates SAIC Frederick's dedication to providing the highest quality products and services to the NIH community.
High-throughput DNA sequencing platforms have become widely available. As a result, personal genomes are increasingly being sequenced in research and clinical settings. However, the resulting massive amounts of variants data pose significant challenges to the average biologists and clinicians without bioinformatics skills.
Methods and results
We developed a web server called wANNOVAR to address the critical needs for functional annotation of genetic variants from personal genomes. The server provides simple and intuitive interface to help users determine the functional significance of variants. These include annotating single nucleotide variants and insertions/deletions for their effects on genes, reporting their conservation levels (such as PhyloP and GERP++ scores), calculating their predicted functional importance scores (such as SIFT and PolyPhen scores), retrieving allele frequencies in public databases (such as the 1000 Genomes Project and NHLBI-ESP 5400 exomes), and implementing a ‘variants reduction’ protocol to identify a subset of potentially deleterious variants/genes. We illustrated how wANNOVAR can help draw biological insights from sequencing data, by analysing genetic variants generated on two Mendelian diseases.
We conclude that wANNOVAR will help biologists and clinicians take advantage of the personal genome information to expedite scientific discoveries. The wANNOVAR server is available at http://wannovar.usc.edu, and will be continuously updated to reflect the latest annotation information.
Enabling deft data integration from numerous, voluminous and heterogeneous data sources is a major bioinformatic challenge. Several approaches have been proposed to address this challenge, including data warehousing and federated databasing. Yet despite the rise of these approaches, integration of data from multiple sources remains problematic and toilsome. These two approaches follow a user-to-computer communication model for data exchange, and do not facilitate a broader concept of data sharing or collaboration among users. In this report, we discuss the potential of Web 2.0 technologies to transcend this model and enhance bioinformatics research. We propose a Web 2.0-based Scientific Social Community (SSC) model for the implementation of these technologies. By establishing a social, collective and collaborative platform for data creation, sharing and integration, we promote a web services-based pipeline featuring web services for computer-to-computer data exchange as users add value. This pipeline aims to simplify data integration and creation, to realize automatic analysis, and to facilitate reuse and sharing of data. SSC can foster collaboration and harness collective intelligence to create and discover new knowledge. In addition to its research potential, we also describe its potential role as an e-learning platform in education. We discuss lessons from information technology, predict the next generation of Web (Web 3.0), and describe its potential impact on the future of bioinformatics studies.
Web 2.0; bioinformatics; scientific social community; web service; pipelines
The emerging next-generation sequencing method based on PCR technology boosts genome sequencing speed considerably, the expense is also get decreased. It has been utilized to address a broad range of bioinformatics problems. Limited by reliable output sequence length of next-generation sequencing technologies, we are confined to study gene fragments with 30~50 bps in general and it is relatively shorter than traditional gene fragment length. Anchoring gene fragments in long reference sequence is an essential and prerequisite step for further assembly and analysis works. Due to the sheer number of fragments produced by next-generation sequencing technologies and the huge size of reference sequences, anchoring would rapidly becoming a computational bottleneck.
Results and discussion
We compared algorithm efficiency on BLAT, SOAP and EMBF. The efficiency is defined as the count of total output results divided by time consumed to retrieve them. The data show that our algorithm EMBF have 3~4 times efficiency advantage over SOAP, and at least 150 times over BLAT. Moreover, when the reference sequence size is increased, the efficiency of SOAP will get degraded as far as 30%, while EMBF have preferable increasing tendency.
In conclusion, we deem that EMBF is more suitable for short fragment anchoring problem where result completeness and accuracy is predominant and the reference sequences are relatively large.
Bioinformatics and Genomics are closely related disciplines that hold great promises for the advancement of research and development in complex biomedical systems, as well as public health, drug design, comparative genomics, personalized medicine and so on. Research and development in these two important areas are impacting the science and technology.
High throughput sequencing and molecular imaging technologies marked the beginning of a new era for modern translational medicine and personalized healthcare. The impact of having the human sequence and personalized digital images in hand has also created tremendous demands of developing powerful supercomputing, statistical learning and artificial intelligence approaches to handle the massive bioinformatics and personalized healthcare data, which will obviously have a profound effect on how biomedical research will be conducted toward the improvement of human health and prolonging of human life in the future. The International Society of Intelligent Biological Medicine (http://www.isibm.org) and its official journals, the International Journal of Functional Informatics and Personalized Medicine (http://www.inderscience.com/ijfipm) and the International Journal of Computational Biology and Drug Design (http://www.inderscience.com/ijcbdd) in collaboration with International Conference on Bioinformatics and Computational Biology (Biocomp), touch tomorrow's bioinformatics and personalized medicine throughout today's efforts in promoting the research, education and awareness of the upcoming integrated inter/multidisciplinary field. The 2007 international conference on Bioinformatics and Computational Biology (BIOCOMP07) was held in Las Vegas, the United States of American on June 25-28, 2007. The conference attracted over 400 papers, covering broad research areas in the genomics, biomedicine and bioinformatics. The Biocomp 2007 provides a common platform for the cross fertilization of ideas, and to help shape knowledge and scientific achievements by bridging these two very important disciplines into an interactive and attractive forum. Keeping this objective in mind, Biocomp 2007 aims to promote interdisciplinary and multidisciplinary education and research. 25 high quality peer-reviewed papers were selected from 400+ submissions for this supplementary issue of BMC Genomics. Those papers contributed to a wide-range of important research fields including gene expression data analysis and applications, high-throughput genome mapping, sequence analysis, gene regulation, protein structure prediction, disease prediction by machine learning techniques, systems biology, database and biological software development. We always encourage participants submitting proposals for genomics sessions, special interest research sessions, workshops and tutorials to Professor Hamid R. Arabnia (firstname.lastname@example.org) in order to ensure that Biocomp continuously plays the leadership role in promoting inter/multidisciplinary research and education in the fields. Biocomp received top conference ranking with a high score of 0.95/1.00. Biocomp is academically co-sponsored by the International Society of Intelligent Biological Medicine and the Research Laboratories and Centers of Harvard University – Massachusetts Institute of Technology, Indiana University - Purdue University, Georgia Tech – Emory University, UIUC, UCLA, Columbia University, University of Texas at Austin and University of Iowa etc. Biocomp - Worldcomp brings leading scientists together across the nation and all over the world and aims to promote synergistic components such as keynote lectures, special interest sessions, workshops and tutorials in response to the advances of cutting-edge research.
The arrival of both high-throughput and bench-top next-generation sequencing technologies and sequence enrichment methods has revolutionized our approach to dissecting the genetic basis of cancer. These technologies have been almost invariably employed in whole-genome sequencing (WGS) and whole-exome sequencing (WES) studies. Both WGS and WES approaches have been widely applied to interrogate the somatic mutational landscape of sporadic cancers and identify novel germline mutations underlying familial cancer syndromes. The clinical implications of cancer genome sequencing have become increasingly clear, for example in diagnostics. In this editorial, we present these advances in the context of research discovery and discuss both the clinical relevance of cancer genome sequencing and the challenges associated with the adoption of these genomic technologies in a clinical setting.
Next-generation sequencing; Exome; Cancer; Diagnostics; Familial cancer syndrome; Somatic mutation
The arrival of next-generation sequencing (NGS) technologies has led to novel opportunities for expression profiling and genome analysis by utilizing vast amounts of short read sequence data. Here, we demonstrate that expression profiling in organisms lacking any genome or transcriptome sequence information is feasible by combining Illumina’s mRNA-seq technology with a novel bioinformatics pipeline that integrates assembled and annotated Chinese hamster ovary (CHO) sequences with information derived from related organisms. We applied this pipeline to the analysis of CHO cells which were chosen as a model system owing to its relevance in the production of therapeutic proteins. Specifically, we analysed CHO cells undergoing butyrate treatment which is known to affect cell cycle regulation and to increase the specific productivity of recombinant proteins. By this means, we identified sequences for >13 000 CHO genes which added sequence information of ∼5000 novel genes to the CHO model. More than 6000 transcript sequences are predicted to be complete, as they covered >95% of the corresponding mouse orthologs. Detailed analysis of selected biological functions such as DNA replication and cell cycle control, demonstrated the potential of NGS expression profiling in organisms without extended genome sequence to improve both data quantity and quality.
Recent rapid advances in next generation RNA sequencing (RNA-Seq)-based provide researchers with unprecedentedly large data sets and open new perspectives in transcriptomics. Furthermore, RNA-Seq-based transcript profiling can be applied to non-model and newly discovered organisms because it does not require a predefined measuring platform (like e.g. microarrays). However, these novel technologies pose new challenges: the raw data need to be rigorously quality checked and filtered prior to analysis, and proper statistical methods have to be applied to extract biologically relevant information. Given the sheer volume of data, this is no trivial task and requires a combination of considerable technical resources along with bioinformatics expertise. To aid the individual researcher, we have developed RobiNA as an integrated solution that consolidates all steps of RNA-Seq-based differential gene-expression analysis in one user-friendly cross-platform application featuring a rich graphical user interface. RobiNA accepts raw FastQ files, SAM/BAM alignment files and counts tables as input. It supports quality checking, flexible filtering and statistical analysis of differential gene expression based on state-of-the art biostatistical methods developed in the R/Bioconductor projects. In-line help and a step-by-step manual guide users through the analysis. Installer packages for Mac OS X, Windows and Linux are available under the LGPL licence from http://mapman.gabipd.org/web/guest/robin.