Search tips
Search criteria

Results 1-25 (831130)

Clipboard (0)

Related Articles

1.  SHRiMP: Accurate Mapping of Short Color-space Reads 
PLoS Computational Biology  2009;5(5):e1000386.
The development of Next Generation Sequencing technologies, capable of sequencing hundreds of millions of short reads (25–70 bp each) in a single run, is opening the door to population genomic studies of non-model species. In this paper we present SHRiMP - the SHort Read Mapping Package: a set of algorithms and methods to map short reads to a genome, even in the presence of a large amount of polymorphism. Our method is based upon a fast read mapping technique, separate thorough alignment methods for regular letter-space as well as AB SOLiD (color-space) reads, and a statistical model for false positive hits. We use SHRiMP to map reads from a newly sequenced Ciona savignyi individual to the reference genome. We demonstrate that SHRiMP can accurately map reads to this highly polymorphic genome, while confirming high heterozygosity of C. savignyi in this second individual. SHRiMP is freely available at
Author Summary
Next Generation Sequencing (NGS) technologies are revolutionizing the way biologists acquire and analyze genomic data. NGS machines, such as Illumina/Solexa and AB SOLiD, are able to sequence genomes more cheaply by 200-fold than previous methods. One of the main application areas of NGS technologies is the discovery of genomic variation within a given species. The first step in discovering this variation is the mapping of reads sequenced from a donor individual to a known (“reference”) genome. Differences between the reference and the reads are indicative either of polymorphisms, or of sequencing errors. Since the introduction of NGS technologies, many methods have been devised for mapping reads to reference genomes. However, these algorithms often sacrifice sensitivity for fast running time. While they are successful at mapping reads from organisms that exhibit low polymorphism rates, they do not perform well at mapping reads from highly polymorphic organisms. We present a novel read mapping method, SHRiMP, that can handle much greater amounts of polymorphism. Using Ciona savignyi as our target organism, we demonstrate that our method discovers significantly more variation than other methods. Additionally, we develop color-space extensions to classical alignment algorithms, allowing us to map color-space, or “dibase”, reads generated by AB SOLiD sequencers.
PMCID: PMC2678294  PMID: 19461883
2.  ComB: SNP Calling and Mapping Analysis for Color and Nucleotide Space Platforms 
Journal of Computational Biology  2011;18(6):795-807.
The determination of single nucleotide polymorphisms (SNPs) has become faster and more cost effective since the advent of short read data from next generation sequencing platforms such as Roche's 454 Sequencer, Illumina's Solexa platform, and Applied Biosystems SOLiD sequencer. The SOLiD sequencing platform, which is capable of producing more than 6 GB of sequence data in a single run, uses a unique encoding scheme where color reads represent transitions between adjacent nucleotides. The determination of SNPs from color reads usually involves the translation of color alignments to likely nucleotide strings to facilitate the use of tools designed for nucleotide reads. This technique results in the loss of significant information in the color read, producing many incorrect SNP calls, especially if regions exist with dense or adjacent polymorphism. Additionally, color reads align ambiguously and incorrectly more often than nucleotide reads making integrated SNP calling a difficult challenge. We have developed ComB, a SNP calling tool which operates directly in color space, using a Bayesian model to incorporate unique and ambiguous reads to iteratively determine SNP identity. ComB is capable of accurately calling short consecutive nucleotide polymorphisms and densely clustered SNPs; both of which other SNP calling tools fail to identify. ComB, which is capable of using billions of short reads to accurately and efficiently perform whole human genome SNP calling in parallel, is also capable of using sequence data or even integrating sequence and color space data sets. We use real and simulated data to demonstrate that ComB's iterative strategy and recalibration of quality scores allow it to discover more true SNPs while calling fewer false positives than tools which use only color alignments as well as tools which translate color reads to nucleotide strings.
PMCID: PMC3122929  PMID: 21563978
algorithms; genome analysis; SNPs; next generation sequencing; statistics
3.  Detection of microRNAs in color space 
Bioinformatics  2011;28(3):318-323.
Motivation: Deep sequencing provides inexpensive opportunities to characterize the transcriptional diversity of known genomes. The AB SOLiD technology generates millions of short sequencing reads in color-space; that is, the raw data is a sequence of colors, where each color represents 2 nt and each nucleotide is represented by two consecutive colors. This strategy is purported to have several advantages, including increased ability to distinguish sequencing errors from polymorphisms. Several programs have been developed to map short reads to genomes in color space. However, a number of previously unexplored technical issues arise when using SOLiD technology to characterize microRNAs.
Results: Here we explore these technical difficulties. First, since the sequenced reads are longer than the biological sequences, every read is expected to contain linker fragments. The color-calling error rate increases toward the 3′ end of the read such that recognizing the linker sequence for removal becomes problematic. Second, mapping in color space may lead to the loss of the first nucleotide of each read. We propose a sequential trimming and mapping approach to map small RNAs. Using our strategy, we reanalyze three published insect small RNA deep sequencing datasets and characterize 22 new microRNAs.
Availability and implementation: A bash shell script to perform the sequential trimming and mapping procedure, called SeqTrimMap, is available at:
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3268249  PMID: 22171334
4.  Digital Gene Expression of miRNA in Osteosarcoma Xenografts: Finding Biological Relevance in miRNA High Throughput Sequencing Data 
High throughput DNA sequencing is a powerful tool for profiling miRNA expression but presents many computational challenges. We have used xenograft osteosarcoma tumor lines with SOLiD sequencing technology to identify expression of miRNAs that is altered by anti-IGF1R antibody treatment. The quality of the sequencing runs was assessed with quality values for color space calls, total reads, correlation between biological replicates of mapped reads and percent mapped reads. To take advantage of the low error rate for base calling of SOLiD sequencing technology, sequence alignments must be performed in color space. In addition, the size range of miRNAs (18-28 bp) must be considered when aligning reads. The Small RNA Analysis Tool (RNA2MAP v 0.5), an Applied Biosystems free open source software tool that takes these points into account, was used to filter out reads resulting from adapters, transfer RNA (tRNA), ribosomal RNA, repetitive elements (including Alu, LINE and LTR), centromeric and telomeric satellites, small cytoplasmic RNA (scRNA) and small nuclear RNA (snRNA). Remaining reads were sequentially aligned to miRNAs present in miRBase v14 and then to the human genome reference sequence (assembly GRCh37) using RNA2MAP. Differentially expressed miRNAs, determined with a rank product non-parametric method using RankProd (a Bioconductor package), include hsa-mir-654 and hsa-mir-370. Interestingly, IRS1 mRNA is a TargetScan predicted target of hsa-mir-370 and IRS1 protein interacts with IGF1R. Throughout this analysis, several tools including custom scripts were needed to parse and format data. Differentially expressed miRNAs and their target genes may provide insight into the susceptibility of some osteosarcoma tumors and the resistance of others to anti-IGF1R antibody treatment. These studies may also lead to the identification of biomarkers for drug resistance.
PMCID: PMC2918036
5.  Fast and accurate short read alignment with Burrows–Wheeler transform 
Bioinformatics  2009;25(14):1754-1760.
Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals.
Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ∼10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package.
PMCID: PMC2705234  PMID: 19451168
6.  Single Nucleotide Polymorphisms (SNPs) Detection in Kinases of Lung Tumors by Exon Capture and SOLiD Sequencing Technologies 
Next-generation sequencing (NGS) technologies are highly affordable and powerful tools for biomedical research. Methods for exon enrichment provide a means for focused study of protein-coding regions that may be involved in diseases. For this study, we combined Agilent exon capture and Applied Biosystems SOLiD sequencing technologies to determine single nucleotide polymorphisms (SNPs) specifically in the kinase genes of human lung tumors. Quantity and quality of genomic DNA from paired samples of normal and tumor lung tissue (N=5) were assessed with the Qubit® 2.0 fluorometer and Agilent 2100 Bioanalyzer. A Covaris® S2 system was utilized for shearing DNA (∼165 bp). After ligation of specific-adaptors, the kinase exome library was enriched using the Agilent SureSelect human kinome capture system utilizing 120-bp biotinylated RNA probes. Bound DNA was purified and barcoded for sample identification. Kinase captured libraries were quantified and pooled for amplification by emulsion PCR. Template beads for all 5 tumor-normal paired samples were pooled on a single slide and sequenced on the SOLiD 4 platform. Image analysis and base calling was performed with the SOLiD System Analysis Pipeline tool. Read length was 50 bp. GenomeQuest NGS analysis tools were used to map SOLID reads to the reference human genome (Build-hg19) as well as for SNPs identification. An average of ∼11 million reads was mapped per sample. Data analyses showed that 88.8% of the kinome probes are fully covered at a depth 1, whereas 95.55% of the probes had a depth of coverage greater than 10X. In addition, 72.2% of the sequence reads were either within or overlapping the target. Off-target reads (27.8%) were evenly distributed among chromosomes with no bias toward GC-rich regions. Overall, exon capture and NGS technologies are reliable and cost-effective approaches for SNPs detection and suitable for other applications in biomedical research.
PMCID: PMC3630643
7.  A robust framework for detecting structural variations in a genome 
Bioinformatics  2008;24(13):i59-i67.
Motivation: Recently, structural genomic variants have come to the forefront as a significant source of variation in the human population, but the identification of these variants in a large genome remains a challenge. The complete sequencing of a human individual is prohibitive at current costs, while current polymorphism detection technologies, such as SNP arrays, are not able to identify many of the large scale events. One of the most promising methods to detect such variants is the computational mapping of clone-end sequences to a reference genome.
Results: Here, we present a probabilistic framework for the identification of structural variants using clone-end sequencing. Unlike previous methods, our approach does not rely on an a priori determined mapping of all reads to the reference. Instead, we build a framework for finding the most probable assignment of sequenced clones to potential structural variants based on the other clones. We compare our predictions with the structural variants identified in three previous studies. While there is a statistically significant correlation between the predictions, we also find a significant number of previously uncharacterized structural variants. Furthermore, we identify a number of putative cross-chromosomal events, primarily located proximally to the centromeres of the chromosomes.
Availability: Our dataset, results and source code are available at,,
PMCID: PMC2718654  PMID: 18586745
8.  Comparison of Sequence Reads Obtained from Three Next-Generation Sequencing Platforms 
PLoS ONE  2011;6(5):e19534.
Next-generation sequencing technologies enable the rapid cost-effective production of sequence data. To evaluate the performance of these sequencing technologies, investigation of the quality of sequence reads obtained from these methods is important. In this study, we analyzed the quality of sequence reads and SNP detection performance using three commercially available next-generation sequencers, i.e., Roche Genome Sequencer FLX System (FLX), Illumina Genome Analyzer (GA), and Applied Biosystems SOLiD system (SOLiD). A common genomic DNA sample obtained from Escherichia coli strain DH1 was applied to these sequencers. The obtained sequence reads were aligned to the complete genome sequence of E. coli DH1, to evaluate the accuracy and sequence bias of these sequence methods. We found that the fraction of “junk” data, which could not be aligned to the reference genome, was largest in the data set of SOLiD, in which about half of reads could not be aligned. Among data sets after alignment to the reference, sequence accuracy was poorest in GA data sets, suggesting relatively low fidelity of the elongation reaction in the GA method. Furthermore, by aligning the sequence reads to the E. coli strain W3110, we screened sequence differences between two E. coli strains using data sets of three different next-generation platforms. The results revealed that the detected sequence differences were similar among these three methods, while the sequence coverage required for the detection was significantly small in the FLX data set. These results provided valuable information on the quality of short sequence reads and the performance of SNP detection in three next-generation sequencing platforms.
PMCID: PMC3096631  PMID: 21611185
9.  U87MG Decoded: The Genomic Sequence of a Cytogenetically Aberrant Human Cancer Cell Line 
PLoS Genetics  2010;6(1):e1000832.
U87MG is a commonly studied grade IV glioma cell line that has been analyzed in at least 1,700 publications over four decades. In order to comprehensively characterize the genome of this cell line and to serve as a model of broad cancer genome sequencing, we have generated greater than 30× genomic sequence coverage using a novel 50-base mate paired strategy with a 1.4kb mean insert library. A total of 1,014,984,286 mate-end and 120,691,623 single-end two-base encoded reads were generated from five slides. All data were aligned using a custom designed tool called BFAST, allowing optimal color space read alignment and accurate identification of DNA variants. The aligned sequence reads and mate-pair information identified 35 interchromosomal translocation events, 1,315 structural variations (>100 bp), 191,743 small (<21 bp) insertions and deletions (indels), and 2,384,470 single nucleotide variations (SNVs). Among these observations, the known homozygous mutation in PTEN was robustly identified, and genes involved in cell adhesion were overrepresented in the mutated gene list. Data were compared to 219,187 heterozygous single nucleotide polymorphisms assayed by Illumina 1M Duo genotyping array to assess accuracy: 93.83% of all SNPs were reliably detected at filtering thresholds that yield greater than 99.99% sequence accuracy. Protein coding sequences were disrupted predominantly in this cancer cell line due to small indels, large deletions, and translocations. In total, 512 genes were homozygously mutated, including 154 by SNVs, 178 by small indels, 145 by large microdeletions, and 35 by interchromosomal translocations to reveal a highly mutated cell line genome. Of the small homozygously mutated variants, 8 SNVs and 99 indels were novel events not present in dbSNP. These data demonstrate that routine generation of broad cancer genome sequence is possible outside of genome centers. The sequence analysis of U87MG provides an unparalleled level of mutational resolution compared to any cell line to date.
Author Summary
Glioblastoma has a particularly dismal prognosis with median survival time of less than fifteen months. Here, we describe the broad genome sequencing of U87MG, a commonly used and thus well-studied glioblastoma cell line. One of the major features of the U87MG genome is the large number of chromosomal abnormalities, which can be typical of cancer cell lines and primary cancers. The systematic, thorough, and accurate mutational analysis of the U87MG genome comprehensively identifies different classes of genetic mutations including single-nucleotide variations (SNVs), insertions/deletions (indels), and translocations. We found 2,384,470 SNVs, 191,743 small indels, and 1,314 large structural variations. Known gene models were used to predict the effect of these mutations on protein-coding sequence. Mutational analysis revealed 512 genes homozygously mutated, including 154 by SNVs, 178 by small indels, 145 by large microdeletions, and up to 35 by interchromosomal translocations. The major mutational mechanisms in this brain cancer cell line are small indels and large structural variations. The genomic landscape of U87MG is revealed to be much more complex than previously thought based on lower resolution techniques. This mutational analysis serves as a resource for past and future studies on U87MG, informing them with a thorough description of its mutational state.
PMCID: PMC2813426  PMID: 20126413
10.  Coverage Bias and Sensitivity of Variant Calling for Four Whole-genome Sequencing Technologies 
PLoS ONE  2013;8(6):e66621.
The emergence of high-throughput, next-generation sequencing technologies has dramatically altered the way we assess genomes in population genetics and in cancer genomics. Currently, there are four commonly used whole-genome sequencing platforms on the market: Illumina’s HiSeq2000, Life Technologies’ SOLiD 4 and its completely redesigned 5500xl SOLiD, and Complete Genomics’ technology. A number of earlier studies have compared a subset of those sequencing platforms or compared those platforms with Sanger sequencing, which is prohibitively expensive for whole genome studies. Here we present a detailed comparison of the performance of all currently available whole genome sequencing platforms, especially regarding their ability to call SNVs and to evenly cover the genome and specific genomic regions. Unlike earlier studies, we base our comparison on four different samples, allowing us to assess the between-sample variation of the platforms. We find a pronounced GC bias in GC-rich regions for Life Technologies’ platforms, with Complete Genomics performing best here, while we see the least bias in GC-poor regions for HiSeq2000 and 5500xl. HiSeq2000 gives the most uniform coverage and displays the least sample-to-sample variation. In contrast, Complete Genomics exhibits by far the smallest fraction of bases not covered, while the SOLiD platforms reveal remarkable shortcomings, especially in covering CpG islands. When comparing the performance of the four platforms for calling SNPs, HiSeq2000 and Complete Genomics achieve the highest sensitivity, while the SOLiD platforms show the lowest false positive rate. Finally, we find that integrating sequencing data from different platforms offers the potential to combine the strengths of different technologies. In summary, our results detail the strengths and weaknesses of all four whole-genome sequencing platforms. It indicates application areas that call for a specific sequencing platform and disallow other platforms. This helps to identify the proper sequencing platform for whole genome studies with different application scopes.
PMCID: PMC3679043  PMID: 23776689
11.  SOPRA: Scaffolding algorithm for paired reads via statistical optimization 
BMC Bioinformatics  2010;11:345.
High throughput sequencing (HTS) platforms produce gigabases of short read (<100 bp) data per run. While these short reads are adequate for resequencing applications, de novo assembly of moderate size genomes from such reads remains a significant challenge. These limitations could be partially overcome by utilizing mate pair technology, which provides pairs of short reads separated by a known distance along the genome.
We have developed SOPRA, a tool designed to exploit the mate pair/paired-end information for assembly of short reads. The main focus of the algorithm is selecting a sufficiently large subset of simultaneously satisfiable mate pair constraints to achieve a balance between the size and the quality of the output scaffolds. Scaffold assembly is presented as an optimization problem for variables associated with vertices and with edges of the contig connectivity graph. Vertices of this graph are individual contigs with edges drawn between contigs connected by mate pairs. Similar graph problems have been invoked in the context of shotgun sequencing and scaffold building for previous generation of sequencing projects. However, given the error-prone nature of HTS data and the fundamental limitations from the shortness of the reads, the ad hoc greedy algorithms used in the earlier studies are likely to lead to poor quality results in the current context. SOPRA circumvents this problem by treating all the constraints on equal footing for solving the optimization problem, the solution itself indicating the problematic constraints (chimeric/repetitive contigs, etc.) to be removed. The process of solving and removing of constraints is iterated till one reaches a core set of consistent constraints. For SOLiD sequencer data, SOPRA uses a dynamic programming approach to robustly translate the color-space assembly to base-space. For assessing the quality of an assembly, we report the no-match/mismatch error rate as well as the rates of various rearrangement errors.
Applying SOPRA to real data from bacterial genomes, we were able to assemble contigs into scaffolds of significant length (N50 up to 200 Kb) with very few errors introduced in the process. In general, the methodology presented here will allow better scaffold assemblies of any type of mate pair sequencing data.
PMCID: PMC2909219  PMID: 20576136
12.  Comparison of Sequencing Platforms for Single Nucleotide Variant Calls in a Human Sample 
PLoS ONE  2013;8(2):e55089.
Next-generation sequencings platforms coupled with advanced bioinformatic tools enable re-sequencing of the human genome at high-speed and large cost savings. We compare sequencing platforms from Roche/454(GS FLX), Illumina/HiSeq (HiSeq 2000), and Life Technologies/SOLiD (SOLiD 3 ECC) for their ability to identify single nucleotide substitutions in whole genome sequences from the same human sample. We report on significant GC-related bias observed in the data sequenced on Illumina and SOLiD platforms. The differences in the variant calls were investigated with regards to coverage, and sequencing error. Some of the variants called by only one or two of the platforms were experimentally tested using mass spectrometry; a method that is independent of DNA sequencing. We establish several causes why variants remained unreported, specific to each platform. We report the indel called using the three sequencing technologies and from the obtained results we conclude that sequencing human genomes with more than a single platform and multiple libraries is beneficial when high level of accuracy is required.
PMCID: PMC3566181  PMID: 23405114
13.  Management of High-Throughput DNA Sequencing Projects: Alpheus 
High-throughput DNA sequencing has enabled systems biology to begin to address areas in health, agricultural and basic biological research. Concomitant with the opportunities is an absolute necessity to manage significant volumes of high-dimensional and inter-related data and analysis. Alpheus is an analysis pipeline, database and visualization software for use with massively parallel DNA sequencing technologies that feature multi-gigabase throughput characterized by relatively short reads, such as Illumina-Solexa (sequencing-by-synthesis), Roche-454 (pyrosequencing) and Applied Biosystem’s SOLiD (sequencing-by-ligation). Alpheus enables alignment to reference sequence(s), detection of variants and enumeration of sequence abundance, including expression levels in transcriptome sequence. Alpheus is able to detect several types of variants, including non-synonymous and synonymous single nucleotide polymorphisms (SNPs), insertions/deletions (indels), premature stop codons, and splice isoforms. Variant detection is aided by the ability to filter variant calls based on consistency, expected allele frequency, sequence quality, coverage, and variant type in order to minimize false positives while maximizing the identification of true positives. Alpheus also enables comparisons of genes with variants between cases and controls or bulk segregant pools. Sequence-based differential expression comparisons can be developed, with data export to SAS JMP Genomics for statistical analysis.
PMCID: PMC2819532  PMID: 20151039
Alpheus; sequencing-by-synthesis; pyrosequencing; GMAP; GSNAP; resequencing; transcriptome sequencing
14.  Sensitive and fast mapping of di-base encoded reads 
Bioinformatics  2011;27(14):1915-1921.
Motivation: Discovering variation among high-throughput sequenced genomes relies on efficient and effective mapping of sequence reads. The speed, sensitivity and accuracy of read mapping are crucial to determining the full spectrum of single nucleotide variants (SNVs) as well as structural variants (SVs) in the donor genomes analyzed.
Results: We present drFAST, a read mapper designed for di-base encoded ‘color-space’ sequences generated with the AB SOLiD platform. drFAST is specially designed for better delineation of structural variants, including segmental duplications, and is able to return all possible map locations and underlying sequence variation of short reads within a user-specified distance threshold. We show that drFAST is more sensitive in comparison to all commonly used aligners such as Bowtie, BFAST and SHRiMP. drFAST is also faster than both BFAST and SHRiMP and achieves a mapping speed comparable to Bowtie.
Availability: The source code for drFAST is available at
PMCID: PMC3129524  PMID: 21586516
15.  Revising a Personal Genome by Comparing and Combining Data from Two Different Sequencing Platforms 
PLoS ONE  2013;8(4):e60585.
For the robust practice of genomic medicine, sequencing results must be compatible, regardless of the sequencing technologies and algorithms used. Presently, genome sequencing is still an imprecise science and is complicated by differences in the chemistry, coverage, alignment, and variant-calling algorithms. We identified ∼3.33 million single nucleotide variants (SNVs) and ∼3.62 million SNVs in the SJK genome using SOLiD and Illumina data, respectively. Approximately 3 million SNVs were concordant between the two platforms while 68,532 SNVs were discordant; 219,616 SNVs were SOLiD-specific and 516,080 SNVs were Illumina-specific (i.e., platform-specific). Concordant, discordant, and platform-specific SNVs were further analyzed and characterized. Overall, a large portion of heterozygous SNVs that were discordant with genotyping calls of single nucleotide polymorphism chips were highly confident. Approximately 70% of the platform-specific SNVs were located in regions containing repetitive sequences. Such platform-specificity may arise from differences between platforms, with regard to read length (36 bp and 72 bp vs. 50 bp), insert size (∼100–300 bp vs. ∼1–2 kb), sequencing chemistry (sequencing-by-synthesis using single nucleotides vs. ligation-based sequencing using oligomers), and sequencing quality. When data from the two platforms were merged for variant calling, the proportion of callable regions of the reference genome increased to 99.66%, which was 1.43% higher than the average callability of the two platforms, representing ∼40 million bases. In this study, we compared the differences in sequencing results between two sequencing platforms. Approximately 90% of the SNVs were concordant between the two platforms, yet ∼10% of the SNVs were either discordant or platform-specific, indicating that each platform had its own strengths and weaknesses. When data from the two platforms were merged, both the overall callability of the reference genome and the overall accuracy of the SNVs improved, demonstrating the likelihood that a re-sequenced genome can be revised using complementary data.
PMCID: PMC3620462  PMID: 23593254
16.  A Balanced Barcoding System for Multiplexed DNA library and SOLiD SAGE “Sequencing” 
A set of 96 molecular barcode adaptors specifically designed for the SOLiD™ platform have been validated for use with DNA fragment and paired end libraries. Moreover, the barcode system is adapted for multiplexed Serial Analysis of Gene Expression (SAGE). DNA libraries are constructed with a multiplex adaptor which consists of three segments: (1) an internal sequencing primer binding site, (2) a barcode decamer sequence and (3) a P2 PCR priming site. The barcode and target DNA are then sequenced as two separate reads from the same strand allowing for the libraries to be pooled in a multiplexed emulsion PCR and deposited into a single spot on a SOLiD™ slide. Similarly, SAGE libraries are constructed with a modified adaptor allowing for the addition of unique barcode primers with a short cycle amplification consistent with the SOLiD™ barcoding system. The modular barcoding design requires only 5bp of sequencing to distinguish 16-plex samples and 10bp of sequencing to distinguish 96-plex samples. The barcodes are optimized in sets of four wherein each set is color balanced at every position. Importantly, clear discrimination between barcode samples is achieved by maintaining a minimum Hamming distance of 3 colorspace calls for optimal data integrity. The DNA barcode system was validated by sequencing of E. coli fragment libraries. Error rates and quality value (QV) scores for the barcode reads were found to be consistent across the final set. Importantly, QV scores were also consistent for the reads, indicating minimal effects of the barcode decamers on bead templating and ligation sequencing efficiency. Furthermore, the set of 16 SAGE barcoded samples yielded Pearson correlations above 0.98. Ongoing development studies include integration with methods of target enrichment that will further enable high levels of DNA and RNA expression library multiplexing afforded by the increasing throughput of the SOLiD™ system.
PMCID: PMC2918128
17.  Error-correcting properties of the SOLiD Exact Call Chemistry 
BMC Bioinformatics  2012;13:145.
The Exact Call Chemistry for the SOLiD Next-Generation Sequencing platform augments the two-base-encoding chemistry with an additional round of ligation, using an alternative set of probes, that allows some mistakes made when reading the first set of probes to be corrected. Additionally, the Exact Call Chemistry allows reads produced by the platform to be decoded directly into nucleotide sequence rather than its two-base ‘color’ encoding.
We apply the theory of linear codes to analyse the new chemistry, showing the types of sequencing mistakes it can correct and identifying those where the presence of an error can only be detected. For isolated mistakes that cannot be unambiguously corrected, we show that the type of substitution can be determined, and its location can be narrowed down to two or three positions, leading to a significant reduction in the the number of plausible alternative reads.
The Exact Call Chemistry increases the accuracy of the SOLiD platform, enabling many potential miscalls to be prevented. However, single miscalls in the color sequence can produce complex but localised patterns of error in the decoded nucleotide sequence. Analysis of similar codes shows that some exist that, if implemented in alternative chemistries, should have superior performance.
PMCID: PMC3464616  PMID: 22726842
18.  FadE: whole genome methylation analysis for multiple sequencing platforms 
Nucleic Acids Research  2012;41(1):e14.
DNA methylation plays a central role in genomic regulation and disease. Sodium bisulfite treatment (SBT) causes unmethylated cytosines to be sequenced as thymine, which allows methylation levels to reflected in the number of ‘C’-‘C’ alignments covering reference cytosines. Di-base color reads produced by lifetech’s SOLiD sequencer provide unreliable results when translated to bases because single sequencing errors effect the downstream sequence. We describe FadE, an algorithm to accurately determine genome-wide methylation rates directly in color or nucleotide space. FadE uses SBT unmethylated and untreated data to determine background error rates and incorporate them into a model which uses Newton–Raphson optimization to estimate the methylation rate and provide a credible interval describing its distribution at every reference cytosine. We sequenced two slides of human fibroblast cell-line bisulfite-converted fragment library with the SOLiD sequencer to investigate genome-wide methylation levels. FadE reported widespread differences in methylation levels across CpG islands and a large number of differentially methylated regions adjacent to genes which compares favorably to the results of an investigation on the same cell-line using nucleotide-space reads at higher coverage levels, suggesting that FadE is an accurate method to estimate genome-wide methylation with color or nucleotide reads.
PMCID: PMC3592442  PMID: 22965123
19.  A comparison of massively parallel nucleotide sequencing with oligonucleotide microarrays for global transcription profiling 
BMC Genomics  2010;11:282.
RNA-Seq exploits the rapid generation of gigabases of sequence data by Massively Parallel Nucleotide Sequencing, allowing for the mapping and digital quantification of whole transcriptomes. Whilst previous comparisons between RNA-Seq and microarrays have been performed at the level of gene expression, in this study we adopt a more fine-grained approach. Using RNA samples from a normal human breast epithelial cell line (MCF-10a) and a breast cancer cell line (MCF-7), we present a comprehensive comparison between RNA-Seq data generated on the Applied Biosystems SOLiD platform and data from Affymetrix Exon 1.0ST arrays. The use of Exon arrays makes it possible to assess the performance of RNA-Seq in two key areas: detection of expression at the granularity of individual exons, and discovery of transcription outside annotated loci.
We found a high degree of correspondence between the two platforms in terms of exon-level fold changes and detection. For example, over 80% of exons detected as expressed in RNA-Seq were also detected on the Exon array, and 91% of exons flagged as changing from Absent to Present on at least one platform had fold-changes in the same direction. The greatest detection correspondence was seen when the read count threshold at which to flag exons Absent in the SOLiD data was set to t<1 suggesting that the background error rate is extremely low in RNA-Seq. We also found RNA-Seq more sensitive to detecting differentially expressed exons than the Exon array, reflecting the wider dynamic range achievable on the SOLiD platform. In addition, we find significant evidence of novel protein coding regions outside known exons, 93% of which map to Exon array probesets, and are able to infer the presence of thousands of novel transcripts through the detection of previously unreported exon-exon junctions.
By focusing on exon-level expression, we present the most fine-grained comparison between RNA-Seq and microarrays to date. Overall, our study demonstrates that data from a SOLiD RNA-Seq experiment are sufficient to generate results comparable to those produced from Affymetrix Exon arrays, even using only a single replicate from each platform, and when presented with a large genome.
PMCID: PMC2877694  PMID: 20444259
20.  MALINA: a web service for visual analytics of human gut microbiota whole-genome metagenomic reads 
MALINA is a web service for bioinformatic analysis of whole-genome metagenomic data obtained from human gut microbiota sequencing. As input data, it accepts metagenomic reads of various sequencing technologies, including long reads (such as Sanger and 454 sequencing) and next-generation (including SOLiD and Illumina). It is the first metagenomic web service that is capable of processing SOLiD color-space reads, to authors’ knowledge. The web service allows phylogenetic and functional profiling of metagenomic samples using coverage depth resulting from the alignment of the reads to the catalogue of reference sequences which are built into the pipeline and contain prevalent microbial genomes and genes of human gut microbiota. The obtained metagenomic composition vectors are processed by the statistical analysis and visualization module containing methods for clustering, dimension reduction and group comparison. Additionally, the MALINA database includes vectors of bacterial and functional composition for human gut microbiota samples from a large number of existing studies allowing their comparative analysis together with user samples, namely datasets from Russian Metagenome project, MetaHIT and Human Microbiome Project (downloaded from MALINA is made freely available on the web at The website is implemented in JavaScript (using Ext JS), Microsoft .NET Framework, MS SQL, Python, with all major browsers supported.
PMCID: PMC3599743  PMID: 23216677
Metagenomics; Human gut microbiota; Web-server; Statistical analysis; Visualization
21.  PerM: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds 
Bioinformatics  2009;25(19):2514-2521.
Motivation: The explosion of next-generation sequencing data has spawned the design of new algorithms and software tools to provide efficient mapping for different read lengths and sequencing technologies. In particular, ABI's sequencer (SOLiD system) poses a big computational challenge with its capacity to produce very large amounts of data, and its unique strategy of encoding sequence data into color signals.
Results: We present the mapping software, named PerM (Periodic Seed Mapping) that uses periodic spaced seeds to significantly improve mapping efficiency for large reference genomes when compared with state-of-the-art programs. The data structure in PerM requires only 4.5 bytes per base to index the human genome, allowing entire genomes to be loaded to memory, while multiple processors simultaneously map reads to the reference. Weight maximized periodic seeds offer full sensitivity for up to three mismatches and high sensitivity for four and five mismatches while minimizing the number random hits per query, significantly speeding up the running time. Such sensitivity makes PerM a valuable mapping tool for SOLiD and Solexa reads.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2752623  PMID: 19675096
22.  Crystallizing short-read assemblies around seeds 
BMC Bioinformatics  2009;10(Suppl 1):S16.
New short-read sequencing technologies produce enormous volumes of 25–30 base paired-end reads. The resulting reads have vastly different characteristics than produced by Sanger sequencing, and require different approaches than the previous generation of sequence assemblers. In this paper, we present a short-read de novo assembler particularly targeted at the new ABI SOLiD sequencing technology.
This paper presents what we believe to be the first de novo sequence assembly results on real data from the emerging SOLiD platform, introduced by Applied Biosystems. Our assembler SHORTY augments short-paired reads using a trivially small number (5 – 10) of seeds of length 300 – 500 bp. These seeds enable us to produce significant assemblies using short-read coverage no more than 100×, which can be obtained in a single run of these high-capacity sequencers. SHORTY exploits two ideas which we believe to be of interest to the short-read assembly community: (1) using single seed reads to crystallize assemblies, and (2) estimating intercontig distances accurately from multiple spanning paired-end reads.
We demonstrate effective assemblies (N50 contig sizes ~40 kb) of three different bacterial species using simulated SOLiD data. Sequencing artifacts limit our performance on real data, however our results on this data are substantially better than those achieved by competing assemblers.
PMCID: PMC2648751  PMID: 19208115
23.  Combining Two Technologies for Full Genome Sequencing of Human 
Acta Naturae  2009;1(3):102-107.
At present, the new technologies of DNA sequencing are rapidly developing allowing quick and efficient characterisation of organisms at the level of the genome structure. In this study, the whole genome sequencing of a human (Russian man) was performed using two technologies currently present on the market - Sequencing by Oligonucleotide Ligation and Detection (SOLiD™) (Applied Biosystems) and sequencing technologies of molecular clusters using fluorescently labeled precursors (Illumina). The total number of generated data resulted in 108.3 billion base pairs (60.2 billion from Illumina technology and 48.1 billion from SOLiD technology). Statistics performed on reads generated by GAII and SOLiD showed that they covered 75% and 96% of the genome respectively. Short polymorphic regions were detected with comparable accuracy however, the absolute amount of them revealed by SOLiD was several times less than by GAII. Optimal algorithm for using the latest methods of sequencing was established for the analysis of individual human genomes. The study is the first Russian effort towards whole human genome sequencing.
PMCID: PMC3347526  PMID: 22649622
24.  Fine De Novo Sequencing of a Fungal Genome Using only SOLiD Short Read Data: Verification on Aspergillus oryzae RIB40 
PLoS ONE  2013;8(5):e63673.
The development of next-generation sequencing (NGS) technologies has dramatically increased the throughput, speed, and efficiency of genome sequencing. The short read data generated from NGS platforms, such as SOLiD and Illumina, are quite useful for mapping analysis. However, the SOLiD read data with lengths of <60 bp have been considered to be too short for de novo genome sequencing. Here, to investigate whether de novo sequencing of fungal genomes is possible using only SOLiD short read sequence data, we performed de novo assembly of the Aspergillus oryzae RIB40 genome using only SOLiD read data of 50 bp generated from mate-paired libraries with 2.8- or 1.9-kb insert sizes. The assembled scaffolds showed an N50 value of 1.6 Mb, a 22-fold increase than those obtained using only SOLiD short read in other published reports. In addition, almost 99% of the reference genome was accurately aligned by the assembled scaffold fragments in long lengths. The sequences of secondary metabolite biosynthetic genes and clusters, whose products are of considerable interest in fungal studies due to their potential medicinal, agricultural, and cosmetic properties, were also highly reconstructed in the assembled scaffolds. Based on these findings, we concluded that de novo genome sequencing using only SOLiD short reads is feasible and practical for molecular biological study of fungi. We also investigated the effect of filtering low quality data, library insert size, and k-mer size on the assembly performance, and recommend for the assembly use of mild filtered read data where the N50 was not so degraded and the library has an insert size of ∼2.0 kb, and k-mer size 33.
PMCID: PMC3646829  PMID: 23667655
25.  An alignment algorithm for bisulfite sequencing using the Applied Biosystems SOLiD System 
Bioinformatics  2010;26(15):1901-1902.
Summary: Bisulfite sequencing allows cytosine methylation, an important epigenetic marker, to be detected via nucleotide substitutions. Since the Applied Biosystems SOLiD System uses a unique di-base encoding that increases confidence in the detection of nucleotide substitutions, it is a potentially advantageous platform for this application. However, the di-base encoding also makes reads with many nucleotide substitutions difficult to align to a reference sequence with existing tools, preventing the platform's potential utility for bisulfite sequencing from being realized. Here, we present SOCS-B, a reference-based, un-gapped alignment algorithm for the SOLiD System that is tolerant of both bisulfite-induced nucleotide substitutions and a parametric number of sequencing errors, facilitating bisulfite sequencing on this platform. An implementation of the algorithm has been integrated with the previously reported SOCS alignment tool, and was used to align CpG methylation-enriched Arabidopsis thaliana bisulfite sequence data, exhibiting a 2-fold increase in sensitivity compared to existing methods for aligning SOLiD bisulfite data.
Availability: Executables, source code, and sample data are available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2905549  PMID: 20562417

Results 1-25 (831130)