The EMBL-European Bioinformatics Institute (EMBL-EBI) offers public access to patent sequence data, providing a valuable service to the intellectual property and scientific communities. The non-redundant (NR) patent sequence databases comprise two-level nucleotide and protein sequence clusters (NRNL1, NRNL2, NRPL1 and NRPL2) based on sequence identity (level-1) and patent family (level-2). Annotation from the source entries in these databases is merged and enhanced with additional information from the patent literature and biological context. Corrections in patent publication numbers, kind-codes and patent equivalents significantly improve the data quality. Data are available through various user interfaces including web browser, downloads via FTP, SRS, Dbfetch and EBI-Search. Sequence similarity/homology searches against the databases are available using BLAST, FASTA and PSI-Search. In this article, we describe the data collection and annotation and also outline major changes and improvements introduced since 2009. Apart from data growth, these changes include additional annotation for singleton clusters, the identifier versioning for tracking entry change and the entry mappings between the two-level databases.
Helminth parasites ensure their survival by regulating host immunity through mechanisms that dampen inflammation. These properties have recently been exploited therapeutically to treat human diseases. The biocomplexity of the intestinal lumen suggests that interactions between the parasite and the intestinal microbiota would also influence inflammation. In this study, we characterized the microbiota in the porcine proximal colon in response to Trichuris suis (whipworm) infection using 16S rRNA gene-based and whole-genome shotgun (WGS) sequencing. A 21-day T. suis infection in four pigs induced a significant change in the composition of the proximal colon microbiota compared to that of three parasite-naive pigs. Among the 15 phyla identified, the abundances of Proteobacteria and Deferribacteres were changed in infected pigs. The abundances of approximately 13% of genera were significantly altered by infection. Changes in relative abundances of Succinivibrio and Mucispirillum, for example, may relate to alterations in carbohydrate metabolism and niche disruptions in mucosal interfaces induced by parasitic infection, respectively. Of note, infection by T. suis led to a significant shift in the metabolic potential of the proximal colon microbiota, where 26% of all metabolic pathways identified were affected. Besides carbohydrate metabolism, lysine biosynthesis was repressed as well. A metabolomic analysis of volatile organic compounds (VOCs) in the luminal contents showed a relative absence in infected pigs of cofactors for carbohydrate and lysine biosynthesis, as well as an accumulation of oleic acid, suggesting altered fatty acid absorption contributing to local inflammation. Our findings should facilitate development of strategies for parasitic control in pigs and humans.
Summary: CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. In response to the rapid increase in the amount of sequencing data produced by the next-generation sequencing technologies, we have developed a new CD-HIT program accelerated with a novel parallelization strategy and some other techniques to allow efficient clustering of such datasets. Our tests demonstrated very good speedup derived from the parallelization for up to ∼24 cores and a quasi-linear speedup for up to ∼8 cores. The enhanced CD-HIT is capable of handling very large datasets in much shorter time than previous versions.
Supplementary data are available at Bioinformatics online.
Summary: Numerous metagenomics projects have produced tremendous amounts of
sequencing data. Aligning these sequences to reference genomes is an essential analysis in
metagenomics studies. Large-scale alignment data call for intuitive and efficient
visualization tool. However, current tools such as various genome browsers are highly
specialized to handle intraspecies mapping results. They are not suitable for alignment
data in metagenomics, which are often interspecies alignments. We have developed a web
browser-based desktop application for interactively visualizing alignment data of
metagenomic sequences. This viewer is easy to use on all computer systems with modern web
browsers and requires no software installation.
The rapid advances of high-throughput sequencing technologies dramatically prompted metagenomic studies of microbial communities that exist at various environments. Fundamental questions in metagenomics include the identities, composition and dynamics of microbial populations and their functions and interactions. However, the massive quantity and the comprehensive complexity of these sequence data pose tremendous challenges in data analysis. These challenges include but are not limited to ever-increasing computational demand, biased sequence sampling, sequence errors, sequence artifacts and novel sequences. Sequence clustering methods can directly answer many of the fundamental questions by grouping similar sequences into families. In addition, clustering analysis also addresses the challenges in metagenomics. Thus, a large redundant data set can be represented with a small non-redundant set, where each cluster can be represented by a single entry or a consensus. Artifacts can be rapidly detected through clustering. Errors can be identified, filtered or corrected by using consensus from sequences within clusters.
clustering; metagenomics; next-generation sequencing; protein families; artificial duplicates; OTU
As a signaling molecule and an inhibitor of histone deacetylases (HDACs), butyrate exerts its impact on a broad range of biological processes, such as apoptosis and cell proliferation, in addition to its critical role in energy metabolism in ruminants. This study examined the effect of butyrate on alternative splicing in bovine epithelial cells using RNA-seq technology. Junction reads account for 11.28 and 12.32% of total mapped reads between the butyrate-treated (BT) and control (CT) groups. 201,326 potential splicing junctions detected were supported by ≥3 junction reads. Approximately 94% of these junctions conformed to the consensus sequence (GT/AG) while ∼3% were GC/AG junctions. No AT/AC junctions were observed. A total of 2,834 exon skipping events, supported by a minimum of 3 junction reads, were detected. At least 7 genes, their mRNA expression significantly affected by butyrate, also had exon skipping events differentially regulated by butyrate. Furthermore, COL5A3, which was induced 310-fold by butyrate (FDR <0.001) at the gene level, had a significantly higher number of junction reads mapped to Exon#8 (Donor) and Exon#11 (Acceptor) in BT. This event had the potential to result in the formation of a COL5A3 mRNA isoform with 2 of the 69 exons missing. In addition, 216 differentially expressed transcript isoforms regulated by butyrate were detected. For example, Isoform 1 of ORC1 was strongly repressed by butyrate while Isoform 2 remained unchanged. Butyrate physically binds to and inhibits all zinc-dependent HDACs except HDAC6 and HDAC10. Our results provided evidence that butyrate also regulated deacetylase activities of classical HDACs via its transcriptional control. Moreover, thirteen gene fusion events differentially affected by butyrate were identified. Our results provided a snapshot into complex transcriptome dynamics regulated by butyrate, which will facilitate our understanding of the biological effects of butyrate and other HDAC inhibitors.
Short-chain fatty acids (SCFAs), such as butyrate, produced by gut microorganisms, play a critical role in energy metabolism and physiology of ruminants as well as in human health. In this study, the temporal effect of elevated butyrate concentrations on the transcriptome of the rumen epithelium was quantified via serial biopsy sampling using RNA-seq technology. The mean number of genes transcribed in the rumen epithelial transcriptome was 17,323.63 ± 277.20 (±SD; N = 24) while the core transcriptome consisted of 15,025 genes. Collectively, 80 genes were identified as being significantly impacted by butyrate infusion across all time points sampled. Maximal transcriptional effect of butyrate on the rumen epithelium was observed at the 72-h infusion when the abundance of 58 genes was altered. The initial reaction of the rumen epithelium to elevated exogenous butyrate may represent a stress response as Gene Ontology (GO) terms identified were predominantly related to responses to bacteria and biotic stimuli. An algorithm for the reconstruction of accurate cellular networks (ARACNE) inferred regulatory gene networks with 113,738 direct interactions in the butyrate-epithelium interactome using a combined cutoff of an error tolerance (ɛ = 0.10) and a stringent P-value threshold of mutual information (5.0 × 10−11). Several regulatory networks were controlled by transcription factors, such as CREBBP and TTF2, which were regulated by butyrate. Our findings provide insight into the regulation of butyrate transport and metabolism in the rumen epithelium, which will guide our future efforts in exploiting potential beneficial effect of butyrate in animal well-being and human health.
butyrate; epithelial; networks; RNA-seq; ruminant; transcriptome
Short-chain fatty acids (SCFAs), especially butyrate, affect cell differentiation, proliferation, and motility. Butyrate also induces cell cycle arrest and apoptosis through its inhibition of histone deacetylases (HDACs). In addition, butyrate is a potent inducer of histone hyper-acetylation in cells. Therefore, this SCFA provides an excellent in vitro model for studying the epigenomic regulation of gene expression induced by histone acetylation. In this study, we analyzed the differential in vitro expression of genes induced by butyrate in bovine epithelial cells by using deep RNA-sequencing technology (RNA-seq). The number of sequences read, ranging from 57,303,693 to 78,933,744, were generated per sample. Approximately 11,408 genes were significantly impacted by butyrate, with a false discovery rate (FDR) <0.05. The predominant cellular processes affected by butyrate included cell morphological changes, cell cycle arrest, and apoptosis. Our results provided insight into the transcriptome alterations induced by butyrate, which will undoubtedly facilitate our understanding of the molecular mechanisms underlying butyrate-induced epigenomic regulation in bovine cells.
Summary: Iterative similarity searches with PSI-BLAST position-specific score matrices (PSSMs) find many more homologs than single searches, but PSSMs can be contaminated when homologous alignments are extended into unrelated protein domains—homologous over-extension (HOE). PSI-Search combines an optimal Smith–Waterman local alignment sequence search, using SSEARCH, with the PSI-BLAST profile construction strategy. An optional sequence boundary-masking procedure, which prevents alignments from being extended after they are initially included, can reduce HOE errors in the PSSM profile. Preventing HOE improves selectivity for both PSI-BLAST and PSI-Search, but PSI-Search has ~4-fold better selectivity than PSI-BLAST and similar sensitivity at 50% and 60% family coverage. PSI-Search is also produces 2- for 4-fold fewer false-positives than JackHMMER, but is ~5% less sensitive.
Availability and implementation: PSI-Search is available from the authors as a standalone implementation written in Perl for Linux-compatible platforms. It is also available through a web interface (www.ebi.ac.uk/Tools/sss/psisearch) and SOAP and REST Web Services (www.ebi.ac.uk/Tools/webservices).
Helminth infection in pigs serves as an excellent model for the study of the interaction between human malnutrition and parasitic infection and could have important implications in human health. We had observed that pigs infected with Trichuris suis for 21 days showed significant changes in the proximal colon microbiota. In this study, interactions between worm burden and severity of disruptions to the microbial composition and metabolic potentials in the porcine proximal colon microbiota were investigated using metagenomic tools. Pigs were infected by a single dose of T. suis eggs for 53 days. Among infected pigs, two cohorts were differentiated that either had adult worms or were worm-free. Infection resulted in a significant change in the abundance of approximately 13% of genera detected in the proximal colon microbiota regardless of worm status, suggesting a relatively persistent change over time in the microbiota due to the initial infection. A significant reduction in the abundance of Fibrobacter and Ruminococcus indicated a change in the fibrolytic capacity of the colon microbiota in T. suis infected pigs. In addition, ∼10% of identified KEGG pathways were affected by infection, including ABC transporters, peptidoglycan biosynthesis, and lipopolysaccharide biosynthesis as well as α-linolenic acid metabolism. Trichuris suis infection modulated host immunity to Campylobacter because there was a 3-fold increase in the relative abundance in the colon microbiota of infected pigs with worms compared to naïve controls, but a 3-fold reduction in worm-free infected pigs compared to controls. The level of pathology observed in infected pigs with worms compared to worm-free infected pigs may relate to the local host response because expression of several Th2-related genes were enhanced in infected pigs with worms versus those worm-free. Our findings provided insight into the dynamics of the proximal colon microbiota in pigs in response to T. suis infection.
The capacity of the rumen microbiota to produce volatile fatty acids (VFAs) has important implications in animal well-being and production. We investigated temporal changes of the rumen microbiota in response to butyrate infusion using pyrosequencing of the 16S rRNA gene. Twenty one phyla were identified in the rumen microbiota of dairy cows. The rumen microbiota harbored 54.5±6.1 genera (mean ± SD) and 127.3±4.4 operational taxonomic units (OTUs), respectively. However, the core microbiome comprised of 26 genera and 82 OTUs. Butyrate infusion altered molar percentages of 3 major VFAs. Butyrate perturbation had a profound impact on the rumen microbial composition. A 72 h-infusion led to a significant change in the numbers of sequence reads derived from 4 phyla, including 2 most abundant phyla, Bacteroidetes and Firmicutes. As many as 19 genera and 43 OTUs were significantly impacted by butyrate infusion. Elevated butyrate levels in the rumen seemingly had a stimulating effect on butyrate-producing bacteria populations. The resilience of the rumen microbial ecosystem was evident as the abundance of the microorganisms returned to their pre-disturbed status after infusion withdrawal. Our findings provide insight into perturbation dynamics of the rumen microbial ecosystem and should guide efforts in formulating optimal uses of probiotic bacteria treating human diseases.
Phosphatidylinositol 3-kinase (PI3K)/Akt signaling pathway, activated during influenza A virus infection, can promote viral replication via multiple mechanisms. Direct binding of NS1 protein to p85β subunit of PI3K is required for activation of PI3K/Akt signaling. Binding and subsequent activation of PI3K is believed to be a conserved character of influenza A virus NS1 protein. Sequence variation of NS1 proteins in different influenza A viruses led us to investigate possible deviation from the conservativeness.
In the present study, NS1 proteins from four different influenza A virus subtypes/strains were tested for their ability to bind p85β subunit of PI3K and to activate PI3K/Akt. All NS1 proteins efficiently bound to p85β and activated PI3K/Akt, with the exception of NS1 protein from an H5N1 virus (A/Chicken/Guangdong/1/05, abbreviated as GD05), which bound to p85β but failed to activate PI3K/Akt, implying that as-yet-unidentified domain(s) in NS1 may alternatively mediate the activation of PI3K. Moreover, PI3K inhibitor, LY294002, did not suppress but significantly increased the replication of GD05 virus.
Our study indicates that activation of PI3K/Akt by NS1 protein is not highly conserved among influenza A viruses and inhibition of the PI3K/Akt pathway as an anti-influenza strategy may not work for all influenza A viruses.
We present a report of the BIOCOMP'10 - The 2010 International Conference on Bioinformatics & Computational Biology and other related work in the area of systems biology.
Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega
Multiple sequence alignments are fundamental to many sequence analysis methods. The new program Clustal Omega can align virtually any number of protein sequences quickly and has powerful features for adding sequences to existing precomputed alignments.
Multiple sequence alignments are fundamental to many sequence analysis methods. Most alignments are computed using the progressive alignment heuristic. These methods are starting to become a bottleneck in some analysis pipelines when faced with data sets of the size of many thousands of sequences. Some methods allow computation of larger data sets while sacrificing quality, and others produce high-quality alignments, but scale badly with the number of sequences. In this paper, we describe a new program called Clustal Omega, which can align virtually any number of protein sequences quickly and that delivers accurate alignments. The accuracy of the package on smaller test cases is similar to that of the high-quality aligners. On larger data sets, Clustal Omega outperforms other packages in terms of execution time and quality. Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam.
bioinformatics; hidden Markov models; multiple sequence alignment
Due to its diverse, wondrous plants and unique topography, Western China has drawn great attention from explorers and naturalists from the Western World. Among them, Ernest Henry Wilson (1876 –1930), known as ‘Chinese’ Wilson, travelled to Western China five times from 1899 to 1918. He took more than 1,000 photos during his travels. These valuable photos illustrated the natural and social environment of Western China a century ago. Since 1997, we had collected E.H. Wilson's old pictures, and then since 2004, along the expedition route of E.H. Wilson, we took 7 years to repeat photographing 250 of these old pictures. Comparing Wilson's photos with ours, we found an obvious warming trend over the 100 years, not only in specific areas but throughout the entire Western China. Such warming trend manifested in phenology changes, community shifts and melting snow in alpine mountains. In this study, we also noted remarkable vegetation changes. Out of 62 picture pairs were related to vegetation change, 39 indicated vegetation has changed to the better condition, 17 for degraded vegetation and six for no obvious change. Also in these photos at a century interval, we found not only rapid urbanization in Western China, but also the disappearance of traditional cultures. Through such comparisons, we should not only be amazed about the significant environmental changes through time in Western China, but also consider its implications for protecting environment while meeting the economic development beyond such changes.
Infections in cattle by the abomasal nematode Ostertagia ostertagi result in impaired gastrointestinal function. Six partially immune animals were developed using multiple drug-attenuated infections, and these animals displayed reduced worm burdens and a slightly elevated abomasal pH upon reinfection. In this study, we characterized the abomasal microbiota in response to reinfection using metagenomic tools. Compared to uninfected controls, infection did not induce a significant change in the microbial community composition in immune animals. 16S rRNA gene-based phylogenetic analysis identified 15 phyla in the bovine abomasal microbiota with Bacteroidetes (60.5%), Firmicutes (27.1%), Proteobacteria (7.2%), Spirochates (2.9%), and Fibrobacteres (1.5%) being the most predominant. The number of prokaryotic genera and operational taxonomic units (OTU) identified in the abomasal microbial community was 70.8±19.8 (mean ± SD) and 90.3±2.9, respectively. However, the core microbiome comprised of 32 genera and 72 OTU. Infection seemingly had a minimal impact on the abomasal microbial diversity at a genus level in immune animals. Proteins predicted from whole genome shotgun (WGS) DNA sequences were assigned to 5,408 Pfam and 3,381 COG families, demonstrating dazzling arrays of functional diversity in bovine abomasal microbial communities. However, none of COG functional classes were significantly impacted by infection. Our results demonstrate that immune animals may develop abilities to maintain proper stability of their abomasal microbial ecosystem. A minimal disruption in the bovine abomasal microbiota by reinfection may contribute equally to the restoration of gastric function in immune animals.
The new field of metagenomics studies microorganism communities by culture-independent sequencing. With the advances in next-generation sequencing techniques, researchers are facing tremendous challenges in metagenomic data analysis due to huge quantity and high complexity of sequence data. Analyzing large datasets is extremely time-consuming; also metagenomic annotation involves a wide range of computational tools, which are difficult to be installed and maintained by common users. The tools provided by the few available web servers are also limited and have various constraints such as login requirement, long waiting time, inability to configure pipelines etc.
We developed WebMGA, a customizable web server for fast metagenomic analysis. WebMGA includes over 20 commonly used tools such as ORF calling, sequence clustering, quality control of raw reads, removal of sequencing artifacts and contaminations, taxonomic analysis, functional annotation etc. WebMGA provides users with rapid metagenomic data analysis using fast and effective tools, which have been implemented to run in parallel on our local computer cluster. Users can access WebMGA through web browsers or programming scripts to perform individual analysis or to configure and run customized pipelines. WebMGA is freely available at http://weizhongli-lab.org/metagenomic-analysis.
WebMGA offers to researchers many fast and unique tools and great flexibility for complex metagenomic data analysis.
The NS1 protein of influenza A virus is able to bind with many proteins that affect cellular signal transduction and protein synthesis in infected cells. The NS1 protein consists of approximately 230 amino acids and the last 4 amino acids of the NS1 C-terminal form a PDZ binding motif. Postsynaptic Density Protein-95 (PSD-95), which is mainly expressed in neurons, has 3 PDZ domains. We hypothesise that NS1 binds to PSD-95, and this binding is able to affect neuronal function.
We conducted a yeast two-hybrid analysis, GST-pull down assays and co-immunoprecipitations to detect the interaction between NS1 and PSD-95. The results showed that NS1 of avian influenza virus H5N1 (A/chicken/Guangdong/1/2005) is able to bind to PSD-95, whereas NS1 of human influenza virus H1N1 (A/Shantou/169/2006) is unable to do so. The results also revealed that NS1 of H5N1 significantly reduces the production of nitric oxide (NO) in rat hippocampal neurons.
In summary, our study indicates that NS1 of influenza A virus can bind with neuronal PSD-95, and the avian H5N1 and human H1N1 influenza A viruses possess distinct binding properties.
NS1; PSD-95; influenza virus; nitric oxide; neurons
Summary: Fragment recruitment, a process of aligning sequencing reads to reference genomes, is a crucial step in metagenomic data analysis. The available sequence alignment programs are either slow or insufficient for recruiting metagenomic reads. We implemented an efficient algorithm, FR-HIT, for fragment recruitment. We applied FR-HIT and several other tools including BLASTN, MegaBLAST, BLAT, LAST, SSAHA2, SOAP2, BWA and BWA-SW to recruit four metagenomic datasets from different type of sequencers. On average, FR-HIT and BLASTN recruited significantly more reads than other programs, while FR-HIT is about two orders of magnitude faster than BLASTN. FR-HIT is slower than the fastest SOAP2, BWA and BWA-SW, but it recruited 1–5 times more reads.
Supplementary information: Supplementary data are available at Bioinformatics online.
The Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis (CAMERA, http://camera.calit2.net/) is a database and associated computational infrastructure that provides a single system for depositing, locating, analyzing, visualizing and sharing data about microbial biology through an advanced web-based analysis portal. CAMERA collects and links metadata relevant to environmental metagenome data sets with annotation in a semantically-aware environment allowing users to write expressive semantic queries against the database. To meet the needs of the research community, users are able to query metadata categories such as habitat, sample type, time, location and other environmental physicochemical parameters. CAMERA is compliant with the standards promulgated by the Genomic Standards Consortium (GSC), and sustains a role within the GSC in extending standards for content and format of the metagenomic data and metadata and its submission to the CAMERA repository. To ensure wide, ready access to data and annotation, CAMERA also provides data submission tools to allow researchers to share and forward data to other metagenomics sites and community data archives such as GenBank. It has multiple interfaces for easy submission of large or complex data sets, and supports pre-registration of samples for sequencing. CAMERA integrates a growing list of tools and viewers for querying, analyzing, annotating and comparing metagenome and genome data.
The microbes that inhabit particular environments must be able to perform molecular functions that provide them with a competitive advantage to thrive in those environments. As most molecular functions are performed by proteins and are conserved between related proteins, we can expect that organisms successful in a given environmental niche would contain protein families that are specific for functions that are important in that environment. For instance, the human gut is rich in polysaccharides from the diet or secreted by the host, and is dominated by Bacteroides, whose genomes contain highly expanded repertoire of protein families involved in carbohydrate metabolism. To identify other protein families that are specific to this environment, we investigated the distribution of protein families in the currently available human gut genomic and metagenomic data. Using an automated procedure, we identified a group of protein families strongly overrepresented in the human gut. These not only include many families described previously but also, interestingly, a large group of previously unrecognized protein families, which suggests that we still have much to discover about this environment. The identification and analysis of these families could provide us with new information about an environment critical to our health and well being.
Metagenomics provides a unique opportunity to sample the gene content of microbial communities adapted to specific environments and for the study of the correlations between the presence or absence of gene families that occur in organisms within that environment. Such studies provide detailed information about the adaptation of microbes to a given environment and, indirectly, provide clues about the most important molecular processes that are specific for that environment. Having performed such an analysis for the community of the human distal gut, we report many new protein families and identify many others that are highly specific for this particular environment. The function of most of these proteins is unknown, which illustrates the extent of our ignorance about the organisms within this environment that are so important for human health and well being.
The EMBL-EBI provides access to various mainstream sequence analysis applications. These include sequence similarity search services such as BLAST, FASTA, InterProScan and multiple sequence alignment tools such as ClustalW, T-Coffee and MUSCLE. Through the sequence similarity search services, the users can search mainstream sequence databases such as EMBL-Bank and UniProt, and more than 2000 completed genomes and proteomes. We present here a new framework aimed at both novice as well as expert users that exposes novel methods of obtaining annotations and visualizing sequence analysis results through one uniform and consistent interface. These services are available over the web and via Web Services interfaces for users who require systematic access or want to interface with customized pipe-lines and workflows using common programming languages. The framework features novel result visualizations and integration of domain and functional predictions for protein database searches. It is available at http://www.ebi.ac.uk/Tools/sss for sequence similarity searches and at http://www.ebi.ac.uk/Tools/msa for multiple sequence alignments.
Artificial duplicates from pyrosequencing reads may lead to incorrect interpretation of the abundance of species and genes in metagenomic studies. Duplicated reads were filtered out in many metagenomic projects. However, since the duplicated reads observed in a pyrosequencing run also include natural (non-artificial) duplicates, simply removing all duplicates may also cause underestimation of abundance associated with natural duplicates.
We implemented a method for identification of exact and nearly identical duplicates from pyrosequencing reads. This method performs an all-against-all sequence comparison and clusters the duplicates into groups using an algorithm modified from our previous sequence clustering method cd-hit. This method can process a typical dataset in ~10 minutes; it also provides a consensus sequence for each group of duplicates. We applied this method to the underlying raw reads of 39 genomic projects and 10 metagenomic projects that utilized pyrosequencing technique. We compared the occurrences of the duplicates identified by our method and the natural duplicates made by independent simulations. We observed that the duplicates, including both artificial and natural duplicates, make up 4-44% of reads. The number of natural duplicates highly correlates with the samples' read density (number of reads divided by genome size). For high-complexity metagenomic samples lacking dominant species, natural duplicates only make up <1% of all duplicates. But for some other samples like transcriptomic samples, majority of the observed duplicates might be natural duplicates.
Our method is available from http://cd-hit.org as a downloadable program and a web server. It is important not only to identify the duplicates from metagenomic datasets but also to distinguish whether they are artificial or natural duplicates. We provide a tool to estimate the number of natural duplicates according to user-defined sample types, so users can decide whether to retain or remove duplicates in their projects.
Summary: CD-HIT is a widely used program for clustering and comparing large biological sequence datasets. In order to further assist the CD-HIT users, we significantly improved this program with more functions and better accuracy, scalability and flexibility. Most importantly, we developed a new web server, CD-HIT Suite, for clustering a user-uploaded sequence dataset or comparing it to another dataset at different identity levels. Users can now interactively explore the clusters within web browsers. We also provide downloadable clusters for several public databases (NCBI NR, Swissprot and PDB) at different identity levels.
Availability: Free access at http://cd-hit.org
Supplementary information: Supplementary data are available at Bioinformatics online.
Sequence identification of ESTs from non-model species offers distinct challenges particularly when these species have duplicated genomes and when they are phylogenetically distant from sequenced model organisms. For the common carp, an environmental model of aquacultural interest, large numbers of ESTs remained unidentified using BLAST sequence alignment. We have used the expression profiles from large-scale microarray experiments to suggest gene identities.
Expression profiles from ~700 cDNA microarrays describing responses of 7 major tissues to multiple environmental stressors were used to define a co-expression landscape. This was based on the Pearsons correlation coefficient relating each gene with all other genes, from which a network description provided clusters of highly correlated genes as 'mountains'. We show that these contain genes with known identities and genes with unknown identities, and that the correlation constitutes evidence of identity in the latter. This procedure has suggested identities to 522 of 2701 unknown carp ESTs sequences. We also discriminate several common carp genes and gene isoforms that were not discriminated by BLAST sequence alignment alone. Precision in identification was substantially improved by use of data from multiple tissues and treatments.
The detailed analysis of co-expression landscapes is a sensitive technique for suggesting an identity for the large number of BLAST unidentified cDNAs generated in EST projects. It is capable of detecting even subtle changes in expression profiles, and thereby of distinguishing genes with a common BLAST identity into different identities. It benefits from the use of multiple treatments or contrasts, and from the large-scale microarray data.