|Home | About | Journals | Submit | Contact Us | Français|
Thakkar A, Wavreille A-S, Pei D. Traceless capping agent for peptide sequencing by partial Edman degradation and mass spectrometry. Analytical Chemistry 78;2006:5935–5939. [PubMed]
Partial Edman degradation is a procedure in which peptides are coupled to phenylisothiocyanate (PITC) in the presence of a small amount of a capping reagent so that, when treated with strong anhydrous acid after coupling, peptide molecules coupled to PITC release the N-terminal residue while those coupled to the capping reagent do not. Multiple cycles of this procedure produce a ladder of truncation products from which the N-terminal sequence can be read by MALDI mass spectrometry. The authors of the present paper use this approach extensively as a simple, rapid, and inexpensive method for sequencing biologically active peptides selected from one-bead/one-peptide combinatorial libraries. However, they experience lower success rates with peptides rich in tryptophan, proline, and tyrosine residues. The present paper describes an improved capping reagent, N-(9-fluor enylmethoxycarbonyloxy) succinimide. This is highly reactive, even with secondary amino acids, and, after several rounds of partial Edman degradation, can be removed from the N-termini and side chains of all peptide molecules by treatment with 20% piperidine in dimethylfor-mamide. This produces simpler spectra, and eliminates the previously observed problems with Trp, Pro, and Tyr residues.
Hamberg A, Kempka M, Sjödahl J, Roeraade J, Hult K. C-terminal ladder sequencing of peptides using an alternative nucleophile in carboxypeptidase Y digests. Analytical Biochemistry 357;2006:167–172. [PubMed]
Carboxypeptidase digestion has become the basis of a popular method for ladder sequencing from the C-terminus of peptides/proteins in which mass spectrometry is used to measure the mass differences between the truncated species in order to deduce amino acid sequence. To avoid gaps in sequence resulting from rapid cleavage of amino acids, this paper demonstrates that introducing into the enzymatic reaction mixture an alternative nucleophile, 2-pyridylmethylamine, to compete with water, results in the carboxypeptidase-catalyzed aminolytic incorporation of a protecting group into a fraction of the peptide molecules. These are resistant to further cleavage by carboxy-peptidase, so the effect is to stabilize the ladder. The products of digesting peptides with carboxypeptidase Y are analyzed by MALDI mass spectrometry, and substantial enhancement of sequence information is demonstrated.
Nordgård O, Kvaløy JT, Farmen RK, Heikkilä R. Error propagation in relative real-time reverse transcription polymerase chain reaction quantification models: The balance between accuracy and precision. Analytical Biochemistry 356;2006:182–193. [PubMed]
This paper applies the principles of error propagation to the best-established models for the real-time polymerase chain reaction (RT-PCR) to evaluate the contributions of various components of the models to the overall random error of mRNA quantitation. Normalization against a calibration standard is shown to increase comparability between runs, but also increases overall random error. However, normalization against multiple reference genes may be preferable because it does not increase overall random error. It is also shown that run-to-run efficiency variations observed for the same sample are dominated by random error of efficiency determination. Incorporating sample-specific amplification efficiencies determined from individual amplification curves therefore severely reduces the overall reproducibility of quantitation, and should be avoided. Finally, the authors point out that new, more complex models for the PCR process, designed to improve accuracy, also have the potential to reduce reproducibility, and their adoption should therefore be balanced against the possible loss of precision.
Patwa TH, Zhao J, Anderson MA, Simeone DM, Lubman DM. Screening of glycosylation patterns in serum using natural glycoprotein microarrays and multi-lectin fluorescence detection. Analytical Chemistry 78;2006:6411–6421. [PubMed]
A method is presented for screening complex mixtures of glycoproteins, such as those in serum, for changes in glycan structure at both the global and individual protein levels. Glycoprotein enrichment is first performed using a general lectin such as wheat germ agglutinin. The enriched glycoproteins are then separated using reverse-phase HPLC, and the chromatographic eluent is spotted onto nitrocellulose slides. The resulting array of spots is then screened for various glycan structures using five lectins—concanavalin A, Maackia lectin II, Sambucus bark lectin, and peanut agglutinin. These lectins are conjugated to biotin, and their binding to the array is detected in a sandwich assay using streptavidin conjugated to the fluorophore Alexafluor555. Using this methodology, differences in glycosylation patterns are demonstrated between normal, pancreatitis, and cancer sera, particularly with respect to sialylation, mannosylation, and fucosylation. The approach has potential for identifying disease-related biomarkers.
Blair S, Richmond K, Rodesch M, Bassetti M, Cerrina F. A scalable method for multiplex LED-controlled synthesis of DNA in capillaries. Nucleic Acids Research 34;2006:e110. [PubMed]
This paper presents a proof of principle for instrumentation enabling large numbers of oligonucleotides to be synthesized in small amounts with high quality and at modest cost. Standard DNA synthesizers utilizing acid-labile phosphoramidite chemistry frequently make products in much larger amounts than needed, and are not well suited for the development of large-scale libraries of oligonucleotides in small amounts. The process described in the present paper is based on light-directed 3-nitro-phenylpropyloxycarbonyl (NPPOC) chemistry. Synthesis is performed in a capillary flow cell. The capillary is illuminated by a series of ultraviolet light–emitting diodes (LEDs) arranged along its length, emitting light directed normal to the capillary axis. The light from each LED deprotects oligonucleotide in the region immediately facing it. When connected via fluid fittings to a fluidics delivery system such as a standard DNA synthesizer, the capillary becomes a multiplex synthesis cell. Through synchronization of fluid delivery and UV light emission from each LED, discrete species of oligonucleotide are grown on the inner surface of the capillary. The system is readily scalable in terms of the amounts and numbers of oligonucleotides synthesized.
Ejsing CS, Duchoslav E, Sampaio J, Simons K, Bonner R, Thiele C, Ekroos K, Shevchenko A. Automated identification and quantification of glycerophospholipid molecular species by multiple precursor ion scanning. Analytical Chemistry 78;2006:6202–6214. [PubMed]
Lipid profiling seeks to describe lipid content quantitatively with respect to the different lipid head groups that are present and the various fatty acid or fatty alcohol tail groups with which each head group is associated. The present paper extends previous lipid-profiling work in which a hybrid quadrupole-time-of-flight mass spectrometer acquires CID precursor ion spectra for multiple fatty acyl product ions. Software is here introduced for automated acquisition and processing of the precursor ion spectra for 41 fatty acyl product ions. The software, called Lipid Profiler, is available from MDS Sciex. The program incorporates correction algorithms to calculate the intensity of lipid precursors within overlapping isotopic clusters. Absolute quantitation is realized using a synthetic, internal standard consisting of heptadecanoyl (17:0/17:0) fatty acids attached to the six common glycerophospholipid head groups. Spectra are acquired with the help of a NanoMate robotic nanoflow source from Advion Biosciences, Inc., Ithaca, NY. The results demonstrate a linear dynamic range of 10 nM–100 μM for quantitation.
Nakamura K, Suzuki Y, Goto-Inoue N, Yoshida-Noro C, Suzuki A. Structural characterization of neutral glycosphingolipids by thin-layer chromatography coupled to matrix-assisted laser desorption/ionization quadrupole ion-trap time-of-flight MS/MS. Analytical Chemistry 78;2006:5736–5743. [PubMed]
The method described here for analysis of neutral glycosphingolipids (GSLs) involves direct coupling of thin-layer chromatography (TLC) for the separation of GSLs with MALDI mass spectrometry for their identification. The problem with this marriage of analytical techniques is that the rough surface of TLC plates may result in poor mass accuracy and resolution of mass spectra. These problems are avoided by the use of an ion trap for storage of ions derived from MALDI prior to their analysis in a time-of-flight mass analyzer. Using dihydroxybenzoic acid as the matrix, structural characterization of GSLs is achieved on the picomole scale.
Khatib-Shahidi S, Andersson M, Herman JL, Gillespie TA, Caprioli RM. Direct molecular analysis of whole-body animal tissue sections by imaging MALDI mass spectrometry. Analytical Chemistry 78;2006:6448–6456. [PubMed]
This study demonstrates that MALDI imaging mass spectrometry can be used to deduce the tissue localization of a drug molecule in sections of a whole animal. Rats are dosed orally with olanzapine, and sagittal sections of whole animals 2 h and 6 h post dose are prepared with a cryostat. Sections are immobilized on a target plate, and dihydroxybenzoic acid matrix solution is applied by spraying. The drug distribution is ascertained by multiple reaction monitoring in a tandem TOF/TOF mass spectrometer system. Olanzapine is observed to be generally distributed, but with significant localization in specific organs. The distribution of the olanzapine metabolites N-desmethylolanzapine and 2-hydroxymethyl olanzapine is also determined, and shown to represent 21% of the total MS/MS signal. The results correlate well with previous autoradiographic distribution studies.
Monroe EB, Jurchen JC, Koszczuk BA, Losh JL, Rubakhin SS, Sweedler JV. Massively parallel sample preparation for the MALDI MS analyses of tissues. Analytical Chemistry 78;2006:6826–6832. [PubMed]
In MALDI mass spectral imaging, matrix solution is applied to the tissue section to be imaged and extracts analytes for incorporation into the matrix crystals. There is an inherent advantage in sensitivity to exposing the sample to matrix for a longer time to increase the amount of analyte extracted. Unfortunately, however, this advantage is offset by diffusional blurring and loss of spatial resolution in the image. The present report describes a method to overcome this problem. Tissue slices are applied to a monolayer of 38-μm glass beads that are pressed into a square of Parafilm M. The parafilm is then stretched, causing the tissue to become divided into thousands of pieces individually attached to the beads. In this way, the spatial organization of the tissue is preserved, but the pieces become separated from one another by the hydrophobic Parafilm surface to minimize analyte migration between them. Matrix solution is then applied with an airbrush. For this purpose, the sample is placed in a humidified environment and cooled by a Peltier device below the dew point of water to prolong analyte extraction and control the size of the coalescing liquid droplets. The Peltier device is then allowed to warm to ambient temperature to permit controlled matrix crystallization. The mass spectra obtained for such samples are equivalent to those acquired from single cells.
Kraft ML, Weber PK, Longo ML, Hutcheon ID, Boxer SG. Phase separation of lipid membranes analyzed with high-resolution secondary ion mass spectrometry. Science 313;2006:1948–1951. [PubMed]
Secondary ion mass spectrometry (SIMS) is used here to provide very high resolution mass spectral images. A focused beam of 133Cs+ ions is employed as the source of primary ions. Lipid bilayers are formed on warmed silicon wafers (which help dissipate charge during the SIMS analysis). Upon cooling to room temperature, the bilayers undergo phase separation, and are then frozen and lyophilized. The distribution of lipids is then determined by mass spectrometry at a spatial resolution of approximately 100 nm, sufficient to describe the composition of small lipid domains. Variations in composition are detected within some domains.
Chen X, Murawski A, Kuang G, Sexton DJ, Galbraith W. Sample preparation for MALDI mass spectrometry using an elastomeric device reversibly sealed on the MALDI target. Analytical Chemistry 78;2006:6160–6168. [PubMed]
A concentrator device consisting of an array of wells open at both the top and the bottom is fabricated to interface with standard MALDI targets. The device is made of poly-(dimethylsiloxane), an elastomer that can make a reversible connection to the MALDI target without leaking fluid placed in the wells. Large-volume samples (5–200 μL) can be placed in the wells and dried down in a vacuum centrifuge. This allows acquisition of spectra at high sensitivity from dilute samples. The hydrophobic nature of the elastomer permits peptides to be bound to the walls of the wells and desalted by washing, then eluted onto the target for in situ sample preparation. In-well trypsin digestion is also demonstrated, followed by desalting and concentrating the digestion products all in the same well. The concentrator devices are available from BD Biosciences, Bedford, MA.
Chen Y, Vertes A. Adjustable fragmentation in laser desorption/ionization from laser-induced silicon microcolumn arrays. Analytical Chemistry 78;2006:5835–5844. [PubMed]
In 1999, Siuzdak and coworkers described the use of silicon wafers galvanostatically etched with hydrofluoric acid as substrates to support matrix-free laser desorption/ionization. The present paper produces similar nano-porous surfaces on silicon by irradiating with a pulsed Nd:YAG laser in environments of air, sulfur hexafluoride, or water. The morphology of the surface varies with the chemical environment, the laser pulse length, fluence, and number of pulses. Surfaces produced in a water environment are shown to support laser desorption/ionization of peptides and synthetic polymers using a nitrogen laser at low fluence. Fluence thresholds and ion yields similar to those seen in MALDI are observed, and low femtomole sensitivity and mass range up to 6000 Da is demonstrated. Furthermore, at elevated laser fluence, in-source decay can be induced to provide sequence information.
Giritch A, Marillonnet S, Enger C, van Eldik G, Botterman J, Klimyuk V, Gleba Y. Rapid high-yield expression of full-size IgG antibodies in plants coinfected with noncompeting virali vectors. Proceedings of the National Academy of Sciences, U.S.A. 103;2006:14,701–14,706.
Expression of proteins for therapeutic application in stably transformed plants has suffered from very long development times and low yields. The present article remedies these problems by adopting a transient transfec-tion approach. Genes are designed for efficient nuclear processing in the plant host Nicotiana by removing cryptic splice sites and adding several introns compatible with the host transcription/translation machinery. The genes for the immunoglobulin heavy and light chains are delivered by separate plant viruses that do not display mutual exclusion from host cells, a phenomenon that has previously interfered with the expression of hetero-oligomeric proteins. The viral vectors are delivered by an Agrobacterium sp., which mediates the primary infection and systemic spread through the plant, while the viral vector provides for cell-to-cell spread, amplification, and high expression. The vectors are engineered with specific recombinase sites that, when co-delivered with the cognate integrase, recombine in the host to form a complete viral replicon. The results show co-expression of heavy and light chains in 82% of leaf cells. Yields of up to 0.5 g of assembled antibody per kg of leaf biomass are reported in just 14 d from gene delivery to harvested plant cells. The plants are best grown in controlled growth rooms, mitigating fears about transgenic containment. The system promises to provide a platform for rapid, large-scale manufacturing of antibodies as well as other protein complexes.
Montgomery R, Jia X, Tolley L. Dynamic isoelectric focusing for proteomics. Analytical Chemistry 78;2006:6511–6518. [PubMed]
Isoelectric focusing is a separation technique capable of producing separations of very high peak capacity. However, isoelectric focusing in solution is difficult to interface with other separation methods without serious loss of resolution through zone broadening. Here, a dynamic isoelectric focusing technique is described that uses additional power supplies to control the shape of the electric field within an isoelectric focusing capillary. The field is manipulated to change the pH gradient so that zones can be moved to a collection point without ever becoming de-focused. Peak capacities of over 1000 are demonstrated with this technique. Applications in multidimensional separations, for example for proteomics, are envisioned.
Seebacher J, Mallick P, Zhang N, Eddes JS, Aebersold R, Gelb MH. Protein cross-linking analysis using mass spectrometry, isotope-coded cross-linkers, and integrated computational data processing. Journal of Proteome Research 5;2006:2270–2282. [PubMed]
Chemical cross-linking provides a way to identify the proteins that bind to one another in multi-protein complexes and to localize the sites of interaction between binding partners. Proteolytic digestion of the interacting proteins is followed by identification of the cross-linked peptides by mass spectrometry. The task of recognizing cross-linked peptides is complicated by the excess of peptides underivatized by the cross-linking reagent, and peptides that have reacted with a reagent molecule without becoming cross-linked to another peptide. The present paper uses isotopically substituted cross-linking reagents and isotopically substituted water to distinguish between unmodified peptides, mono-substituted or loop-linked peptides, and cross-linked peptides. A sample of protein in [16O]-water is split into two. One half is treated with a light (undeuterated) form of the cross-linker (a bis-NHS ester), and the other half is treated with heavy (deu-terated) cross-linker. After reaction, the two samples are recombined, dried, redissolved in [16O]-water, and digested. Peptides are fractionated by reverse-phase chromatography and then analyzed by MALDI mass spectrometry. Cross-linked peptides will appear as doublets. An identical protein sample is prepared in a 50/50 mixture of [16O/18O]-water, split into two, and treated with light and heavy forms of the cross-linker as before. This permits cross-links to be distinguished from mono-links because mono-links will show an additional 2-Da splitting, because 18O gets incorporated from the solvent when a functional group not attached to protein is inactivated by hydrolysis. A software package for analyzing the data is available under open-source license from http://www.systemsbiology.org/Resources_and_Development/Downloadable_Software.
Prince JT, Marcotte EM. Chromatographic alignment of ESI-LC-MS proteomics data sets by ordered bijective interpolated warping. Analytical Chemistry 78;2006:6140–6152. [PubMed]
This paper addresses the problem of aligning chromatographs produced in multiple experiments to compare protein/peptide expression levels, such as those employing LC-ESI/MS (shotgun) techniques. Although dataset comparisons may be based on peptide identities discovered by mass spectrometry, uncertainties in peptide assignment, potentially large changes in peptide abundance, and stochastic variation in the representation of peptides in replicate datasets all contribute to the complexity of the operation. This paper presents a method for chromatographic alignment based on dynamic time warping, a methodology first used in speech processing to align single or multivariate signals across time while preserving the ordering of the signals. As a measure of spectral similarity, Pearson’s correlation coefficient provides the best alignments. Using optimized parameters, and a global gap penalty function as opposed to a local weighting scheme alone, smooth warping functions are calculated. The success of the approach is demonstrated even for runs on samples that differ substantially due to biological variation or pre-fractionation. Software for the procedure is available under MIT style license at http://obi-warp.sourceforge.net/.
Craig R, Cortens JC, Fenyo D, Beavis RC. Using annotated peptide mass spectrum libraries for protein identification. Journal of Proteome Research 5;2006:1843–1849. [PubMed]
Frewen BE, Merrihew GE, Wu CC, Noble WS, MacCross MJ. Analysis of peptide MS/MS spectra from large-scale proteomics experiments using spectrum libraries. Analytical Chemistry 78;2006:5678–5684. [PubMed]
Both these papers explore the possibility of assigning peptide MS/MS spectra by comparison with previously assigned spectra instead of by the more conventional method of comparing peak patterns with those predicted from library sequences based on the known rules governing the CID process. Both groups construct a library of assigned spectra and describe algorithms for searching them. Craig et al. provide their library and software for library population and searching in the supporting information for their paper at http://pubs.acs.org. Frewen et al. provide theirs at http://proteome.gs.washington.edu. The process of assignment using this approach is shown to be successful for the majority of spectra that can be assigned using a conventional search engine, and to be rapid.
Lynch JL, deSilva CJS, Peeva VK, Swanson NR. Comparison of commercial probe labeling kits for microarray: Towards quality assurance and consistency of reactions. Analytical Biochemistry 355;2006:224–231. [PubMed]
Seven commercially available kits for the synthesis of labeled cDNA probes are compared for use in two-dye expression analysis employing oligonucleotide microarrays. The kits utilize differing methods for cDNA probe synthesis, including direct labeling with cyanine 3-dCTP/cyanine 5-dCTP, amino allyl indirect labeling, and dendrimer technology. Although all the kits were found to label probes successfully, variation in performance was observed. The Stratagene Fairplay Microarray Labeling Kit supported identification of the largest number of expression changes and the lowest incidence of spot signal intensities lower than background. The Invitrogen SuperScript Indirect cDNA Labeling System displayed the lowest gene-associated variability and technical variation between replicates. The Promega Pronto! Plus System, which uses direct labeling, showed the smallest dye bias effect. This result challenges the widespread assumption that indirect labeling affords lower dye bias.
Tong W, Lucas AB, Shippy R, Fan X, Fang H, Hong H, Orr MS, Chu T-M, Guo X, Collins PJ, Sun YA, Wang S-J, Bao W, Wolfinger RD, Shchegrova S, Guo L, Warrington JA, Shi L. Evaluation of external RNA controls for the assessment of microarray performance. Nature Biotechnology 24;2006:1132–1139.
This article, one of a recent collection focusing on data quality in genomics and microarrays (see Ji H, Davis RW, Nature Biotechnolog y 24;2006:1112–1113 ) evaluates the utility of adding external RNA controls to test samples for assessing microarray performance. External controls may be added to total RNA samples before performing cDNA synthesis and in vitro transcription, or the controls may be added to cRNA samples immediately before they are hybridized to arrays. These approaches provide different information, and may both be used in a single experiment. Together, they provide quantitative measures of assay performance, including the ability to assess dynamic range.
Kusnezow W, Syagailo YV, Rüffer S, Baudenstiel N, Gauer C, Hoheisel JD, Wild D, Goychuk I Optimal design of microarray immunoassays to compensate for kinetic limitations: Theory and experiment. Molecular and Cellular Proteomics 5;2006:1681–1696. [PubMed]
The kinetics of binding in antibody array experiments is demonstrated to be surprisingly strongly mass transport limited, requiring sometimes prolonged incubation times. Assay formats that permit stirring are therefore strongly favored, and attention to the geometry of the reaction chamber to optimize mixing is recommended. The effects of sample viscosity on diffusion rates must be allowed for in choosing incubation times, although it is recommended that the concentration of samples be kept as high as possible to promote the rate of binding of low-abundance proteins. Binding-site density within array features is shown to affect signal strength, and is optimized by modifying antibody spotting concentrations. Sample volume must exceed that at which binding to the array, whether by specific interactions or by nonspecific adsorption, causes analyte depletion in solution. Caution is advised in choosing washing times, because significant signal loss may be incurred during prolonged washing due to dissociation.
Kudva IT, Krastins B, Sheng H, Griffin RW, Sarracino DA, Tarr PI, Hovde CJ, Calderwood SB, John M. Proteomics-based expression library screening (PELS): A novel method for rapidly defining microbial immunoproteomes. Molecular and Cellular Proteomics 5;2006:1514–1519. [PubMed]
A method is described for rapidly identifying microbial proteins that are immunogenic in the host. Recombinant proteins encoded by the pathogen’s DNA are expressed from an inducible, microbial genomic DNA expression library. The expressed proteins are collected for affinity chromatography on a polyclonal antibody column made from affinity-purified sera of acute or convalescing infected hosts. Proteins captured by the column are fractionated by SDS-PAGE and then identified by LC-MS/MS analysis as putative targets for host immune response. This approach is validated by testing proteins from E. coli strain O157:H7 for binding to antibodies from the pooled sera of hyperimmune cattle. Two hundred seven proteins, representing 3.8% of the proteome of O157, are identified in the screen, of which 35 are known to be immunogenic in humans. The 207 proteins are candidates for development of a vaccine for eliminating this pathogen from the gastrointestinal tracts of cattle, an important source of human infection. In principle, this approach is suitable for any cultivable, sequenced pathogen that elicits host antibody response.
Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, Lerner J, Brunet J-P, Subramanian A, Ross KN, Reich M, Hieronymus H, Wei G, Armstrong SA, Haggerty SJ, Clemons PA, Wei R, Carr SA, Lander ES, Golub TR. The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313;2006:1929–1935. [PubMed]
This work represents a pilot study to establish the potential utility of a public database documenting the perturbations in mRNA expression patterns that are induced by an extensive panel of drug molecules to mammalian cells in culture. One hundred sixty-four small molecules representing a broad range of activities is tested against a small panel of one to four cell lines. To recognize patterns that may suggest mechanisms of action common between drugs, a nonparametric pattern-matching strategy is adopted to assess the similarities between a query signature, representing the changes in gene expression produced by a given drug, and a set of reference signatures in the database. These reference signatures are also represented in a nonparametric fashion. The study shows that such expression signatures can be used to recognize common mechanisms of action (e.g., histone deacetylase inhibitors and estrogen receptor modulators), to discover unknown mechanisms of drug action (e.g., gedunin as an HSP90 inhibitor), and to identify potential new therapeutic agents (e.g., sirolimus for overcoming dexamethasone resistance in acute lymphoblastic leukemia). Software for these analyses is available from www.broad.mit.edu/cmap. Expansion of this effort as a community resource project to include a much larger range of drugs and cell lines is anticipated.
McAfee KJ, Dencan DT, Assink M, Link AJ. Analyzing proteomes and protein function using graphical comparative analysis of tandem mass spectrometry results. Molecular and Cellular Proteomics 5;2006:1497–1513. [PubMed]
This paper contributes to the solution of the numerous and substantial problems associated with the storage, dissemination, and analysis of the large mass-spectrometric datasets acquired in proteomics. A relational database built upon the Oracle Relational Database Management System is described. The system, called Bioinformatic Graphical Comparative Analysis Tools (BIGCAT), incorporates not only mass-spectrometric search and archiving utilities but also a suite of data-mining applications, accessed via a Web-based browser interface, that are structured in a manner suitable for addressing biological questions related to protein function and cell state.
Widmann J, Hamady M, Knight R. DivergentSet, a tool for picking non-redundant sequences from large sequence collections. Molecular and Cellular Proteomics 5;2006:1520–1532. [PubMed]
In tasks such as identifying functional motifs that aim to correlate sequence with function, it is desirable to minimize the contribution of sequences that are similar simply because they derive from recent common ancestry rather than shared functional constraints. For this purpose, representative sequences from a taxon of closely similar sequences are selected for inclusion in the analysis. Phylogenetic analysis of a large number of sequences using a full distance matrix for this purpose can be exceedingly time-consuming. The present paper provides an automated method for accomplishing this task much more rapidly. A Web-based utility called DivergentSet is described. The user starts with a single sequence, identifier, or set of sequences. The program recovers additional sequences using utilities such as BLAST, and a tree relating the sequences is then used to choose a candidate divergent set. To validate the set, all pairs of sequences in the set are compared and non-divergent ones are discarded. The resulting divergent set can then be recovered in the form of database identifiers or sequences for use in motif finding, phylogenetic analysis, covariation analysis, etc., or may be subjected to another round of refinement. DivergentSet can be accessed as a Web-based tool at bmf. colorado.edu/divergentset.