Search tips
Search criteria

Results 1-25 (82)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
more »
1.  Comparison of Speech Recognition and Localization Performance in Bilateral and Unilateral Cochlear Implant Users Matched on Duration of Deafness and Age at Implantation 
Ear and hearing  2008;29(3):352-359.
The purpose of this investigation was to compare speech recognition and localization performance of subjects who wear bilateral cochlear implants (CICI) with subjects who wear a unilateral cochlear implant (true CI-only).
A total of 73 subjects participated in this study. Specifically, of the 73 subjects, 64 (32 CICI and 32 true CI-only) participated in the word recognition testing; 66 (33 CICI and 33 true CI-only) participated in the sentence recognition testing; and 24 (12 CICI and 12 true CI-only) participated in the localization testing. Because of time constraints not all subjects completed all testing. The average age at implantation for the CICI and true CI-only listeners who participated in the speech perception testing was 54 and 55 yrs, respectively, and the average duration of deafness was 8 yrs for both groups of listeners. The average age at implantation for the CICI and true CI-only listeners who participated in the localization testing was 54 and 53 yrs, respectively, and the average duration of deafness was 10 yrs for the CICI listeners and 11 yrs for the true CI-only listeners. All speech stimuli were presented from the front. The test setup for everyday-sound localization comprised an eight-speaker array spanning, an arc of approximately 108° in the frontal horizontal plane.
Average group results were transformed to Rationalized Arcsine Unit scores. A comparison in performance between the CICI score and the true CI-only score in quiet revealed a significant difference between the two groups with the CICI group scoring 19% higher for sentences and 24% higher for words. In addition, when both cochlear implants were used together (CICI) rather than when either cochlear implant was used alone (right CI or left CI) for the CICI listeners, results indicated a significant binaural summation effect for sentences and words.
The average group results in this study showed significantly greater benefit on words and sentences in quiet and localization for listeners using two cochlear implants over those using only one cochlear implant. One explanation of this result might be that the same information from both sides are combined, which results in a better representation of the stimulus. A second explanation might be that CICI allow for the transfer of different neural information from two damaged peripheral auditory systems leading to different patterns of information summating centrally resulting in enhanced speech perception. A future study using similar methodology to the current one will have to be conducted to determine if listeners with two cochlear implants are able to perform better than listeners with one cochlear implant in noise.
PMCID: PMC4266575  PMID: 18453885
2.  Motif-based analysis of large nucleotide data sets using MEME-ChIP 
Nature protocols  2014;9(6):1428-1450.
MEME-ChIP is a web-based tool for analyzing motifs in large DNA or RNA data sets. It can analyze peak regions identified by ChIP-seq, cross-linking sites identified by cLIP-seq and related assays, as well as sets of genomic regions selected using other criteria. MEME-ChIP performs de novo motif discovery, motif enrichment analysis, motif location analysis and motif clustering, providing a comprehensive picture of the DNA or RNA motifs that are enriched in the input sequences. MEME-ChIP performs two complementary types of de novo motif discovery: weight matrix–based discovery for high accuracy; and word-based discovery for high sensitivity. Motif enrichment analysis using DNA or RNA motifs from human, mouse, worm, fly and other model organisms provides even greater sensitivity. MEME-ChIP’s interactive HTML output groups and aligns significant motifs to ease interpretation. this protocol takes less than 3 h, and it provides motif discovery approaches that are distinct and complementary to other online methods.
PMCID: PMC4175909  PMID: 24853928
3.  Comparative analysis of metazoan chromatin organization 
Ho, Joshua W. K. | Jung, Youngsook L. | Liu, Tao | Alver, Burak H. | Lee, Soohyun | Ikegami, Kohta | Sohn, Kyung-Ah | Minoda, Aki | Tolstorukov, Michael Y. | Appert, Alex | Parker, Stephen C. J. | Gu, Tingting | Kundaje, Anshul | Riddle, Nicole C. | Bishop, Eric | Egelhofer, Thea A. | Hu, Sheng’en Shawn | Alekseyenko, Artyom A. | Rechtsteiner, Andreas | Asker, Dalal | Belsky, Jason A. | Bowman, Sarah K. | Chen, Q. Brent | Chen, Ron A-J | Day, Daniel S. | Dong, Yan | Dose, Andrea C. | Duan, Xikun | Epstein, Charles B. | Ercan, Sevinc | Feingold, Elise A. | Ferrari, Francesco | Garrigues, Jacob M. | Gehlenborg, Nils | Good, Peter J. | Haseley, Psalm | He, Daniel | Herrmann, Moritz | Hoffman, Michael M. | Jeffers, Tess E. | Kharchenko, Peter V. | Kolasinska-Zwierz, Paulina | Kotwaliwale, Chitra V. | Kumar, Nischay | Langley, Sasha A. | Larschan, Erica N. | Latorre, Isabel | Libbrecht, Maxwell W. | Lin, Xueqiu | Park, Richard | Pazin, Michael J. | Pham, Hoang N. | Plachetka, Annette | Qin, Bo | Schwartz, Yuri B. | Shoresh, Noam | Stempor, Przemyslaw | Vielle, Anne | Wang, Chengyang | Whittle, Christina M. | Xue, Huiling | Kingston, Robert E. | Kim, Ju Han | Bernstein, Bradley E. | Dernburg, Abby F. | Pirrotta, Vincenzo | Kuroda, Mitzi I. | Noble, William S. | Tullius, Thomas D. | Kellis, Manolis | MacAlpine, David M. | Strome, Susan | Elgin, Sarah C. R. | Liu, Xiaole Shirley | Lieb, Jason D. | Ahringer, Julie | Karpen, Gary H. | Park, Peter J.
Nature  2014;512(7515):449-452.
PMCID: PMC4227084  PMID: 25164756
4.  Spectrum Identification using a Dynamic Bayesian Network Model of Tandem Mass Spectra 
Shotgun proteomics is a high-throughput technology used to identify unknown proteins in a complex mixture. At the heart of this process is a prediction task, the spectrum identification problem, in which each fragmentation spectrum produced by a shotgun proteomics experiment must be mapped to the peptide (protein subsequence) which generated the spectrum. We propose a new algorithm for spectrum identification, based on dynamic Bayesian networks, which significantly out-performs the de-facto standard tools for this task: SEQUEST and Mascot.
PMCID: PMC4221238  PMID: 25383048
5.  Implications of COMT long-range interactions on the phenotypic variability of 22q11.2 deletion syndrome 
Nucleus  2013;4(6):487-493.
22q11.2 deletion syndrome (22q11DS) results from a hemizygous microdeletion on chromosome 22 and is characterized by extensive phenotypic variability. Penetrance of signs, including congenital heart, craniofacial, and neurobehavioral abnormalities, varies widely and is not well correlated with genotype. The three-dimensional structure of the genome may help explain some of this variability. The physical interaction profile of a given gene locus with other genetic elements, such as enhancers and co-regulated genes, contributes to its regulation. Thus, it is possible that regulatory interactions with elements outside the deletion region are disrupted in the disease state and modulate the resulting spectrum of symptoms. COMT, a gene within the commonly deleted ~3 Mb region has been implicated as a contributor to the neurological features frequently found in 22q11DS patients. We used this locus as bait in a 4C-seq experiment to investigate genome-wide interaction profiles in B lymphocyte and fibroblast cell lines derived from both 22q11DS and unaffected individuals. All normal B lymphocyte lines displayed local, conserved chromatin looping interactions with regions that are lost in atypical and distal deletions, which may mediate similarities between typical, atypical, and distal 22q11 deletion phenotypes. There are also distinct clusterings of cis interactions based on disease state. We identified regions of differential trans interactions present in normal, and lost in deletion-carrying, B lymphocyte cell lines. This data suggests that hemizygous chromosomal deletions such as 22q11DS can have widespread effects on chromatin organization, and may contribute to the inherent phenotypic variability.
PMCID: PMC3925693  PMID: 24448439
DiGeorge syndrome; long-range interactions; chromosome conformation capture; genome organization; schizophrenia
6.  Learning Peptide-Spectrum Alignment Models for Tandem Mass Spectrometry 
We present a peptide-spectrum alignment strategy that employs a dynamic Bayesian network (DBN) for the identification of spectra produced by tandem mass spectrometry (MS/MS). Our method is fundamentally generative in that it models peptide fragmentation in MS/MS as a physical process. The model traverses an observed MS/MS spectrum and a peptide-based theoretical spectrum to calculate the best alignment between the two spectra. Unlike all existing state-of-the-art methods for spectrum identification that we are aware of, our method can learn alignment probabilities given a dataset of high-quality peptide-spectrum pairs. The method, moreover, accounts for noise peaks and absent theoretical peaks in the observed spectrum. We demonstrate that our method outperforms, on a majority of datasets, several widely used, state-of-the-art database search tools for spectrum identification. Furthermore, the proposed approach provides an extensible framework for MS/MS analysis and provides useful information that is not produced by other methods, thanks to its generative structure.
PMCID: PMC4185971  PMID: 25298752
7.  Inferring Clonal Composition from Multiple Sections of a Breast Cancer 
PLoS Computational Biology  2014;10(7):e1003703.
Cancers arise from successive rounds of mutation and selection, generating clonal populations that vary in size, mutational content and drug responsiveness. Ascertaining the clonal composition of a tumor is therefore important both for prognosis and therapy. Mutation counts and frequencies resulting from next-generation sequencing (NGS) potentially reflect a tumor's clonal composition; however, deconvolving NGS data to infer a tumor's clonal structure presents a major challenge. We propose a generative model for NGS data derived from multiple subsections of a single tumor, and we describe an expectation-maximization procedure for estimating the clonal genotypes and relative frequencies using this model. We demonstrate, via simulation, the validity of the approach, and then use our algorithm to assess the clonal composition of a primary breast cancer and associated metastatic lymph node. After dividing the tumor into subsections, we perform exome sequencing for each subsection to assess mutational content, followed by deep sequencing to precisely count normal and variant alleles within each subsection. By quantifying the frequencies of 17 somatic variants, we demonstrate that our algorithm predicts clonal relationships that are both phylogenetically and spatially plausible. Applying this method to larger numbers of tumors should cast light on the clonal evolution of cancers in space and time.
Author Summary
Cancers arise from a series of mutations that occur over time. As a result, as a tumor grows each cell inherits a distinctive genotype, defined by the set of all somatic mutations that distinguish the tumor cell from normal cells. Acertaining these genotype patterns, and identifying which ones are associated with the growth of the cancer and its ability to metastasize, can potentially give clinicians insights into how to treat the cancer. In this work, we describe a method for inferring the predominant genotypes within a single tumor. The method requires that a tumor be sectioned and that each section be subjected to a high-throughput sequencing procedure. The resulting mutations and their associated frequencies within each tumor section are then used as input to a probabilistic model that infers the underlying genotypes and their relative frequencies within the tumor. We use simulated data to demonstrate the validity of the approach, and then we apply our algorithm to data from a primary breast cancer and associated metastatic lymph node. We demonstrate that our algorithm predicts genotypes that are consistent with an evolutionary model and with the physical topology of the tumor itself. Applying this method to larger numbers of tumors should cast light on the evolution of cancers in space and time.
PMCID: PMC4091710  PMID: 25010360
8.  Determining the calibration of confidence estimation procedures for unique peptides in shotgun proteomics 
Journal of proteomics  2012;0:123-131.
The analysis of a shotgun proteomics experiment results in a list of peptide-spectrum matches (PSMs) in which each fragmentation spectrum has been matched to a peptide in a database. Subsequently, most protein inference algorithms rank peptides according to the best-scoring PSM for each peptide. However, there is disagreement in the scientific literature on the best method to assess the statistical significance of the resulting peptide identifications. Here, we use a previously described calibration protocol to evaluate the accuracy of three different peptide-level statistical confidence estimation procedures: the classical Fisher’s method, and two complementary procedures that estimate significance, respectively, before and after selecting the top-scoring PSM for each spectrum. Our experiments show that the latter method, which is employed by MaxQuant and Percolator, produces the most accurate, well-calibrated results.
PMCID: PMC3683086  PMID: 23268117
Shotgun proteomics; peptides; statistics
9.  A statistical approach for inferring the 3D structure of the genome 
Bioinformatics  2014;30(12):i26-i33.
Motivation: Recent technological advances allow the measurement, in a single Hi-C experiment, of the frequencies of physical contacts among pairs of genomic loci at a genome-wide scale. The next challenge is to infer, from the resulting DNA–DNA contact maps, accurate 3D models of how chromosomes fold and fit into the nucleus. Many existing inference methods rely on multidimensional scaling (MDS), in which the pairwise distances of the inferred model are optimized to resemble pairwise distances derived directly from the contact counts. These approaches, however, often optimize a heuristic objective function and require strong assumptions about the biophysics of DNA to transform interaction frequencies to spatial distance, and thereby may lead to incorrect structure reconstruction.
Methods: We propose a novel approach to infer a consensus 3D structure of a genome from Hi-C data. The method incorporates a statistical model of the contact counts, assuming that the counts between two loci follow a Poisson distribution whose intensity decreases with the physical distances between the loci. The method can automatically adjust the transfer function relating the spatial distance to the Poisson intensity and infer a genome structure that best explains the observed data.
Results: We compare two variants of our Poisson method, with or without optimization of the transfer function, to four different MDS-based algorithms—two metric MDS methods using different stress functions, a non-metric version of MDS and ChromSDE, a recently described, advanced MDS method—on a wide range of simulated datasets. We demonstrate that the Poisson models reconstruct better structures than all MDS-based methods, particularly at low coverage and high resolution, and we highlight the importance of optimizing the transfer function. On publicly available Hi-C data from mouse embryonic stem cells, we show that the Poisson methods lead to more reproducible structures than MDS-based methods when we use data generated using different restriction enzymes, and when we reconstruct structures at different resolutions.
Availability and implementation: A Python implementation of the proposed method is available at
Contact: or
PMCID: PMC4229903  PMID: 24931992
10.  Upregulation of the mammalian X chromosome is associated with enhanced transcription initiation, MOF-mediated H4K16 acetylation, and longer RNA half-life 
Developmental cell  2013;25(1):55-68.
X upregulation in mammals increases levels of expressed X-linked transcripts to compensate for autosomal bi-allelic expression. Here, we present molecular mechanisms that enhance X expression at transcriptional and posttranscriptional levels. Active mouse X-linked promoters are enriched in the initiation form of RNA polymerase II (PolII-S5p) and in specific histone marks including H4K16ac and histone variant H2AZ. The H4K16 acetyltransferase MOF, known to mediate the Drosophila X upregulation, is also enriched on the mammalian X. Depletion of MOF or MSL1 in mouse ES cells causes a specific decrease in PolII-S5p and in expression of a subset of X-linked genes. Analyses of RNA half-life datasets show increased stability of mammalian X-linked transcripts. Both ancestral X-linked genes, defined as those conserved on chicken autosomes, and newly acquired X-linked genes are upregulated by similar mechanisms but to a different extent, suggesting that subsets of genes are distinctly regulated dependent on their evolutionary history.
PMCID: PMC3662796  PMID: 23523075
11.  A short form of the Speech, Spatial and Qualities of Hearing scale suitable for clinical use: The SSQ12 
International journal of audiology  2013;52(6):10.3109/14992027.2013.781278.
To develop and evaluate a 12-item version of the Speech, Spatial and Qualities of Hearing Scale for use in clinical research and rehabilitation settings, and provide a formula for converting scores between the full (SSQ49) and abbreviated (SSQ12) versions.
Items were selected independently at the three centres (Eriksholm, MRC Institute of Hearing Research, University of New England) to be representative of the complete scale. A consensus was achieved after discussion.
Study Sample
The data set (n=1220) used for a factor analysis (Akeroyd et al., submitted) was re-analysed to compare original SSQ scores (SSQ49) with scores on the short version (SSQ12).
A scatter-plot of SSQ12 scores against SSQ49 scores showed that SSQ12 score was about 0.6 of a scale point lower than the SSQ49 (0-10 scale) in the re-analysis of the Akeroyd et al. data. SSQ12 scores lay on a slightly steeper slope than scores on the SSQ49.
The SSQ12 provides similar results to SSQ49 in a large clinical research sample. The slightly lower average SSQ12 score and the slightly steeper slope reflect the composition of this short form relative to the SSQ49.
PMCID: PMC3864780  PMID: 23651462
Speech; Spatial and Qualities of Hearing scale; short version; clinical use; SSQ
12.  A genome-wide 3C-method for characterizing the three-dimensional architectures of genomes 
Methods (San Diego, Calif.)  2012;58(3):277-288.
Accumulating evidence demonstrates that the three-dimensional (3D) organization of chromosomes within the eukaryotic nucleus reflects and influences genomic activities, including transcription, DNA replication, recombination and DNA repair. In order to uncover structure-function relationships, it is necessary first to understand the principles underlying the folding and the 3D arrangement of chromosomes. Chromosome conformation capture (3C) provides a powerful tool for detecting interactions within and between chromosomes. A high throughput derivative of 3C, chromosome conformation capture on chip (4C), executes a genome-wide interrogation of interaction partners for a given locus. We recently developed a new method, a derivative of 3C and 4C, which, similar to Hi-C, is capable of comprehensively identifying long-range chromosome interactions throughout a genome in an unbiased fashion. Hence, our method can be applied to decipher the 3D architectures of genomes. Here, we provide a detailed protocol for this method.
PMCID: PMC3477625  PMID: 22776363
Chromatin; chromosome; chromosome conformation capture (3C); chromosome conformation capture on chip (4C); genome architecture, three-dimensional (3D) organization
13.  Learning score function parameters for improved spectrum identification in tandem mass spectrometry experiments 
Journal of proteome research  2012;11(9):4499-4508.
The identification of proteins from spectra derived from a tandem mass spectrometry experiment involves several challenges: matching each observed spectrum to a peptide sequence, ranking the resulting collection of peptide-spectrum matches, assigning statistical confidence estimates to the matches, and identifying the proteins. The present work addresses algorithms to rank peptide-spectrum matches. Many of these algorithms, such as PeptideProphet, IDPicker, or Q-ranker, follow similar methodology that includes representing peptide-spectrum matches as feature vectors and using optimization techniques to rank them. We propose a richer and more flexible feature set representation that is based on the parametrization of the SEQUEST XCorr score and that can be used by all of these algorithms. This extended feature set allows a more effective ranking of the peptide-spectrum matches based on the target-decoy strategy, in comparison to a baseline feature set devoid of these XCorr-based features. Ranking using the extended feature set gives 10–40% improvement in the number of distinct peptide identifications relative to a range of q-value thresholds. While this work is inspired by the model of the theoretical spectrum and the similarity measure between spectra used specifically by SEQUEST, the method itself can be applied to the output of any database search. Further, our approach can be trivially extended beyond XCorr to any linear operator that can serve as similarity score between experimental spectra and peptide sequences.
PMCID: PMC3436966  PMID: 22866926
14.  Genomic Interaction Profiles in Breast Cancer Reveal Altered Chromatin Architecture 
PLoS ONE  2013;8(9):e73974.
Gene transcription can be regulated by remote enhancer regions through chromosome looping either in cis or in trans. Cancer cells are characterized by wholesale changes in long-range gene interactions, but the role that these long-range interactions play in cancer progression and metastasis is not well understood. In this study, we used IGFBP3, a gene involved in breast cancer pathogenesis, as bait in a 4C-seq experiment comparing normal breast cells (HMEC) with two breast cancer cell lines (MCF7, an ER positive cell line, and MDA-MB-231, a triple negative cell line). The IGFBP3 long-range interaction profile was substantially altered in breast cancer. Many interactions seen in normal breast cells are lost and novel interactions appear in cancer lines. We found that in HMEC, the breast carcinoma amplified sequence gene family (BCAS) 1–4 were among the top 10 most significantly enriched regions of interaction with IGFBP3. 3D-FISH analysis indicated that the translocation-prone BCAS genes, which are located on chromosomes 1, 17, and 20, are in close physical proximity with IGFBP3 and each other in normal breast cells. We also found that epidermal growth factor receptor (EGFR), a gene implicated in tumorigenesis, interacts significantly with IGFBP3 and that this interaction may play a role in their regulation. Breakpoint analysis suggests that when an IGFBP3 interacting region undergoes a translocation an additional interaction detectable by 4C is gained. Overall, our data from multiple lines of evidence suggest an important role for long-range chromosomal interactions in the pathogenesis of cancer.
PMCID: PMC3760796  PMID: 24019942
15.  Analysis of Secondary Structure in Proteins by Chemical Cross-Linking Coupled to Mass Spectrometry 
Proteomics  2012;12(17):2746-2752.
Chemical cross-linking is an attractive technique for the study of the structure of protein complexes due to its low sample consumption and short analysis time. Furthermore, distance constraints obtained from the identification of cross-linked peptides by mass spectrometry can be used to construct and validate protein models. If a sufficient number of distance constraints are obtained, then determining the secondary structure of a protein can allow inference of the protein’s fold. In this work, we show how the distance constraints obtained from cross-linking experiments can identify secondary structures within the protein sequence. Molecular modeling of alpha helices and beta sheets indicate cross-linking patterns based on the topological distances between reactive residues. DSS[1] cross-linking experiments with model alpha helix containing proteins corroborated the molecular modeling predictions. The patterns established here can be extended to other cross-linkers with known spacing lengths.
PMCID: PMC3655428  PMID: 22778071
16.  A statistical approach to peptide identification from clustered tandem mass spectrometry data 
Tandem mass spectrometry experiments generate from thousands to millions of spectra. These spectra can be used to identify the presence of proteins in biological samples. In this work, we propose a new method to identify peptides, substrings of proteins, based on clustered tandem mass spectrometry data. In contrast to previously proposed approaches, which identify one representative spectrum for each cluster using traditional database searching algorithms, our method uses all available information to score all the spectra in a cluster against candidate peptides using Bayesian model selection. We illustrate the performance of our method by applying it to seven-standard-protein mixture data.
PMCID: PMC3698614  PMID: 23828149
Bayesian analysis; Bioinformatics; Clustered tandem mass spectra; False discovery rate; Peptide identification; Proteomics
17.  Faster Mass Spectrometry-based Protein Inference: Junction Trees are More Efficient than Sampling and Marginalization by Enumeration 
The problem of identifying the proteins in a complex mixture using tandem mass spectrometry can be framed as an inference problem on a graph that connects peptides to proteins. Several existing protein identification methods make use of statistical inference methods for graphical models, including expectation maximization, Markov chain Monte Carlo, and full marginalization coupled with approximation heuristics. We show that, for this problem, the majority of the cost of inference usually comes from a few highly connected subgraphs. Furthermore, we evaluate three different statistical inference methods using a common graphical model, and we demonstrate that junction tree inference substantially improves rates of convergence compared to existing methods. The python code used for this paper is available at
PMCID: PMC3389307  PMID: 22331862
Mass spectrometry; protein identification; graphical models; Bayesian inference
18.  Self-assessed hearing abilities in middle- and older-age adults: A stratified sampling approach 
For evaluation of audiological service outcomes, the primary objective was to determine baseline and target profiles on the Speech, Spatial and Qualities of Hearing scale (SSQ); a secondary objective was to test a short form of the SSQ; opportunity was also taken to compare responses of samples providing consistent versus inconsistent self-assessments.
2×2×2 factorial design crossed age, reported presence versus absence of hearing difficulty, and low versus high self-rated hearing ability.
Study Sample
Eight samples (total n=413), representing two age ranges; a response of “yes” or “no” to a question about having hearing difficulty, and either low or high self-rated hearing ability on six items from the SSQ.
Using present and previous results, baseline SSQ profiles were determined indicating the pattern of response likely to be observed prior to clinical intervention, and both an achieved outcome and “ideal” target outcome from such intervention. The six-item SSQ yielded better test-retest results in consistent versus inconsistent samples. The inconsistent samples showed signs of different interpretations of “hearing difficulty”.
Baseline and both actual and ideal target outcomes can guide comparative appraisal of clinical achievements; more research is needed to determine a robust short form of the SSQ.
PMCID: PMC3635014  PMID: 22115161
19.  Epigenetic priors for identifying active transcription factor binding sites 
Bioinformatics  2011;28(1):56-62.
Motivation Accurate knowledge of the genome-wide binding of transcription factors in a particular cell type or under a particular condition is necessary for understanding transcriptional regulation. Using epigenetic data such as histone modification and DNase I, accessibility data has been shown to improve motif-based in silico methods for predicting such binding, but this approach has not yet been fully explored.
Results We describe a probabilistic method for combining one or more tracks of epigenetic data with a standard DNA sequence motif model to improve our ability to identify active transcription factor binding sites (TFBSs). We convert each data type into a position-specific probabilistic prior and combine these priors with a traditional probabilistic motif model to compute a log-posterior odds score. Our experiments, using histone modifications H3K4me1, H3K4me3, H3K9ac and H3K27ac, as well as DNase I sensitivity, show conclusively that the log-posterior odds score consistently outperforms a simple binary filter based on the same data. We also show that our approach performs competitively with a more complex method, CENTIPEDE, and suggest that the relative simplicity of the log-posterior odds scoring method makes it an appealing and very general method for identifying functional TFBSs on the basis of DNA and epigenetic evidence.
Availability and implementation: FIMO, part of the MEME Suite software toolkit, now supports log-posterior odds scoring using position-specific priors for motif search. A web server and source code are available at Utilities for creating priors are at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3244768  PMID: 22072382
20.  Integrative annotation of chromatin elements from ENCODE data 
Nucleic Acids Research  2012;41(2):827-841.
The ENCODE Project has generated a wealth of experimental information mapping diverse chromatin properties in several human cell lines. Although each such data track is independently informative toward the annotation of regulatory elements, their interrelations contain much richer information for the systematic annotation of regulatory elements. To uncover these interrelations and to generate an interpretable summary of the massive datasets of the ENCODE Project, we apply unsupervised learning methodologies, converting dozens of chromatin datasets into discrete annotation maps of regulatory regions and other chromatin elements across the human genome. These methods rediscover and summarize diverse aspects of chromatin architecture, elucidate the interplay between chromatin activity and RNA transcription, and reveal that a large proportion of the genome lies in a quiescent state, even across multiple cell types. The resulting annotation of non-coding regulatory elements correlate strongly with mammalian evolutionary constraint, and provide an unbiased approach for evaluating metrics of evolutionary constraint in human. Lastly, we use the regulatory annotations to revisit previously uncharacterized disease-associated loci, resulting in focused, testable hypotheses through the lens of the chromatin landscape.
PMCID: PMC3553955  PMID: 23221638
21.  Estimating relative abundances of proteins from shotgun proteomics data 
BMC Bioinformatics  2012;13:308.
Spectral counting methods provide an easy means of identifying proteins with differing abundances between complex mixtures using shotgun proteomics data. The crux spectral-counts command, implemented as part of the Crux software toolkit, implements four previously reported spectral counting methods, the spectral index (SIN), the exponentially modified protein abundance index (emPAI), the normalized spectral abundance factor (NSAF), and the distributed normalized spectral abundance factor (dNSAF).
We compared the reproducibility and the linearity relative to each protein’s abundance of the four spectral counting metrics. Our analysis suggests that NSAF yields the most reproducible counts across technical and biological replicates, and both SIN and NSAF achieve the best linearity.
With the crux spectral-counts command, Crux provides open-source modular methods to analyze mass spectrometry data for identifying and now quantifying peptides and proteins. The C++ source code, compiled binaries, spectra and sequence databases are available at
PMCID: PMC3599300  PMID: 23164367
22.  A cross-validation scheme for machine learning algorithms in shotgun proteomics 
BMC Bioinformatics  2012;13(Suppl 16):S3.
Peptides are routinely identified from mass spectrometry-based proteomics experiments by matching observed spectra to peptides derived from protein databases. The error rates of these identifications can be estimated by target-decoy analysis, which involves matching spectra to shuffled or reversed peptides. Besides estimating error rates, decoy searches can be used by semi-supervised machine learning algorithms to increase the number of confidently identified peptides. As for all machine learning algorithms, however, the results must be validated to avoid issues such as overfitting or biased learning, which would produce unreliable peptide identifications. Here, we discuss how the target-decoy method is employed in machine learning for shotgun proteomics, focusing on how the results can be validated by cross-validation, a frequently used validation scheme in machine learning. We also use simulated data to demonstrate the proposed cross-validation scheme's ability to detect overfitting.
PMCID: PMC3489528  PMID: 23176259
23.  Unsupervised pattern discovery in human chromatin structure through genomic segmentation 
Nature methods  2012;9(5):473-476.
We applied a dynamic Bayesian network method that identifies joint patterns from multiple functional genomics experiments to ChIP-seq histone modification and transcription factor data, and DNaseI-seq and FAIRE-seq open chromatin readouts from the human cell line K562. In an unsupervised fashion, we identified patterns associated with transcription start sites, gene ends, enhancers, CTCF elements, and repressed regions. Software and genome browser tracks are at
PMCID: PMC3340533  PMID: 22426492
24.  Faster SEQUEST Searching for Peptide Identification from Tandem Mass Spectra 
Journal of proteome research  2011;10(9):3871-3879.
Computational analysis of mass spectra remains the bottleneck in many proteomics experiments. SEQUEST was one of the earliest software packages to identify peptides from mass spectra by searching a database of known peptides. Though still popular, SEQUEST performs slowly. Crux and TurboSEQUEST have successfully sped up SEQUEST by adding a precomputed index to the search, but the demand for ever-faster peptide identification software continues to grow. Tide, introduced here, is a software program that implements the SEQUEST algorithm for peptide identification and that achieves a dramatic speedup over Crux and SEQUEST. The optimization strategies detailed here employ a combination of algorithmic and software engineering techniques to achieve speeds up to 170 times faster than a recent version of SEQUEST that uses indexing. For example, on a single Xeon CPU, Tide searches 10,000 spectra against a tryptic database of 27,499 C. elegans proteins at a rate of 1,550 spectra per second, which compares favorably with a rate of 8.8 spectra per second for a recent version of SEQUEST with index running on the same hardware.
PMCID: PMC3166376  PMID: 21761931
shotgun proteomics; peptide identification
25.  A review of statistical methods for protein identification using tandem mass spectrometry 
Tandem mass spectrometry has emerged as a powerful tool for the characterization of complex protein samples, an increasingly important problem in biology. The effort to efficiently and accurately perform inference on data from tandem mass spectrometry experiments has resulted in several statistical methods. We use a common framework to describe the predominant methods and discuss them in detail. These methods are classified using the following categories: set cover methods, iterative methods, and Bayesian methods. For each method, we analyze and evaluate the outcome and methodology of published comparisons to other methods; we use this comparison to comment on the qualities and weaknesses, as well as the overall utility, of all methods. We discuss the similarities between these methods and suggest directions for the field that would help unify these similar assumptions in a more rigorous manner and help enable efficient and reliable protein inference.
PMCID: PMC3402235  PMID: 22833779
Mass spectrometry; Proteomics; Bayesian methods

Results 1-25 (82)