1.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements 
Nucleic Acids Research  2001;29(14):2994-3005.
PSI-BLAST is an iterative program to search a database for proteins with distant similarity to a query sequence. We investigated over a dozen modifications to the methods used in PSI-BLAST, with the goal of improving accuracy in finding true positive matches. To evaluate performance we used a set of 103 queries for which the true positives in yeast had been annotated by human experts, and a popular measure of retrieval accuracy (ROC) that can be normalized to take on values between 0 (worst) and 1 (best). The modifications we consider novel improve the ROC score from 0.758 ± 0.005 to 0.895 ± 0.003. This does not include the benefits from four modifications we included in the ‘baseline’ version, even though they were not implemented in PSI-BLAST version 2.0. The improvement in accuracy was confirmed on a small second test set. This test involved analyzing three protein families with curated lists of true positives from the non-redundant protein database. The modification that accounts for the majority of the improvement is the use, for each database sequence, of a position-specific scoring system tuned to that sequence’s amino acid composition. The use of composition-based statistics is particularly beneficial for large-scale automated applications of PSI-BLAST.
PMCID: PMC55814  PMID: 11452024
2.  Cloning the human and mouse MMS19 genes and functional complementation of a yeast mms19 deletion mutant 
Nucleic Acids Research  2001;29(9):1884-1891.
The MMS19 gene of the yeast Saccharomyces cerevisiae encodes a polypeptide of unknown function which is required for both nucleotide excision repair (NER) and RNA polymerase II (RNAP II) transcription. Here we report the molecular cloning of human and mouse orthologs of the yeast MMS19 gene. Both human and Drosophila MMS19 cDNAs correct thermosensitive growth and sensitivity to killing by UV radiation in a yeast mutant deleted for the MMS19 gene, indicating functional conservation between the yeast and mammalian gene products. Alignment of the translated sequences of MMS19 from multiple eukaryotes, including mouse and human, revealed the presence of several conserved regions, including a HEAT repeat domain near the C-terminus. The presence of HEAT repeats, coupled with functional complementation of yeast mutant phenotypes by the orthologous protein from higher eukaryotes, suggests a role of Mms19 protein in the assembly of a multiprotein complex(es) required for NER and RNAP II transcription. Both the mouse and human genes are ubiquitously expressed as multiple transcripts, some of which appear to derive from alternative splicing. The ratio of different transcripts varies in several different tissue types.
PMCID: PMC37259  PMID: 11328871
3.  The COG database: new developments in phylogenetic classification of proteins from complete genomes 
Nucleic Acids Research  2001;29(1):22-28.
The database of Clusters of Orthologous Groups of proteins (COGs), which represents an attempt on a phylogenetic classification of the proteins encoded in complete genomes, currently consists of 2791 COGs including 45 350 proteins from 30 genomes of bacteria, archaea and the yeast Saccharomyces cerevisiae ( In addition, a supplement to the COGs is available, in which proteins encoded in the genomes of two multicellular eukaryotes, the nematode Caenorhabditis elegans and the fruit fly Drosophila melanogaster, and shared with bacteria and/or archaea were included. The new features added to the COG database include information pages with structural and functional details on each COG and literature references, improvements of the COGNITOR program that is used to fit new proteins into the COGs, and classification of genomes and COGs constructed by using principal component analysis.
PMCID: PMC29819  PMID: 11125040

