PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (84)
 

Clipboard (0)
None

Select a Filter Below

Journals
more »
Year of Publication
more »
1.  Patterns of database citation in articles and patents indicate long-term scientific and industry value of biological data resources 
F1000Research  2016;5:ELIXIR-160.
Data from open access biomolecular data resources, such as the European Nucleotide Archive and the Protein Data Bank are extensively reused within life science research for comparative studies, method development and to derive new scientific insights. Indicators that estimate the extent and utility of such secondary use of research data need to reflect this complex and highly variable data usage. By linking open access scientific literature, via Europe PubMedCentral, to the metadata in biological data resources we separate data citations associated with a deposition statement from citations that capture the subsequent, long-term, reuse of data in academia and industry.  We extend this analysis to begin to investigate citations of biomolecular resources in patent documents. We find citations in more than 8,000 patents from 2014, demonstrating substantial use and an important role for data resources in defining biological concepts in granted patents to both academic and industrial innovators. Combined together our results indicate that the citation patterns in biomedical literature and patents vary, not only due to citation practice but also according to the data resource cited. The results guard against the use of simple metrics such as citation counts and show that indicators of data use must not only take into account citations within the biomedical literature but also include reuse of data in industry and other parts of society by including patents and other scientific and technical documents such as guidelines, reports and grant applications.
doi:10.12688/f1000research.7911.1
PMCID: PMC4821287  PMID: 27092246
Data citations; Data reuse; Data repositories; Data archiving; Open data; Bibliometrics; Patent analysis; Research impact
2.  The Pfam protein families database: towards a more sustainable future 
Nucleic Acids Research  2015;44(Database issue):D279-D285.
In the last two years the Pfam database (http://pfam.xfam.org) has undergone a substantial reorganisation to reduce the effort involved in making a release, thereby permitting more frequent releases. Arguably the most significant of these changes is that Pfam is now primarily based on the UniProtKB reference proteomes, with the counts of matched sequences and species reported on the website restricted to this smaller set. Building families on reference proteomes sequences brings greater stability, which decreases the amount of manual curation required to maintain them. It also reduces the number of sequences displayed on the website, whilst still providing access to many important model organisms. Matches to the full UniProtKB database are, however, still available and Pfam annotations for individual UniProtKB sequences can still be retrieved. Some Pfam entries (1.6%) which have no matches to reference proteomes remain; we are working with UniProt to see if sequences from them can be incorporated into reference proteomes. Pfam-B, the automatically-generated supplement to Pfam, has been removed. The current release (Pfam 29.0) includes 16 295 entries and 559 clans. The facility to view the relationship between families within a clan has been improved by the introduction of a new tool.
doi:10.1093/nar/gkv1344
PMCID: PMC4702930  PMID: 26673716
3.  Creating a specialist protein resource network: a meeting report for the protein bioinformatics and community resources retreat 
During 11–12 August 2014, a Protein Bioinformatics and Community Resources Retreat was held at the Wellcome Trust Genome Campus in Hinxton, UK. This meeting brought together the principal investigators of several specialized protein resources (such as CAZy, TCDB and MEROPS) as well as those from protein databases from the large Bioinformatics centres (including UniProt and RefSeq). The retreat was divided into five sessions: (1) key challenges, (2) the databases represented, (3) best practices for maintenance and curation, (4) information flow to and from large data centers and (5) communication and funding. An important outcome of this meeting was the creation of a Specialist Protein Resource Network that we believe will improve coordination of the activities of its member resources. We invite further protein database resources to join the network and continue the dialogue.
doi:10.1093/database/bav063
PMCID: PMC4499208
4.  Key challenges for the creation and maintenance of specialist protein resources 
Proteins  2015;83(6):1005-1013.
As the volume of data relating to proteins increases, researchers rely more and more on the analysis of published data, thus increasing the importance of good access to these data that varies from the supplemental material of individual papers, all the way to major reference databases with professional staff and long-term funding. Specialist protein resources fill an important middle ground, providing interactive web interfaces to their databases for a focused topic or family of proteins, using specialised approaches that are not feasible in the major reference databases. Many are labours of love, run by a single lab with little or no dedicated funding and there are many challenges to building and maintaining them. This perspective arose from a meeting of several specialist protein resources and major reference databases held at the Wellcome Trust Genome Campus (Cambridge, UK) on the 11th and 12th of August 2014. During this meeting some common key challenges involved in creating and maintaining such resources were discussed, along with various approaches to address them. In laying out these challenges, we aim to inform users about how these issues impact our resources and illustrate ways in which our working together could enhance their accuracy, currency, and overall value.
doi:10.1002/prot.24803
PMCID: PMC4446195  PMID: 25820941
Specialist Protein Resource; Key Challenges; Biocuration; Longevity; Big data; Mis-annotation
5.  HMMER web server: 2015 update 
Nucleic Acids Research  2015;43(Web Server issue):W30-W38.
The HMMER website, available at http://www.ebi.ac.uk/Tools/hmmer/, provides access to the protein homology search algorithms found in the HMMER software suite. Since the first release of the website in 2011, the search repertoire has been expanded to include the iterative search algorithm, jackhmmer. The continued growth of the target sequence databases means that traditional tabular representations of significant sequence hits can be overwhelming to the user. Consequently, additional ways of presenting homology search results have been developed, allowing them to be summarised according to taxonomic distribution or domain architecture. The taxonomy and domain architecture representations can be used in combination to filter the results according to the needs of a user. Searches can also be restricted prior to submission using a new taxonomic filter, which not only ensures that the results are specific to the requested taxonomic group, but also improves search performance. The repertoire of profile hidden Markov model libraries, which are used for annotation of query sequences with protein families and domains, has been expanded to include the libraries from CATH-Gene3D, PIRSF, Superfamily and TIGRFAMs. Finally, we discuss the relocation of the HMMER webserver to the European Bioinformatics Institute and the potential impact that this will have.
doi:10.1093/nar/gkv397
PMCID: PMC4489315  PMID: 25943547
6.  Domain atrophy creates rare cases of functional partial protein domains 
Genome Biology  2015;16(1):88.
Background
Protein domains display a range of structural diversity, with numerous additions and deletions of secondary structural elements between related domains. We have observed a small number of cases of surprising large-scale deletions of core elements of structural domains. We propose a new concept called domain atrophy, where protein domains lose a significant number of core structural elements.
Results
Here, we implement a new pipeline to systematically identify new cases of domain atrophy across all known protein sequences. The output of this pipeline was carefully checked by hand, which filtered out partial domain instances that were unlikely to represent true domain atrophy due to misannotations or un-annotated sequence fragments. We identify 75 cases of domain atrophy, of which eight cases are found in a three-dimensional protein structure and 67 cases have been inferred based on mapping to a known homologous structure. Domains with structural variations include ancient folds such as the TIM-barrel and Rossmann folds. Most of these domains are observed to show structural loss that does not affect their functional sites.
Conclusion
Our analysis has significantly increased the known cases of domain atrophy. We discuss specific instances of domain atrophy and see that there has often been a compensatory mechanism that helps to maintain the stability of the partial domain. Our study indicates that although domain atrophy is an extremely rare phenomenon, protein domains under certain circumstances can tolerate extreme mutations giving rise to partial, but functional, domains.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-015-0655-8) contains supplementary material, which is available to authorized users.
doi:10.1186/s13059-015-0655-8
PMCID: PMC4432964  PMID: 25924720
7.  The InterPro protein families database: the classification resource after 15 years 
Nucleic Acids Research  2014;43(Database issue):D213-D221.
The InterPro database (http://www.ebi.ac.uk/interpro/) is a freely available resource that can be used to classify sequences into protein families and to predict the presence of important domains and sites. Central to the InterPro database are predictive models, known as signatures, from a range of different protein family databases that have different biological focuses and use different methodological approaches to classify protein families and domains. InterPro integrates these signatures, capitalizing on the respective strengths of the individual databases, to produce a powerful protein classification resource. Here, we report on the status of InterPro as it enters its 15th year of operation, and give an overview of new developments with the database and its associated Web interfaces and software. In particular, the new domain architecture search tool is described and the process of mapping of Gene Ontology terms to InterPro is outlined. We also discuss the challenges faced by the resource given the explosive growth in sequence data in recent years. InterPro (version 48.0) contains 36 766 member database signatures integrated into 26 238 InterPro entries, an increase of over 3993 entries (5081 signatures), since 2012.
doi:10.1093/nar/gku1243
PMCID: PMC4383996  PMID: 25428371
8.  Rfam 12.0: updates to the RNA families database 
Nucleic Acids Research  2014;43(Database issue):D130-D137.
The Rfam database (available at http://rfam.xfam.org) is a collection of non-coding RNA families represented by manually curated sequence alignments, consensus secondary structures and annotation gathered from corresponding Wikipedia, taxonomy and ontology resources. In this article, we detail updates and improvements to the Rfam data and website for the Rfam 12.0 release. We describe the upgrade of our search pipeline to use Infernal 1.1 and demonstrate its improved homology detection ability by comparison with the previous version. The new pipeline is easier for users to apply to their own data sets, and we illustrate its ability to annotate RNAs in genomic and metagenomic data sets of various sizes. Rfam has been expanded to include 260 new families, including the well-studied large subunit ribosomal RNA family, and for the first time includes information on short sequence- and structure-based RNA motifs present within families.
doi:10.1093/nar/gku1063
PMCID: PMC4383904  PMID: 25392425
9.  Structure and computational analysis of a novel protein with metallopeptidase-like and circularly permuted winged-helix-turn-helix domains reveals a possible role in modified polysaccharide biosynthesis 
BMC Bioinformatics  2014;15:75.
Background
CA_C2195 from Clostridium acetobutylicum is a protein of unknown function. Sequence analysis predicted that part of the protein contained a metallopeptidase-related domain. There are over 200 homologs of similar size in large sequence databases such as UniProt, with pairwise sequence identities in the range of ~40-60%. CA_C2195 was chosen for crystal structure determination for structure-based function annotation of novel protein sequence space.
Results
The structure confirmed that CA_C2195 contained an N-terminal metallopeptidase-like domain. The structure revealed two extra domains: an α+β domain inserted in the metallopeptidase-like domain and a C-terminal circularly permuted winged-helix-turn-helix domain.
Conclusions
Based on our sequence and structural analyses using the crystal structure of CA_C2195 we provide a view into the possible functions of the protein. From contextual information from gene-neighborhood analysis, we propose that rather than being a peptidase, CA_C2195 and its homologs might play a role in biosynthesis of a modified cell-surface carbohydrate in conjunction with several sugar-modification enzymes. These results provide the groundwork for the experimental verification of the function.
doi:10.1186/1471-2105-15-75
PMCID: PMC4000134  PMID: 24646163
CA_C2195; Peptidase; DUF4910; DUF2172; HTH_47; Structural genomics
11.  iPfam: a database of protein family and domain interactions found in the Protein Data Bank 
Nucleic Acids Research  2013;42(Database issue):D364-D373.
The database iPfam, available at http://ipfam.org, catalogues Pfam domain interactions based on known 3D structures that are found in the Protein Data Bank, providing interaction data at the molecular level. Previously, the iPfam domain–domain interaction data was integrated within the Pfam database and website, but it has now been migrated to a separate database. This allows for independent development, improving data access and giving clearer separation between the protein family and interactions datasets. In addition to domain–domain interactions, iPfam has been expanded to include interaction data for domain bound small molecule ligands. Functional annotations are provided from source databases, supplemented by the incorporation of Wikipedia articles where available. iPfam (version 1.0) contains >9500 domain–domain and 15 500 domain–ligand interactions. The new website provides access to this data in a variety of ways, including interactive visualizations of the interaction data.
doi:10.1093/nar/gkt1210
PMCID: PMC3965099  PMID: 24297255
12.  Pfam: the protein families database 
Nucleic Acids Research  2013;42(Database issue):D222-D230.
Pfam, available via servers in the UK (http://pfam.sanger.ac.uk/) and the USA (http://pfam.janelia.org/), is a widely used database of protein families, containing 14 831 manually curated entries in the current release, version 27.0. Since the last update article 2 years ago, we have generated 1182 new families and maintained sequence coverage of the UniProt Knowledgebase (UniProtKB) at nearly 80%, despite a 50% increase in the size of the underlying sequence database. Since our 2012 article describing Pfam, we have also undertaken a comprehensive review of the features that are provided by Pfam over and above the basic family data. For each feature, we determined the relevance, computational burden, usage statistics and the functionality of the feature in a website context. As a consequence of this review, we have removed some features, enhanced others and developed new ones to meet the changing demands of computational biology. Here, we describe the changes to Pfam content. Notably, we now provide family alignments based on four different representative proteome sequence data sets and a new interactive DNA search interface. We also discuss the mapping between Pfam and known 3D structures.
doi:10.1093/nar/gkt1223
PMCID: PMC3965110  PMID: 24288371
13.  LUD, a new protein domain associated with lactate utilization 
BMC Bioinformatics  2013;14:341.
Background
A novel highly conserved protein domain, DUF162 [Pfam: PF02589], can be mapped to two proteins: LutB and LutC. Both proteins are encoded by a highly conserved LutABC operon, which has been implicated in lactate utilization in bacteria. Based on our analysis of its sequence, structure, and recent experimental evidence reported by other groups, we hereby redefine DUF162 as the LUD domain family.
Results
JCSG solved the first crystal structure [PDB:2G40] from the LUD domain family: LutC protein, encoded by ORF DR_1909, of Deinococcus radiodurans. LutC shares features with domains in the functionally diverse ISOCOT superfamily. We have observed that the LUD domain has an increased abundance in the human gut microbiome.
Conclusions
We propose a model for the substrate and cofactor binding and regulation in LUD domain. The significance of LUD-containing proteins in the human gut microbiome, and the implication of lactate metabolism in the radiation-resistance of Deinococcus radiodurans are discussed.
doi:10.1186/1471-2105-14-341
PMCID: PMC3924224  PMID: 24274019
LUD; DUF162; LutB; LutC; Domain of unknown function; Deinococcus radiodurans
14.  Filling out the structural map of the NTF2-like superfamily 
BMC Bioinformatics  2013;14:327.
Background
The NTF2-like superfamily is a versatile group of protein domains sharing a common fold. The sequences of these domains are very diverse and they share no common sequence motif. These domains serve a range of different functions within the proteins in which they are found, including both catalytic and non-catalytic versions. Clues to the function of protein domains belonging to such a diverse superfamily can be gleaned from analysis of the proteins and organisms in which they are found.
Results
Here we describe three protein domains of unknown function found mainly in bacteria: DUF3828, DUF3887 and DUF4878. Structures of representatives of each of these domains: BT_3511 from Bacteroides thetaiotaomicron (strain VPI-5482) [PDB:3KZT], Cj0202c from Campylobacter jejuni subsp. jejuni serotype O:2 (strain NCTC 11168) [PDB:3K7C], rumgna_01855) and RUMGNA_01855 from Ruminococcus gnavus (strain ATCC 29149) [PDB:4HYZ] have been solved by X-ray crystallography. All three domains are similar in structure and all belong to the NTF2-like superfamily. Although the function of these domains remains unknown at present, our analysis enables us to present a hypothesis concerning their role.
Conclusions
Our analysis of these three protein domains suggests a potential non-catalytic ligand-binding role. This may regulate the activities of domains with which they are combined in the same polypeptide or via operonic linkages, such as signaling domains (e.g. serine/threonine protein kinase), peptidoglycan-processing hydrolases (e.g. NlpC/P60 peptidases) or nucleic acid binding domains (e.g. Zn-ribbons).
doi:10.1186/1471-2105-14-327
PMCID: PMC3924330  PMID: 24246060
NTF2-like superfamily; Protein function prediction; Protein structure; Ligand-binding; JCSG; 3D structure; Protein family
15.  TreeFam v9: a new website, more species and orthology-on-the-fly 
Nucleic Acids Research  2013;42(Database issue):D922-D925.
TreeFam (http://www.treefam.org) is a database of phylogenetic trees inferred from animal genomes. For every TreeFam family we provide homology predictions together with the evolutionary history of the genes. Here we describe an update of the TreeFam database. The TreeFam project was resurrected in 2012 and has seen two releases since. The latest release (TreeFam 9) was made available in March 2013. It has orthology predictions and gene trees for 109 species in 15 736 families covering ∼2.2 million sequences. With release 9 we made modifications to our production pipeline and redesigned our website with improved gene tree visualizations and Wikipedia integration. Furthermore, we now provide an HMM-based sequence search that places a user-provided protein sequence into a TreeFam gene tree and provides quick orthology prediction. The tool uses Mafft and RAxML for the fast insertion into a reference alignment and tree, respectively. Besides the aforementioned technical improvements, we present a new approach to visualize gene trees and alternative displays that focuses on showing homology information from a species tree point of view. From release 9 onwards, TreeFam is now hosted at the EBI.
doi:10.1093/nar/gkt1055
PMCID: PMC3965059  PMID: 24194607
16.  MEROPS: the database of proteolytic enzymes, their substrates and inhibitors 
Nucleic Acids Research  2013;42(Database issue):D503-D509.
Peptidases, their substrates and inhibitors are of great relevance to biology, medicine and biotechnology. The MEROPS database (http://merops.sanger.ac.uk) aims to fulfill the need for an integrated source of information about these. The database has hierarchical classifications in which homologous sets of peptidases and protein inhibitors are grouped into protein species, which are grouped into families, which are in turn grouped into clans. Recent developments include the following. A community annotation project has been instigated in which acknowledged experts are invited to contribute summaries for peptidases. Software has been written to provide an Internet-based data entry form. Contributors are acknowledged on the relevant web page. A new display showing the intron/exon structures of eukaryote peptidase genes and the phasing of the junctions has been implemented. It is now possible to filter the list of peptidases from a completely sequenced bacterial genome for a particular strain of the organism. The MEROPS filing pipeline has been altered to circumvent the restrictions imposed on non-interactive blastp searches, and a HMMER search using specially generated alignments to maximize the distribution of organisms returned in the search results has been added.
doi:10.1093/nar/gkt953
PMCID: PMC3964991  PMID: 24157837
18.  Two Pfam protein families characterized by a crystal structure of protein lpg2210 from Legionella pneumophila 
BMC Bioinformatics  2013;14:265.
Background
Every genome contains a large number of uncharacterized proteins that may encode entirely novel biological systems. Many of these uncharacterized proteins fall into related sequence families. By applying sequence and structural analysis we hope to provide insight into novel biology.
Results
We analyze a previously uncharacterized Pfam protein family called DUF4424 [Pfam:PF14415]. The recently solved three-dimensional structure of the protein lpg2210 from Legionella pneumophila provides the first structural information pertaining to this family. This protein additionally includes the first representative structure of another Pfam family called the YARHG domain [Pfam:PF13308]. The Pfam family DUF4424 adopts a 19-stranded beta-sandwich fold that shows similarity to the N-terminal domain of leukotriene A-4 hydrolase. The YARHG domain forms an all-helical domain at the C-terminus. Structure analysis allows us to recognize distant similarities between the DUF4424 domain and individual domains of M1 aminopeptidases and tricorn proteases, which form massive proteasome-like capsids in both archaea and bacteria.
Conclusions
Based on our analyses we hypothesize that the DUF4424 domain may have a role in forming large, multi-component enzyme complexes. We suggest that the YARGH domain may play a role in binding a moiety in proximity with peptidoglycan, such as a hydrophobic outer membrane lipid or lipopolysaccharide.
doi:10.1186/1471-2105-14-265
PMCID: PMC3848476  PMID: 24004689
Domain of unknown function; Protein family; Protein structure; DUF4424; YARHG domain; Sequence analysis
19.  The COMBREX Project: Design, Methodology, and Initial Results 
Anton, Brian P. | Chang, Yi-Chien | Brown, Peter | Choi, Han-Pil | Faller, Lina L. | Guleria, Jyotsna | Hu, Zhenjun | Klitgord, Niels | Levy-Moonshine, Ami | Maksad, Almaz | Mazumdar, Varun | McGettrick, Mark | Osmani, Lais | Pokrzywa, Revonda | Rachlin, John | Swaminathan, Rajeswari | Allen, Benjamin | Housman, Genevieve | Monahan, Caitlin | Rochussen, Krista | Tao, Kevin | Bhagwat, Ashok S. | Brenner, Steven E. | Columbus, Linda | de Crécy-Lagard, Valérie | Ferguson, Donald | Fomenkov, Alexey | Gadda, Giovanni | Morgan, Richard D. | Osterman, Andrei L. | Rodionov, Dmitry A. | Rodionova, Irina A. | Rudd, Kenneth E. | Söll, Dieter | Spain, James | Xu, Shuang-yong | Bateman, Alex | Blumenthal, Robert M. | Bollinger, J. Martin | Chang, Woo-Suk | Ferrer, Manuel | Friedberg, Iddo | Galperin, Michael Y. | Gobeill, Julien | Haft, Daniel | Hunt, John | Karp, Peter | Klimke, William | Krebs, Carsten | Macelis, Dana | Madupu, Ramana | Martin, Maria J. | Miller, Jeffrey H. | O'Donovan, Claire | Palsson, Bernhard | Ruch, Patrick | Setterdahl, Aaron | Sutton, Granger | Tate, John | Yakunin, Alexander | Tchigvintsev, Dmitri | Plata, Germán | Hu, Jie | Greiner, Russell | Horn, David | Sjölander, Kimmen | Salzberg, Steven L. | Vitkup, Dennis | Letovsky, Stanley | Segrè, Daniel | DeLisi, Charles | Roberts, Richard J. | Steffen, Martin | Kasif, Simon
PLoS Biology  2013;11(8):e1001638.
Experimental data exists for only a vanishingly small fraction of sequenced microbial genes. This community page discusses the progress made by the COMBREX project to address this important issue using both computational and experimental resources.
doi:10.1371/journal.pbio.1001638
PMCID: PMC3754883  PMID: 24013487
21.  The challenge of increasing Pfam coverage of the human proteome 
It is a worthy goal to completely characterize all human proteins in terms of their domains. Here, using the Pfam database, we asked how far we have progressed in this endeavour. Ninety per cent of proteins in the human proteome matched at least one of 5494 manually curated Pfam-A families. In contrast, human residue coverage by Pfam-A families was <45%, with 9418 automatically generated Pfam-B families adding a further 10%. Even after excluding predicted signal peptide regions and short regions (<50 consecutive residues) unlikely to harbour new families, for ∼38% of the human protein residues, there was no information in Pfam about conservation and evolutionary relationship with other protein regions. This uncovered portion of the human proteome was found to be distributed over almost 25 000 distinct protein regions. Comparison with proteins in the UniProtKB database suggested that the human regions that exhibited similarity to thousands of other sequences were often either divergent elements or N- or C-terminal extensions of existing families. Thirty-four per cent of regions, on the other hand, matched fewer than 100 sequences in UniProtKB. Most of these did not appear to share any relationship with existing Pfam-A families, suggesting that thousands of new families would need to be generated to cover them. Also, these latter regions were particularly rich in amino acid compositional bias such as the one associated with intrinsic disorder. This could represent a significant obstacle toward their inclusion into new Pfam families. Based on these observations, a major focus for increasing Pfam coverage of the human proteome will be to improve the definition of existing families. New families will also be built, prioritizing those that have been experimentally functionally characterized.
Database URL: http://pfam.sanger.ac.uk/
doi:10.1093/database/bat023
PMCID: PMC3630804  PMID: 23603847
22.  Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions 
Nucleic Acids Research  2013;41(12):e121.
Detection of protein homology via sequence similarity has important applications in biology, from protein structure and function prediction to reconstruction of phylogenies. Although current methods for aligning protein sequences are powerful, challenges remain, including problems with homologous overextension of alignments and with regions under convergent evolution. Here, we test the ability of the profile hidden Markov model method HMMER3 to correctly assign homologous sequences to >13 000 manually curated families from the Pfam database. We identify problem families using protein regions that match two or more Pfam families not currently annotated as related in Pfam. We find that HMMER3 E-value estimates seem to be less accurate for families that feature periodic patterns of compositional bias, such as the ones typically observed in coiled-coils. These results support the continued use of manually curated inclusion thresholds in the Pfam database, especially on the subset of families that have been identified as problematic in experiments such as these. They also highlight the need for developing new methods that can correct for this particular type of compositional bias.
doi:10.1093/nar/gkt263
PMCID: PMC3695513  PMID: 23598997
23.  A comparison of dense transposon insertion libraries in the Salmonella serovars Typhi and Typhimurium 
Nucleic Acids Research  2013;41(8):4549-4564.
Salmonella Typhi and Typhimurium diverged only ∼50 000 years ago, yet have very different host ranges and pathogenicity. Despite the availability of multiple whole-genome sequences, the genetic differences that have driven these changes in phenotype are only beginning to be understood. In this study, we use transposon-directed insertion-site sequencing to probe differences in gene requirements for competitive growth in rich media between these two closely related serovars. We identify a conserved core of 281 genes that are required for growth in both serovars, 228 of which are essential in Escherichia coli. We are able to identify active prophage elements through the requirement for their repressors. We also find distinct differences in requirements for genes involved in cell surface structure biogenesis and iron utilization. Finally, we demonstrate that transposon-directed insertion-site sequencing is not only applicable to the protein-coding content of the cell but also has sufficient resolution to generate hypotheses regarding the functions of non-coding RNAs (ncRNAs) as well. We are able to assign probable functions to a number of cis-regulatory ncRNA elements, as well as to infer likely differences in trans-acting ncRNA regulatory networks.
doi:10.1093/nar/gkt148
PMCID: PMC3632133  PMID: 23470992
24.  The SHOCT Domain: A Widespread Domain Under-Represented in Model Organisms 
PLoS ONE  2013;8(2):e57848.
We have identified a new protein domain, which we have named the SHOCT domain (Short C-terminal domain). This domain is widespread in bacteria with over a thousand examples. But we found it is missing from the most commonly studied model organisms, despite being present in closely related species. It's predominantly C-terminal location, co-occurrence with numerous other domains and short size is reminiscent of the Gram-positive anchor motif, however it is present in a much wider range of species. We suggest several hypotheses about the function of SHOCT, including oligomerisation and nucleic acid binding. Our initial experiments do not support its role as an oligomerisation domain.
doi:10.1371/journal.pone.0057848
PMCID: PMC3581485  PMID: 23451277
25.  Genome of Acanthamoeba castellanii highlights extensive lateral gene transfer and early evolution of tyrosine kinase signaling 
Genome Biology  2013;14(2):R11.
Background
The Amoebozoa constitute one of the primary divisions of eukaryotes, encompassing taxa of both biomedical and evolutionary importance, yet its genomic diversity remains largely unsampled. Here we present an analysis of a whole genome assembly of Acanthamoeba castellanii (Ac) the first representative from a solitary free-living amoebozoan.
Results
Ac encodes 15,455 compact intron-rich genes, a significant number of which are predicted to have arisen through inter-kingdom lateral gene transfer (LGT). A majority of the LGT candidates have undergone a substantial degree of intronization and Ac appears to have incorporated them into established transcriptional programs. Ac manifests a complex signaling and cell communication repertoire, including a complete tyrosine kinase signaling toolkit and a comparable diversity of predicted extracellular receptors to that found in the facultatively multicellular dictyostelids. An important environmental host of a diverse range of bacteria and viruses, Ac utilizes a diverse repertoire of predicted pattern recognition receptors, many with predicted orthologous functions in the innate immune systems of higher organisms.
Conclusions
Our analysis highlights the important role of LGT in the biology of Ac and in the diversification of microbial eukaryotes. The early evolution of a key signaling facility implicated in the evolution of metazoan multicellularity strongly argues for its emergence early in the Unikont lineage. Overall, the availability of an Ac genome should aid in deciphering the biology of the Amoebozoa and facilitate functional genomic studies in this important model organism and environmental host.
doi:10.1186/gb-2013-14-2-r11
PMCID: PMC4053784  PMID: 23375108

Results 1-25 (84)