Figure demonstrated the framework for T3DB construction, which involves 4 steps: (1) identification of T3SS containing bacterial genera and species; (2) T3SS gene identification and categorization; (3) T3SS gene annotation; (4) ortholog annotation.
Figure 1 Construction of T3DB and T3SS gene categorization. Construction of T3DB involved 4 steps. The annotation work and main achievements for each step are shown in (A). (B) lists the major T3SS gene categories annotated in T3DB. For each subgroup of Apparatus, (more ...)
First, a text based literature search strategy was adopted to obtain a comprehensive list of T3SS related publications. ‘Type III Secretion System’, ‘Type 3 Secretion System’, ‘TTSS’ and ‘T3SS’, were respectively used as key words to search the PubMed database. This search resulted in more than 3000 non-redundant publications. The abstract of each publication was scanned manually, and the bacterial genera and species were recorded and examined. Because some bacteria may contain not yet reported candidate T3SSs, instead of using comprehensive sequence alignments to find these candidates, we only included potential T3SS candidates based on literature reviews in which the authors presented sequence alignments, genome localization, and phylogenetic evidence.This procedure generated a list of 26 bacterial genera from different classes, even from different phyla (http://biocomputer.bio.cuhk.edu.hk/T3DB/browse
). The phylogenetic relationship between these bacterial genera was annotated from Bergey’s Manual [21
]. For each genus, the model species and strains with the most adequate experimental data and molecular information were further selected. The genomes (chromosomes and plasmids) of most of the selected model strains have been sequenced, and the current release contains 35 model species. The host type (animal or plant) and interaction type (pathogenesis or symbiosis) were annotated for each species according to Bergey’s Manual [21
All the T3SS-related genes were then collected for each selected model strain. Due to the lack of high sequence similarity among different genera, especially among distantly-related genera, it is difficult to identify T3SS-related genes only based on sequence alignment using fixed parameters and reference gene sets. Accordingly, a manual curation from literature search was combined. Genes that have been reported to relate with functional T3SSs (e.g., being secreted or translocated through T3SSs, regulating putative T3SS genes, or assisting secretion of T3SS effectors), or have sequence or function conservation with T3SS genes in other bacterial species/genera were retrieved. Because it is much easier to identify orthologs in bacteria that belong to the same genus, each genus was therefore used as key word in combination with either ‘Type III Secretion System’, ‘Type 3 Secretion System’, ‘TTSS’, or ‘T3SS’ to search relevant literature in the PubMed database. Each literature hit was manually curated and genes related to T3SS were collected, together with their bacterial host strain, alias, gene accession, and detailed function. Furthermore, the candidate gene sequences and their genomic coordinates were tracked and compared, and T3SS orthologs in different species or strains were identified. Because a strain may contain more than one T3SS, and genes with similar sequence, structural, function and genomic clustering features among different T3SSs in the same strain can not be accurately defined as paralogs or orthologs, we created a new term ‘T3 ortholog’ to specify this case. Specifically, any genes with the above-stated features among different T3SSs in the same or different strains were collectively termed ‘T3 orthologs’. A non-redundant T3 ortholog cluster set was obtained for each genus after clustering the within-genus and inter-T3SS T3 orthologs. Each gene cluster in the genus-based non-redundant T3 ortholog cluster set was assigned a unique name, in the form of ‘XXX-YYY’, where ‘YYY’ is the traditional gene name for that gene in most studied strains and ‘XXX’ describes the genus. The genus name was included in the gene name so that users can easily distinguish the genus from which the candidate gene originates. Even in the same genus, the orthologs in different strains may have different names. After a unique nomenclature was set for each ortholog cluster in a genus, other names representing the same gene were considered as aliases. For strains with more than one T3SS, the nomenclature for genes was in the format of ‘XXX-ZZZ-YYY’, where ‘XXX’ and ‘YYY’ denote genus and gene name respectively, and ‘ZZZ’ describes the T3SS name. Each T3SS was classified into one of the five putative categories [22
] according to phylogenetic analysis of the conserved T3SS proteins among all bacteria that contain T3SSs.
After the within-genus T3 ortholog clusters were set up for T3SSs in each genus, most genes in T3 ortholog clusters could be tracked directly from literature for most representative strains (Case I). For other strains, however, many T3SS related genes were not well annotated (Case II) and we could not track the T3 ortholog clusters directly from literature according to the function annotation. Therefore, a comparative genomic analysis was further conducted between different representative bacterial strains, combined with sequence alignment. If a Case II gene in strain A and its two flanking genes had the same gene order with a Case I gene and its flanking genes in strain B, and two corresponding genes had a >90% amino acid identity,the Case II gene in strain A was considered as a putative T3SS-related gene, and was classified into the corresponding T3 ortholog cluster. Due to evolutionary gene loss, pseudogenes, sequencing errors, and large sequence divergence, after this annotation, many strains still contain only a subset of the within-genus T3 ortholog genes.
The T3SS genes annotated in T3DB are divided into 4 major categories (Figure ): apparatus (Category I), chaperone (Category II), effector (Category III), and transcription regulator (Category IV). Apparatus genes encode those that assemble the needle-like structure as well as accessory genes. Genes in this category are further sub-classified into different function clusters (Figure ). Chaperone genes encode proteins that assist effector proteins to secrete through T3SS conduit. Effectors genes encode proteins specifically secreted through T3SS conduit. Some effectors themselves also function as structure proteins, such as those translocon proteins (e.g., Sal-SPI1-SipB and SipC). In such cases, they were classified into ‘Translocon’ in Category I. T3SS transcription regulators were collected as an independent category. For categories chaperone, effector and transcription regulator, at least one reference with experimental evidence was required to support the function annotation. For apparatus (Category I), sequence similarity and genomic organization were used as evidence, for which two conditions must be both met. For bacteria that contain multiple-T3SSs, some effectors cannot be precisely classified to a specified T3SS; in such case, the name of the orthologous gene cluster adopts ‘XXX-YYY’ instead of ‘XXX-ZZZ-YYY’ system.
For gene annotation, orthologous genes in different strains within the same genus adopt the same gene names. To distinguish these genes from different strains, a unique ID was assigned to each gene. The ID is represented by T3X, where ‘X’ is one of the four characters (‘A’: Apparatus; ‘C’: Chaperone; ‘E’: Effector; ‘R’: Regulator), followed by 11 numerical numbers representing different phyla (1 number), classes (1 number), orders (1 number), families (1 number), genera (2 numbers), strains (2 numbers), T3SSs in the same strain (1 number), and the individual genes (2 numbers), respectively. It should be noted that when more than one T3SSs are presented in a single strain and one is not able to determine to which T3SS the gene belong, the corresponding number is replaced by a character ‘x’. For each gene, the genome type (chromosome or plasmid), genome ID and gene coordinates in genome (if available), strand direction, nucleic acid and protein sequences, major function category, detailed function annotation, structure information, and reference PMIDs were all annotated.
In the last step, an inter-genus ‘T3 ortholog’ cluster was annotated for each gene. As defined in the previous paragraphs, the term ‘T3 orthologs’ was proposed because wide horizontal gene transfer events has led to the loss or gain of T3SS clusters in different genomic loci. For T3SS proteins, the sequence similarity among orthologs in different genera, especially distantly-related genera, was very low. Therefore, apart from significant sequence similarity (not lower than 30% identity for amino acid sequences), the within-genome synteny information was also considered. Two genes within two different T3SS clusters were also annotated as T3 orthologs if they and their respective two flanking genes had exactly the same gene order, and meanwhile if they shared not lower than 10% amino acid identity. Genes unsatisfied for above criteria could also be considered ‘T3 orthologs’ if they share high similarity in structure (structure orthologs) or function (function ortholog) based on experimental evidence. For structure orthologs, the authors of the original report must have observed and given explanations or discussions for the similarity between two proteins or key domains, and this structure similarity should have similar influence on important protein-protein interaction or protein function with experimental evidence. For function orthologs, experimental evidence is required to support two proteins have similar function. For example, Salmonella
SptP and Yersinia
YopE belong to the same ‘T3 ortholog’ cluster because both effectors show GTPase activity and can activate host cytoskeleton proteins [23
]. Although lacking of apparent sequence similarity, E. Coli
SopE and SifA, and Shigella
IpgB2 all contain a ‘WxxxE’ motif and can mimic guanine nucleotide exchange factors, they were consequently annotated as ‘T3 orthologs’ [25