Uropathogenic
Escherichia coli (UPEC) strains lead to 70-90% of the estimated annual 150 million community-acquired urinary tract infections (UTIs) [
1]. As a member of UPEC, the complete genome of strain CFT073 (serotype O6:K2:H1) was sequenced in 2002 [GenBank:
AE014075.1] [
2], which has a 5,231,428 bp chromosome without plasmid and is 590,209 bp longer than the well-studied K-12 MG1655 strain. The difference in the CFT073 genome is mostly caused by five unique cryptic inserted prophage genomes that contain a large portion of virulence or virulence-associated genes, referred to as pathogenicity islands (PAIs) [
3]. At the time of this writing, the release in RefSeq annotates 5,339 protein-coding genes, 89 tRNA genes, 21 rRNA genes, and 6 miscellaneous RNA genes [
2]. In-depth analysis reveals that 3,190 genes (3,925,047 bp, 75.0%) are considered as conserved backbone genes, while the rest (1,306,391 bp, 25.0%), known as CFT073-specific islands, inserts into the backbone regions in an extensive mosaic manner. Regarding virulence and virulence-associated genes, the annotation includes 12 types of fimbriae, 7 autotransporters, and toxin operons such as
hlyCABD and
upxBDA [
2]. Since its first release, the annotation not only presents an overview of the complexity of the pathogen's lifestyle, but also has served as a guide for experimental design.
However, several lines of evidence suggest a need for the reannotation of the
Escherichia coli CFT073 genome, partially due to discoveries and corrections overtime for the original RefSeq annotation even updated with some minor corrections. For example, new autotransporter encoding genes and some vital population density control factors are missing from the annotation [
4,
5], while more and more novel small RNAs (sRNAs) that have recently been found to add to the complexity of virulence regulatory networks [
6]. In addition, a computational estimation suggests that the annotation quality of the translation initiation site is surprisingly lower in this strain than in its close relative, K-12 MG1655 [
7]. Moreover, similar observation, along with low annotation quality in CDSs, has been demonstrated in other
E. coli strains (for example, APEC O1), by syntactic annotation methods [
8]. Such an observation indicates that the highly diverse adaptive paths in different
E. coli are responsible for the requirement of more sophisticated annotation methods rather than traditional ones. As a systematic issue, research on how CFT073 establishes its virulence during the UTI process needs a more comprehensive and precise picture of the genomic structure of this pathogen instead of piecemeal information. Therefore, a thorough reannotation of CFT073 is justified for future studies.
Reannotation is a process to annotate a previously annotated genome by using better bioinformatics methods and more complete databases [
9]. Working toward improvement of gene structure as well as functionary information, the importance of genome reannotation has been recognized even before the completion of the first genome sequence [
9,
10]. However, out of the total number of sequenced microbial genomes (845 at the time of writing), examples of genome-wide reannotations are surprisingly rare [
11]. With a few number of documented projects [
11-
14], nevertheless, several common features can be summarized. Firstly, the functional examination of genes already annotated has become a common practice in reannotation, thanks to the advances of sequence comparison and new experimental data from literature [
11-
14]. Secondly, new genes may also be described, with evidences mostly from
de novel gene prediction or sequence comparison to public databases like SWISS-PROT [
13], and to a less degree from experimental genome analysis data [
12]. Finally, almost all projects involve manual efforts to offer more precise designations to expert curators, and thus help avoid flawed research. In addition to a genome-wide analysis, particular interest may be directed to subsets of genes. For instance, Chen
et al. [
14] focused on assignment of function to genes recognized as being "hypothetical" in previous annotations.
In this work, we combine automated annotation tools with manual efforts to provide a comprehensive and precise reannotation of the
Escherichia coli CFT073 genome. Hereby we refer to the current release of RefSeq annotation as the original annotation [RefSeq: NC_004431] for CFT073, although the very first annotation in 2002 has already been updated with some minor corrections. With a focus on virulence genes, the reannotation was achieved by using literature curation and applications of several analytical methods including gene finding tools, sequence/domain similarity search and transmembrane region analysis. As a result, 608 coding sequences (CDSs) annotated in RefSeq were excluded, while a total of 299 CDSs are new to the original annotation and one third of these are found in genomic island (GI) regions. Subsequent analysis were conducted by both general and case studies on genes that are crucial during the UTI process, including invasion, colonization, nutrition uptake and population density control. Besides virulence factors, miscellaneous RNAs are believed to contribute to the virulence of strain CFT073 [
6]. Therefore, the reannotation presents a total of 40 new miscellaneous RNA genes based on literature curation and database searching. The CFT073 reannotation resource is freely available
via http://mech.ctb.pku.edu.cn/CFT073/. Following the proposal by Salzberg [
10], the reannotation website includes three sections: a brief overview of the methods for reannotation, links to browse the reannotation, and links for data download.
In general, the new CDSs and miscellaneous RNA genes bring new perspectives to the virulence properties of this pathogen. We expect the reannotation to be complementary to the original annotation, with the hope to facilitate the study of new mechanisms of uropathogenicity in CFT073 for a variety of research communities.