|Home | About | Journals | Submit | Contact Us | Français|
There are two types of acetylation processes widely occurred in proteins (1–8). The first Nα-terminal acetylation is catalyzed a variety of N-terminal acetyltransferases (NATs), which co-translationally transfer acetyl moieties from acetyl-coenzyme A (Acetyl-CoA) to the α-amino (Nα) group of protein amino-terminal residues (1,2). Although Nα-terminal acetylation is rare in prokaryotes, it was estimated that ~85% of eukaryotic proteins are Nα-terminally modified (1,2). The second type is Nε-lysine acetylation, which specifically modifies ε-amino group of protein lysine residues (3–8). Although Nε-lysine acetylation is less common, it’s one of the most important and ubiquitous post-translational modifications (PTMs) conserved in prokaryotes and eukaryotes (1,2). Moreover, the acetylation and deacetylation are dynamically and temporally regulated by histone acetyltransferases (HATs) and histone deacetylases (HDACs), respectively (4–8).
In 1964, Allfrey et al. (9) first observed that lysine acetylation of histones plays an essential role in regulation of gene expression. Later and recent studies in epigenetics solidified this seminal discovery, and proposed acetylation as a key component of the ‘histone code’ (6). Beyond histones, a wide-range of non-histone proteins can also be lysine acetylated, and involved in a variety of biological processes, such as transcription regulation (10), DNA replication (11), cellular signaling (12), stress response (13) and so on. Aberrance of lysine acetylation and deacetylation is associated with various diseases and cancers (5,7,14). In particular, acetylation was demonstrated to be implicated in cellular metabolism and aging (15–17), while one class of NAD+ dependent HDACs of sirtuins might be potent drug target for promoting longevity (13,17).
Although a great number of efforts have been carried out during the past four decades, the functional contents of lysine acetylation are still far from fully understood. In this regard, identification of acetylated substrates with their sites is fundamental for understanding the molecular mechanisms and regulatory roles of acetylation. In contrast with labor-intensive and time-consuming conventional experimental approaches, recent progresses in acetylome with high-throughput mass spectrometry (MS) have detected thousands of acetylation sites. In 2006, Kim et al. (14) performed a large-scale identification of acetylome with an anti-acetyllysine antibody. There were 195 acetylated proteins with 388 sites detected in HeLa cells and mouse liver mitochondria (14). With a similar strategy, Choudhary et al. (11) experimentally identified 3600 acetylation sites in human. In 2010, Zhao et al. (16) discovered 1047 acetylated substrates in human liver, and demonstrated acetylation playing a major role in metabolic regulation. Furthermore, two acetylomic studies revealed that the functions of lysine acetylation are conserved in Escherichia coli (18) and Salmonella enterica (15).
Since the number of known acetylation sites has rapidly increased, it is an urgent topic to collect the experimental data and provide an integrated resource for the community. Recently, several public databases, such as PhosphoSitePlus (19), HPRD (20), SysPTM (21) and dbPTM (22), have already contained protein acetylation information. In these databases, both of Nα-terminal and Nε-lysine acetylation data were curated, while lysine acetylation sites are usually only a limited part of total sites. For example, SysPTM 1.1 contains 3001 acetylation sites in 2000 proteins, with only 345 lysine sites (~11.5%) in 397 substrates (21). In dbPTM 2.0, 2071 experimentally verified acetylation sites were collected in 1525 proteins, with only 792 lysine sites (~38.2%) in 299 targets (22). Interestingly, HPRD release 9 contains 4691 total sites in 1987 proteins, with 4420 lysine sites (~94.2%) in 1821 substrates (20). However, HPRD database only focuses on human protein information (20), while thousands of lysine acetylation sites in other species still remain to be collected.
With the motivation to meet the desire for complete acetylomes, here we developed a novel database of compendium of protein lysine acetylation (CPLA). From the scientific literature in PubMed, we manually curated 3311 acetylated proteins with 7151 lysine sites (Table 1). In CPLA database, the primary references and other annotations of these substrates were provided, while the protein–protein interaction (PPI) information was also integrated. Based on the Gene Ontology (GO) and InterPro annotations, we carried out an analysis of functional diversities and regulatory roles of lysine acetylation. As 75.64% of total lysine acetylation sites are taken from Homo sapiens, a potential human lysine acetylation network (HLAN) among HATs, substrates and HDACs was constructed, with 1019 PPIs among 199 proteins. Interestingly, we revealed 1862 potential triplet relationships of HAT-substrate-HDAC, while at lease 13 were previously experimentally verified. Taken together, the CPLA database might be an integrated resource for protein lysine acetylation and provide useful information for further experimental or computational considerations.
To ensure the quality of CPLA database, we searched the PubMed with a major keyword ‘acetylation’ and collected experimentally identified lysine acetylated proteins with their sites from more than 18500 published articles (before 1 March 2010). To avoid missing data, we also search more articles with keywords ‘acetylated’ and ‘acetyl’. After all substrates with unambiguous acetylation lysines were collected, we searched the UniProt Knowledgebase (23) to obtain the corresponding protein sequences and associated annotation information. The theoretical pI (isoelectric point) and Mw (molecular weight) were calculated for each protein (http://www.expasy.org/tools/pi_tool.html) (24,25).
In CPLA database, the PPI information was also integrated if available. We took experimental PPIs from several major public databases (on 10 April 2010), such as HPRD (20), BioGRID (26), DIP (27), MINT (28) and IntAct (29). The redundant PPIs were thoroughly cleared. In addition, a well-known pre-predicted database of STRING (30) was also used. All proteins were mapped to the UniProt sequences by BLAST. For human, we collected a total of 59481 experimental PPIs in 12221 proteins and 1212607 predicted PPIs in 16523 proteins, respectively. The detailed statistics of PPI information was shown in Supplementary Table S1.
The CPLA database 1.0 was developed in a user-friendly manner. The search option (http://cpla.biocuckoo.org/search.php) provides an interface for querying the CPLA 1.0 database with one or several keywords such as gene/protein names, UniProt ID or CPLA ID, etc. For example, if the keyword ‘STAT3’ is inputted and submitted (Figure 1A), the result will be shown in a tabular format, with the features of CPLA ID, UniProt accession and protein/gene names/aliases (Figure 1B). By clicking on the CPLA ID (CPLA-000136), the detailed information for human STAT3 will be shown (Figure 1C). The acetylation information, including acetylated positions, flanking peptides, experimental reagents or upstream HATs, and primary references are provided. The protein sequence, GO annotation, domain organization, molecular weight, computed/ theoretical Ip and PPI information are also presented.
Furthermore, we provided three additional advance options, including (i) advance search, (ii) browse and (iii) BLAST search (Supplementary Figure S1). (i) Advance search: in this option, users could use relatively complex and combined keywords to locate the precise information, with up to two search terms. The interface of search-engine permits the querying by different database fields and the linking of queries through three operators of ‘and’, ‘or’ and ‘exclude’ (Supplementary Figure S1A). (ii) Browse: instead of searching for a specific protein, all entries of CPLA database could be listed by species name (Supplementary Figure S1B). (iii) BLAST search: this option was designed for finding related information in CPLA database quickly. The blastall program of NCBI BLAST packages (31) was included in CPLA 1.0 database (Supplementary Figure S1C). Users can input a protein sequence in FASTA format for searching identical or homologous proteins.
Recent progresses toward understanding the full functional content of acetylome have experimentally revealed several thousands of lysine acetylated substrates with their sites. Besides experimental efforts, computational studies such as predictor construction and database development also attract much attention. The current available computational resources were summarized and listed in Supplementary Table S2. Among these researches, database development is particularly important for integrating experimental data from heterogeneous sources, and providing a high quality benchmark for further experimental or computational designs. Although several public databases (19–22) have already maintained the acetylation information, the lysine acetylation is usually collected together with another less controlled Nα-terminal acetylation. In this work, we only focused on protein lysine acetylation and manually curated 7151 lysine acetylation sites in 3311 proteins.
Since a large proportion of acetylation sites were taken from Homo sapiens, we had the opportunity to analyze abundance and functional diversity of lysine acetylation in an acetylomic level. We surveyed the GO terms of 2585 acetylated proteins from UniProt annotations. Using the human proteome as the background, we statistically calculated over-represented biological processes, molecular functions and cellular components in acetylome with the hypergeometric distribution (P<0.01). The top five most enriched GO entries in each category were shown in Table 2. Our analyses revealed several potentially interesting results. For example, the three most abundant biological processes such as translational elongation, RNA splicing and mRNA processing suggest that acetylation predominantly regulates gene expression in a post-transcriptional manner (Table 2). Also, four most over-represented molecular functions such as ATP binding, protein binding, RNA binding and nucleotide binding suggest that acetylation modulates enzyme activity and protein interaction ability (Table 2). In addition, the statistical analysis of cellular components revealed acetylated proteins to be highly enriched in distinct cellular compartments. For instance, ~30 and ~62% of cytosol and mitochondrial matrix proteins are acetylated, respectively (Table 2). For more detailed information, the top 15 most over-represented GO terms and InterPro domains were shown in Supplementary Tables S3 and S4.
The acetylation and deacetylation of proteins are carried out by HATs and HDACs, which antagonistically and dynamically control protein function. Combined with experimental and predicted PPIs, we constructed a potential HLAN among HATs, substrates and HDACs, with 1019 PPIs of 199 proteins (Supplementary Table S5). If only experimental PPIs are considered, the core HLAN contained 369 PPIs among 77 proteins, including 12 HATs and 12 HDACs (Figure 2). From the whole HLAN, we retrieved 1862 potential triplet relations of HAT–substrate–HDAC (Supplementary Table S6). If a substrate is a HAT or HDAC, it should be acetylated or deacetylated by a different HAT or HDAC. We carefully surveyed scientific literature and found that at least 13 triplet interactions were experimentally identified (Supplementary Table S6). For example, Gaughan et al. (32) observed that Tip60 (KAT5) and histone deacetylase 1 (HDAC1) regulate the transcriptional activity of androgen receptor (AR) through changing its acetylation status, and form a KAT5-AR-HDAC1 relation (Supplementary Table S6). Moreover, our results also discovered a number of potentially interesting results. For instance, EP300 acetylates BCL6 at K379 and inhibits its function, while deacetylases were not clearly identified (33). In our results, the EP300-BCL6-HDAC5, EP300-BCL6-SIRT2, EP300-BCL6-HDAC11, EP300-BCL6-HDAC3, EP300-BCL6-HDAC2 and EP300-BCL6-HDAC8 suggested that BCL6 might be deacetylated by multiple HDACs (Supplementary Table S6). Moreover, human GCMa/GCM1 was reported to be acetylated by CBP/CREBBP at K367, K406 and K409 (34). In our results, the relations of CREBBP-GCM1-HDAC3, CREBBP-GCM1-HDAC3, CREBBP-GCM1-HDAC1 and CREBBP-GCM1-HDAC4 proposed that at least four HDACs might deacetylate GCM1 (Supplementary Table S6).
Taken together, here we developed a comprehensive database of protein lysine acetylation. The statistical analyses revealed functional diversity and enrichment of acetylation, while network studies generated a large number of potentially useful results for further experimental or computational researches. The CPLA database will be routinely updated if new acetylated substrates are reported.
Supplementary Data are available at NAR Online.
Funding for open access charge: National Basic Research Program (973 project) (2010CB945400, 2007CB947401); National Natural Science Foundation of China (90919001, 30700138, 30900835, 30830036, 31071154); Chinese Academy of Sciences (INFO-115-C01-SDB4-36).
Conflict of interest statement. None declared.