Lysine acetylation is one of the important post-translational modifications of both histone and non-histone proteins. Thousands of acetylated proteins are known. However, few lysine acetylation transferases (KAT) responsible for the acetylation of these proteins have been identified. After analyzing the sequence feature of the acetylated proteins from different KAT families, we found that KAT-catalyzed acetylation should be substrate-specific. Based on these concepts and using the discovered acetylation proteins, we developed the Acetylation Set Enrichment-Based (ASEB) method to predict the KAT families responsible for a given protein. A total of experimentally validated 280 CBP/p300 and 84 GCN5/PCAF family acetylated lysine sites were manually collected. The ASEB method can predict novel KAT-specific acetylated sites based on the different characteristics of the two aforementioned sets of lysine sites.
Data Update :
We manually collected new data of CBP/p300, PCAF/GCN5 and Class I HDAC, which were added to 341, 112 and 42 identically. Meanwhile, 35 acetylation sites of MYST family and 129 deacetylation sites of SIRT1 were added. click here
Method Update :
The original ASEB prediction method merely took sequence features into consideration. Considering functional features like proteins participating in different biological process never interact with each other, we optimized our algorithm though adding functional features, including Biological Process, Cellular Component, Molecular Function features from GO database, Protein-Protein Interaction features from STRING database and Protein Functional Domain features from Pfam database.
Human proteins acetylated by CBP/p300, GCN5/PCAF, MYST KAT families and deacetylated by Class I HDAC, SIRT1 were collected though searching the PubMed literature using keywords. Users can search known modified sites and responsible KATs or HDACs for their query proteins by this service. Papers with identified acetylation or deacetylation sites and KAT or HDAC information were examined and selected. The acetylated and deacetylated sites were reviewed carefully. Users can search known acetylated and deacetylated sites on specific proteins by the Swiss-Prot accession number or by downloading the dataset directly.
This service can predict responsible KATs or HDACs for the protein sequence or a protein by inputting the Swiss-Prot accession number.
Predict(S) : Sequences based prediction
During the prediction of novel acetylation and deacetylation sites in a KAT or HDAC specific way, we used the ASEB method which employs a similar strategy as GSEA (Mootha et al., 2003; Subramanian et al., 2005; Guttman et al., 2009). We focused on finding sites similar in sequence with the discovered ones for each KAT or HDAC family, including CBP/p300, GCN5/PCAF, MYST family, Class I HDAC and SIRT1. For each query, we assigned a P-value according to its similarity with known modified sites. The P-values for the query peptides are between 0.0001 and 1, with a minimum interval of 0.0001. The smaller the P-value, the more significant will be the chance that the given peptides were acetylated by the KAT family or deacetylated by the HDAC family.
• Users can view one of the shortest path between the query protein and each KAT or HDAC. What's more, we provide another page to show all the shortest pathes. During the determination of the shortest path, the values of the edges in the protein-protein interaction (PPI) network are estimated from the database PINA and STRING. Users can view an example in Predict(S) page.
• Users can use the template script provided for the prediction of various proteins. This script can analyze the query proteins programmatically, rather than through a manual interaction.
Note: The protein-protein interaction view service is only available for human proteins.
Predict(S+F) : Sequences and functions based prediction:
In this new method, sequences and functional features were changed into numeric vectors. We can simply define a 20-bit binary tuple for each amino acid. For a specific feature, if the candidates hold the feature, the corresponding bit would be represented by 1, otherwise represented by 0. Nine models were constructed with the same positive set (but differently partition to training and testing sets) and different negative sets by using Support Vector Machine. If more than or equal to half (≥5) of models predict acetylation or deacetylation will occur on the site, modification was believed to happen on that site.
Note: Because of checking similarity requirements, sequences with less than 8 amino acids on each side surrounding lysine were deleted.
Suggestions: Functional features take protein as a whole, sites on one protein have no difference. So Predict(S+F) is suitable to screen the substrate proteins. To find the substrate sites on known substrate proteins, Predict(S) performs better.
• First, leave-one-out method was adopted to validate these two methods.
• Second, an independent data set of other species was predicted with two methods.
• Third, verification experiments by immunoprecipitation combined with Western blotting were conducted to test the prediction results.
The detailed validation results of two methods can be found at the predict(S) and the predict(S+F) page correspondingly.
• Download the template program in Perl to access the web services and parse the output data.
• Download the human proteins modified by CBP/p300, PCAF/GCN5, MYST families, Class I HDAC and SIRT1 collected.
• Download the ASEB R package to do prediction based on sequence from Bioconductor (>= 2.10).
• Download the Predict(S+F) package to predict based on sequence and functional features.
All comments, suggestions, questions, and bug reports are welcome. For inquiries, please send an e-mail to Tingting Li, Ph.D., Peking University Health Science Center via firstname.lastname@example.org.