How many input peptides are the optimal number for the preformance of the Proteasix Prediction tool?
The processing times of the input peptides list depends on: a) the client/end-user machine; and b) the internet connection speed.
Out of the stress testing performed, 1000 input peptides (i.e. 1K) is the recomended number. Although, a good performace was repeatedly obtained with a maximum of 6K input peptides for the observed mode (matching against cleavage site associations collected from the literature) to find proteases, and 3K for the predicted mode (calculating the probability of cleavage by a protase based on MEROPS specificity matrices) to find proteases.
From where does the information contained in the Proteasix Knowledge Base come?
The Proteasix Knowledge Base incorporates information from reference databases including the UniProt Knowledgebase (UniProtKB), the MEROPS database, CutDB, the literature and ontologies such as the Gene Ontology. Previous observations of protease/cleavage site associations were extracted from the MEROPS database, UniProtKB and the literature, through both manual literature mining and automated Java scripts. Cleavage site sequences and scissile bond positions were aligned with the UniprotKB parent protein sequences and stored in octapeptide form (i.e. P4 P3 P2 P1-P1’ P2’ P3’ P4’, with the scissile bond between the P1 and P1’ residues). Further curation was performed to annotate each substrate and protease with a stable UniProt unique identifier (a.k.a. UniProt accession number, or UniProt AC for short).
All information contained in the Proteasix Knowledge Base is cross-referenced to its external source. Read More...
Why peptide sequences (e.g. the ones identified by mass spectrometry) can not be pasted directly in the Prediction tool?
There are already online tools freely available and specifically dedicated to retrieve peptide start and end amino acid position from peptide sequence, for example: 1) the Peptide search tool from the Universal Protein Resource (UniProt) finds all UniProtKB sequences that exactly match a query peptide sequence; 2) the Batch Peptide Match tool from the Protein Information Resource (established in 1984 by the National Biomedical Research Foundation) retrieves protein sequences that would exactly match a peptide query.
Proteasix seeks to build upon current resources with the aim of complement existing tools, rather than replicating the functionality already provided by other online freely available tools. For this reason, Proteasix takes for each input peptide the parent protein (UniProt identifier or UniProt accession number) together with the start and end amino acid position instead of the peptide sequence.
What does P4, P3, P2, P1, P1', P2', P3', P4' mean? This corresponds to the cleavage site nomenclature. Amino acid residues in the N terminal direction from the scissile bond (i.e. the peptide bond hydrolysed by the protease) are designated P4, P3, P2, and P1. Amino acid residues in the C terminal direction are designated P1', P2', P3' and P4'. Cleavage occurs at the scissile bond between the P1 and P1' residues.
What are the MEROPS specificity matrices? Proteases exhibit varying binding affinities for amino-acids, ranging from strict restriction to one or few critical amino-acids in given positions, to generic binding with little discrimination between different amino acids. The MEROPS database lists such information through specificity matrices showing how frequently each amino acid occurred at each of the eight positions in a cleavage site. More information about MEROPS specificity matrices can be found directly from the MEROPS database website.
How is the probability for cleavage site prediction calculated?
Confidence thresholds of predicted proteolysis were determined using MEROPS specificity weight matrices for 319 proteases. Out of the 319 proteases, 138 were associated with more than 40 observed proteolysis cleavage sites in Proteasix. For these proteases, 2/3 of the 50213 observed cleavage sites contained in Proteasix were randomly selected to establish the predicted proteolysis confidence thresholds (training set). The probability of cleavage for each cleavage site from the training set was estimated using a log-likelihood based on the corresponding protease MEROPS specificity weight matrices. Receiving operating curves were plotted for each of the proteases. Predicted confidence thresholds were then established to achieve a sensitivity of prediction of 80% (i.e. percentage of cleavage sites correctly predicted to be cleaved by the protease). Based on these thresholds, specificity of prediction (i.e. percentage of cleavage sites correctly predicted not to be cleaved by the protease) varied from 30-97% (70±18%). For the remaining proteases, the probability of cleavage for the 25 billion amino acid combinations (20 possible amino acids for each of the 8 positions of the cleavage site) was estimated using a log-likelihood based on the corresponding protease MEROPS specificity weight matrices. Predicted confidence thresholds were then set to the 99th percentile of the population distribution of all possible sequences. We chose this strict threshold as we observed for the 144 proteases with a training set that the 99th percentile achieved a lower sensitivity (fewer true positive predictions) but a higher specificity (fewer false positive predictions).
Sensitivity and specificity of predicted proteolysis was assessed using two independent validation sets: sensitivity (i.e. percentage of cleavage sites correctly predicted to be cleaved by the protease) was calculated using the remaining thirdof the 50213 observed cleavage sites contained in Proteasix and specificity (i.e. percentage of cleavage sites correctly predicted not to be cleaved by the protease) was calculated using a set of 15000 random sequences of 8 amino acids. Using the first validation set, we could achieve proteolysis prediction with 42±30% sensitivity, with most of the proteases having a sensitivity of 0-19% (for those with the lowest number of cleavage sites used for the validation) or 60-79% (for those with the highest number of cleavage sites used for the validation). Using the second validation set, proteolysis prediction demonstrated 95±11% specificity, with most of the proteases having a specificity of 80-100%.
Confidence thresholds, sensitivity and specificity for each MEROPS specificity matrices useful for prediction.
What is the Proteasix Ontology (PxO)?
Developing new resources that integrate existing data typically involves bringing together the external data within new bespoke schemas. Ontologies have become popular in the life sciences for the annotation of data and offer novel ways for the analysis and integration of biological data. An ontology language, such as the Web Ontology Language (OWL), provides a precise semantics for the language that can be used to build a conceptualisation of a domain, along with querying and inference over data. Hence, in order to consolidate the data within the Proteasix Knowledge Base, we designed the Proteasix Ontology (PxO), an ontology describing the biological underpinnings of the generation of peptides and supporting the Proteasix peptide-centric prediction tool.
The Proteasix Ontology. Arguello Casteleiro M, Klein J, Stevens R. J Biomed Semantics. 2016 Jun 4;7(1):33. doi: 10.1186/s13326-016-0078-9. PMID: 27259807
How can I contribute to Proteasix? If you have evidence about proteolytic cleavages that are not yet included in the Proteasix Knowledge Base, please let us know, and send an email to firstname.lastname@example.org
How can I cite Proteasix in a publication? Proteasix: a tool for automated and large-scale prediction of proteases involved in naturally occurring peptide generation. Klein J, Eales J, Zürbig P, Vlahou A, Mischak H, Stevens R. Proteomics. 2013 Apr;13(7):1077-82. doi: 10.1002/pmic.201200493. Epub 2013 Feb 26. PMID: 23348921