SPROUTS has been designed to give scientists access to data related to protein folding prediction. In this scope, we processed a set of proteins on five different tools devoted to the prediction of stability changes upon point mutation. We also propose the results obtained with two methods devoted for one to the direct prediction of residues involved in the core of a protein structure and for the other, the characterization of fragments which ends are assumed to be part of the folding nucleus. To emphasize the offered possibilities, we present here some small use cases which will describe some features implemented on the server.
Use case 1: The user is planning to
synthetize the L14A mutant of the engrailed homeodomain, PDB code:
1enh. Will this mutant be folded and play the role of an analog to
the wild homeodomain or not?
This problem can be answered very easily by querying the 1enh structure with the adequate filter on the residue to mutate. The results consist in ΔΔG values computed for the five tools. Every one predict a stability change in the range of 2.62 for I-Mutant sequence only to 3.59 kcal/mol for PoPMuSiC which indicates that this L14A mutation greatly destabilizes the structure of the protein. These results converge and tell the user that processing this particular mutation would give rise to an unstable mutant thus very difficult to manipulate afterwards.
Use case 2: Which positions are globally more
sensitive to mutation than the other ones along the 1enh sequence
according to I-Mutant sequence only?
The first way of answering this question relies on querying the database on the 1enh structure with a restriction on the tool which is I-Mutant sequence only. The user get 1026 ΔΔG values split into 6 pages. The question asked by the user requires a high level interpretation of the results and therefore, a more complex analysis must be done. The 2D visualization mode offers this possibility by generating the graph of stability score for each position of the sequence. It summarizes for each position the stability change upon the 19 possible mutations. By looking at the graph, the user will directly localize 7 extrema around the positions: 3, 7, 35, 46, 50, 52, 54. These positions may be of interest and the user can then retrieve the exact ΔΔG values in order to integate them in further studies.
Use case 3: The user is interested in
retrieving the potential residues involved in the folding core of
the 1enh structure. More spcifically, he wants to know which
positions have a negative stability score regarding the DFIRE tool
prediction and are characterized as MIR along the 1enh structure.
The query consists in the 1enh structure with the only restriction on the DFIRE tool. ΔΔG values alone cannot give information about the MIR prediction and as the topic of the query is about protein folding, the 3D mode is the most appropriate way of getting the answer. The user only have to activate the display of stability scores for DFIRE and toggle the MIR prediction button. All the amino acids represented by a small solid sphere in red dominant color and with a bigger transparent purple sphere around will correspond to the inquired positions. The user can then investigate residues 3, 8, 13, 16, 48, 19 (according to the PDB numbering).
Gibbs free energy change due to mutation is a good approximation to characterize the stability of a given structure. It consists of a succession of energetic terms that attempt to capture all the properties and forces that drive the conformation of a protein. In our study we focus on the difference of these energies for the wild type structure ΔGwild and for the mutant structure ΔGmutant. Considering that in the literature various stability prediction methods use different nomenclature, ΔΔG is defined as follows:
ΔΔG = ΔG(mutant) - ΔG(wild)
The unit is kcal/mol. ΔΔG describes whether it costs more in energy to have the mutated amino acid or the wild type one. For example, if ΔΔG < 0 then it costs more in energy to have the wild type structure than the mutant one thus the mutation is more favorable to the structure stability. Conversely, if ΔΔG > 0, the mutant structure ΔG is higher than the wild type one thus the mutation is less favorable to the structure stability.
To synthesize the mean stability change tendency for a given amino acid, the 19 ΔΔG have been summed in one value. Actually, to normalize this result, instead of summing ΔΔG, a score has been given to each ΔΔG. If ΔΔG < 0, the mutation is considered as stabilizing and it is granted a value of +1. Conversely, if ΔΔG > 0, the mutation is considered as destabilizing and the value is -1. This procedure produces a score in the range of [-19,+19] which reflects the global stability change for an amino acid upon its mutation. The lower the score, the more sensitive to mutation, i.e., the native residue is the most stable.
The query interface offers, for the moment, the strict minimum options to retrieve the results associated with one structure. Figure 1 is a snapshot from the page and shows the six parameters the user can modify. Default selections for skipping a criteria are indicated by "---".
Figure 1: Database query form
Once data is available in the database, it can be visualized with
three different methods.
The first way of visualizing results is directly reading data extracted from the database. The user retrieves all the information contained in the database and relative to the query they submitted. There is no data post processing and it is the simplest way of getting the results. Figure 2 is an example of the results corresponding to a simple query with for the MUpro data for the 1asu protein.
Figure 2: Raw results obtained for the 1asu protein
Eight columns are available:
Right side menu:
The user can sort the results by clicking on each of the column header. By default, the list is ordered by the residue number. Results are split into different pages to avoid scrolling a long list of information and the number of results per page can be changed in the query form. We recommend users use the CSV download if they need to manipulate or view large portions of the given data.
This visualization mode is intended to quickly interpret the results for a given structure. In one graph, the user can superpose the results for any set of tools, they also have the ability to smooth the curves in order to have a clearer look. The user may display a consensus which is calculated as the mean of data provided by each tool. We also provide the results from the Most Interacting Residues (MIR)  prediction and the Tightened End Fragment (TEF)  assignment (if PDB is available). Figure 3 shows the pop-up that manage this functionality. Graphs are generated with the GNUplot  software.
Figure 3: Pop-up of the 2D visualization mode. The graphs represent the stability score along the 1asu sequence
On the right is the graph. It represents the stability score of each amino acid along the sequence of a given structure. The stability score summarizes the effect of the 19 possible mutations for an amino acid. If a mutation has its ΔΔG with a negative sign of the required threshold (thus it increases the stability of the structure), we add +1 to the score. Conversely, if a mutation has its ΔΔG with a positive sign of the required threshold (thus it decreases the stability of the structure), we add -1 to the score. We repeat this operation for the 19 possible mutations on each amino acid and we finally obtain a score included in the range of [-19: +19]. Therefore, the use of the word stability actually concerns the stability after the mutation.
This kind of graph is useful to quickly visualize where are the regions in a sequence that are sensitive to mutations. Indeed, if almost every mutation has a destabilizing effect, the score will be around -19. Conversely, if almost every mutation has a stabilizing effect, the score will be around +19. The ability to distinguish these extrema is of great importance as it highlights the positions which are very sensitive and which, by the way, should play a role in the folding of the structure. We hope by this mean of adding the scores, to get rid of insignificant individual ΔΔG contributions. We recommend to focus mainly on maxima and minima of this curve, because the variations among individual tools may present diverging behaviors. Maxima and minima positions are often coherent among the various tools. This is a general trend, not a must. So interpretation should be the following: Positions close to -19 must be mutated with caution, while positions close to 19 are rather secure. This is the only conclusion one can retain from such a rough model.
The smoothing process is applicable as the original graphs are very sharp and it is difficult to evaluate them. Currently we smooth them with the Pascal triangle. This technique takes into account the neighborhood of a point (4 neighbors from each side of the point in this case) and thus reduces the number of peaks. The downside is the loss of accuracy but it helps to localize the regions of interest. This smoothing procedure is off at the origin of the peaked values at both ends of the curve, because some neighbors are missing for a complete triangle application. We think that much better smoothing of the ΔΔG values is possible and we plan to enhance this feature in the future.
In the scope of providing a meta server with information related to the folding core of proteins, we offer the possibility to get the MIR prediction and the TEF assignment of the studied protein. These two methods are summarized below and for further details, see the related references.
This simulation is repeated 100 times with different initial conformations. The number of neighbors is recorded after each series of 10 Monte Carlo steps, and at the end of the process, an average Number of Contact Neighbors (NCN) is calculated for each amino acid of the sequence, at the exclusion of the first neighbors along the main chain. Amino acids surrounded by many others play a role in the compactness of the protein and thus are called Most Interacting Residues (MIR).
Later on, it has been shown that the ends of these closed loops are mainly occupied by hydrophobic amino acids. A thorough analysis demonstrated that these hydrophobic amino acids were highly conserved among structures of the same family, although containing distantly related sequences: these positions were called topohydrophobic .
The concept of TEF emerged from the junction between closed loops and topohydrophobic positions.
Under the graphs, the protein sequence, the TEF assignment and the MIR prediction are displayed. We are looking into mapping zones of the graphs with this information to better localize the regions of interests.
This mode is intended to aid structural biologists who are more accustomed at manipulating 3D structures and objects. The Jmol  3D applet has been used in our project as its development has been targeted specifically for Web browser integration.
Several options are available like the ability to display the stability score computed for each amino acid with a selected tool. Please note that consensus data is not available here yet. A small sphere located on the alpha carbon of each amino acid coupled with a gradient of color displays this information. The color palette goes from red representing a score of -19 to blue for a score of +19. The user can also locate the TEFs assigned to the protein structure. In this case, the fragments are selected and the cartoon representation is colored in a different color for each TEF. When there is an overlapping between two TEFs, the resulting color of the overlap is a mix between the colors of the two TEFs. For example, if the first TEF is in blue and the second one in yellow, the overlap segment will be in green. The last option is the localization of the residues characterized as MIR. A transparent purple sphere with a greater diameter than the one for the stability score is represented on the alpha carbons of the MIR. Figure 4 shows an example of the 3D applet for the 1asu structure.
Figure 4: Pop-up of the 3D visualization mode for the 1asu structure. All the display options have been selected: stability score for the DFIRE tool, MIR prediction and the TEF assignment.
The Jmol applet is a powerful visualization software tool which offers many options. We only provide some predefined scripts we assume to be the best representation ways given the information. You may access all the Jmol options by right clicking inside the applet and selecting all the options you want. This may cause the loss of existing information. In this case, the user can reset the view by clicking the appropriate button. The raw structure button provides the display of the whole structure i.e. all the chains if several exist in a wireframe representation.
The submission interface offers two basic ways to submit proteins for analysis. A user may submit a valid DPB ID and the server will automatically retrieve all needed data from the PDB. Alternatively, a user may submit custom data in FASTA format. In the latter case, the user will also be required to enter a 4-letter alphanumeric code to identify the submission and later retrieve the results from SPROUTS database and offered to submit their data in DSSP and/or PDB formats. A PDB ID or all three files (FASTA, DSSP and PDB) must be submitted to execute the complete workflow. Figure 5 is a snapshot from the page and shows the main options available to the user. The first part of the form allows the user to select the source of data to use while the second part requires the user to enter an email address so they can be notified at the completion of the request.
There are two data sources:
The initial project was a comparison of several tools devoted to the prediction of stability changes upon single point mutations on a given structure. The amount of produced data has given rise to the idea that these information should also benefit to other scientists that may be interested in. The decision of creating such a database was then on its way and after some discussions, the desire to offer more services than a data repository has emerged.
From the original 10 structures, we now have 129 structures with related results for the five mentioned tools. We also offer two modes to better interpret and analyze them in a more friendly way by graphs representation and by 3D structures with annotations for scientists more involved and accustomed to structural bioinformatics.
Our efforts are still underway to further propose more functions in order to increase the different levels of analysis. Another aspect will be to add other kind of data which might be related and thus we aim at proposing a non exhaustive but complete meta server on the protein folding core theme based on the protein stability.
Tool: List of the tools used to predict the stability changes upon point mutations. 4+2+2 are available: DFIRE , MUpro , PoPMuSiC , and FoldX 3.0 beta 5.1  + 2 versions of I-Mutant 2.0.6  (one with the only sequence as input and the other one with sequence + structure data as input) + 2 versions of I-Mutant 3.0.6  (as previous). The two modes of I-Mutant are considered as two distinct tools as they do not use the same input.
Residue number: The result selection can also be executed on the residue number in the protein sequence. For this purpose, the user has to enter a valid positive integer number. The numbering follows the sequence indices thus it always begins by 1 to n. The PDB numbering is not used here.
DDG Signs: ΔΔG gives information about the increase in stability or not for a given structure after a mutation has been applied. The user can choose to filter mutations at a high level based on their sign with this selection.
DDG Threshold: The user can choose to filter mutations by magnitude of ΔΔG that should be considered. For instance, if 2.00 is selected, then any� |ΔΔG| values less than 2.00 will be assumed to be neutral and not be classified as stabilizing or destabilizing.
Mutation letter: The user can select from which type of mutation he wants to get the results. The list contains the 20 different amino acids to substitute with the wild one. In the results page, the one letter code will be used.
Smooth graphs: The smoothing process is relevant as original graphs are very sharp and it is difficult to evaluate them. We have decided to smooth them with the Pascal triangle. This technique takes into account the neighbourhood of a point (4 neighbours from each side of the point in this case) and thus reduces the number of peaks. The counterpart is the loss of accuracy but it helps to localize the regions of interest.
Gradient of stability score: One can represent stability score of each amino acid of a given protein and for a specific tool with a colored sphere located on the alpha carbon of the residue. A gradient from red to blue is used respectively representing the [-19:+19] score range. Normally, the five tools are available + the consensus which average the results of the five tools for each residue thus resulting in mean stability score values. If the query is tool specific, only the given tool can be selected. If the query is residue specific, all the gradients will be disabled and we suggest the user to refer to the table results. In case of missing residues in the structure file or exception errors for a given tool, the gradient is not available as the stability depends on the whole protein. Only the tools providing results for all the residue mutations are taken into account
View MIR: A Monte Carlo algorithm is used to simulate the early steps of protein folding on a (2,1,0) lattice. An amino acid is randomly selected and displaced to a new available position on the lattice. The energy of both initial and final conformations is computed from the Miyazawa and Jernigan potential of mean force [Miyazawa and Jernigan, 1996] and the Metropolis criterion is then applied [Papandreou et al., 2004; Chomilier et al., 2004]. The starting point is the protein structure in a random coil conformation and the simulation is typically of 106 Monte Carlo steps.
This simulation is repeated 100 times with different initial conformations. The number of neighbors is recorded after each series of 10 Monte Carlo steps, and at the end of the process, an average Number of Contact Neighbors (NCN) is calculated for each amino acid of the sequence. Actually, amino acids surrounded by many others play a role in the compactness of the protein and thus are called Most Interacting Residues (MIR).
View TEF: Along the backbone of a protein, some pairs of amino acids can be very close in several places, with a typical distance between their alpha carbons below 10�. The histogram of the sequence separation between these "contact" amino acids is not smooth, and presents a maximum around 25 amino acids [Berezovsky et al., 2000]. These sequence fragments were initially called closed loops [Ittah and Haas, 1995].
Later on, it has been shown that the ends of these closed loops are mainly occupied by hydrophobic amino acids. A thorough analysis demonstrated that these hydrophobic amino acids were highly conserved among structures of the same family, although containing distantly related sequences: these positions were called topohydrophobic [Poupon and Mornon, 1998].
The concept of TEF emerged from the junction between closed loops and topohydrophobic positions.