The SBSA program takes it’s name from the original intended function as a Salt Bridge Statistical Analyzer. In the end, however, it does far more than that. The program consists of several modules which take a list of protein sequences in FASTA format, predicts secondary structure, removes the helical segments, searches for motifs, runs a statistical analysis on this result, and produces a set of graphs.
The program was written by Charlie Hall and Jimmy Saw, two UH graduate students, for a class called Computational Astrobiology, taught by Kim Binsted.
If you would like to use this program, click here to download a zipped file. Click Here to get the README document for SBSA. I use this program on a MacOS operating system via the terminal interface, but I think that it will also work on a Unix system as well. Please note several things about this program:
-There is no technical support other than this web page. Please do not email me questions about how to install/use/make it work. If I don’t have it written down here then I don’t know the answer.
-If you have questions, some of the answers will be found in the readme file that is pat of the program.
-Depending on what programs you already have on your computer, you may need to download extra dependant programs. For the most part, these programs are documented in the README file included with the program. Other programs that may be needed are: the perl GD module (available from: cpan.org), the c library libgd (available from http://www.boutell.com/gd/), and libpng (available from http://www.libpng.org/pub/png/libpng.html ).
-If you use this program in putting together a paper, please cite our paper that originally uses this program (“Application of Enhanced Multiple Sequence Alignment Profiles to Improve Protein Secondary Structure Prediction” Cuff, J. A.; Barton, G. J. Proteins 1999, 40, 502-511.) as well as the authors of JNET, which is the secondary structure prediction program which we used with permission by the authors of the program (“Composition of a-helices in Proteins from Extremophilic Microorganisms” Boal, A. K.; Hall, C. M.; Saw, J. H. W.; Binsted, K.; Brown, M. V. Manuscript in Preparation).
What SBSA does:
Secondary structure prediction
The first step is that SBSA takes the raw primary protein sequence and runs it through the JNET program to predict primary structure. When all of the proteins have been fed through this routine, SBSA than goes through the predicted structures and extracts the sequences that correspond to predicted helical sections of the protein. These are then searched motifs distributions.
Motif searches
SBSA then goes through the data file and counts the number of times various motifs are found in the data set. On the original protein list, SBSA first counts the number of times each amino acid is found (Nopai) and the total number of amino acids (atp) in the data set. Similar values counted and correlated for the a-helices (Nohai and ath). Finally, SBSA counts the number of motifs observed (Nomi) in the helix list. Here, we define a motifs as a pattern of amino acids which, when in a helical structure, are positioned so that their side chains could interact with each other. So, this means that amino acids placed either three or four amino acids apart (called the (i,i+3 or i,i+4 positions). In a primary sequence, this would mean that patterns such as X1..X2 or X1…X2 would be counted. Here, X1 and X2 are specific amino acids and the “.” Represents any amino acids. Take the standard set of 20 amino acids, there are 800 possible motifs of this nature (A..A to Y…Y). Additionally, there are three amino acid patterns that SBSA counts as well. Configurations here include X1..X2..X3, X1...X2..X3, X1..X2...X3, and X1...X2...X3 which yields an additional 32000 motifs. Statistical analysis
Once the counting is done, SBSA than computes the percent occurrence (%o) for amino acids in the proteins, and for amino acids and motifs in helices, %o, propensity (Pr), and z-values are calculated. For protein amino acids percent occurrence (%opai), the following equation is used:

For amino acids in the helix, a similar %o value (%ohai) is calculated using Nohai and athai. The number of each amino acid predicted to be used in a helix(NPhai), based on the percent distribution in proteins, is calculated as:

This value is then used to calculate the propensity for each amino acid in the helix (Prhai) using:

finally, the statistically reliability is calculated using the z-test. The z-test is a change of proportions test which is calculated as:

Here, a z-value of 2 indicates that the value is significant at the 95% confidence level, and for the 99% confidence level, a value of 3 is used. (For more information on Z-values, see: http://www.adamssixsigma.com/Newsletters/standard_normal_table.htm)
For motifs, the total number of motifs possible (mt) was first calculated using the following equation:

Where ml is the length of the motif and hli is the lenght of a helix, both in number of amino acids. nhli is the number of helices of a length, and the total is summed over all of the helix elngths in teh data set. mt was then used to calcualte the number of a given motif that is expected (NPmi) using:

%o, Pr, and z-values were then calculated using variations of the above equations.
Output files
SBSA has two main out put files, one graphical and the other a comma-delimitated text file that can be imported into spreadsheet programs of further analysis. The graphical files consist of a square showing all the amino acids on both the x- and y-axis, and therefore where they intersect is the data point for that given motif. In the case of three amino acid motifs, the first amino acid in the motif is assigned a graph and the other two are one the axis (so, all motifs like A..X1..X2 will be on a single graph, C..X1..X2 a second graph, and so on). There are three types of graphs, one for %o, one for Pr, and one for the Z-value. For %o, the graphs are constructed such that they are built from squares and the shading in the square indicates the relative value for that motif- darker means a higher value. In the case of Pr, the motif is indicated by either a blue or red square, blue for Pr>1 and red for Pr<1. The z-value graphs have three colors, blue for z>2, white for 2
o_X1..X2 for %o graphs,
p_X1..X2 for Pr graphs, and
z_X1..X2 for z-value graphs.
The full output is also available as a comma-delimitated file that is listed in the main outfile subdirectory as “datestamp.s” where datestamp is the same date stamp that is applied to main output subdirectory. I typically open these files in MS-XL, and to do so have to change the file suffix from “.s” to “.txt” or XL will not recognize the file type.
The first part of the file contains information on the amino acid distribution in the data set. The first row whose the motif, in this case amino acids. The second row (ta) gives the total number of amino acids in all of the proteins in the data set, the third row (no(a_i)) gives the number of times each amino acid was found in the proteins and the fourth row (po(a_i)) gives the percent occurrence of each amino acid in the protein.
The rest of the rows in this section are for helix amino acid composition. Row ta_h(a_i) gives the total number of amino acids identified as part of a helix, no_h(a_i) gives the number of each amino acid type, no_h(a_i), p_h(a_i), and z_h(a_i) give the percent occurrence, propensity, and z-values, respectively, for those amino acids.
Below the amino acid section is that motif section. The first row, m, gives the motif. Row tm(m_i) gives the total number of times that a motif of that type can occur (one value for X..Y motifs, another for X…Y motifs, and so on). For each motif, there are several data rows which include:
no_m(m_i): number of times that motif i is observed
np_m(m_i): number of times motif i is predicted to occur
po_m(m_i): percent occurrence for motif i
p_m(m_i): propensity for motif i
z_m(m_i): z value for motif i
Click here for an example of the “.s” output file from SBSA.
Other files included in the output are:
Datestamp.eh- this file is a text file that contains a list of the extracted helices form the protein data set
Datestamp.ms- contains the raw motif count output as a comma-delimitated file
Datestamp.ss- contains the Jnet secondary structure output files for all proteins
How to use SBSA.
First, you need to install the program. On a Mac, this is easy, simply download the zip file above and the computer should unpack it by itself (or just double click the sip file to do it manually). This should put a folder called sbsa09 somewhere on your computer, move it to where you want and you’re ready to go.
The next ting that you need to do is to prepare the file. This is because some unseen control characters (put in if you take a FASTA file from MSword, even if you export it as a plain text file!) can crash the program. Other things that need to be removed include some of the characters present in some FASTA file information lines, unassigned residues in the protein sequence, and sequences over 1500 amino acids (the limit imposed by JNET). To do this, move you file into the util subdirectory in sbsa09. Open a terminal window and move into the util subdirectory, and then use the prepare script as:
user% ./prepare
The outfile will contain your original list of proteins but will fix any illegal characters and remove sequences that have either unassigned amino acids or are over 1500 amino acids. While the program is running, it will print in the terminal window every time it removes a sequence and why. The one way that I know that the program will self terminate is if it finds a duplicate sequence, which will produce the following error message:
“Error: duplicate title found: '130430'
Recommendation: check that you haven't accidentally
included the same sequence twice. If not
(or if that is what you intended to do)
then rename one of them. For example, change
'prot3bc' to 'prot3bc1'.”
to fix this problem, either delete or rename the offending sequence and run prepare again.
Now, move the cleaned file into the bin folder, and here you are ready to run the program. To do so, enter the following into the command line:
user% ./sbsa infile.txt
which will run the full sbsa program in the accurate mode. Immediately, sbsa will produce a new subdirectory in which all of the output files will be placed. The name of this directory is a time stamp of when the program was run (from the year to the second) so you can keep track of which file is which when running the program.
Note that the above comman runs the program in the accurate mode, but eh fast mode can be run by adding an extra argument in the command line:
user% ./sbsa –-fast infile.txt
See above to read about the difference between fast and accurate sbsa modes. You can also run each of the modules independently, and each are listed as a separate name in the bin folder. To get help running any of the modules, type the following at the command line:
user% ./sbsa –-help
For more information, email me at andy@kiskadden.net


