This program can be used to quickly detect amino acids which characterize differences between two groups of sequences. It compares two groups of sequences and looks for a "signature" pattern, or the set of amino acids that is conserved among each set, but differing between the sets. It will pick out those distinguishing amino acids, and calculate their frequencies in each set. (Nucleotide alignments can also be used; however, in the following discussion amino acids are used as representative examples.)
Align all of your sequences and break down into two sets for comparison. The sequences should all be of the same length, so if some sequences are shorter than others, insert stars (*) in positions where no information was available. Positions with stars will be discounted from frequency calculations. Insertions made to maintain the alignment should be dashes (-); positions with dashes will be counted and included in the signature pattern analysis. Example:
alphabet-OK ABCDEFGHIJKLMNOPQRSTUVWXYZ mutant-ABCs ZBCDEFGH-JKLMNOPQRSTVVWX**
In the above sequence alignment, the sequence names are alphabet-OK and mutant-ABCs. For the second sequence, no sequence information was available for the last two positions. The "I" in the first sequence was deleted in the second sequence. U has "mutated" to V, and A to Z. Hence the signature pattern for mutant-ABCs relative to alphabet-OK is:
signature Z.......-...........V...**, or 3/24 characters.
The periods (.) in the above signature indicate that the two sequences agree in those positions. The Z, -, and V show where the sequences disagree defining a signature for the "mutant-ABCs" sequence. The denominator for the three amino acid signature is 24, not 26, because no sequence information was available for the last two positions.
The allowed characters for inclusion in an alignment are A-Z, -, and *; a-z can be used but will be treated as equivalent to uppercase letters, i.e., A = a. Any other character that is used will be treated as a star, and not counted in the signature pattern tally. Therefore, if you have a stop codon, and you label it as a dollar sign, it will be treated as if you have no information at that site. If, on the other hand, you label it with a Z, it will be included in the signature pattern analysis.
Show amino acid frequencies?
Checking the button will answer "yes". Not checking the button gives a short output, just the signatures and frequencies of signature amino acids among the query and background sets. A checked button gives a long output with signatures AND the number of every amino acid found in every position for both alignments. A "yes" answer to this question might be useful if you have positions in your sequence sets that are 50% one amino acid, 50% another.
Choosing a threshold (between 0 and 1.0)
Choose a specific threshold (0 to 1.0) or run the program with the default threshold set to 0. If you do not set a threshold, the majority signature will be used. If you want to only count the most conserved of the signature amino acids for this calculation, you can set a threshold for the minimum degree of conservation of signature amino acids in the query set.
A 1.0 will require that the signature amino acid be included in every sequence in the query set to be considered. A 0.9 will require that the signature amino acid be included in 90% of the sequences in the query set to be considered. The default (0) will just use the majority consensus.