This program is based on the method of Nei and Gojobori 1986, and incorporating a statistic developed in Ota and Nei 1994. An application of the SNAP package in HIV-1 research is described in a paper by Ganeshan et. al., J Virol 71: 663-677 (1997). You should be familiar with the Nei and Gojobori paper before using this program. The following references (at minimum) should be cited when using SNAP in a paper:
Alternatives:
An excellent suite of tools for studying natural selection can be found on Datamonkey (http://www.datamonkey.org/).
Input sequences must be codon-aligned nucleotides. The tool automatically recognizes several common sequence formats. For web-based SNAP, alignments smaller than 40 sequences the results will be displayed on our web site. For larger alignments (more than 40 sequences) you will receive a job title and reference number and results will be emailed to you.
WARNING: Unaligned sequences or aligned sequences copied from Word (containing non-ASCII characters) may cause the program to crash or give meaningless output.
Click here to return to web-based SNAP.
The Perl-based version of SNAP is written in perl and has been run on UNIX operating systems (SUN and LINUX). It should also work on DOS and MAC.
Usage: SNAP.pl filename
Output: This program calculates the number of
synonymous vs. non-synonymous base substitutions as described in Nei and
Gojobori for all pairwise comparisons of sequences in an alignment. The number
of synonymous and non-synonymous codon changes are counted, as well as the
number of potential synonymous and non-synonymous changes when comparing two
sequences. Ambiguous codons or codons with insertions are excluded from the
tally of compared codons. The overall sequence distances are calculated as well
as a codon by codon summary.
The examples below are based on input of 4 HIV protease sequences:
Imagine you have TTA in one sequence, and CTT in the other. Two bases are
different. TTA encodes Leu, as does CTT. If you assume that the bases have
changed one at a time, there are two paths:
NOTE2: Some thoughts on statistical comparisons of the output: One must be
wary when doing typical statistical analysis of these values. One thing I have
found to be very common are distributions of values that are NOT close to a
Gaussian distribution, so you should either check to see if you have a Gaussian
distribution, or default to the use of non-parametric statistics, like a
Wilcoxon rank sum test.
Also, if one uses the full column of
values for all pairwise comparisons (say all values of dn for one set, compared
to all values for another set) there is a non-independence of points issue to
be considered. An alternative is the use of a sequence like a consensus or a
best estimate of an ancestral sequence as the first sequence in the alignment,
and then just use the comparison of the first sequence to all others rather
than all pairwise comparisons. Sequence 0 is the first in the set, so that
would be all lines that start with 0.
NOTE3: NA and mutational saturation.
If either ds or dn is NA or 0, the ds/dn ratio is not calculated. Two distance matrices that are compatible input for PHYLIP's neighbor program are created,
based on either the ds or dn values. A 5 sequence example is shown below, for
dn values: For every codon in each pairwise comparison, it is identified as "identity" if there is no change,
" synon " if the nucleotides change but the encoded amino acid does not, " nonsyn " if
the nucleotides change as well as the encoded amino acid, " indel " if there is an
insertion or deletion, " ambiguous " if there is an N. Indel supersedes N if both
are present. The information in "codon.data" is the average
behavior of each codon for all pairwise comparisons for indels, syn, and nonsyn
mutations. In very variable positions, the value can be over 1, because
there are 3 positions in each codon, and more than one change can occur. USAGE: SNAPstats.pl codons.[pid] The program SNAPstats.pl will yield
the variance and standard deviation for the average
ds, dn, ps and pn
values for the data set, based on the strategy described by
Ota and Nei.
Please read and cite this
paper if you use this program.
(I am grateful to Dr. Yumi Yamaguchi-Kabata of Kyoto University for
graciously encouraging us to add this statistic to our site).
The tool give several output files, including both a summary file and a statistics file. Why are the dN and dS values different in the two files?
Can SNAP be implemented to use the mitochondrial genetic code?
Input: Sequences should be provided
with codons aligned, in caps, and in frame.
They should be provided in table format:
Seq1
Seq2
The first field is the sequence name, the
second field the sequence, then return to a new line for the second sequence.
Single or double insertions will throw the sequence out of frame, and
care must be taken to keep codons intact.
A single insertion could be treated in the following way to keep the codons ACT and GGC intact:
ACTTGCC
ACTGGC
The letter N should be used to represent ambiguous bases.
A dash, '-' for insertions, Only A C G T N - are allowed.
Output (web-based and Perl versions)
92UG037, RF, 92BRO25, ELI
SUMMARY OF THE SYNONYMOUS AND NONSYNONYMOUS INFORMATION
______________________________________________________________
"summary.[pid]":
Compare Sequences_names Sd Sn S N ps pn ds dn ds/dn
0 1 92UG037 RF 18.00 9.00 68.50 228.50 0.26 0.04 0.32 0.04 8.00
0 2 92UG037 92BRO25 22.50 10.50 69.00 228.00 0.33 0.05 0.43 0.05 9.00
0 3 92UG037 ELI 17.00 10.00 68.50 228.50 0.25 0.04 0.30 0.05 6.68
1 2 RF 92BRO25 20.50 11.50 68.50 228.50 0.30 0.05 0.38 0.05 7.33
1 3 RF ELI 11.00 5.00 68.00 229.00 0.16 0.02 0.18 0.02 8.22
2 3 92BRO25 ELI 18.50 10.50 68.50 228.50 0.27 0.05 0.33 0.05 7.06
Averages of all pairwise comparisons: ds = 0.62, dn =
0.11, ds/dn = 6.60
Averages of the first sequence compared to others: ds
= 0.49, dn = 0.09, ds/dn = 7.09
Compare:
Sequences_names:
Sd:
Sn:
S:
N:
ps:
pn:
ds:
dn:
ds/dn:
NOTE1: An example of how Sd and Sn is determined for a single codon:
TTA (L) --> TTT (F) --> CTT (L).
From first path, count = .5syn + .5syn
From second path, count = .5nonsyn + .5nonsyn
So this two base change would be 1 synonymous, 1 non-synonymous.
If ps or pn has a value >= .75, saturation has been reached and
a Jukes-Cantor transformation cannot be done, so the value of NA is returned.
Sd == 0 && Sn != 0: ps < 1/S; ds < -3/4*ln(1-4/3*ps) ≈ 1/S
Sd != 0 && Sn == 0: pn < 1/N; dn < -3/4*ln(1-4/3*pn) ≈ 1/N
Sd == 0 && Sn == 0: ps < 1/S, pn < 1/N; ds/dn = undef, ps/pn = undef
ps > 0.74: ds > 3.2
pn > 0.74: dn > 3.2
_________________________
PHYLOGENETIC TREE INPUT
_________________________
5
92UG037 0.0000 0.0400 0.0500 0.0500 0.2000
RF 0.0400 0.0000 0.0500 0.0200 0.2200
92BRO25 0.0500 0.0500 0.0000 0.0500 0.2200
ELI 0.0500 0.0200 0.0500 0.0000 0.2200
ANT70 0.2000 0.2200 0.2200 0.2200 0.000
The PHYLIP neighbor output:
--3--2
Between And Length
------- --- ------
3 92UG037 0.01375
3 2 0.00875
2 1 0.01375
1 RF 0.00833
1 ELI 0.01167
2 92BRO25 0.02625
3 ANT70 0.18625
These are examples of neighbor joining trees that can be produced using the
"dsdist.[pid]" and "dndist.[pid]" input files.
Synonymous Substitution Tree For Protease
Nonsynonymous Substitution Tree For Protease
__________________________________
BACKGROUND ABOUT THE DATA SET
__________________________________
"background.[pid]" for the four sequence example used in "summary.[pid]":
The input file has 4 sequences
Sequence Length: 297
Compare
Sequences_names Codons Compared Ambiguous Indels Ns
0 1 92UG037 RF 99 99 0 0 0
0 2 92UG037 92BRO25 99 99 0 0 0
0 3 92UG037 ELI 99 99 0 0 0
1 2 RF 92BRO25 99 99 0 0 0
1 3 RF ELI 99 99 0 0 0
2 3 92BRO25 ELI 99 99 0 0 0
Codons:
Compared:
Ambiguous:
(Indel + N's) = Ambiguous
Indels:
Ns:
NOTE:There were no indels or Ns in the protease alignment file.
_____________________
ALL CODON COMPARISONS
_____________________
"codons.[pid]"
The Nei and Gojobori syn/non-syn value for each codon
is recorded.
This is comparison 0 x 1: 92UG037 U455
Codon# class 1 2 aa1 aa2 syn non
1 identity CCT CCT P P 2 identity CAA CAA Q Q
3 identity ATC ATC I I 4 identity ACT ACT T T
5 identity CTT CTT L L 6 identity TGG TGG W W
7 identity CAA CAA Q Q 8 identity CGA CGA R R
9 synon CCT CCC P P 1.00 0.00
10 identity CTT CTT L L 11 identity GTC GTC V V
12 identity ACA ACA T T 13 identity GTA GTA V V
14 synon AAG AAA K K 1.00 0.00
15 identity ATA ATA I I
16 synon GGG GGA G G 1.00 0.00
17 identity GGA GGA G G 18 identity CAG CAG Q Q
19 synon CTA CTG L L 1.00 0.00
20 nonsynon AAA ATA K I 0.00 1.00
21 nonsynon AAA GAA K E 0.00 1.00 ...
The "codons.[pid]" file is input for the perl scripts "codons-xyplot.pl", and for
"SNAPstats.pl".
USAGE: codons-xyplot.pl codons.[pid]
This will create an output called "codon.data".
"codon.xy" is an input file for the program xyplot.
"codon.xy" includes the file "codon.data"
and makes a plot of this data if the program xyplot
is available.
USAGE: xyplot -ps codon.xy > codon.ps
The number of codons is: 99
The average behavior for 153 comparisons:
Cumulative Per Codon
codon indel syn nonsyn indel syn nonsyn aa
1 0.00 0.00 0.00 0.00 0.00 0.00 P
2 0.00 0.47 0.00 0.00 0.47 0.00 Q
3 0.00 0.47 0.00 0.00 0.00 0.00 I
4 0.00 0.47 0.21 0.00 0.00 0.21 P
5 0.00 0.47 0.21 0.00 0.00 0.00 L
6 0.00 0.47 0.32 0.00 0.00 0.11 W
7 0.00 0.73 0.74 0.00 0.25 0.42 D
8 0.00 0.93 0.74 0.00 0.21 0.00 R
9 0.00 1.32 0.74 0.00 0.39 0.00 P
10 0.00 1.92 1.03 0.00 0.60 0.29 I
11 0.00 2.22 1.03 0.00 0.29 0.00 V
These are some examples of xyplots for synonymous and nonsynonymous changes for the protease region
of HIV-1 pol:
Cumulative Synonymous and Nonsynonymous substitutions in protease
Codon-by-codon Synonymous and Nonsynonymous substitutions in protease
Statistics
Frequently-asked questions about SNAP
The summary page just reports the simple averages, while the stats file refers to the paper by Ota
and Nei 1994 which has a subtle correction. You should read the paper if you want to use the stats values.
The version we currently provide cannot be used for alternative genetic codes. We hope to be able to provide a version with this feature in the future.