HIV Databases HIV Databases home HIV Databases home
HIV sequence database



Readme file for SNAP

Written by: Bette Korber, T10 MS K710, LANL, Los Alamos, NM 87545


BACKGROUND and REFERENCES:

This program is based on the method of Nei and Gojobori 1986, and incorporating a statistic developed in Ota and Nei 1994. An application of the SNAP package in HIV-1 research is described in a paper by Ganeshan et. al., J Virol 71: 663-677 (1997). You should be familiar with the Nei and Gojobori paper before using this program. The following references (at minimum) should be cited when using SNAP in a paper:


SNAP Input and Output

The Input instructions below refer to the perl-based version of SNAP. The Output examples apply to both perl-based and web-based SNAP. Click here to return to web-based SNAP.

The Perl-based version of SNAP is written in perl and has been run on UNIX operating systems (SUN and LINUX). It should also work on DOS and MAC.

USAGE: SNAP.pl filename

INPUT: Sequences should be provided with codons aligned, in caps, and in frame.

They should be provided in table format:
Seq1ACTGCCTTTGGC...
Seq2ACTGCCTATGGG...

The first field is the sequence name, the second field the sequence, then return to a new line for the second sequence.

Single or double insertions will throw the sequence out of frame, and care must be taken to keep codons intact.

A single insertion could be treated in the following way to keep the codons ACT and GGC intact:

ACTTGCC ==> ACTT--GCC
ACTGGC ==> ACT---GCC

The letter N should be used to represent ambiguous bases.
A dash, '-' for insertions, Only A C G T N - are allowed.

OUTPUT:
This program calculates the number of synonymous vs. non-synonymous base substitutions as described in Nei and Gojobori for all pairwise comparisons of sequences in an alignment. The number of synonymous and non-synonymous codon changes are counted, as well as the number of potential synonymous and non-synonymous changes when comparing two sequences. Ambiguous codons or codons with insertions are excluded from the tally of compared codons. The overall sequence distances are calculated as well as a codon by codon summary.

The pid (process ID) is appended to the output filename,i.e., "outputfilename.[pid]".

The examples below are based on input of 4 HIV protease sequences:
92UG037, RF, 92BRO25, ELI

_________________________________________________________________

SUMMARY OF THE SYNONYMOUS AND NONSYNONYMOUS INFORMATION _________________________________________________________________

"summary.[pid]":
CompareSequences_namesSdSnSNpspndsdnds/dn
0192UG037RF18.009.0068.50228.500.260.040.320.048.00
0292UG03792BRO2522.5010.5069.00228.000.330.050.430.059.00
0392UG037ELI17.0010.0068.50228.500.250.040.300.056.68
12RF92BRO2520.5011.5068.50228.500.300.050.380.057.33
13RFELI11.005.0068.00229.000.160.020.180.028.22
2392BRO25ELI18.5010.5068.50228.500.270.050.330.057.06

Averages of all pairwise comparisons: ds = 0.62, dn = 0.11, ds/dn = 6.60
Averages of the first sequence compared to others: ds = 0.49, dn = 0.09, ds/dn = 7.09

Compare:Lists the two sequences compared, starting with 0 (4 sequences are seqs 0-3)
Sequences_names:The names of the two sequences being compared.
Sd:The number of observed synonymous substitutions
Sn:The number of observed non-synonymous substitutions
S:The number of potential synonymous substitutions (the average for the two compared sequences)
N:The number of potential non-synonymous substitutions (the average for the two compared sequences)
ps:The proportion of observed synonymous substitutions: Sd/S
pn:The proportion of observed non-synonymous substitutions: Sn/N
ds:The Jukes-Cantor correction for multiple hits of ps
dn:The Jukes-Cantor correction for multiple hits of pn
ds/dn:The ratio of synonymous to non-synonymous substitutions

NOTE1: An example of how Sd and Sn is determined for a single codon:

Imagine you have TTA in one sequence, and CTT in the other. Two bases are different. TTA encodes Leu, as does CTT. If you assume that the bases have changed one at a time, there are two paths:

TTA (L) --> CTA (L) --> CTT (L) or
TTA (L) --> TTT (F) --> CTT (L).

From first path, count = .5syn + .5syn
From second path, count = .5nonsyn + .5nonsyn
So this two base change would be 1 synonymous, 1 non-synonymous.

NOTE2: Some thoughts on statistical comparisons of the output: One must be wary when doing typical statistical analysis of these values. One thing I have found to be very common are distributions of values that are NOT close to a Gaussian distribution, so you should either check to see if you have a Gaussian distribution, or default to the use of non-parametric statistics, like a Wilcoxon rank sum test.

Therefore the averages given at the bottom are only meant as a crude guide.

Also, if one uses the full column of values for all pairwise comparisons (say all values of dn for one set, compared to all values for another set) there is a non-independence of points issue to be considered. An alternative is the use of a sequence like a consensus or a best estimate of an ancestral sequence as the first sequence in the alignment, and then just use the comparison of the first sequence to all others rather than all pairwise comparisons. Sequence 0 is the first in the set, so that would be all lines that start with 0.

NOTE3: NA and mutational saturation.
If ps or pn has a value >= .75, saturation has been reached and a Jukes-Cantor transformation cannot be done, so the value of NA is returned.

If either ds or dn is NA or 0, the ds/dn ratio is not calculated.

_______________________

PHYLOGENETIC TREE INPUT
_______________________

Two distance matrices that are compatible input for PHYLIP's neighbor program are created, based on either the ds or dn values. A 5 sequence example is shown below, for dn values:

"dndist.17985" (a PHYLIP neighbor input file):
5
92UG0370.00000.04000.05000.05000.2000
RF0.04000.00000.05000.02000.2200
92BRO250.05000.05000.00000.05000.2200
ELI0.05000.02000.05000.00000.2200
ANT700.20000.22000.22000.22000.000

The PHYLIP neighbor output:
+92UG037
!
!+RF
!+--1
--3--2+ELI
!!
!+92BRO25
!
+----------ANT70

BetweenAndLength
----------------
392UG0370.01375
320.00875
210.01375
1RF0.00833
1ELI0.01167
292BRO250.02625
3ANT700.18625


These are examples of neighbor joining trees that can be produced using the "dsdist.[pid]" and "dndist.[pid]" input files.

Synonymous Substitution Tree For Protease

Nonsynonymous Substitution Tree For Protease
__________________________________

BACKGROUND ABOUT THE DATA SET
__________________________________

"background.[pid]" for the four sequence example used in "summary.[pid]":
The input file has 4 sequences
Sequence Length: 297

Compare Sequences_namesCodonsComparedAmbiguousIndelsNs
0192UG037RF9999000
0292UG03792BRO259999000
0392UG037ELI9999000
12RF92BRO259999000
13RFELI9999000
2392BRO25ELI9999000

Codons:Total number of codons in the input file (number of bases/3)
Compared:(Codons - Ambiguous)
Ambiguous:Codons that contain either dashes "-" for indels or N's.
(Indel + N's) = Ambiguous
Indels:Codons that contain dashes "-" for indels
Ns:Codons that contain N's.

NOTE:There were no indels or Ns in the protease alignment file.

_____________________

ALL CODON COMPARISONS
_____________________
"codons.[pid]"

For every codon in each pairwise comparison, it is identified as "identity" if there is no change, " synon " if the nucleotides change but the encoded amino acid does not, " nonsyn " if the nucleotides change as well as the encoded amino acid, " indel " if there is an insertion or deletion, " ambiguous " if there is an N. Indel supersedes N if both are present.


The Nei and Gojobori syn/non-syn value for each codon is recorded.

This is comparison 0 x 1: 92UG037 U455
Codon#class12aa1aa2synnon
1identityCCTCCTPP
2identityCAACAAQQ
3identityATCATCII
4identityACTACTTT
5identityCTTCTTLL
6identityTGGTGGWW
7identityCAACAAQQ
8identityCGACGARR
9synonCCTCCCPP1.000.00
10identityCTTCTTLL
11identityGTCGTCVV
12identityACAACATT
13identityGTAGTAVV
14synonAAGAAAKK1.000.00
15identityATAATAII
16synonGGGGGAGG1.000.00
17identityGGAGGAGG
18identityCAGCAGQQ
19synonCTACTGLL1.000.00
20nonsynonAAAATAKI0.001.00
21nonsynonAAAGAAKE0.001.00...

The "codons.[pid]" file is input for the perl scripts "codons-xyplot.pl", and for "SNAPstats.pl".

USAGE: codons-xyplot.pl codons.[pid]

This will create an output called "codon.data".

"codon.xy" is an input file for the program xyplot.
"codon.xy" includes the file "codon.data" and makes a plot of this data if the program xyplot is available.

USAGE: xyplot -ps codon.xy > codon.ps

The information in "codon.data" is the average behavior of each codon for all pairwise comparisons for indels, syn, and nonsyn mutations. In very variable positions, the value can be over 1, because there are 3 positions in each codon, and more than one change can occur.


The number of codons is: 99
The average behavior for 153 comparisons:
CumulativePer Codon
codonindelsynnonsynindelsynnonsynaa
10.000.000.000.000.000.00P
20.000.470.000.000.470.00Q
30.000.470.000.000.000.00I
40.000.470.210.000.000.21P
50.000.470.210.000.000.00L
60.000.470.320.000.000.11W
70.000.730.740.000.250.42D
80.000.930.740.000.210.00R
90.001.320.740.000.390.00P
100.001.921.030.000.600.29I
110.002.221.030.000.290.00V

These are some examples of xyplots for synonymous and nonsynonymous changes for the protease region of HIV-1 pol:

Cumulative Synonymous and Nonsynonymous substitutions in protease

Codon-by-codon Synonymous and Nonsynonymous substitutions in protease

STATISTICS:


USAGE: SNAPstats.pl codons.[pid]

The program SNAPstats.pl will yield the variance and standard deviation for the average ds, dn, ps and pn values for the data set, based on the strategy described by Ota and Nei. Please read and cite this paper if you use this program. (I am grateful to Dr. Yumi Yamaguchi-Kabata of Kyoto University for graciously encouraging us to add this statistic to our site).

Other programs of interest may be found at Dr. Yasuo Ina's ftp site: ftp.dna.affrc.go.jp
- /pub/unix/syn/new1(Ina1)
- /pub/unix/syn/new2(Ina2)
------------------------------------------------------------------------------------
FUTURE PLANS:
Add a better model of evolutionary distance, incorporate links.
------------------------------------------------------------------------------------
----------------------------------------------------------------------------

last modified: Thu Aug 2 11:49 2012


Questions or comments? Contact us at seq-info@lanl.gov.

 
Operated by Los Alamos National Security, LLC, for the U.S. Department of Energy's National Nuclear Security Administration
Copyright © 2005-2012 LANS LLC All rights reserved | Disclaimer/Privacy

Dept of Health & Human Services Los Alamos National Institutes of Health