HIV Databases HIV Databases home HIV Databases home
HIV sequence database



RIP 3.0 Explanation

Reference

When using RIP in a publication, please cite:

A computer program designed to screen rapidly for HIV type 1 intersubtype recombinant sequences.
Siepel AC, Halpern AL, Macken C, Korber BT.
AIDS Res Hum Retroviruses. 1995 Nov;11(11):1413-6.
PMID: 8573400


Query

Background

Background alignment options:

HIV-1 Consensus Alignment (Default)

The RIP default background consists of a single representative sequence for subtypes A1, A2, B, C, D, F1, F2, G, H, J, and K. Optionally, you may check the box to include CRF01_AE. These are near-full-length-genome sequences. Most are consensus sequences, some are single reference sequences of the subtype.

The consensus sequences were created using Consensus Maker. Details available on request.

Custom Background

The RIP custom background contains a large assortment of consensus sequences and subtype reference sequences for HIV-1. RIP allows the user to select from a list of these sequences.

Choose as few sequences as possible to get cleaner results. RIP cannot include more than 26 sequences, as it has only 26 colors for the plot.

The RIP Custom Background alignment is available for download on the Alignments page. It is updated periodically with new CRFs and updated censensus sequences.

User-provided Alignment

The Use your own alignment option allows you to submit your own background alignment. The background sequences must be aligned to each other, but do not need to be aligned with the query sequence; RIP will align your query to the background alignment. The query must not be included in the background file. Using this option, RIP can be used to analyze non-HIV sequences. The total number of background sequences is limited to 26.

Download: HIV-2 reference sequences. This file of reference sequences can be used as a background set for RIP to examine HIV-2 recombination: HIV-2 reference set (Fasta format).

Alignment quality

RIP aligns your query sequence to the background using the program "align0". This usually works well, but if your RIP output looks unusual, check the alignment.

Note that you can download the alignment of your query with the background, fine tune it if necessary, and then resubmit this alignment to RIP.

 


RIP Options

Window size

The window size is chosen by the user. Window size must be smaller than the length of the query. The window is moved in increments of one nucleotide from left to right in the alignment. A Hamming distance (p-distance) is calculated for each window.

Choice of window size is important, as it will affect the sensitivity of the detection of recombinants. On the one hand, using a small window size may introduce artifacts (small regions that appear to be of another subtype, but are not). On the other hand, using an overly-large window size may mask true regions of recombination.

Confidence threshold

The best match within each window is qualified by a measure of confidence, at a level chosen by the user, obtained by comparing the distance to the best-matching reference sequence to the second-best-matching reference sequence. Confidence is calculated using a z-test, assuming:
  1. that each site evolves independently according to the same process; and
  2. that the binomial distribution that theoretically results from the use of Hamming distances can be approximated by a normal distribution. Note that measurements with respect to overlapping windows are nondependent, and for this reason and others the measure of confidence is approximate and is only used for heuristic purposes.

Gap handling and window plotting

The user has 4 options for handling gaps in the alignment:

  1. Gaps are left in place, i.e., no gapstripping. Plots window values throughout the alignment.
  2. Gaps are left in place. No window value is plotted when the center of window is in a gap of the query sequence.
  3. Default: Global gapstripping. Plots window values only for remaining positions.
  4. Global gapstripping. Reinserts blanks in plot where gaps have been removed.

Examples of Gap Handling Options

Position 1234567890123456789012
   Query AATCGTAAA---TGGCATAGTA
   Ref 1 AATCTTAAA---TGAAACGATA
   Ref 2 AAA---ATTACCTGGCATAGTA

 Window1 ---
 Window2  ---
 Window3   ---
 Window4    ---
 Window10         ---

With a window size of 3 nucleotides, the first point in the plot will be in position 2. All three gap/window handling options would give the same result, i.e., the query will be a perfect match to Ref 1 and distance = 1/3 away from Ref 2 (one mismatch out of three positions compared).

With options 1 and 2, window 2 will compare the query sequence ATC with Ref 1's ATC and Ref 2's AA-, and plot the corresponding similarity values in position 3 of the graph (Similarity = Match Fraction = 1 - distance). Here, gaps are treated as a 5th nucleotide character. Hence, window 2 will have distance = 0 to Ref 1 and distance = 2/3 to Ref 2. Similarly, windows 3 and 4 fill have values plotted in positions 4 and 5.

For window 10, option 1 will continue to plot the similarity value (perfect match to Ref 1), while option 2 would leave a blank in the graph to indicate that there is a gap in the query sequence. Also windows 9 and 11 would be blank with option 2.

Options 3 and 4 gapstrip the above alignment. The resulting alignment looks like this:

Query AATAAATGGCATAGTA
Ref 1 AATAAATGAAACGATA
Ref 2 AAAATTTGGCATAGTA

In option 3, windows scan the remaining alignment and plot similarity values throughout.

With option 4, the regions that were stripped out will be reinserted with blanks in the graph. The resulting alignment would look like this:

  Query AAT---AAA---TGGCATAGTA
  Ref 1 AAT---AAA---TGAAACGATA
  Ref 2 AAA---ATT---TGGCATAGTA

Window1 ---
Window2  --   -
Window3   -   --
Window4       ---

Note that windows that include gaps will not use gaps as information; instead the next nucleotide after the gap will be used (which is the next nucleotide in the gapstripped alignment).

Scoring of multistate character matches

Sometimes the query or the background sequences contains IUPAC multistate character codes. How do you want to score a comparison between such codes? For example, in a certain column of the alignment, if the query sequence is "A" and one of the background sequences is "R" (meaning both A and G are known from that position) do you want to score this as a partial match (1/2) or mismatch (0)? The former is the default. Check the "false" box if you want to score such comparisons as mismatches. The complete scoring matrix for partial matches looks like this:
      A      C      G      T      M      R      W      S      Y      K      B      D      H      V      N
A     1      -      -      -     .50    .50    .50     -      -      -      -     .33    .33    .33     1     
C     -      1      -      -     .50     -      -     .50    .50     -     .33     -     .33    .33     1     
G     -      -      1      -      -     .50     -     .50     -     .50    .33    .33     -     .33     1     
T     -      -      -      1      -      -     .50     -     .50    .50    .33    .33    .33     -      1     
M    .50    .50     -      -      1     .50    .50    .50    .50     -     .33    .33    .66    .66     1     
R    .50     -     .50     -     .50     1     .50    .50     -     .50    .33    .66    .33    .66     1     
W    .50     -      -     .50    .50    .50     1      -     .50    .50    .33    .66    .66    .33     1     
S     -     .50    .50     -     .50    .50     -      1     .50    .50    .66    .33    .33    .66     1     
Y     -     .50     -     .50    .50     -     .50    .50     1     .50    .66    .33    .66    .33     1     
K     -      -     .50    .50     -     .50    .50    .50    .50     1     .66    .66    .33    .33     1     
B     -     .33    .33    .33    .33    .33    .33    .66    .66    .66     1     .66    .66    .66     1     
D    .33     -     .33    .33    .33    .66    .66    .33    .33    .66    .66     1     .66    .66     1     
H    .33    .33     -     .33    .66    .33    .66    .33    .66    .33    .66    .66     1     .66     1     
V    .33    .33    .33     -     .66    .66    .33    .66    .33    .33    .66    .66    .66     1      1     
N     1      1      1      1      1      1      1      1      1      1      1      1      1      1      1    

where
M = AC
R = AG
W = AT
S = GC
Y = CT
K = GT

B = CGT  ( not A)
D = AGT  ( not C)
H = ACT  ( not G)
V = ACG  ( not TU)

N = ACGT

 


RIP Output

The results page begins with a summary of the parameters used in this RIP run:

WindowSize = 400, Significance threshold = 0.9, GapOption = 1, Multistate characters = yes

Next is a Download button that allows you to retrieve your query aligned to the background sequences.

Auto-simplify RIP results

After running RIP, it may be obvious that only a few (2 or 3, for example) of the sequences in the background are the "best match" sequences. You can ask RIP to automatically rerun the analysis with only these best match sequences in order to make the distance plots less cluttered, and thus easier to read. Simply press the "Rerun" button.

Graphical output

Three graphs showing different distance measurements between the query and the various background sequences are presented. A typical similarity plot might look like this:

sample image

The the x-axis (k) represents the query sequence position at the center of the moving window. That is why the first point is at position 200; half the window size of 400.

The y-axis, s(k), shows the similarity between that window of sequence and each of the background sequences. In the sample plot, CONSENSUS G (dark blue) is the sequence with the highest similarity to the background; it begins with a similarity of 0.9 and falls to a similarity of about 0.77 near position 600.

The two bars across the top of the graph represent the "best match" (lower bar), and the significance of this match (upper bar). The "best match" sequence is the background sequence with the highest similarity to the query. The upper bar is also colored at a position when the best match is significantly better than the second match. In the example above, you can see that around position 1700 the best match switches from "red" (A1) to "green" (J); however, there are several positions where neither sequence is significantly the best match.

Alignment-style output

Following the graphical output is an alignment of the query to the background, one block of which might look like this:

 841: 900  [ 799: 855]
 query: 11_cpx.NG.94.NG3670b:  AATGGCAGTCTAGCAGAAGAAGAGGTAAGGAT...TAGATCTGAAAACATCACAAACAAT
   a :  CON_A1              :  ----------------------------T---...------------T------R-----
   b :  A2.CY.94CY017_41    :  -------------------G--G-AA--TA--GAT------------T--T---------
   c :  CON_B               :  ---------------------------GTA--...---------C--TT----GG-----
   d :  CON_C               :  -----T--C---------------A---TA--...------------TC-G---------
   e :  CON_D               :  ------------------------A---TA--...------------TC-------T---
   f :  CON_F1              :  --------C--------------TA---TA--...C------C----T---T--G-T---
   g :  CON_F2              :  --------C--------------TA---TA--...------------T---T--G-T---
   h :  CON_G               :  ---------T-------------AA---TA--...------------T------G-----
   i :  CON_H               :  -----A--C----------D-C----C-TA--...-------A----T---T--G-----
   j :  J.SE.94.SE7022      :  ---------G---------G---CA---TA--...------------T---T--G-----
   k :  K.CD.97.EQTB11C     :  --------C---------------A---TT--...---G------G-T--T-----G---
   l :  CON_01_AE           :  ------------------------A---TA--...C-----------TC-----------
        best match          :  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
        significant         :  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This style of output is called "output-aligned" because the background sequences (labeled a through l) are shown aligned to the query, and only in those positions where they differ is the difference shown by a letter. When a background position agrees with the query, a "-" character is shown. Gaps are represented by "." characters. Fifty characters of the alignment are shown in each block; this example shows query positions 841 to 900. The second set of numbers, "[ 799: 855]" shows the absolute position in the query sequence, i.e., the position not counting gaps. At the bottom of the alignment is the "best match" line. In this example, the center of the window at every position 841-900 was a best match to sequence "a" which is "CON_A1", the A1 consensus sequence in RIP's standard background. You get a feeling that this is true just by seeing fewer "mutations" in the CON_A1 line relative to other sequences in the background. But note, you are only seeing 50 characters in this block, whereas the window itself was 400 characters wide. The final line in this block shows that the match between the query and the CON_A1 sequence was significantly better than the match score with any other background sequence. When the match is not significant, the "^" symbol disappears.

last modified: Tue Dec 3 11:44 2013


Questions or comments? Contact us at seq-info@lanl.gov.

 
Operated by Los Alamos National Security, LLC, for the U.S. Department of Energy's National Nuclear Security Administration
Copyright © 2005-2012 LANS LLC All rights reserved | Disclaimer/Privacy

Dept of Health & Human Services Los Alamos National Institutes of Health