Each RIP job requires 2 inputs from the user: a single DNA query sequence, and a background alignment. There are 3 options for providing the background alignment.
If your results reveal a combinant HIV-1 sequence, you can create an illustration of the recombinant with the Recombinant HIV-1 Drawing Tool.
The RIP default background consists of consensus sequences for subtypes A1, B, C, D, F1, F2, G, H, and CRF01. These are near-full-length-genome sequences. In addition to the consensuses, the subtypes A2, J, and K are each represented by one sequence of the appropriate subtype.
The consensus sequences were constructed from our master alignment of complete genomes. This alignment was edited to remove all recombinant genomes other than CRF01_AE. Also removed were the CPZ sequences. The resulting alignment still contained almost 400 genomes. Consensuses were constructed for each subtype group of this alignment using our consensus making tools. In the rare instances in which there were equal numbers of differing nucleotides in a column, e.g., 3 "A" and 3 "G", the consensus sequence reflects this fact by using the IUPAC multistate character code for the equally abundant nucleotides. In the case of "A and G" this code is "R". Because most sequences are somewhat shorter than true full-length genomes, there is missing information at their beginnings and ends. Commonly short sequences are padded with gaps at beginning and end to make them the same length as the longest sequence in the alignment. But it would be wrong to make "gap" (-) the consensus in these regions as shown below.
CONSENSUS ---------------GTgatgtcgACCRAGCGA ... ~9400 nucs ... GAGAGC---CGATCGaTGCTGATGC--- SEQ1 ACTAGCTGTGATGTCGTGATGTCGACCGAGCGA ... ... ... GAGAGC---CGATCGATGCTGATGCAGC SEQ2 ---------------GTGATGTCGACCGAGCGA ... ... ... GAGAGC---CGATCGGTGCTGATGC--- SEQ3 -----------------GATGTCGACCAAGCGA ... ... ... GAGAGCGCTCGATCGCTGCTGATGC--- SEQ4 ------------------------ACCAAGCGA ... ... ... GAGAGC---CGATCGATGCTGATG----
To remedy this undesirable situation we constructed consensuses for the beginnings and ends of the background alignment only from the long sequence(s) and ignored gaps when making the consensus; the consensus in those regions reflects only the long sequence, SEQ1. Note that internal gaps (in contrast to end gaps) can be the consensus.
CONSENSUS ACTAGCTGTGATGTCGTgatgtcgACCRAGCGA ... ~9400 nucs ... GAGAGC---CGATCGaTGCTGATGCAGC SEQ1 ACTAGCTGTGATGTCGTGATGTCGACCGAGCGA ... ... ... GAGAGC---CGATCGATGCTGATGCAGC SEQ2 ---------------GTGATGTCGACCGAGCGA ... ... ... GAGAGC---CGATCGGTGCTGATGC--- SEQ3 -----------------GATGTCGACCAAGCGA ... ... ... GAGAGCGCTCGATCGCTGCTGATGC--- SEQ4 ------------------------ACCAAGCGA ... ... ... GAGAGC---CGATCGATGCTGATG----
The RIP custom background contains consensus sequences, as described above, aligned to the HIV database subtype reference alignment. The user selects sequences from this list to become the background for the RIP analysis. Note, the consensus sequences have not been constructed from the 3 or 4 sequences in the subtype reference set, but from the complete genome alignment as described above.
The Use your own alignment as background option allows you to submit your own background alignment. Be sure that the background alignment is aligned with the query, but does not contain the query sequence. The query must be in a separate file. If the query is left in the background alignment you will get a perfect match between the query and the query, which is not exactly what you intended. Using this option, RIP can be used to analyze non-HIV sequences. Note that the total number of background sequences is limited to 50.
The window size is chosen by the user. Window size must be smaller than the length of the query. The window is moved in increments of one nucleotide residue from left to right in the alignment. A Hamming distance (p-distance) is calculated for each window.
Choice of window size is important, as it will affect the sensitivity of the detection of recombinants. On the one hand, using a small window size may introduce artifacts (small regions that appear to be of another subtype, but are not). On the other hand, using an overly-large window size may mask the presence of legitimate regions of recombination.
The user has 4 options for handling gaps in the alignment:
Examples of Gap Handling Options
Position 1234567890123456789012 Query AATCGTAAA---TGGCATAGTA Ref 1 AATCTTAAA---TGAAACGATA Ref 2 AAA---ATTACCTGGCATAGTA Window1 --- Window2 --- Window3 --- Window4 --- Window10 ---
With a window size of 3 nucleotides, the first point in the plot will be in position 2. All three gap/window handling options would give the same result, i.e., the query will be a perfect match to Ref 1 and distance = 1/3 away from Ref 2 (one mismatch out of three positions compared).
With options 1 and 2, window 2 will compare the query sequence ATC with Ref 1's ATC and Ref 2's AA-, and plot the corresponding similarity values in position 3 of the graph (Similarity = Match Fraction = 1 - distance). Here, gaps are treated as a 5th nucleotide character. Hence, window 2 will have distance = 0 to Ref 1 and distance = 2/3 to Ref 2. Similarly, windows 3 and 4 fill have values plotted in positions 4 and 5.
For window 10, option 1 will continue to plot the similarity value (perfect match to Ref 1), while option 2 would leave a blank in the graph to indicate that there is a gap in the query sequence. Also windows 9 and 11 would be blank with option 2.
Options 3 and 4 gapstrip the above alignment. The resulting alignment looks like this:
Query AATAAATGGCATAGTA Ref 1 AATAAATGAAACGATA Ref 2 AAAATTTGGCATAGTA
In option 3, windows scan the remaining alignment and plot similarity values throughout.
With option 4, the regions that were stripped out will be reinserted with blanks in the graph. The resulting alignment would look like this:
Query AAT---AAA---TGGCATAGTA Ref 1 AAT---AAA---TGAAACGATA Ref 2 AAA---ATT---TGGCATAGTA Window1 --- Window2 -- - Window3 - -- Window4 ---
Note that windows that include gaps will not use gaps as information; instead the next nucleotide after the gap will be used (which is the next nucleotide in the gapstripped alignment).
A C G T M R W S Y K B D H V N
A 1 - - - .50 .50 .50 - - - - .33 .33 .33 1
C - 1 - - .50 - - .50 .50 - .33 - .33 .33 1
G - - 1 - - .50 - .50 - .50 .33 .33 - .33 1
T - - - 1 - - .50 - .50 .50 .33 .33 .33 - 1
M .50 .50 - - 1 .50 .50 .50 .50 - .33 .33 .66 .66 1
R .50 - .50 - .50 1 .50 .50 - .50 .33 .66 .33 .66 1
W .50 - - .50 .50 .50 1 - .50 .50 .33 .66 .66 .33 1
S - .50 .50 - .50 .50 - 1 .50 .50 .66 .33 .33 .66 1
Y - .50 - .50 .50 - .50 .50 1 .50 .66 .33 .66 .33 1
K - - .50 .50 - .50 .50 .50 .50 1 .66 .66 .33 .33 1
B - .33 .33 .33 .33 .33 .33 .66 .66 .66 1 .66 .66 .66 1
D .33 - .33 .33 .33 .66 .66 .33 .33 .66 .66 1 .66 .66 1
H .33 .33 - .33 .66 .33 .66 .33 .66 .33 .66 .66 1 .66 1
V .33 .33 .33 - .66 .66 .33 .66 .33 .33 .66 .66 .66 1 1
N 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
where
M = AC
R = AG
W = AT
S = GC
Y = CT
K = GT
B = CGT ( not A)
D = AGT ( not C)
H = ACT ( not G)
V = ACG ( not TU)
N = ACGT
The results page begins with a summary of the parameters used in this RIP run:
WindowSize = 400, Significance threshold = 0.9, GapOption = 1, Multistate characters = yes
Next is a download button that allows you to retrieve your query aligned to the background sequences:
Download file of query aligned to background:
The colored curves trace the similarity between the query and each sequence in the background. Each point plotted represents the distance value at the center of the moving window. That is why the first point is at position 200; half the window size of 400. In the sample plot, CONSENSUS G (dark blue) is the sequence with the highest similarity to the background; it begins with a similarity of 0.9 and falls to a similarity of about 0.77 near position 600. We call the background sequence with the highest similarity to the background the "best match" sequence, and we represent the best match sequence in the graph as the lower of the two horizontal colored bars near the top of the graph. This is the "best match line" and quickly shows which of the background sequences is the most similar to the query. Above the best match line is another colored line which records whether the best match is also significantly better than the second best match. You can see that around position 1700 the best match switches from "red" (A1) to "green" (J), however the last few red positions are not significantly the best match.
Following the graphical output is an alignment of the query to the background, one block of which might look like this:
841: 900 [ 799: 855]
query: 11_cpx.NG.94.NG3670b: AATGGCAGTCTAGCAGAAGAAGAGGTAAGGAT...TAGATCTGAAAACATCACAAACAAT
a : CON_A1 : ----------------------------T---...------------T------R-----
b : A2.CY.94CY017_41 : -------------------G--G-AA--TA--GAT------------T--T---------
c : CON_B : ---------------------------GTA--...---------C--TT----GG-----
d : CON_C : -----T--C---------------A---TA--...------------TC-G---------
e : CON_D : ------------------------A---TA--...------------TC-------T---
f : CON_F1 : --------C--------------TA---TA--...C------C----T---T--G-T---
g : CON_F2 : --------C--------------TA---TA--...------------T---T--G-T---
h : CON_G : ---------T-------------AA---TA--...------------T------G-----
i : CON_H : -----A--C----------D-C----C-TA--...-------A----T---T--G-----
j : J.SE.94.SE7022 : ---------G---------G---CA---TA--...------------T---T--G-----
k : K.CD.97.EQTB11C : --------C---------------A---TT--...---G------G-T--T-----G---
l : CON_01_AE : ------------------------A---TA--...C-----------TC-----------
best match : aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
significant : ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This style of output is called "output-aligned" because the background sequences (labeled a through l) are shown aligned to the query, and only in those positions where they differ is the difference shown by a letter. When a background position agrees with the query, a "-" character is shown. Gaps are represented by "." characters. Fifty characters of the alignment are shown in each block; in this example from query positions 841 to 900. The second set of numbers, "[ 799: 855]" shows the absolute position in the query sequence, i.e., the position not counting gaps. At the bottom of the alignment is the "best match" line. In this example, the center of the window at every position 841-900 was a best match to sequence "a" which is "CON_A1", the A1 consensus sequence in RIP's standard background. You get a feeling that this is true just by seeing fewer "mutations" in the CON_A1 line relative to other sequences in the background. But note, you are only seeing 50 characters in this block, whereas the window itself was 400 characters wide. The final line in this block shows that the match between the query and the CON_A1 sequence was significantly better than the match score with any other background sequence. When the match is not significant, the "^" symbol disappears.