A recombinant is a genetic sequence that carries regions from two genetically distinct parental strains. In retroviruses like HIV, recombination is frequent and increases sample diversity. Therefore, undetected recombinants can significantly bias estimates of mutation rates and confound phylogenetic tree reconstructions. Recombination results in contiguous stretches of the linear DNA sequence being inherited, with random mutations, from each parent. Our program checks for this pattern of inheritance in every triplet in a given alignment. Specifically, RAPR identifies the sites that differ in exactly one of the three sequences and applies the Wald-Wolfowitz Runs Test to check if in any of the three sequences these sites cluster more than expected by random mutations.
Let A and B be the putative parental strains and C the recombinant one. Each position in C is either identical to A but not B, identical to B but not A, identical to both A and B, or different from either one. In the two latter scenarios we say that the site is non-informative. Define an “A run” as the maximal non-empty segment of adjacent positions in C that are either non-informative or identical to A but not B, and likewise a “B run” the set of contiguous sites that are either non-informative or identical to B but not A. Note that, “A runs” and “B runs” alternate and, in general, there are regions between them consisting entirely of sites where all three sequences agree or all three disagree. Runs in these regions cannot be labeled “A” or “B”, but this ambiguity does not affect the statistic, which depends only on the number of runs of each type, and not on their length or location.
Let m be the total number of positions where C is more similar to A; n the total number of sites where it is more similar to B; and r the total number of “A” or “B” runs. The probability of observing r runs depends only on the number of sites counted as m or n. The probability that there are at most r runs just by a chance ordering of independent mutations is given by:
where a total of fR(r;m,n) gives the probability of r runs arising from a process that produces two symbols with frequencies m and n. Explicitly,
where, ⌈a⌉ and ⌊a⌋ represent the ceiling (i.e., smallest integer greater than or equal) and floor (i.e., the largest integer less than or equal) of a, respectively.
RAPR can read any nucleotide sequence file, which can either be pasted in the text box or uploaded through the "Browse" button. Sequences should be aligned. If you are not sure if your sequences are aligned, please ran them through our QuickAlign tool first.
In order to detect recombinants, RAPR tests every possible set of three sequences in the alignment. Multiple testing correction is then addressed using a false discovery rate (FDR) approach (Benjamini and Hochberg, 1995). Users can specify the FDR threshold (default is set at 0.1) and only sequences reaching a significance below that threshold will be considered recombinants. When the alignments contains different user-specified time points, the FDR threshold will be applied to each time point individually.
In some instances, it is interesting to know whether recombination happened between specific subsets of sequences in the input alignment. We call these subsets "lineages." Users can specify lineages by either designating Transmitted/Founders (TFs) in the alignment or manually.
A transmitted/founder (TF) is the founder variant from which a viral population establishes after a complete bottleneck (Keele et al., PNAS, 2008). For example, such bottlenecks are often observed in heterosexual and placental transmissions, or after a strong selection sweep. In these cases the variants from the donor that establish the infection in the recipient are called TFs. When the new infection is sampled early, prior to the onset of immune pressure, the TF often coincides with the CONSENSUS of the viral population (or the CONSENSUS of a subset of the viral population when multiple TFs are present).
When studying recombination, if the TFs are known, they can be constrained to be considered potential parents only and not recombinants. This is useful because sometimes these sequences originated as recombinants in the donor (i.e. prior to transmission or in a previous bottleneck) and therefore they could be detected as an ancestral recombination event. In order to avoid this, and have RAPR only detect recombination events that originated after transmission and/or the bottleneck under consideration, the user has the option to specify the number of TFs in the TF box. All TFs should be at the beginning of the alignment and then RAPR will consider the first "x" sequences (where x is the number the user writes in the TF box) to be potential parents only and not recombinants. When the user chooses this option, then the software will automatically assign lineages according to which TF the sequences are closest to in the phylogenetic tree.
If there are no TFs, or if the user wants all the sequences in the alignment to be considered as potential recombinants, then this number should be 0. In the latter case, the first sequence in the alignment will be used as the outgroup sequence for the output figures: specifically, this will be the sequence the tree will be rooted on and also the reference sequence for the highlighter plots (see the Output section).
When the input alignment has a substracture that is not dependent on whether or not there are TFs, users may wish to assign different sets of sequences to different lineages independently of phylogenetic distances. For example in transmission pairs, the user may wish to assign one "lineage" to the donor and a different one to the recipient. RAPR has two options to manually assign lineages: the different lineages can be encoded in the sequence name (as a substring delimited by certain characters or the first number of characters), or they can be specified with a separate text file as shown in the following example:
LinA: 703010010.5b 703010010.5c 703010010.5.0009.ipe019.gp1 703010010.5.0009.ipe019.gp4 703010010.5.0009.ipe019.gp5 703010010.5.0009.ipe019.gp8 703010010.5.0009.ipe019.gp10 703010010.5.0026.ipe019.gp3_ambcode 703010010.5.0026.ipe019.gp4_ambcode 703010010.5.0026.ipe019.gp8_ambcode 703010010.5.0072.ipe019.5A1 703010010.5.0072.ipe019.5B1 703010010.5.0072.ipe019.5B2 703010010.5.0072.ipe019.5B7 703010010.5.0072.ipe019.5B9 703010010.5.0072.ipe019.5C4 703010010.5.0072.ipe019.5C5 703010010.5.0188.ipe019.gp1 703010010.5.0188.ipe019.gp12_ambcode 703010010.5.0188.ipe019.gp5_ambcode 703010010.5.0188.ipe019.gp7 703010010.5.0188.ipe019.gp8_ambcode 703010010.5.0188.ipe019.gp9 LinB: 703010010.5a 703010010.5.0009.ipe019.gp2 703010010.5.0009.ipe019.gp3 703010010.5.0009.ipe019.gp6 703010010.5.0009.ipe019.gp7 703010010.5.0009.ipe019.gp9 703010010.5.0026.ipe019.gp2 703010010.5.0026.ipe019.gp5 703010010.5.0026.ipe019.gp6 703010010.5.0026.ipe019.gp7 703010010.5.0026.ipe019.gp10 703010010.5.0072.ipe019.5A2 703010010.5.0072.ipe019.5A4 703010010.5.0072.ipe019.5B4 703010010.5.0072.ipe019.5B6 703010010.5.0072.ipe019.5B10 703010010.5.0072.ipe019.5C2 703010010.5.0188.ipe019.gp10 703010010.5.0188.ipe019.gp11 703010010.5.0188.ipe019.gp13 703010010.5.0188.ipe019.gp2 703010010.5.0188.ipe019.gp3 703010010.5.0188.ipe019.gp6_ambcode
NOTE: When lineages are manually entered, no TFs are assigned. All sequences are expected to have an assigned lineage. NAs will cause an error message.
RAPR can handle multiple time points and serial data when these are available (otherwise, select "No time points available"). In these cases, assuming that the sampling is dense, RAPR prohibits parents appearing at times of sampling later than the putative recombinant. A caveat to this approach is that some earlier lineages may in fact be latent or escape sampling and then reappear at later time points. When in doubt, the user should consider multiple runs, with and without specifying sequence time points.
The user can specify the different time points automatically (when they are part of a specific field in the sequence names) or manually. Below is a description of the different ways the user can input the time information for each case.
NOTE: Time points should be specified as numbers. No other characters, except for the 0-9 digits, are allowed in this field. TFs should not have a time point. Additionally, sequences missing a time point will be treated as TFs and considered only as potential parents, not recombinants.
If the different time points are delimited by a specific character in the sequence name, then the user can specify such character and the field number. For example, the default naming of sequences from the LANL database encodes the sample year as the 3rd field in the sequence name (e.g., B.US.08.sequence_name). For such sequences, the user can input the time point by writing "3" in the field box and "." as the delimiting character.
Alternatively, if the time points are at the very beginning of the sequence names, then the user can specify how many characters to be read in the names.
In order to manually specify the different time points, the user needs to group the sequences as illustrated by the example below, where each time point is specified as a list of sequence names. Each sequence must be on a separate line, and groups are separated by an empty line. The first item in a group will be taken as the time point of the group and should be numeric followed by a colon. Any sequences that are not present in any group will be considered to be in the 'NA' group. If an 'NA' group is specified, all sequences in that group will be considered to be TFs.
The following can be pasted in with the Sample Input for testing the Grouped Sequence Names option:
1: 703010010.5.0009.ipe019.gp2_1 703010010.5.0009.ipe019.gp3_1 703010010.5.0009.ipe019.gp4_1 703010010.5.0009.ipe019.gp5_1 703010010.5.0009.ipe019.gp8_1 703010010.5.0009.ipe019.gp9_1 703010010.5.0026.ipe019.gp10_1 703010010.5.0026.ipe019.gp2_1 703010010.5.0026.ipe019.gp3_ambcode_1 703010010.5.0026.ipe019.gp4_ambcode_1 703010010.5.0026.ipe019.gp5_1 703010010.5.0026.ipe019.gp6_1 703010010.5.0026.ipe019.gp7_1 703010010.5.0026.ipe019.gp8_ambcode_1 3: 703010010.5.0072.ipe019.5A1_1 703010010.5.0072.ipe019.5A2_1 703010010.5.0072.ipe019.5A4_1 703010010.5.0072.ipe019.5B10_1 703010010.5.0072.ipe019.5B1_1 703010010.5.0072.ipe019.5B2_1 703010010.5.0072.ipe019.5B4_1 703010010.5.0072.ipe019.5B6_1 703010010.5.0072.ipe019.5B7_1 703010010.5.0072.ipe019.5B9_1 703010010.5.0072.ipe019.5C2_1 703010010.5.0072.ipe019.5C4_1 703010010.5.0072.ipe019.5C5_1 7: 703010010.5.0188.ipe019.gp10_1 703010010.5.0188.ipe019.gp11_1 703010010.5.0188.ipe019.gp12_ambcode_1 703010010.5.0188.ipe019.gp13_1 703010010.5.0188.ipe019.gp1_1 703010010.5.0188.ipe019.gp2_1 703010010.5.0188.ipe019.gp3_1 703010010.5.0188.ipe019.gp5_ambcode_1 703010010.5.0188.ipe019.gp6_ambcode_1 703010010.5.0188.ipe019.gp7_1 703010010.5.0188.ipe019.gp8_ambcode_1 703010010.5.0188.ipe019.gp9_1 NA: 703010010.5a_3.NA 703010010.5b_2.NA 703010010.5c_2.NA
In order to avoid detecting duplicate recombinants, prior to running all recombination detection statistics, the input alignment is run through the ElimDupes tool. For every group of identical sequences, the tool will keep only one and append to the sequence name the characters "_x" where x is the number of instances that sequence was found. For example, if seq1, seq2, and seq3 are three identical sequences in the alignment, ElimDupes will keep only the first one and it will rename it seq1_3.
If the input alignment is already in this format, and all sequences are unique, the user should select the second option, "Yes, all sequences are unique and sequence names are in ElimDupes format." If all sequences in the alignment are unique but the alignment is NOT in ElimDupes format, then the user should choose the third option, "Yes, all sequences are unique but sequence names are NOT in ElimDupes format."
Once the Wald-Wolfowitz statistic is run on all possible triplets in the alignment, RAPR employs several additional steps to find the "true" recombinants among all possible putative ones that meet the FDR multiple testing correction threshold.
Using a greedy algorithm, sequences are grouped into closely related clusters such that differences within each cluster are likely to have arisen by mutation. Some clusters will have evolved directly from a TF. Others will represent recombination events that were indicated in the initial comparison of all sets of three sequences. For each recombinant cluster, RAPR calculates the minimum Hamming distance (HD) from the closest cluster, and compares it to the minimum number of mutations diverging from the parental strains. Based on which of the two numbers is lower, the cluster is “explained” as having arisen from another cluster by mutation, or as a de novo recombination of sequences belonging to other clusters. When a recombination event is called, the sequence in the cluster with the least number of mutations from the parental strains, the earliest parents, or with the lowest p-value is deemed the initial recombinant, and all other sequences in that cluster that share the same breakpoints are considered “descendants.” At most one sequence per cluster is called a recombinant: the rest are assumed to be descendants of this recombinant sequences.
NOTE: In serial data, sparse sampling may cause an early sequence to be called a descendent of a later time sequence. Within a cluster, RAPR will first try to call the earliest possible time point sequence a recombinant, but sparse sampling may result in loss of significance for that particular recombination event. Therefore, RAPR will call the recombinant the sequence with the strongest p-value from the Wald-Wolfowitz test. If such sequence is at a later time point and the same cluster contains earlier sequences that are called "descendants", the likely explanation is that the actual recombination event happened at an earlier time point but one (or both) true parents were not sampled.
In the realistic case of multiple complementary recombination events, or when one of the parental strains escapes sampling, a triplet will often yield a low p-value when parents and child swap role. This leads to loops in the deduced recombination history leading to a chain of daughters ultimately leading back to a parent. To correct for these loops, RAPR replaces the least preferred recombination from the loop with an alternate one chosen in the same preference order, or if no such choice is admissible, by merging similarity clusters and restarting the resolution process.
Since the region between runs typically is ambiguous about which run it belongs to, our method cannot pinpoint the exact breakpoints. Instead, for each recombinant, RAPR calculates a set of intervals in which the breakpoints are most likely to have happened. Each breakpoint interval (run) is examined, and the run-test p-value is recalculated after removing the run. If this new p-value is higher, i.e., less significant than the overall p-value, we conclude that the run is contributing to the significance of the recombination, and mark the breakpoint as significant. Both significant and not-significant breakpoints are listed in the output file and in the final figures.
For each time point in the alignment, RAPR calculates two frequencies: the number of new breakpoints per sequence and the number of new mutations per sequence, where by “new mutation” we mean any nucleotide change that is not present in any of the TFs. Each mutation is counted only at the time it appears for the first time. Breakpoints are also counted only the first time they appear. For simplicity, in this calculation we replace each breakpoint interval with its midpoint.
Each row in this table represents a time point. The first column names are the TFs as chosen by the user and, for each time point, counts of how many sequences in each TF lineage have been found are shown. The next column shows the total number of sequences per time point. Finally, the last four columns show the number of new breakpoints, the number of new breakpoints per sequence (frequency), the number of new mutations, and the number of new mutations per sequence (frequency). Mutations are considered new if they are not found in any of the TFs.
This table lists all recombinants found in the alignment. Notice, however, that descendants of recombinants are not listed here. In order to find those sequences, please refer to the breakpoint figures below. For each recombinant listed in the first column, this table displays the following information: how many mutations are found in the recombinant but not in any of the two parents (muts); how many mutations are found in the recombinant but not in the closest sequence in the alignment (minHD); which TF the recombinant is closest to (linrec); the first parent (parent1) and which TF it is closest to (linpar1); the second parent (parent2) and which TF it is closest to (linpar2); finally the last column lists all sequences that are closest (minimum HD) with the recombinant.
Notice that in general muts < minHD, unless the recombinant has descendents. TFs are listed with alphabetic letters A-Z according to the order they appear in the alignment.
NOTE: There are rare instances in which RAPR may call recombinant a sequence that's only 1 or 2 mutations away from one of the TFs. Because of the nature of the RunsTest, if the mutation is shared with another sequence in the alignment, this may cause a very low p-value, hence the recombinant call. While this is not a technical mistake, it is up to the user to decide whether a shared mutation would be more likely to have arised from two independent mutational events or from a recombinatorial event. When such an instance occurs, RAPR will list the sequence among the recombinants but it will also issue a warning so the user can decide whether to consider it a recombinant or not.
All sequences are assigned a closest TF according to their branch lengths on the phylogenetic tree (see figure below). TFs are labeled alphabetically according to which order they appear in the input alignment. Based on this, each recombinant is classified as a "X,Y" event, where X is the TF closest to the first parent, and Y is the TF closest to the second parent. For every time point, this table reports: the X,Y type of recombinantion (RecEvent); the cumulative number of sequences up to that time point (so, for example, at the second time point, nseqs would be the number of sequences at the first time point plus the number of sequences at the second time point); the total number of X,Y recombinants found at that time point (nevents); the cumulative number of sequences closest to TF X (nlin1); the cumulative number of sequences closest to TF Y (nlin2); and, finally, the recombination frequency defined as follows: recfreq = (nseqs*nevents)/(nlin1*nlin2).
RAPR outputs figures to better visualize the recombination and breakpoint patterns found in the alignment. Below is an explanation of each figure.
As explained earlier, in order to avoid detecting duplicate recombinants, the input alignment is run through the ElimDupes tool. Therefore, only unique sequences are represented in each figure, and the characters "_x" have been appeded at the end of each sequence name, where x is the multiplicity of the sequence in the alignment (i.e. how many identical copies were found in the alignment). For example, if seq1, seq2, and seq3 are three identical sequences in the alignmet, ElimDupes will keep only the first one and it will rename it seq1_3. Please keep this in mind when downloading the figures from the RAPR output. Sequences that may seem missing are probably represented by their identical part.
In this figure, sequences are ordered by time point. RAPR assumes that time increases as the user-defined time points (which are required to be numerical) grow larger. Therefore, smaller time points are drawn at the top, and time increases toward the bottom. Each line represents one sequence. Gray lines are regular sequences (not recombinants). Recombinants are composed of colored segments, and the segments are colored according to the closest parental TF. The boxes between colored segments represent ranges where the breakpoints are most likely to fall (for an explanation of how breakpoints are calculated go to the section Determination of Breakpoints). Gray boxes indicate breakpoint intervals that are statistically singificant.
The left panel shows the phylogenetic tree (as created by PhyML), with the leaves color-coded by time point and the TF lineages marked by colored dots. The right panel is the same as the Breakpoint Figure, but this time sequences are in the same order as the tree leaves on the left, not by time point.
This figure is created through the tool Highlighter using all TFs as master sequences. As a consequence, only mutations that are not present in any of the TFs are marked. Red bars represent nonsynonimous mutations, green bars represent synonimous mutations and blue basr represent mutations that are synonimous with respect to one TF, but nonsynonimous with respect to another. For more details, read the Highlighter Explanation.
This figure is also created through the tool Highlighter using all TFs as master sequences, but this time instead of marking silent vs. non-silent mutations, nucleotide based changes that are not found in any of the TFs are marked and color-coded according to nucleotide (red for T, darkgreen for A, darkblue for C, and yellow for G). For more details, read the Highlighter Explanation.
RAPR employs several other tools to produce the full output described above. The data files it creates during these intermediate steps are available to the user for download in the last section of the output. These files are: time points (the list of sequences together with the time point each sequence belongs to; if the user did not provide any time point information, then these are all 1's); unique sequences (the fasta file containing only the unique sequences with the names in ElimDupes format); duplicate summary (another file from the ElimDupes output, this time summarizing which sequences are identical and have been omitted); newick tree (the newick file produved by PhyML); and break point file (a list of recombinants and their breakpoints; each breakpoint is represented by a range of positions in which it is most likely to have occurred; asterisks indicate breakpoint intervals that are statistically significant).