The Variable Region Characteristics tool characterizes features within a peptide excised from a protein alignment, summarizing and reporting the following peptide characteristics: length, number of N-linked glycosylation sites (the sequence pattern NX[ST], where X can be any amino acid except P), and net charge. It is a general tool that can be used to characterize regions excised from any protein alignment by the user specifying alignment positions, but it was originally designed to characterize the hypervariable regions in HIV-1 Env. Thus, if an HIV Env alignment is in the input, we have automated the excision of either the full V1, V2, V3, V4, V5 loops, or the hypervariable regions within them. How the boundaries of the hypervariable regions are treated in the alignment it critical for a meaningful result, for an example of issues that should be considered, please see the section below entitled "Alignment Considerations".
The tool requires a set of aligned protein sequences along with an optional reference sequence. If you enter Alignment Positions to be selected in the Select Positions section of the submission form, then a reference sequence is optional. If you choose regions (full loop or hypervariable) from the Select Regions section, then your aligned sequences must include the HXB2 sequence.
Paste or upload a set of aligned protein sequences. Fasta, IG, and table format, as well as other common ascii formats are accepted. If you will be selecting a pre-specified loop or hypervariable region in the Select Region section, then your input must include HXB2 where "HXB2" (no quotes) is included in the label for the reference sequence (For examples, HXB2 is called B_FR_83_HXB2_K03455 in the sample input). The HXB2 reference sequence will not be included in the averaged summaries; if you want to include HXB2 in the average characteristics summaries, then include the HXB2 sequence in the input file twice, once as a reference sequence, and once where you do not include HXB2 in the label (e.g. B_FR_83_K03455).
You can select positions by alignment position or reference position. By default the reference sequence is set to HXB2. Whatever reference sequence you use, that value (such as HXB2) must appear in the sequence name of one and only one sequence in order to uniquely identify the reference sequence. It does not have to be the first sequence in the set.
By default the net charge is computed with KRH = (+) and DE = (-). However you can select the net charge computation method to KR = (+), DE = (-). Whether or not H is positively charged is context dependent within proteins. The net charge is just the sum of the charges. For example, if a peptide contains 2 D's, its net charge = - 2. If it contains 2 Ds, 2 Es, and 2 K's, its net charge is again -2.
The tool can optionally produce a summary by prefix, such as clade or country, if the sequence names have such information as a prefix followed by a separator character (_ . - or *). For example, the sample input has clade in the sequence names followed by a underscore, such as "B". In the following example, V1 hypervariable regions were selected, and "Include prefix summary" was checked. B and C would be grouped together for summary statistics, and the scores would be averaged for each group. If your input names looked like this, from the input file, and you requested a summary of V1 hypervariable regions based on prefixes:
B_FR_83_HXB2_K03455 MRVK---EKYQHL... B_TH_90_BK132_AY173951 MRVKEIRKNCQHL... B_NL_00_671_00T36_AY423387 MKVKGIRKNYQLL... B_US_98_1058_11_AY331295 MRVKGIRRNCQHS... C_BR_92_BR025_d_U52953 MRVEGIQRNWKQW... C_IN_95_95IN21068_AF067155 MRVRGILRNYQQW... C_ET_86_ETH2220_U46016 MKVMGIQRNCQQW... C_ZA_04_04ZASK146_AY772699 MRVRGILRNWPQW...
Your output would look like this:Selected Region: V1_Hypervariable, HXB2 Positions: 132 - 152
And for each group, and all sequences combined, (excluding the HXB2 reference strain), you will get a summary of the averages:Prefix Summary
If you specify a variable loop or hypervariable region in the check box, you will get the following regions excised. Please note that the hypervariable regions are somewhat arbitrary; at the database we have found these boundaries to be good markers of the transition between regions that are variable but can be aligned reasonably, versus regions that very are distinctive and different in length and content in HIV from almost all infections. We have included these regions for convenience, and because we have used them for population studies in the past. But if a different set of boundaries better represents your data and the hypothesis you are testing (for example, a longitudinal study sampled within a subject may have indel regions that are much more narrowly constrained) you should set your own boundaries based on your alignment.
The V1 loop is includes positions 131-157 in HXB2, and is bounded by a disulfide bond in the Cysteines at the base (C). The program will identify the C131 and C157 in HXB2, and excise the region of the alignment that corresponds to the specified region of HXB2. The V1 loop is highlighted in blue in the HXB2 Env protein fragment shown below. The hypervariable region in V1 (the region where the alignment begins to breaks down), as it is found in HXB2 is marked in red; in HXB2 it spans T132 and G152. There is extreme length variation is such regions, and the program will extract everything thing between but not including the more readily aligned C131 and E153 in HXB2 that bound the hypervariable region.
The V2 region begins where V1 ends, starting at S158 and continuing through C196 in HXB2. Like V1, V2 is bounded by Cys bonds, however the C196 is linked with the C at 126, giving a "rabbit ear" structure to the region. The V2 hypervariable region is marked in red, in HXB2 it spans D185 and S190.
V3 is bounded by C296 and C331 using HXB2 numbering. It does not have a hypervariable region.
V4 is bounded by C385 and C418 using HXB2 numbering. The V4 hypervariable region is marked in red, in HXB2 it spans F396 and G410:
The V5 loop defined based on gp120 structure, and is not bound by Cys disulfide bridges at its base; it is located in positions N460 to R469 in HXB2. The V5 hypervariable region is marked in red, in HXB2 it spans N460 and S465:
(example in V2)
If hypervariable regions based on the positions noted in HXB2 are simply excised from an alignment, the extent of the region in other proteins with longer hypervariable sections than HXB2 will not be captured, and depending on the input alignment, even regions with shorter hypervariable regions than HXB2 may not be fully represented.
The V2 hypervariable in HXB2 spans D185 and S190, and is highlighted in red here:
If the regions from the alignment that span D185 and S190 in HXB2 are extracted, the following peptides would be pulled from the alignment, and most of the hypervariable regions in most sequences would be missed.
If instead the region defined as lying between the two more conserved "alignable" positions, just outside of the bounds in the hypervariable stretch in HXB2, are excised, between I184 and Y191, and including gaps in HXB2 inserted to maintain the alignment, the full region is captured, and we get very different, and appropriate, results:
To do this analysis properly, first make sure the alignment is sensible in the boundary regions. Because insertions often in part carry direct repeats, and regions vary in length extensively (Wood et al., PLoS Pathog. 2009 May;5(5):e1000414.), multiple alignment programs often can give grossly inappropriate results in the hypervariable regions of HIV Env, particularly when the multiple alignment program is challenged with a very large and diverse data set as input. Second, if you use the automated settings we have developed, we will excise the hypervariable regions by taking the full region between the residues we feel are more readily aligned, as in the corrected example above. Please take care that the values for the loops and hypervariable regions are in accord with the boundaries you wish to use, given your alignment and the hypothesis you are exploring. The alignments in these regions are subjective. If the boundaries we have pre-defined and selected based on the global database alignment are not most appropriate for your data (for example, a within-subject sequence set may be readily aligned with narrower indel regions than the population as a whole, and more focused boundaries may be more appropriate), then please use the manual input to specify the alignment positions you wish to have extracted and summarized rather than the automated settings.
Finally, as a cautionary note, N-linked glycosylation sites require 3 amino acid, and partial sites are not included in the summaries.
Go Back to the Variable Region Characteristics submission form.