The Variable Region Characteristics tool characterizes features within a peptide excised from a protein alignment, summarizing and reporting the following peptide characteristics: length, number of N-linked glycosylation sites (the sequence pattern NX[ST], where X can be any amino acid except P), and net charge. It is a general tool that can be used to characterize regions excised from any protein alignment by the user specifying alignment positions, but it was originally designed to characterize the hypervariable regions in HIV-1 Env. Thus, if an HIV Env alignment is in the input, we have automated the excision of either the full V1, V2, V3, V4, V5 loops, or the hypervariable regions within them. How the boundaries of the hypervariable regions are treated in the alignment it critical for a meaningful result, for an example of issues that should be considered, please see the section below entitled "Alignment Considerations".
The tool requires a set of aligned protein sequences along with an optional reference sequence. If you enter Alignment Positions to be selected in the Select Positions section of the submission form, then a reference sequence is optional. If you choose regions (full loop or hypervariable) from the Select Regions section, then your aligned sequences must include the HXB2 sequence.
Paste or upload a set of aligned protein sequences. Fasta, IG, table, and other common sequence formats are accepted. If you will be selecting a pre-specified loop or hypervariable region in the Select Region section, then your input must include HXB2 where "HXB2" (no quotes) is included in the name of the reference sequence (For example, HXB2 is called B_FR_83_HXB2_K03455 in the sample input). The HXB2 reference sequence will not be included in the averaged summaries; if you want to include HXB2 in the average characteristics summaries, then include the HXB2 sequence in the input file twice, once as a reference sequence, and once where you do not include HXB2 in the label (e.g. B_FR_83_K03455).
By default the net charge is computed with KRH = (+) and DE = (-). However you can select the net charge computation method to KR = (+), DE = (-). Whether or not H is positively charged is context dependent within proteins. The net charge is just the sum of the charges. For example, if a peptide contains 2 D's, its net charge = - 2. If it contains 2 Ds, 2 Es, and 2 K's, its net charge is again -2.
Sites with the pattern NP[ST] are not included in the Glyco column. However, they are counted in their own NP[ST] column.
Since sites in close proximity to each other may be hindered from being glycosylated at the same time, two overlapping sites are counted as one in the Glyco column (Gavel and von Heijne 1990; Go et al.2015). The NN[ST][ST] column displays the number of overlapping glycosylation sites in the region.
You can select the regions of the alignment to analyze in three possible ways:
The tool can optionally produce a summary by prefix, such as clade or country, if the sequence names have such information as a prefix followed by a separator character (_ . - or *). For example, the sample input has clade in the sequence names followed by a underscore, such as "B". In the following example, V1 hypervariable regions were selected, and "Include prefix summary" was checked. B and C would be grouped together for summary statistics, and the scores would be averaged for each group. If your input names looked like this, from the input file, and you requested a summary of V1 hypervariable regions based on prefixes:
B_FR_83_HXB2_K03455 MRVK---EKYQHL... B_TH_90_BK132_AY173951 MRVKEIRKNCQHL... B_NL_00_671_00T36_AY423387 MKVKGIRKNYQLL... B_US_98_1058_11_AY331295 MRVKGIRRNCQHS... C_BR_92_BR025_d_U52953 MRVEGIQRNWKQW... C_IN_95_95IN21068_AF067155 MRVRGILRNYQQW... C_ET_86_ETH2220_U46016 MKVMGIQRNCQQW... C_ZA_04_04ZASK146_AY772699 MRVRGILRNWPQW...
Your output would look like this:Selected Region: V1_Hypervariable, HXB2 Positions: 132 - 152
And for each group, and all sequences combined, (excluding the HXB2 reference strain), you will get a summary of the averages:Prefix Summary
Optionally, you can give your analysis a name.
If you specify a variable loop or hypervariable region in the check box, you will get the following regions excised. Please note that the hypervariable regions are somewhat arbitrary; at the database we have found these boundaries to be good markers of the transition between regions that are variable but can be aligned reasonably, versus regions that very are distinctive and different in length and content in HIV from almost all infections. We have included these regions for convenience, and because we have used them for population studies in the past. But if a different set of boundaries better represents your data and the hypothesis you are testing (for example, a longitudinal study sampled within a subject may have indel regions that are much more narrowly constrained) you should set your own boundaries based on your alignment.
The V1 loop is includes positions 131-157 in HXB2, and is bounded by a disulfide bond in the Cysteines at the base (C). The program will identify the C131 and C157 in HXB2, and excise the region of the alignment that corresponds to the specified region of HXB2. The V1 loop is highlighted in blue in the HXB2 Env protein fragment shown below. The hypervariable region in V1 (the region where the alignment begins to breaks down), as it is found in HXB2 is marked in red; in HXB2 it spans T132 and G152. There is extreme length variation is such regions, and the program will extract everything thing between but not including the more readily aligned C131 and E153 in HXB2 that bound the hypervariable region. The highlighted positions have also been included in the tool's output so that glycosylation sites across the boundary are accounted for.
The V2 region begins where V1 ends, starting at S158 and continuing through C196 in HXB2. Like V1, V2 is bounded by Cys bonds, however the C196 is linked with the C at 126, giving a "rabbit ear" structure to the region. The V2 hypervariable region is marked in red, in HXB2 it spans D185 and S190. The highlighted positions have also been included in the tool's output so that glycosylation sites across the boundary are accounted for.
V3 is bounded by C296 and C331 using HXB2 numbering. The highlighted positions have also been included in the tool's output so that glycosylation sites across the boundary are accounted for. V3 does not have a hypervariable region.
V4 is bounded by C385 and C418 using HXB2 numbering. The V4 hypervariable region is marked in red, in HXB2 it spans F396 and G410:
The V5 loop defined based on gp120 structure, and is not bound by Cys disulfide bridges at its base; it is located in positions N460 to R469 in HXB2. The highlighted positions have also been included in the tool's output so that glycosylation sites across the boundary are accounted for. The V5 hypervariable region is marked in red, in HXB2 it spans N460 and S465:
(example in V2)
If hypervariable regions based on the positions noted in HXB2 are simply excised from an alignment, the extent of the region in other proteins with longer hypervariable sections than HXB2 will not be captured, and depending on the input alignment, even regions with shorter hypervariable regions than HXB2 may not be fully represented.
The V2 hypervariable in HXB2 spans D185 and S190, and is highlighted in red here:
If the regions from the alignment that span D185 and S190 in HXB2 are extracted, the following peptides would be pulled from the alignment, and most of the hypervariable regions in most sequences would be missed.
If instead the region defined as lying between the two more conserved "alignable" positions, just outside of the bounds in the hypervariable stretch in HXB2, are excised, between I184 and Y191, and including gaps in HXB2 inserted to maintain the alignment, the full region is captured, and we get very different, and appropriate, results:
To do this analysis properly, first make sure the alignment is sensible in the boundary regions. Because insertions often in part carry direct repeats, and regions vary in length extensively (Wood et al., PLoS Pathog. 2009 May;5(5):e1000414.), multiple alignment programs often can give grossly inappropriate results in the hypervariable regions of HIV Env, particularly when the multiple alignment program is challenged with a very large and diverse data set as input. Second, if you use the automated settings we have developed, we will excise the hypervariable regions by taking the full region between the residues we feel are more readily aligned, as in the corrected example above. Please take care that the values for the loops and hypervariable regions are in accord with the boundaries you wish to use, given your alignment and the hypothesis you are exploring. The alignments in these regions are subjective. If the boundaries we have pre-defined and selected based on the global database alignment are not most appropriate for your data (for example, a within-subject sequence set may be readily aligned with narrower indel regions than the population as a whole, and more focused boundaries may be more appropriate), then please use the manual input to specify the alignment positions you wish to have extracted and summarized rather than the automated settings.
Finally, as a cautionary note, N-linked glycosylation sites require 3 amino acid, and partial sites are not included in the summaries.
Optionally, you can include information about the sequences in your analysis, in order to separately analyze the sequences with/without a particular feature. For example, our sample input for the tool includes feature data of "1" or "0" for each sequence to indicate that the sequence's IC50 value is above or below a particular threshold. The tool can take up to 5 featurs.
Indicate whether your feature data are space-delimited, comma-delimited, or tab-delimited.
If this box is unchecked (default), the program will give an error message if any of the names are mismatched between the alignment and the feature data. If this box is checked, the program will run, but will only produce feature-associated data for the sequences with matching names.
If your feature data consists of binary values, then choose the first option here. Use "1" to indicate true, "0" to indicate false, and "-1" or "NA" to indicate that the value should be ignored.
If your data consists of continuous values, the second option tells the program to change the continuous values into binary values. Values at or above the median will be assigned the value of "1", and values below the median will be assigned "0". Use "NA" to ignore the value.
Indicate whether the first row of feature data has a header row.
Go Back to the Variable Region Characteristics submission form.