HIV Databases HIV Databases home HIV Databases home
HIV sequence database

SUDI Explanation




SUDI (Subtyping Distance Tool) is designed to assist a user in determining whether or not a newly identified cluster of HIV-1 sequences should most appropriately be considered part of a new subtype, a new sub-subtype, or part of a previously defined subtype, based upon a comparison to the level of similarity found among previously defined HIV-1 M group subtypes. Because absolute levels of similarity will be dictated by precise gene regions under consideration, the time of sampling of background samples in an ever-diverging epidemic, and the specific alignment, we did not want to try to set absolute criteria for intra- and inter-subtype distances. If the novel sequences under consideration are recombinant, one should include only regions between breakpoints in this analysis. It is important that you are very familiar with the phylogenetic relationships between your novel sequences and the background set of sequences and that potential regions of inter-subtype recombination have been defined, prior to using this tool.


SUDI can use as input either an alignment, or the "outfile" of a PHYLIP tree building program. We left the style of input open so that users would have the option of starting with a sequence alignment (see example alignment) and building a tree as part of the analysis; the default tree for the program is a PHYLIP Neighbor Joining tree with an F84 (DNAML) model. The sequence alignment will be gapstripped.

If you would prefer to use a more complex model or another kind of tree building strategy, or would like to use a tool other than PHYLIP, then first create your own tree. If it is a PHYLIP tree, it can be used directly as input (see example outfile). If it is not, then use the tree you have created as a basis to create a user-defined tree with PHYLIP, and then use the PHYLIP outfile as the input for SUDI.

Here is a link to PHYLIP.


If you are submitting an alignment, you must include an outgroup as the first sequence in the set. The outgroup will not be included in the final subtype distance analysis. If you are submitting an outfile, you need to specify the number of the "base" node on the tree from looking at the outfile; all sequences to the right of the node in the outfile will be compared. The default value for that node is 1, but this may not be appropriate for your data and you should check your outfile. In our example outfile, the appropriate reference node is node 20.


It is critical that sequences used in this analysis be named appropriately, or the analysis WILL NOT WORK!

    The subtype, sub-subtype, circulating recombinant form, or in general, the grouping of a sequence should be indicated by character or two, followed by an underscore. This should be followed by the the sequence name. Examples: A_U455, B_RF, F1_93BR020, 02_IBNG refer to subtype A, subtype B, sub-subtype F1, and a CRF02 sequence respectively, but the you can choose to specify the grouping with characters that are appropriate for your background set. You may essentially name your background clusters according to your needs.
    The new cluster of sequences under consideration should be labeled "U_seqID", where the seqID is the name of the sequence. The U stands for "unknown" and identifies the set query sequences (view example outfile)
    The outgroup must be labeled "OUTGROUP" if you are submitting an alignment instead of the outfile. See example alignment.


Because of the general design of this tool, it can be used for non-HIV sequences, such as HCV. However, the default settings for the groups to be compared against are based on the HIV-1 subtype nomenclature as of the year 1999.

All subtypes and sub-subtypes should be listed under:


Based on the tree, histograms will be generated showing the range of intra-subtype distances, inter-subtype distances, and sub-subtype distances. The category that a given pairwise distance is assigned to (intra-subtype, inter-subtype, or sub-subtype distances ) will depend on how the sequence was labeled (A_, B_...) and how the clusters were defined.

The cluster of sequences that the user is interested in, those sequences labeled "U", will be highlighted relative to the background set. The U intra-subtype distances will be shown, and the U inter-subtype distance relative to the subtype closest to U will be shown. This way the user can determine if the novel cluster should be broken into sub-subtypes, or be considered part of a previously defined subtype.

The Plot Title field allows you to enter a title of up to 50 characters. This is optional.


The sample plots below illustrate Subtype and Sub-subtype behavior and show the output of the SAT process.

SUDI was written by B. Korber and R. Funkhouser; P. Rose assisted with the interactive Web interface.

last modified: Thu Nov 8 11:13 2007

Questions or comments? Contact us at

Operated by Los Alamos National Security, LLC, for the U.S. Department of Energy's National Nuclear Security Administration
Copyright © 2005-2017 LANS, LLC All rights reserved | Disclaimer/Privacy

Dept of Health & Human Services Los Alamos National Institutes of Health