HIV Databases HIV Databases home HIV Databases home
HIV sequence database

GenSig Explanation

A genetic signature is a statistical indicator that identifies specific amino acids in a protein that are likely to impact a particular function or binding. GenSig can be detected by finding a statistical correlation between particular genetic changes in a set of sequences and the resulting phenotypes.

This tool is designed to take a set of variable sequences with phenotypic data and identify specific amino acids that are most likely to be critical to the phenotype. For example, the sample input for the tool is a set of HIV-1 DNA sequences with correlated ID50 and ID80 neutralization data for one antibody. The resulting genetic signatures may indicate sites in the viral protein that are critical to antibody binding.

If you use these analyses strategies, please cite:


Analysis type

Strategy 1: Full phylogenetic and signature analysis

You will need to provide the DNA sequence alignment and phenotype data (e.g. antibody sensitivity data). We have an upper bound of 800 sequences. Use this option if you want to prepare both a tree with maximum likelihood ancestor results and signature results.

Strategy 2: New signature analysis using phylogenetic analysis from a previous run

Please use this option if you are planning to use the same sequence data paired with new phenotype data, or to rerun with new signature options; the code will run faster, and use less computational time on our severs. You will need to provide the run ID of a previous submission and select options. You can choose to re-use the same phenotype data, or provide a new set. The run ID is the ID that was generated when you ran the tool by strategy 1 and is displayed at the top of the results page. You should make a note of this ID if you intend to use the tree and maximum likelihood ancestor results prepared by strategy 1 to generate additional signature results.

Please note: Your past results remain on our server for 4 days from the time you first clicked the link provided in the "job done" email.

Sequence alignment and options


Ambiguity codes

The DNA sequence alignment can contain only A, C, T, G, N and - (gaps). If there are any other nucleotides in the alignment, you will be shown a list of sequences and the subsequent positions where this test failed. You can then manually edit the alignment and try running the tool again. If you wish to replace all the non-ACTGN nucleotides with N, you may check the "Replace with N" checkbox.

Non allowable chars in names

Sequence names cannot contain characters that are reserved characters for Newick tree files: ; : ( ) , and #. If your names contain such characters, choose the option to replace them.

HIV-1 and HXB2

If you select the first option, this means that the first sequence of your alignment file must be the HIV-1 reference sequence, HXB2. Including HXB2 enables us to provide you with results that include HXB2 nucleotide positions, amino acid positions, and the HIV region for each of the signature results.

This HXB2 sequence should be carefully aligned to the rest of the DNA alignment, and the exact name of this reference sequence should not be present in the phenotype file. (If you have actual phenotype data for HXB2, provide the data and another occurrence of HXB2 in the alignment with a different name.) All the sequences except the first sequence will be used for the tree, maximum likelihood ancestor, and signature analyses.

Select the second option if you are using this tool for finding genetic signatures for a non-HIV alignment (e.g., SIV or HCV) or you are not interested in the HXB2 reference positions and regions for your signatures. In this case, do not include any reference sequence in the alignment. All the sequence names are expected to be present in the phenotype file. All the sequences will be used for tree, maximum likelihood ancestor, and signature analysis.

Regions of interest

This option is only relevant to HIV with HXB2 included as a reference sequence. HIV often has overlapping reading frames. The default version of the program will consider any protein encoding reading frame included in the nucleotide sequence. But if your phenotype data is relevant to only one of proteins encoded, you should restrict your analysis to just that protein. For example if your input is neutralizing antibody sensitivity data, then for clarity you should restrict your analysis to Env, and exclude signatures in the overlapping reading frames like Tat and Rev, that might occur due to selection of a base in an overlapping codon.

Phenotype data and options

The phenotype data should be formatted in a table. We recommend tab-delimited files (.txt), but the tool will also accept space-delimited (.txt) or comma-delimited (.csv) values.

The first line of the phenotype data is the header. The header should start with the word "name" to signify the sequence name column header. This word should be followed by a tab. Then the header should have a string of tab delimited feature names. These names will appear in the results under the feature column. This should be followed by a new line character (\n). Then the phenotype file should contain lines similar to the format of the header.

Format of the Phenotype file


name    ID50Ab1 ID50Ab2
A|Q252  1   0   
A|Q842  0   0   
A|21020 1   1   
A|191_B 0   0   
Please note: The reference sequence should not be included in the phenotype file. (If you have actual phenotype data for HXB2, provide the data and another occurrence of HXB2 in the alignment with a different name.)

Mismatched names

Names used in the phenotype file should match the names of the sequences of the DNA alignment exactly. If you check the "Ignore mismatch names between the query and phenotype files" checkbox, any names that are present in the phenotype file, but not present in the sequence file will be removed from the phenotype file. Similarly, any names that are present in the sequence file, but not present in the phenotype file will be added into the phenotype file with the value of -1 (-1 is equivalent to no data) as corresponding value. If you do not check this checkbox, and there are names don't match, then the tool will list them for you and you may edit your input files and rerun the job.

Statistical test

Values used in the phenotype file can either be continuous values or discrete values: 1, 0, -1. Data with continuous values will be analyzed by Wilcoxon test; data with discrete values will be analyzed by Fisher's exact test. Wilcoxon test may give spurious results for data with many tie values, so be wary.

The values of 1, 0, -1 signify:

To perform Fisher's exact test, if continuous values are input, they must be converted to 1, 0, -1 values. Do this by checking one of the options, explained below:a

Signature options

Signature analysis

You may choose multiple analyses.

Maximum q-value

The q-value is the false discovery rate for a given set of p-values, and we present p and q values in the signature output.


Run ID

The run ID is the ID that was generated for you, when you ran the tool by strategy 1. This ID is displayed at the top of the results page. You should make a note of this ID if you intend to use strategy 2 to rerun the tool for additional signature analyses.

Parameters used

This section lists the sequence options, phenotype options, and signature analysis options you selected when submitting the job.

Tree Results

The following files are available for download under tree results:

Maximum Likelihood Ancestor Results

The following files are available for download under maximum likelihood ancestor results.

Signature Results

For each of the signature analysis options selected, this section contains 3 tables. These tables contain results filtered by the maximum q-value you chose on the input page.

Note: Glycosylation analysis has only Table 1 and Table 3, and omits Table 2, because Table 2 and Table 3 contain opposite but identical results. Potential N linked glycosylation motifs are either present (Y, for yes) or they are not present (N or no).

If the table is found to be empty after the filter is applied, then there will be text stating "No results to report here".

Each of these tables contains columns, briefly described below:

Signature Files

The following files are available for download under signature results

NOTES regarding the strategy


Gaps in the alignment to compensate for naturally occurring insertions and deletions are problematic both in the alignments and in maximum likelihood phylogeny character state reconstructions. We currently exclude positions from the alignment that are more than 10% gaps (we intend to eventually add an option to allow users to set this threshold), and thus the analyses focus on only positions in the alignment are less likely to be problematic due to alignment issues.

In our original implementation of the code, we used maximum likelihood reconstructions of sequences, rather than the actual sequences, to represent sequences in the tree, including the leaf taxa. We modified this approach in an update on March 25, 2019, and use the actual sequences for the simple Fishers tests included in Table 1, as these counts are independent of the phylogeny. This enabled us to treat gaps as simply another character state for statistical comparisons.

For Tables 2 and 3, the phylogenetically corrected signature analyses, we exclude gaps from counts of amino acids in a given position, as the modeling insertions and deletions represented by the gaps in ancestral reconstructions is problematic.

Unusual ancestral reconstructions

On very rare occasions, the most likely ancestral state of a particular codon in the most recent common ancestor of two natural sequences will encode an amino acid that is not found among the input sequence data. Such amino acids are summarized in the output files of all amino acids tested, so occasionally an amino acid with zero counts in the sequences will be present, because it is in one or more ancestral states.

Q tests

The q-value is the false discovery rate for a given set of p-values. For the Wilcoxon test signature analysis, we use the R implementation of a q test, called qvalue, as implemented by (Storey and Tibshirani 2003).

To estimate the false discovery rate given the p-values derived using Fisher’s exact test, we use a modified q test as described in our original signature paper (Bhattacharya 2007). The reason for this is the many discrete p-values resulting from the Fisher’s exact test that populate the distribution.

Consider N tests producing p-values {pi, i= 1 ... N}. For any pthresh, we want to know the expected fraction of the tests with pi ≤ pthresh that are false positives. Let us denote the count of observed pi ≤ pthresh as Nobs.

The fraction of N tests that are null distributed is f, i.e., N f tests should be insignificant. In that case, one would naively expect that the number of false positives below pthresh would be Poisson distributed with parameter N f pthresh, so that the expected fraction of false positives is q = N f (pthresh/Nobs). Storey and Tibshirani (see above) showed that under some simple assumptions, one can find an estimator for f, and hence for q.

This analysis has a few issues:

The q threshold is a user defined option, we use a default value of 0.2 (Bricault2019); this is set high to be inclusive, and will identify associations of potential interest in a hypothesis raising scenario. The q values are calculated for Table 1, 2 and 3 separately, as well as for columns of input data.

Storey JD and Tibshirani R. Statistical significance for genome-wide experiments. Proceedings of the National Academy of Sciences, 2003. 100: 9440-9445.

Bhattacharya T, Daniels M, Heckerman D, Foley B, Frahm N, Kadie C, Carlson J, Yusim K, McMahon B, Gaschen B, Mallal S, Mullins JI, Nickle DC, Herbeck J, Rousseau C, Learn GH, Miura T, Brander C, Walker B, Korber B. Founder effects in the assessment of HIV polymorphisms and HLA allele associations. Science, 2007 Mar 16;315(5818):1583-6.

Bricault CA, Yusim K, Seaman MS, Yoon H, Theiler J, Giorgi EE, Wagh K, Theiler M, Hraber P, Macke JP, Kreider EF, Learn GH, Hahn BH, Scheid JF, Kovacs JM, Shields JL, Lavine CL, Ghantous F, Rist M, Bayne MG, Neubauer GH, McMahan K, Peng H, Chéneau C, Jones JJ, Zeng J, Ochsenbauer C, Nkolola JP, Stephenson KE, Chen B, Gnanakaran S, Bonsignori M, Williams LD, Haynes BF, Doria-Rose N, Mascola JR, Montefiori DC, Barouch DH, Korber B. HIV-1 Neutralizing Antibody Signatures and Application to Epitope-Targeted Vaccine Design. Cell Host Microbe, 2019 Jan 9;25(1):59-72.e8. doi: 10.1016/j.chom.2018.12.001.

last modified: Tue Apr 9 13:52 2019

Questions or comments? Contact us at

Operated by Triad National Security, LLC for the U.S. Department of Energy's National Nuclear Security Administration
© Copyright Triad National Security, LLC. All Rights Reserved | Disclaimer/Privacy

Dept of Health & Human Services Los Alamos National Institutes of Health