HIV Databases HIV Databases home HIV Databases home
HIV sequence database



GenSig Explanation

A genetic signature is a statistical indicator that identifies specific amino acids in a protein that are likely to impact a particular function or binding. GenSig can be detected by finding a statistical correlation between particular genetic changes in a set of sequences and the resulting phenotypes.

This tool is designed to take a set of variable sequences with phenotypic data and identify specific amino acids that are most likely to be critical to the phenotype. For example, the sample input for the tool is a set of HIV-1 DNA sequences with correlated ID50 and ID80 neutralization data for one antibody. The resulting genetic signatures may indicate sites in the viral protein that are critical to antibody binding.

Analysis type

Strategy 1: Full phylogenetic and signature analysis

You will need to provide the DNA sequence alignment and phenotype data. Use this option if you want to prepare both a tree with maximum likelihood ancestor results and signature results.

Strategy 2: New signature analysis using phylogenetic analysis from a previous run

You will need to provide the run ID of a previous submission and select options. You can choose to re-use the same phenotype data, or provide a new set. The run ID is the ID that was generated when you ran the tool by strategy 1 and is displayed at the top of the results page. You should make a note of this ID if you intend to use the tree and maximum likelihood ancestor results prepared by strategy 1 to generate additional signature results. Re-using the tree results will provide a FASTER result if you only want to change your phenotype data and/or signature options.

Please note: Your past results remain on our server for 4 days from the time you first clicked the link provided in the "job done" email.

DNA sequence alignment and options

Sequence Alignment

Requirements:

Ambiguity codes

The DNA sequence alignment can contain only A, C, T, G, N and - (gaps). If there are any other nucleotides in the alignment, you will be shown a list of sequences and the subsequent positions where this test failed. You can then manually edit the alignment and try running the tool again. If you wish to replace all the the non-ACTGN nucleotides with N, you may check the "Replace with N" checkbox.

Non allowable chars in names

Sequence names cannot contain characters that are reserved characters for Newick tree files: ; : ( ) , #
If your names contain such characters, choose the option to replace them.

HIV-1 and HXB2

If you select the first option, this means that the first sequence of your alignment file must be the HIV-1 reference sequence, HXB2. Including HXB2 enables us to provide you with results that include HXB2 nucleotide positions, amino acid positions, and the HIV region for each of the signature results.

This HXB2 sequence should be aligned to the rest of the DNA alignment, and the exact name of this reference sequence should not be present in the phenotype file. (If you have actual phenotype data for HXB2, provide the data and another occurrence of HXB2 in the alignment with a different name.) All the sequences except the first sequence will be used for the tree, maximum likelihood ancestor, and signature analyses.

Select the second option if you are using this tool for finding genetic signatures for a non HIV alignment (e.g., SIV or HCV) or you are not interested in the HXB2 reference positions and regions for your signatures. In this case, do not include any reference sequence in the alignment. All the sequence names are expected to be present in the phenotype file. All the sequences will be used for tree, maximum likelihood ancestor, and signature analysis.

Phenotype data and options

The phenotype data should be formatted in a table. We recommend tab-delimited files (.txt), but the tool will also accept space-delimited (.txt) or comma-delimited (.csv) values.

The phenotype file is a tab delimited file with the first line being the header. The header should start with the word "name" to signify the sequence name column header. This word should be followed by a tab character (\t). Then the header should have a string of tab delimited feature or phenotype names. These names will appear in the results under the feature column. This should be followed by a new line character (\n). Then the phenotype file should contain lines similar to the format of the header.

Format of the Phenotype file

The format of the phenotype should be tab delimited:
name[tab][feature1][tab][feature2]...
[seq 1][tab][value 1][tab][value 2]...
[seq 2][tab][value 3][tab][value 4]...
.
.
.
[seq n][tab][value n1][tab][value n2]...

Example:

name	ID50_geomean	ID80_geomean
A|T|T|F|2|KE|AF407152|Q259.d2.17	1	0	
A|T|T|F|2|KE|AF407160|Q842.d12	0	0	
A|T|NA|F|NA|KE|HM215275|21020_13	1	1	
A|T|T|T|NA|UG|HM215266|191084_B7-19	0	0	
A|T|T|T|NA|RW|HM215434|R18553_E1	0	0	
A|T|NA|NA|NA|TZ|HM215312|398-F1_F6_20	0	0	
A|T|T|T|NA|UG|HM215350|9004SS_A3_4	0	0	
A|T|T|F|2|KE|AF407158|Q769.d22	1	0	
Please note: The reference sequence should not be included in the phenotype file. (If you have actual phenotype data for HXB2, provide the data and another occurrence of HXB2 in the alignment with a different name.)

Mismatched names

Names used in the phenotype file should match the names of the sequences of the DNA alignment exactly. If you check the "Ignore mismatch names between the query and phenotype files" checkbox, any names that are present in the phenotype file, but not present in the sequence file will be removed from the phenotype file. Similarly, any names that are present in the sequence file, but not present in the phenotype file will be added into the phenotype file with the value of -1 as corresponding value. If you do not check this checkbox, and there are names don't match, then the tool will list them for you and you may edit your input files and rerun the job.

Statistical test

Values used in the phenotype file can either be continuous values or discrete values 1, 0, -1. Data with continuous values will be analyzed by Wilcoxon test; data with discrete values will be analyzed by Fisher's exact test. Wilcoxon test may give spurious results for data with many tie values, so be wary.

The values of 1, 0, -1 signify:

To perform Fisher's exact test, continuous values must be converted to 1, 0, -1 values. Do this by checking one of the options, explained below:

Signature options

Signature analysis

Your analysis can consider single sites, double sites, or functional domain sites at various depths. You may choose multiple analyses.

The following specific options are available:

Maximum q-value

The q-value is the false discovery rate for a given set of p-values, and we present p and q values in the signature output.

Results

Run ID

The run ID is the ID that was generated for you, when you ran the tool by strategy 1. This ID is displayed at the top of the results page. You should make a note of this ID if you intend to use strategy 2 to rerun the tool for additional signature analyses.

Parameters used

This section lists the sequence options, phenotype options, and signature analysis options you selected when submitting the job.

Tree Results

The following files are available for download under tree results:

Maximum Likelihood Ancestor Results

The following files are available for download under maximum likelihood ancestor results.

Signature Results

For each of the signature analysis options selected, this section contains 3 tables:

Note: Glycosylation analysis has only Table 1 and Table 3, and omits Table 2, because Table 2 and Table 3 contain opposite but identical results. These tables contain results filtered by the maximum q-value you chose on the input page.

If the table is found to be empty after the filter is applied, then there will be text stating "No results to report here".

Each of these tables contains columns, briefly described below:

Signature Files

The following files are available for download under signature results
last modified: Wed Jan 16 08:41 2019


Questions or comments? Contact us at seq-info@lanl.gov.

 
Operated by Triad National Security, LLC for the U.S. Department of Energy's National Nuclear Security Administration
© Copyright Triad National Security, LLC. All Rights Reserved | Disclaimer/Privacy

Dept of Health & Human Services Los Alamos National Institutes of Health