Click a question to view the answer below. For questions about immunology resources and tools, see Immunology Database FAQ.
This FAQ addresses questions about the HIV Sequence Database. We provide a variety of tools and information for researchers studying HIV and SIV. The main aim of this website is to provide easy access to our sequence database, alignments, and the tools and interfaces we have produced. The toolbar at the top of the page should help you navigate among these resources.
The HIV Sequence Database focuses on five primary goals:
The database staff includes molecular biologists, sequence analysts, computer technicians, post-docs, and graduate research assistants. We are part of the Theoretical Biology and Biophysics Group (T-6) at the Los Alamos National Laboratory. We are funded by the Division of AIDS of the National Institute of Allergy and Infectious Diseases through an interagency agreement with the Department of Energy.
Our databases are organized around several areas of viral informatics. The affiliated databases are:
The information on this site is developed for researchers who study the AIDS virus and are seeking ways of defeating it. The information available here is not directly helpful to patients. We are not qualified to give medical advice of any kind; please discuss medical issues with your doctor. You can find links to more relevant websites on our Links Page.
Our sequence database receives all HIV-1, HIV-2, and SIV sequences that are deposited to GenBank. We retrieve these sequences monthly, so the very most recent sequence deposits may not be here yet. In addition to the information contained in GenBank records, we further annotate the sequences with an array of additional information (see questions below about annotation). We also provide an array of tools useful for understanding and working with these sequences.
This search interface finds all sequences that fit your criteria (e.g., all subtype B sequences from Thailand with names starting with 'H'), and allows you to download them either aligned or unaligned. If you want them aligned, you will get an alignment that has the length of the complete genome, and it can contain non-overlapping sequences (for example, if your retrieval contains both env and gag sequences). Another way to do it is to specify a region you are interested in (e.g., env, V3, or HXB2 nucleotide positions 5253-7640). In that case only the sequences are found that both fit your criteria and contain that region, and the alignment will only contain that region.
Additional information about sequence retrieval is available in the Search Help page.
One simple way is to search for the PubMed ID in the appropriate Publication field of the search interface. Another is to use other search criteria (e.g., accession number, author name, title word) to find one sequence from that article, and display its accession record (by clicking on the accession number in the search output). In the accession record, you will find a link called "display all sequences from this publication".
This can easily be done with our search interface by choosing the appropriate fields. There are predefined regions (genes and proteins) or you can use the genome map and find the coordinates. To find coordinates of a sequence, use the Sequence Locator Tool, which lets you paste in a sequence fragment and find its beginning and ending coordinates.
We try to link groups of sequences from a single patient by assigning them all to a unique patient ID. This unique patient ID that we assign is the number in parentheses. The other name/number is usually the sample name assigned by the authors (for example, "Patient_1"). For more details, please see the Search Help page.
This is an artifact of how we define "complete genome". A search for "complete genome" will include all sequences >7000 base pairs. These "complete" genomes are not always 100% complete; many have a small truncation of the 5' end of Gag. A search for "Gag" is limited to sequences that have a full-length Gag gene; those sequences that have a small truncation of Gag are omitted, and thus fewer sequences are obtained.
If you want to search for Gag sequences that include those sequences with small truncations of the 5' end, it is best to search using exact genome coordinates, with the 5' coordinate selected for the greatest truncation you are willing to accept.
Only the author (or owner) of an entry can update or modify the GenBank entry. Our database includes fields and comments that we add ourselves in GenBank-style entries. The fields we add are usually not reviewed by the authors. For example, we might add the health status of the patient, the date of sampling, the patient risk group, the phenotype of a viral culture from which a sequence is derived, subtype information, or additional references to a specific sequence entry. These added comments and fields come from our reading of the literature or analysis. Often, GenBank entries are not updated by their authors after the initial submission, but subsequent publications provide new and important information that pertains to a particular sequence, and we try to link this information.
Most information annotated to sequences is derived from publications and entered manually by our staff. As this is a time-consuming process, not all sequences are annotated. Some fields are blank because published papers do not always provide additional information.
GenBank requests this information for new HIV sequence submissions, so almost all entries now have this information. Please note that we distinguish between 'sampling country' and 'infection country'. This distinction can be important when, for example, a Somalian immigrant lives in Sweden and gets tested there: the sampling country is Sweden, but the likely infection country is Somalia. Filling out the infection country field is a bit of a judgement call, so this field should be regarded as 'likely infection country'.
We represent the country by a two letter country code based on the international naming convention (ISO 3166). These two letter codes are intuitive and short (for example, UG = Uganda, JP = Japan, etc.) so they can be easily linked to a sequence name for more informative representation in alignments and phylogenetic trees.
We try to only include information that is 'very likely or certain' to be true. This means, for example, that when a dual risk group is listed we do not include risk group information. If a paper states someone was probably infected in country X or country Y, we include that information only as a note. When someone was infected 'between 1989 and 1991' we do not include an infection year.
When we created our relational database, we combined comment lines that were linked to sequences by accession number from several sources in the older versions of our database. Thus information may be repeated. There are tens of thousands of entries in the database, and we felt it was more important to get all the information than to have it read smoothly.
No. Sequences must be deposited to one of the major sequence databases: GenBank, EMBL, or DDBJ. All HIV and SIV sequences deposited will automatically enter our database within approximately one month from their public release by any of these databases.
Prior to submission, we recommend that all HIV-1 sequences be run through the Quality Control tool. This tool will help you catch common HIV-1 sequence problems. It will also help you prepare the sequences for deposit to GenBank, if you wish.
Yes! Please contact us for details.
"M" is the main group of viruses in the HIV-1 global pandemic, and it contains multiple subtypes. N, O, and P are very distinctive forms of the virus originating from different transmissions from other primates into humans. CPZ are the primate viruses isolated from chimpanzees, which are the non-human primate viruses most closely related to HIV-1. For additional information, see HIV and SIV Subtype Nomenclature.
Subtypes are phylogenetically associated groups of HIV-1 or HIV-2 sequences. Sometimes the word "clade" is used to mean subtype. The sequences within any one subtype are more similar to each other than to sequences from different subtypes. These subtypes represent different lineages of HIV, and have some geographical associations. There are many ambiguities in the subtyping system, however it describes genetic clustering patterns and provides a useful system for organizing viruses by genetic similarity. This topic is explained in detail in HIV and SIV Subtype Nomenclature.
Each year we gather a set of Subtype Reference Sequences that are considered to be representative of all of the subtypes of the the HIV-1 M, N, and O groups. Larger sets of HIV/SIV Alignments of each gene and complete genomes, including the subtype references sequences, are also available.
Subtype E was redesignated as CRF01_AE in 1998. It was originally described as subtype E based on envelope genes from isolates from southeast Asia. When gag genes and complete genomes from these isolates were sequenced, it was found that regions of the genome other than env gene are more similar to the A subtype, so "subtype E" turned out to be a recombinant. Small fragments in the env region are still commonly called "E" because there they do appear to be completely separate from all other subtypes. The E subtype has only been clearly defined in the env region, and the evolutionary history and the origin of this mosaic form remains controversial.
Multiple letters indicate that the sequence is a recombinant of parental viruses originating from 2 or more clades. For example, AGH indicates that it is thought that three subtypes recombined to form the sequenced virus: A, G, and H. The subtypes are listed alphabetically. The regions of the genome that are derived from a particular subtype are not indicated by the name.
When a number appears as part of a subtype, this refers to a circulating recombinant form (see next question).
CRFs are viruses whose complete genome has been shown to be recombinant or mosaic, consisting of some regions which cluster with one subtype and other regions of the genome which cluster with another subtype in phylogenetic analyses. CRFs are numbered sequentially in the order in which they are reported in the literature, starting with CRF01_AE, which is the new name of what used to be subtype E. The name of the isolate which was first sequenced and described is used to indicate the prototype of that CRF. This is done because there can be many different recombinant genomes containing the same subtypes, but only some of them have the same recombination breakpoints, and are apparently derived from the same common ancestor. In order to classify a recombinant as a circulating recombinant form, it must be found and sequenced in at least 3 patients who were not directly epidemiologically linked. The structure of all defined CRFs can be found here: HIV-1 Circulating Recombinant Forms.
This topic is addressed in How the HIV Database Classifies Sequences.
The subtype shown may reflect the subtypes of 2 or more fragments from the same sample. For example, if we have env sequences of subtype A and gag sequences of subtype C from the same patient, we will usually label all sequences "AC", unless the authors specifically mention that the person was dually infected.
When a short region of sequence has a subtype designation, one should be aware that the subtype designation often refers only to that fragment of sequence, and the virus that it is derived from may be recombinant.
There are some exceptions. Sometimes the sequence is known to come from an isolate from which other fragments are also sequenced; in that case, we try to indicate both subtypes; i.e. if we have an env sequence that is subtype A and a gag sequence that is subtype B, we try to assign 'AB' to both. However, because of the manual effort involved, we don't always manage to do this consistently.
A consensus sequence is a sequence of the most common nucleotide or amino acid at each position in an alignment. We generally use a 50% cut-off, such that at least 50% of the sequences have the same character at this position, or else we replace the character with a question mark (CONS.a in the example below). Another way to create a consensus is to take the most frequently occurring character, even if it is not the majority (CONS.b in the example below).
CONS.a ACG?A?CAT?CTATCAGT CONS.b ACGTAGCATACTATCAGT ------ ------------------ SEQ1 ACGTAGCATGCTATCAGT SEQ2 ACGTAGCATGCTATCAGT SEQ3 ACGTACCATCCTATCAGA SEQ4 ACGAAACATCCTATCAGT SEQ5 ACGAATCATACTATCAGT SEQ6 ACGGATCATACTATCAGT SEQ6 ACGCACCATACTATCAGT
Consensus sequences are built from an alignment. The alignment itself might be dominated by one type of sequence, such as subtype B sequences from the United States. So in general a consensus sequence is not the same as the common ancestor of the sequences, although in some cases it can approximate an ancestral sequence.
Many sequence analysis tools use multiple sequence alignments as the input data for analysis. There are many common sequence formats, and it is relatively simple to convert from one format to another. Most program packages provide scripts or programs to aid in file format conversion. Most of our tools accept multiple formats, and most of our output files are FastA format. We also provide a Format Converter tool. If your sequences are in a format that our programs cannot handle, there are other web-based sequence conversion tools available.
We provide a variety of alignment sources. These include both premade alignments, and tools you can use to produce an alignment of your own sequences.
Alignments often contain symbols and ambiguity codes. See: Codes and Symbols in Sequence Alignments.
First, read the information on the web page for that tool, including the Explanation file, if available. Pay particular attention to the input format of your data. Is your data in one of the common sequence formats? Do your sequences need to be aligned (or codon-aligned) for this tool? Do your sequences (or sequence names) contain spaces, line breaks, or any non-standard characters? Does the tool work using the Sample Input? If you cannot find the source of the problem, please contact us for help. Send your input file, if possible. Sometimes our tools break, and we are very glad to be informed about the problem. We are happy to help you with troubleshooting your specific problem.
Please contact us. We may be able to get your results for you and send them to you. We may also be able to add a mailback feature to the tool to allow larger runs.
There are many tools that can help you identify the subtype/clade. Each tool has advantages and disadvantages, so it's a good idea to try more than one.
We have 2 tools to help you localize a sequence. One is HIV/SIV Sequence Locator, which determines the position number boundaries of a stretch of sequence, or alternatively identifies the stretch of sequence which corresponds to specified positions. You can also use this tool to identify which coordinates you need to use to find the region in the database that corresponds to your sequence.
The QuickAlign tool will also give you the reference coordinates of your sequence. In addition, it will show you the alignment of your sequence with a large number of sequences from the database.
If you have an analysis question and our code doesn't give you the right output, we will try to adapt the code to your needs if we can. Write to the e-mail address below and let us know. If you have something you would like to do to analyze HIV sequences, and can't find the computer code you need to do it, write to us. We will consider writing the programs if we feel they will be generally useful, or we may be able to point you in the right direction if we are aware of code that already exists.
Maybe. Our tool scripts built to run as web tools. It usually requires some modification and/or additional scripts to run them locally. We are able to provide scripts for many of our tools, but they may require considerable additional work for you to run them. If you are interested, contact us.