HIV Databases HIV Databases home HIV Databases home
HIV sequence database



HIV Sequence Database Frequently Asked Questions

Click a question to view the answer below. For questions about immunology resources and tools, see Immunology Database FAQ.

 

Site overview

What can I find on this website?

This FAQ addresses questions about the HIV Sequence Database. We provide a variety of tools and information for researchers studying HIV and SIV. The main aim of this website is to provide easy access to our sequence database, alignments, and the tools and interfaces we have produced. The toolbar at the top of the page should help you navigate among these resources.

The HIV Sequence Database focuses on five primary goals:

HIV Database staff

Who are you?

The database staff includes molecular biologists, sequence analysts, computer technicians, post-docs, and graduate research assistants. We are part of the Theoretical Biology and Biophysics Group (T-6) at the Los Alamos National Laboratory. We are funded by the Division of AIDS of the National Institute of Allergy and Infectious Diseases through an interagency agreement with the Department of Energy.

Why are there several separate databases here?

Our databases are organized around several areas of viral informatics. The affiliated databases are:

I am HIV positive; does this site have information useful to me?

The information on this site is developed for researchers who study the AIDS virus and are seeking ways of defeating it. The information available here is not directly helpful to patients. We are not qualified to give medical advice of any kind; please discuss medical issues with your doctor. You can find links to more relevant websites on our Links Page.

 

Sequence retrieval

What sequences can I find here, and why would I use this database to retrieve them?

Our sequence database receives all HIV-1, HIV-2, and SIV sequences that are deposited to GenBank. We retrieve these sequences monthly, so the very most recent sequence deposits may not be here yet. In addition to the information contained in GenBank records, we further annotate the sequences with an array of additional information (see questions below about annotation). We also provide an array of tools useful for understanding and working with these sequences.

How do I retrieve a specific region of sequences?

You can search for sequences by either their common name or by their accession number, using our search interface.

This search interface finds all sequences that fit your criteria (e.g., all subtype B sequences from Thailand with names starting with 'H'), and allows you to download them either aligned or unaligned. If you want them aligned, you will get an alignment that has the length of the complete genome, and it can contain non-overlapping sequences (for example, if your retrieval contains both env and gag sequences). Another way to do it is to specify a region you are interested in (e.g., env, V3, or HXB2 nucleotide positions 5253-7640). In that case only the sequences are found that both fit your criteria and contain that region, and the alignment will only contain that region.

Additional information about sequence retrieval is available in the Search Help page.

How do I retrieve a set of sequences from a specific paper?

One simple way is to search for the PubMed ID in the appropriate Publication field of the search interface. Another is to use other search criteria (e.g., accession number, author name, title word) to find one sequence from that article, and display its accession record (by clicking on the accession number in the search output). In the accession record, you will find a link called "display all sequences from this publication".

How do I obtain an alignment of all sequences of a particular gene with a particular subtype or country of origin?

This can easily be done with our search interface by choosing the appropriate fields. There are predefined regions (genes and proteins) or you can use the genome map and find the coordinates. To find coordinates of a sequence, use the Sequence Locator Tool, which lets you paste in a sequence fragment and find its beginning and ending coordinates.

What is that "Patient code" that's listed in the output? Why are there two numbers there?

We try to link groups of sequences from a single patient by assigning them all to a unique patient ID. This unique patient ID that we assign is the number in parentheses. The other name/number is usually the sample name assigned by the authors (for example, "Patient_1"). For more details, please see the Search Help page.

A search limited to "complete genome" yields more sequences than the same search limited to "Gag" only. Why?

This is an artifact of how we define "complete genome". A search for "complete genome" will include all sequences >7000 base pairs. These "complete" genomes are not always 100% complete; many have a small truncation of the 5' end of Gag. A search for "Gag" is limited to sequences that have a full-length Gag gene; those sequences that have a small truncation of Gag are omitted, and thus fewer sequences are obtained.

If you want to search for Gag sequences that include those sequences with small truncations of the 5' end, it is best to search using exact genome coordinates, with the 5' coordinate selected for the greatest truncation you are willing to accept.

 

Sequence annotation

What is different about HIV Database accession entries and GenBank entries?

Only the author (or owner) of an entry can update or modify the GenBank entry. Our database includes fields and comments that we add ourselves in GenBank-style entries. The fields we add are usually not reviewed by the authors. For example, we might add the health status of the patient, the date of sampling, the patient risk group, the phenotype of a viral culture from which a sequence is derived, subtype information, or additional references to a specific sequence entry. These added comments and fields come from our reading of the literature or analysis. Often, GenBank entries are not updated by their authors after the initial submission, but subsequent publications provide new and important information that pertains to a particular sequence, and we try to link this information.

Why is added information available for some but not all sequences?

Most information annotated to sequences is derived from publications and entered manually by our staff. As this is a time-consuming process, not all sequences are annotated. Some fields are blank because published papers do not always provide additional information.

How does your database enter the country information?

GenBank requests this information for new HIV sequence submissions, so almost all entries now have this information. Please note that we distinguish between 'sampling country' and 'infection country'. This distinction can be important when, for example, a Somalian immigrant lives in Sweden and gets tested there: the sampling country is Sweden, but the likely infection country is Somalia. Filling out the infection country field is a bit of a judgement call, so this field should be regarded as 'likely infection country'.

What are the abbreviations for each country?

We represent the country by a two letter country code based on the international naming convention (ISO 3166). These two letter codes are intuitive and short (for example, UG = Uganda, JP = Japan, etc.) so they can be easily linked to a sequence name for more informative representation in alignments and phylogenetic trees.

How reliable is information about risk group, infection date, country, etc.?

We try to only include information that is 'very likely or certain' to be true. This means, for example, that when a dual risk group is listed we do not include risk group information. If a paper states someone was probably infected in country X or country Y, we include that information only as a note. When someone was infected 'between 1989 and 1991' we do not include an infection year.

Why is it that the comment lines in the Los Alamos database accession entries are not always smooth reading?

When we created our relational database, we combined comment lines that were linked to sequences by accession number from several sources in the older versions of our database. Thus information may be repeated. There are tens of thousands of entries in the database, and we felt it was more important to get all the information than to have it read smoothly.

Can I submit HIV sequences directly to your database?

No. Sequences must be deposited to one of the major sequence databases: GenBank, EMBL, or DDBJ. All HIV and SIV sequences deposited will automatically enter our database within approximately one month from their public release by any of these databases.

Prior to submission, we recommend that all HIV-1 sequences be run through the Quality Control tool. This tool will help you catch common HIV-1 sequence problems. It will also help you prepare the sequences for deposit to GenBank, if you wish.

I have some patient/sequence data related to HIV sequences I deposited to GenBank. Can I send you this information to add to your database?

Yes! Please contact us for details.

 

Subtypes and Recombinants

What are M, N, O, P and CPZ sequences?

"M" is the main group of viruses in the HIV-1 global pandemic, and it contains multiple subtypes. N, O, and P are very distinctive forms of the virus originating from different transmissions from other primates into humans. CPZ are the primate viruses isolated from chimpanzees, which are the non-human primate viruses most closely related to HIV-1. For additional information, see HIV and SIV Subtype Nomenclature.

What are subtypes?

Subtypes are phylogenetically associated groups of HIV-1 or HIV-2 sequences. Sometimes the word "clade" is used to mean subtype. The sequences within any one subtype are more similar to each other than to sequences from different subtypes. These subtypes represent different lineages of HIV, and have some geographical associations. There are many ambiguities in the subtyping system, however it describes genetic clustering patterns and provides a useful system for organizing viruses by genetic similarity. This topic is explained in detail in HIV and SIV Subtype Nomenclature.

Each year we gather a set of Subtype Reference Sequences that are considered to be representative of all of the subtypes of the the HIV-1 M, N, and O groups. Larger sets of HIV/SIV Alignments of each gene and complete genomes, including the subtype references sequences, are also available.

Why can I no longer find subtype E in the nucleotide or protein alignments in the database?

Subtype E was redesignated as CRF01_AE in 1998. It was originally described as subtype E based on envelope genes from isolates from southeast Asia. When gag genes and complete genomes from these isolates were sequenced, it was found that regions of the genome other than env gene are more similar to the A subtype, so "subtype E" turned out to be a recombinant. Small fragments in the env region are still commonly called "E" because there they do appear to be completely separate from all other subtypes. The E subtype has only been clearly defined in the env region, and the evolutionary history and the origin of this mosaic form remains controversial.

What do multiple letters representing a subtype mean?

Multiple letters indicate that the sequence is a recombinant of parental viruses originating from 2 or more clades. For example, AGH indicates that it is thought that three subtypes recombined to form the sequenced virus: A, G, and H. The subtypes are listed alphabetically. The regions of the genome that are derived from a particular subtype are not indicated by the name.

When a number appears as part of a subtype, this refers to a circulating recombinant form (see next question).

What is a CRF?

CRFs are viruses whose complete genome has been shown to be recombinant or mosaic, consisting of some regions which cluster with one subtype and other regions of the genome which cluster with another subtype in phylogenetic analyses. CRFs are numbered sequentially in the order in which they are reported in the literature, starting with CRF01_AE, which is the new name of what used to be subtype E. The name of the isolate which was first sequenced and described is used to indicate the prototype of that CRF. This is done because there can be many different recombinant genomes containing the same subtypes, but only some of them have the same recombination breakpoints, and are apparently derived from the same common ancestor. In order to classify a recombinant as a circulating recombinant form, it must be found and sequenced in at least 3 patients who were not directly epidemiologically linked. The structure of all defined CRFs can be found here: HIV-1 Circulating Recombinant Forms.

How does the HIV database classify sequences and recombinants?

This topic is addressed in How the HIV Database Classifies Sequences.

Sometimes a sequence is labeled as a recombinant, but seems to be a pure subtype. Why?

The subtype shown may reflect the subtypes of 2 or more fragments from the same sample. For example, if we have env sequences of subtype A and gag sequences of subtype C from the same patient, we will usually label all sequences "AC", unless the authors specifically mention that the person was dually infected.

Why are subtypes specified for sequences that are gene fragments when they might be embedded in a recombinant genome?

When a short region of sequence has a subtype designation, one should be aware that the subtype designation often refers only to that fragment of sequence, and the virus that it is derived from may be recombinant.

There are some exceptions. Sometimes the sequence is known to come from an isolate from which other fragments are also sequenced; in that case, we try to indicate both subtypes; i.e. if we have an env sequence that is subtype A and a gag sequence that is subtype B, we try to assign 'AB' to both. However, because of the manual effort involved, we don't always manage to do this consistently.

 

Alignments

What is a "consensus sequence" and how is it made?

A consensus sequence is a sequence of the most common nucleotide or amino acid at each position in an alignment. We generally use a 50% cut-off, such that at least 50% of the sequences have the same character at this position, or else we replace the character with a question mark (CONS.a in the example below). Another way to create a consensus is to take the most frequently occurring character, even if it is not the majority (CONS.b in the example below).

CONS.a  ACG?A?CAT?CTATCAGT  
CONS.b  ACGTAGCATACTATCAGT 
------  ------------------ 
SEQ1    ACGTAGCATGCTATCAGT 
SEQ2    ACGTAGCATGCTATCAGT 
SEQ3    ACGTACCATCCTATCAGA 
SEQ4    ACGAAACATCCTATCAGT 
SEQ5    ACGAATCATACTATCAGT 
SEQ6    ACGGATCATACTATCAGT 
SEQ6    ACGCACCATACTATCAGT 

Consensus sequences are built from an alignment. The alignment itself might be dominated by one type of sequence, such as subtype B sequences from the United States. So in general a consensus sequence is not the same as the common ancestor of the sequences, although in some cases it can approximate an ancestral sequence.

To make a consensus from your own sequences, we provide Consensus Maker Tools. We also provide premade HIV-1 Subtype Consensus Sequences.

What are the Intelligenetics, Mase, FastA, and other sequence formats?

Many sequence analysis tools use multiple sequence alignments as the input data for analysis. There are many common sequence formats, and it is relatively simple to convert from one format to another. Most program packages provide scripts or programs to aid in file format conversion. Most of our tools accept multiple formats, and most of our output files are FastA format. We also provide a Format Converter tool. If your sequences are in a format that our programs cannot handle, there are other web-based sequence conversion tools available.

Which alignment is best for my purpose?

We provide a variety of alignment sources. These include both premade alignments, and tools you can use to produce an alignment of your own sequences.

What are all these strange symbols in the alignments?

Alignments often contain symbols and ambiguity codes. See: Codes and Symbols in Sequence Alignments.

How can I make a printable alignment for publication?

Try SeqPublish.

 

Tools

Where can I get an overview of all of your tools and what they do?

The Tools Index lists all of our tools, with brief descriptions of what they do. We also provide links to relevant tools on other websites in our list of External Tools.

I tried one of your tools and it failed. What should I do?

First, read the information on the web page for that tool, including the Explanation file, if available. Pay particular attention to the input format of your data. Is your data in one of the common sequence formats? Do your sequences need to be aligned (or codon-aligned) for this tool? Do your sequences (or sequence names) contain spaces, line breaks, or any non-standard characters? Does the tool work using the Sample Input? If you cannot find the source of the problem, please contact us for help. Send your input file, if possible. Sometimes our tools break, and we are very glad to be informed about the problem. We are happy to help you with troubleshooting your specific problem.

I am running a large data set, and the tool (or browser) is timing out before the results come back. What can I do?

Please contact us. We may be able to get your results for you and send them to you. We may also be able to add a mailback feature to the tool to allow larger runs.

How can I determine the subtype/clade of my HIV sequences?

There are many tools that can help you identify the subtype/clade. Each tool has advantages and disadvantages, so it's a good idea to try more than one.

How do I find where my sequence is located in a gene or protein (for example, where are the boundaries of PCR primer or a CTL epitope)?

We have 2 tools to help you localize a sequence. One is HIV/SIV Sequence Locator, which determines the position number boundaries of a stretch of sequence, or alternatively identifies the stretch of sequence which corresponds to specified positions. You can also use this tool to identify which coordinates you need to use to find the region in the database that corresponds to your sequence.

The QuickAlign tool will also give you the reference coordinates of your sequence. In addition, it will show you the alignment of your sequence with a large number of sequences from the database.

Is it possible to get the group at the Los Alamos database to modify programs or write additional code?

If you have an analysis question and our code doesn't give you the right output, we will try to adapt the code to your needs if we can. Write to the e-mail address below and let us know. If you have something you would like to do to analyze HIV sequences, and can't find the computer code you need to do it, write to us. We will consider writing the programs if we feel they will be generally useful, or we may be able to point you in the right direction if we are aware of code that already exists.

Can I obtain one of your web tools as a script to run locally?

Maybe. Our tool scripts built to run as web tools. It usually requires some modification and/or additional scripts to run them locally. We are able to provide scripts for many of our tools, but they may require considerable additional work for you to run them. If you are interested, contact us.

last modified: Thu Oct 24 11:34 2013


Questions or comments? Contact us at seq-info@lanl.gov.

 
Operated by Los Alamos National Security, LLC, for the U.S. Department of Energy's National Nuclear Security Administration
Copyright © 2005-2012 LANS LLC All rights reserved | Disclaimer/Privacy

Dept of Health & Human Services Los Alamos National Institutes of Health