HIV Databases HIV Databases home HIV Databases home
HIV sequence database

HELP for the Search Interface




Sequence Information


Upload a text file of accession numbers, one accession per line.

To search for a range of accession numbers, use the format X12345..X23456.

You can also enter a list of space separated accessions, such as

Sequence name

This is usually the isolate or clone name for a sequence, and may be the way a sequence is referred to in publications. This field also searches the GenBank Locus Name field.

Sequence length

The length of the nucleotide sequence in base pairs.

Sampling year

The year in which the sample was obtained. If the year of sampling is not specified exactly by the authors, the data in the database may be a range of years. You can choose whether or not to include these data by using the "exact" checkbox.


When the "exact" box is checked, your search will return only sequences sampled in the exact year(s) specified. When unchecked, your search will returned sequences sampled within any range of years that includes the year(s) specified.

Sampling country

The country in which the sample was taken. We use 2-letter Country Codes.


The organism sequenced (e.g., HIV-1, HIV-2, SIV). The choice of virus will determine the choices available in the Subtype field.


In this field, you can search on multiple subtypes by clicking the ones you want. To select non-adjacent fields, use 'ctrl-click' instead of 'shift-click'. Note that if your search is limited to a specific genomic region, you may bring up some recombinant sequences that are not of the selected subtype in that region.

Include recombinants

Select the 'include recombinants' checkbox if you want to include recombinants of the chosen subtype(s). You must have one or more subtypes selected. The output will include both CRFs and URFs that include segment(s) of the chosen subtype. This option works well for finding sequences that have segments of the pure subtypes (A-L), but less well for finding sequences that include segments of specific CRFs. For finding recombinants of CRFs, we recommend using Advanced Search.

For more information on the subtype and CRF classifications, see:
Overview of HIV-1, HIV-2, and SIV subtype nomenclature
How the HIV database classifies sequence subtypes
Overview of primate immunodeficiency viruses lists SIV subtypes
Circulating Recombinant Forms lists currently recognized CRFs
HIV-1 M group nomenclature (1999)



More Sequence Information

SE id

A unique identifying number assigned sequentially to each sequence as it is imported into the LANL HIV Database.

GB Create Date (YYYY)

The year in which the sequence was entered in GenBank.

Isolate name

This field is usually the same as the "sequence name".

Clone name

The clone number of the sequence. This field is only used for sets of cloned samples.

Sample tissue

The tissue from which the the sample was derived. Categories include: plasma, PBMC, blood, brain, CSF, semen, cervix, feces, etc.

Culture method

For samples derived from PBMC. Categories are: cultured, uncultured, primary, expanded, co-cultured.

Drug naive

Check box to select only sequences that were sampled prior to the patient receiving any drug treatment. Sequences are annotated as drug naive only when there is certainty that the patient has not been treated. If you want to select for sequences from drug-treated (non-naive) patients, you will need to use the Advanced Search.


When you search on "Comment", the comment fields in both the sequence entries and the patient records will be searched.

Coreceptor and phenotype

These fields are annotated based on biological data only, not based on presumed usage inferred from sequences. For information about these fields, see articles:
Biological and Molecular Aspects of HIV-1 Coreceptor Usage
Coreceptor Use by Primate Lentiviruses

RIP Subtype (Advanced Search only)

The Advanced Search includes a field that stores a precalculated RIP subtype, automatically generated by the Recombinant Identification Program (RIP). These RIP results are generated using default parameters, using background sequences for subtypes A-K and CRF01_AE. The window size used is 400 for sequences >600 bp, and 300 for sequences 350-599 bp. No result is available for sequences <350 bp.

The purpose of this field is to provide a heads-up for possible errors or omissions in the main Subtype field. While our main Subtype field is manually curated, the RIP subtype is not, and it is generally less reliable.

Note: the precalculated RIP results do not include most of the CRFs, so, for example, a CRF02_AG sequence will be subtypesd as A1G or A1 or G. Also note that RIP is prone to artifacts, so you may need to rerun RIP and examine the graphic output in order to use this information effectively.



Find sequences for a specific gene or region

Most sequences in the database are internally pre-aligned, and the location of their starting and ending positions are stored; these positions are compared to your region of interest. Currently, these functions are only available for HIV-1 and SIVcpz.

Genomic Region

You can limit your search to a specific genomic region of the virus; the HXB2 coordinates used by the search interface are shown below. The "complete genome" category yields all sequences over 7000 base pairs, regardless of exact coordinates. Thus, some sequences obtained as "complete genome" may lack parts of LTR, gag, or nef.

  Fragment HXB2 coordinates  
  complete genome any >7000 bp  
  5' LTR    1 - 634  
  5' LTR R  456 - 551  
  5' LTR U3    1 - 455  
  5' LTR U5  552 - 634  
  TAR  453 - 513  
  Gag-Pol  790 - 5096  
  Gag  790 - 2292  
  p17 (matrix)  790 - 1185  
  p24 (capsid) 1186 - 1878  
  p7 (nucleocapsid) 1921 - 2085  
  p6 2134 - 2292  
  Pol CDS 2085 - 5096  
  p51 (RT) 2550 - 3869  
  p15 (RNAse H) 3870 - 4229  
  p31 (integrase) 4230 - 5096  
  protease 2253 - 2549  
  Vif CDS 5041 - 5619  
  Vpr CDS 5559 - 5850  
  Tat CDS (plus intron) 5831 - 8469  
  Tat exon 1 5831 - 6045  
  Tat exon 2 8379 - 8469  
  Rev CDS (plus intron) 5970 - 8653  
  Rev exon 1 5970 - 6045  
  Rev exon 2 8379 - 8653  
  Vpu CDS 6062 - 6310  
  Env CDS 6225 - 8795  
  V1 6615 - 6692  
  V2 6693 - 6812  
  V3 7110 - 7217  
  V4 7377 - 7478  
  V5 7602 - 7634  
  RRE 7710 - 8061  
  gp41 7758 - 8795  
  gp120 6225 - 7757  
  Nef CDS 8797 - 9417  
  3' LTR 9086 - 9719  
  3' LTR R 9541 - 9636  
  3' LTR U3 9086 - 9540  
  3' LTR U5 9637 - 9719  

Start/End Coordinates

You can also choose to search for sequences based on your own coordinates. For example, the pulldown menu does not include the 3' LTR, but it can be obtained by specifying its coordinates. HXB2 reference coordinates for many regions of interest can be obtained from the Reference Sequence Coordinate Search.

Include fragments of minimum length __

By default, all sequences obtained from a genomic region search will span the entire region selected. If you want to include sequences that only partly cover the selected region, check the "Include fragments" box and enter a minimum length for the overlap of the included fragments. Do not use symbols such as > in this box.

One sequence/patient

When searching by genomic region, your results page will include the option to select "One sequence/patient". This option will select, for each patient, the first sequence that appears on the results page. (Since results are sorted alpha-numerically by accession, the selected sequence will be the first accession in alpha-numeric order.) This function is dependent on the correct database linkage of sequences by patient; some newly-received sequences may not yet be linked, and some patient identities between studies may have been missed. You may need to use additional tools to remove closely-related sequences.



Combine database sequences with your own sequence alignment

When you enter a sequence alignment here, you will limit your database search to the genomic region of your input sequences. Your sequences must already be aligned, and it is helpful if they are all approximately the same length.

Ragged ends

When you paste or upload an alignment, the interface normally defines the genomic region of your search by taking the genome coordinates of the first sequence in your alignment. If you choose "ragged ends", the interface will determine the coordinates of all the sequences in the alignment, and it will use the lowest 5' coordinate and the highest 3' coordinate. Choosing this option will prevent the interface from using the wrong coordinates in cases where your first sequence is shorter than the others. However, this option makes searches significantly slower.



Publication Information

Publication ID

The publication ID is a unique number assigned to each publication by the HIV Database.

Author Last Name

This search assumes an 'and', so if you search for 'smith jones' you will retrieve all sequences for which both Smith and Jones are in the author list. Do not include initials or first names. Author names are taken directly from GenBank; we do not correct mistakes in the sequence records.

PubMed ID

This field restricts your search to sequences from a published paper specified by its PubMed ID.

Title and Journal

Search with any word or set of words from the Title of the paper or from the name of the Journal itself.



Patient Information

Patient id

A unique number, assigned by this database, that links sequences from a single, unique patient.

Patient code

The patient identifier is displayed in searches as a two-part number, for example "P1(19555)". The first part is the code name or number by which the patient is identified in publication(s). The second part is an internal number assigned by our database, the patient ID. A patient code such as "P1" may refer to more than one patient, but, the sequence records associated with "19555" are specific to a single patient.

Not all sequences in the database have an assigned patient code/id; if your search for a patient code fails, try entering the code in the "Sequence Name" field (sequence names often contain a patient code).

The search algorithm used for this field is different from the other fields, in that a space (for example, "Patient 2") is not interpreted as AND. Instead, the entire string, including spaces, is used as the search term.

Risk factor

The risk factor describes the risk activity by which the patient most likely was infected. Dual risk factors are not recorded. The risk factor must be established with reasonable certainty to be recorded in this field.

SG - homosexual
SB - bisexual
SM - male sex with male
SH - heterosexual
SW - sex worker
SU - sexual transmission, unspecified type
PH - hemophiliac
PB - Blood transfusion
PI - IV drug use
MB - Mother-baby
NO - Nosocomial
EX - Experimental
NR - not recorded (or unknown)
OT - other

Infection year

This is the year in which the patient was infected. The year is only recorded when it is known with some certainty.

HLA information

When the checkbox is selected, you will get only sequences from patients with any known HLA data, and this HLA information will be displayed.

The HLA field can be searched for specific HLA types using the Advanced Search. To search the field, enter a space separated list. Wildcard searches using * do not work because HLA data often contain the * character.

Patient sex

Categories are: M and F.

Days from seroconversion

The number in this field indicates the estimated number of days between the patient's seroconversion and the date the sample was taken for sequencing. For samples taken before seroconversion, negative numbers are used. If the source data were given in weeks or months, these numbers have been converted to days.

Please note: Days from infection or seroconversion are almost always estimates, and different studies use different methods and definitions. In many studies these estimates are very rough. We have attempted to translate these values into a single system for study cross-comparisons, but please use these fields with caution; go back to the original papers to confirm the study-specific timing definitions.

In cases where studies give data that is vague, but possibly useful, a text entry may appear in this field. The following text entries may appear:

“Pre-seroconversion”: sample was taken before seroconversion, but the exact number of days is unknown.
“Early”: <1 year after seroconversion
“Late”: ≥2 years after seroconversion
“No data”: the same meaning as a blank field

Days from infection

The number in this field indicates an estimate of the number of days from the time the patient was infected with the virus until the sample was taken for sequencing. Post-infection dates are relatively rare. Most often they are known when a patient seeks medical treatment for acute illness shortly after having a sexual encounter with a stranger. We use this field when the primary author presents the data as post infection in the original citation. Please note: Days from infection or seroconversion are always estimates, and different studies use different methods and definitions.

If you are interested in sequences from a particular timepoint relative to infection or seroconversion, it may be wise to perform 2 separate searches of both the ‘days from seroconversion’ and ‘days from infection’ fields, as most data are recorded in one field or the other, not both. For example, if you want sequences that are either <90 days post-infection or <30 days post-seroconversion, two separate searches are needed.


If the patient was enrolled in a named project or cohort, it is recorded in this field.

Patient health

The health status of the patient at the time of sampling. Categories are: acute infection, asymptomatic, chronic, symptomatic, AIDS, and deceased.

Vaccine status

The patient's exposure to vaccination at the time of sampling. Categories are: preventative placebo, preventative breakthrough, therapeutic placebo, therapeutic pre-treatment, and therapeutic post-treatment.



More Patient Information

Patient age

The age of the patient in integer days when the sample was taken. You can use y for year and m for month. For example, to select for sequences from patients under 18 years, enter either "<6575" or "<18y".

Viral load

The plasma viral load in units of copies/ml of plasma. A viral load of "1" indicates that the viral load was below the limit of detection (usually <50 copies/ml).

CD4 count

The CD4+ T-cell count at the time of sampling, in absolute counts of cells/ul.

CD8 count

The CD8+ T-cell count at the time of sampling, in absolute counts of cells/ul.


The rate of disease progression of the patient, if recorded by the study. In most cases, the authors' definition is used, although the definitions may vary by study. Categories are:

RP - rapid progressor
P - normal progressor
SP - slow progressor
LTNP - long-term non-progressor
EC - elite controller

# patient sequences

This field can limit your search to patients with multiple sequences in the database. For example, if you enter ">9" in this field, your output will include only sequences from patients with 10 or more sequences in the database. This option is particularly useful for searches of intrapatient sequence sets. For more options for intrapatient searches, see Intrapatient Search Interface.

# patient timepoints

This field can limit your search to patients with multiple time points. For example, if you enter ">2" in this field, your output will include only sequences from patients with sequences in the database that contain information on at least 3 of the following: "Days from first Sample", "Days from treatment start", "Days from treatment end", "Days from Seroconversion" or "Days from Infection". This option is particularly useful for searches of intrapatient sequence sets. For more options for intrapatient searches, see Intrapatient Search Interface.

Cluster name

A cluster is a group of two or more epidemiologically-linked patients. A cluster ID links two or more patient IDs in the database. Each cluster is assigned a name, which is not necessarily unique. (For example, there may be more than one "chain1" clusters in the database.) Clusters are assigned only when both the publication and the sequences themselves indicate epidemiological linkage of the patients.

Cluster transmission type

The cluster transmission type describes the mode of transmission of the virus among all patients in the cluster. For example, clusters with Heterosexual transmission are pairs or chains of patients linked by heterosexual transmission of the virus. In many cases, a cluster will have more than one transmission type. For example, a cluster consisting of a heterosexual couple and their infected child would have both Heterosexual and Mother->Child transmission types.

Cluster comment

The cluster comment field gives information about the cluster of epidemiologically-linked patients.

Fiebig stage

Fiebig stage is a staging system for early HIV infection. This field can be searched from the Advanced Search and the Intrapatient Search interfaces.

Fiebig stage Duration in days (range) Cumulative duration (range)
Eclipse 10 (7,21) 10 (7,21)
1 (vRNA+) 7 (5,10) 17 (13,28)
2 (p24Ag+) 5 (4,8) 22 (18,34)
3 (ELISA+) 3 (2,5) 25 (22,37)
4 (Western Blot +/-) 6 (4,8) 31 (27,43)
5 (Western Blot +, p31-) 70 (40,122) 101 (71,154)
6 (Western Blot +, p31+) open-ended

References for Fiebig staging system:

  • Fiebig et al. 2003. Dynamics of HIV viremia and antibody seroconversion in plasma donors: implications for diagnosis and staging of primary HIV infection. AIDS 17(13): 1871-1879.
  • Keele et al. 2008. Identification and characterization of transmitted and early founder virus envelopes in primary HIV-1 infection. Proc Natl Acad Sci U S A. 105(21):7552-7.



Geographical Information

Sampling country

Although this is geographical information, the search field is located at the top under Sequence Information for convenience.

Sampling city

The city/province/state/region in which the sample was obtained.

Infection country

This field records the country in which the patient was infected. This field is filled in only when the infection country differs from the sampling country and the infection country is known specifically and with high certainty.

Infection city

If the infection city is known to be different than the city where the sample was collected, that is noted here. This field rarely contains data.

Geographic region

This is a way to retrieve all sequences from (for example) the African continent without having to search for each country separately. For a list of countries included in each region, see:
Definitions of the HIV Database geographic regions



Amino Acid Motif Search

Restricts the search to sequences containing a specified amino acid motif, such as YCVHQRIEIKDTK. The output can be downloaded as nucleotides or amino acids.

Boolean searches with "and" and "or" work normally. Like other fields, a space means "and". So the query "KKE ESK" returns sequences that contain both KKE and ESK. In contrast, the query KKE*ESK will return sequences that contain both tripeptides, but in the specified order (KKE 5' of ESK).

To search for a motif where amino acids are separated by an exact number of residues, use underscore. For example, if you want to find sequences with the HLA-A1 motif xxDExxxxxY, enter the query __DE_____Y


Restricts the motif search to a particular gene for which amino acid information is available.



Output Options

Problematic sequences

This field marks sequences that users usually want to exclude from a retrieval. Our default excludes these sequences from searches, but users may choose to include them if desired. Our criteria are very conservative, so that we have very few false positives. Thus there are some unlabeled sequences that are still problematic, and you still need to check for problem sequences!

  1. N: Non-ACTG characters

    High content of non-ACTG characters, meeting one of the following criteria:

    • more than 100 consecutive non-ACTG characters
    • >7% non-ACTG characters for sequences of length <1000
    • >5% non-ACTG characters for sequences of length 1000-2999
    • >3% non-ACTG characters for sequences of length 3000 or above.

    While direct sequences will naturally contain some IUPAC ambiguity characters, sequences annotated as N have such a high fraction that multiple alignment programs and other analysis programs have trouble with them. All incoming data are automatically screened.

  2. C: Contaminant

    Likely contamination with a laboratory strain. If a major part of a study set is contaminated, we may label the full set with C. In other cases, we choose a very conservative (high) level of similarity before we mark a particular sequence C. Different genomic regions have different standards; for example it is harder to detect potential contamination in pol than it is in env. In short sequences, it is particularly difficult to say with certainty if a sequence is a contaminant or not. Contaminants cannot be reliably annotated through an automatic screen, so many potential contaminant sequences are still unmarked.

  3. H: Hypermutant

    We screen all incoming sequences for extreme cases of G->A hypermutation, and mark these sequences with H. Hypermutated sequences can carry substitutions not found in viable viruses, so such sequences alter phylogenetic tree branch lengths and complicate the determination of appropriate evolutionary models. For additional information about hypermutation, see the Hypermut Tool.

  4. S: Synthetic

    A synthetic sequence does not represent a naturally-occurring viral sequence. There are many ways that this can occur, including:

    • sequences containing non-HIV/SIV components
    • sequences altered to change codon usage
    • patent sequences for which we cannot determine their origin
    • sequences where the author has accidentally concatenated two sequences into one
    • sequences where the author has accidentally produced a DNA reverse-translated from protein
  5. D: Deletion

    A sequence containing an artifactual deletion of >100 nucleotides. These sequences often occur when an author puts together 2 sequences from a single sample (for example, a protease and an RT sequence), but omits some intervening sequence. Sequences that represent viruses with naturally-occurring deletions are not annotated in this category.

  6. T: Tiny

    A tiny sequence (< 50 bp).

  7. R: Reverse complement

    A sequence that was deposited as its reverse complement.

% non-ACGT

Percentage of non-ACGT characters in the nucleotide sequence.
Example: to restrict your search to sequences with less than 0.5% non-ACGT character content use <0.5 in the input box.


last modified: Thu Sep 2 11:25 2021

Questions or comments? Contact us at

Operated by Triad National Security, LLC for the U.S. Department of Energy's National Nuclear Security Administration
© Copyright Triad National Security, LLC. All Rights Reserved | Disclaimer/Privacy

Dept of Health & Human Services Los Alamos National Institutes of Health