HIV Databases HIV Databases home HIV Databases home
HIV sequence database

Contamination in Conserved Regions

NOTE: We have created a web interface called SNAP to create trees based on only synonymous changes.

The low variability of protease and reverse transcriptase and the occurrence of drug resistance-associated mutations make detecting problems more challenging, especially in short sequences. Phylogenetic analysis of the protease gene sometimes shows patient intermixing that is not necessarily caused by contamination. But even in these genes it is possible to get more information about the sequence quality.

In addition to the methods suggested on the previous page, here are some other analyses that can help exclude contamination:

Below is an example of the use of synonymous substitutions to validate a published sequence dataset. The set consists of 421 clonal culture-derived sequences from varying numbers of samples from 21 patients. A BLAST search revealed no indication of lab strain contamination. Four single sequences clustered with other patients, and were probably contaminants or mix-ups. This is an excellent result for a study of this size.

In addition, samples from some patients show unexpected clustering. A neighbor-joining tree containing three 'well-behaved' (N, P, U) and one 'strange' patient (K) looked like this:

Sequences from patient K form two separate clusters, separated by sequences from two other patients. It is possible that this behavior is caused by the emergence of a drug-resistant strain of virus. The virus isolated at week 60 was highly resistant to Indinavir, while the virus from week 0 and week 18 was not. To diminish the influence of amino acid-driven selection, we looked at only synonymous changes, using the program SNAP that was developed here at Los Alamos; the phylogenetic analysis programs MEGA (for PC) and Phylowin (for UNIX) can do the same thing. The tree below is based on only synonymous changes.

All sequences from patient K now cluster together, which makes it very plausible that they are indeed from the same patient, rather than a sample mix-up or cross-contaminant.

Not all cases are as clear-cut. In this datasets, six patients showed unexpected clustering. In three of these, the sequences came together in the synonymous tree, and therefore most likely were legitimate. In one other case, the existence of a sequence from plasma from the same patient showed that the outlying cluster was OK. Two other cases couldn't be resolved; in view of the overall quality of the dataset and in the absence of strong evidence to the contrary, we considered them likely to be valid.

We thank Dr. Jon Condra and Dr. Andrew Leigh-Brown for help with the analysis and further consideration of the protease sequence set.

Back to main quality control page

last modified: Tue Apr 22 12:07 2008

Questions or comments? Contact us at

Operated by Triad National Security, LLC for the U.S. Department of Energy's National Nuclear Security Administration
© Copyright Triad National Security, LLC. All Rights Reserved | Disclaimer/Privacy

Dept of Health & Human Services Los Alamos National Institutes of Health