Carla Kuiken and Bette Korber
MS K710, Los Alamos National Laboratory, Los Alamos, NM 87545
1. Create a phylogenetic tree that includes all the sequences in the study. Common signs of trouble are:
A phylogenetic tree can clarify the relations between the sequences. If you have lab strain contamination or sample mix-ups between two patients, a phylogenetic tree will likely show it. Once you have your sequences aligned, use our web site to generate a simple neighbor-joining tree (Saitou and Nei, 1987) to check for potential problems. Neighbor-joining is a computationally fast phylogenetic method that can easily handle hundreds of sequences in a single tree.
2. Compare your sequences to all published sequences (BLAST search).
BLAST is a program that finds sequences with similarity to the query sequence (Altshul, et al., 1990). The output can be ordered by the degree of similarity. If your sequence is very similar to a published strain, especially a lab strain that is used for in vitro studies, it is likely that you have contamination. Even if your sequence is not identical to the lab strain, watch out for in vitro recombination, in which only part of the sequence matches the lab strain, and the other part is derived from your patient sample (see example 2 below). On our Web site you can compare your sequences to all Genbank entries, which contains the very latest sequences, or against the Los Alamos HIV database which can lag behind Genbank a bit, but contains more background information about the sequences.
3. Look carefully at the alignments, and pay attention to patient signature patterns.
Signature patterns (Korber and Myers, 1992) often help to show what is typical and atypical for a
patient, and thus help to reveal sequences that don't seem to belong with a patient. The usefulness of signature patterns can be seen in the
contamination example 3 below. You can use the VESPA program to find signature patterns, but often a simple alignment is
sufficient to spot suspicious sequences.
4. Keep a background set of sequences that are commonly used in your laboratory for comparison.
BLAST searches can detect contamination by common lab strains whose sequences are entered in Genbank, but contamination with other genetic material that was recently used in your lab may go undetected. Aligning sequences that look suspicious with other sequences that your lab has produced may bring this type of contamination to light.
Some examples of contamination.
Example 1: A set of C1-C3 sequences containing LAI/HXB2 contamination and sample mix-ups (partial set, published). This partial dataset contains several examples of possible contamination. The tree is shown in Figure 2. Signs of trouble in this tree:
1. Sequences from patient F spread over three clusters. One cluster is very similar to HXB2/LAI and is probably a laboratory contaminant. The distinctness of the other two clusters suggests either dual infection, contamination with an isolate for which there is no sequence in Genbank, or mix-up with a patient that's not in the study.
2. Patients E and G both have a single sequence that clusters tightly with the other patient, suggesting a sample mix-up or mislabeling.
Example 2: This dataset contains three sequences, labeled 59, 77, and 65 in Figures 3 and 4, that are the result of in vitro recombination between the viral DNA from the patient and LAI/HXB2 DNA. In the tree, (Figure 3) three sequences clearly cluster with the LAI clone. That in vitro recombination has occurred can be seen very clearly in the alignment (Figure 4). The three recombinant sequences match the LAI sequence perfectly in the latter half of the alignment whereas the other sequences in the study do not.
Example 3: This set was generated to study CTL epitope variation, and consists of partially overlapping sequence fragments of variable length. Phylogenetic analysis was impossible because the sequences had too little overlap to create a tree, but a BLAST search and an alignment with the most similar Genbank sequence showed extensive contamination with pNL43. Figure 5 shows the sequences aligned to pNL43. Yellow bars indicate sequences whose best BLAST match was to pNL43 and are thus considered contaminants. Signature patterns, shown as colored rectangles, are characteristic for each patient and are notably absent from the contaminant sequences. Although identity of one individual sequence to a lab strain would not conclusively prove contamination, all evidence taken together is very strong:
The special case of conserved genes.
Cited references:
Selected references on contamination and its consequences:
Learn GH Jr, Korber BT, Foley B, Hahn BH, Wolinsky SM, Mullins JI., Maintaining the integrity of human immunodeficiency virus sequence databases, J. Virol. 1996 Aug;70(8):5720Ð5730.
Korber BT, Learn G, Mullins JI, Hahn BH, Wolinsky S., Protecting HIV databases, Nature 1995 Nov 16;378(6554):242Ð244.
Frenkel LM, Mullins JI, Learn GH et al., Genetic Evaluation of Suspected Cases of Transient HIV-1 Infection of Infants, Science 1998 May 15;280(5366):1073-1077.
McClure MO, Bieniasz PD, Weber JN, Tedder RS, O'Shea S, Banatvala JE, Tudor-Williams G, Simmonds P, Holmes EC., HIV clearance in an infant?, Nature 1995 Jun 22;375(6533):637-638
R. Schuurman, L. Demeter, P. Reichelderfer, J. Tijnagel, T. de Groot, C. Boucher on behalf of the ENVA laboratories, the Sequencing Working Group and participating laboratories, World-wide Evaluation of DNA Sequencing Approaches for the Identification Drug Resistance Mutations in the HIV1 Reverse Transcriptase, Proceedings of the 5th annual Conference on Retroviruses, Abstract # 532.
where you can build a tree and do a BLAST search with your sequences. We include some tips for identifying problem sequences in conserved regions of the HIV genome such as protease or reverse transcriptase (RT).
What degree of similarity indicates contamination depends on the gene or region being analyzed. Thus, RT sequences are much less variable than V3 sequences. Figure 1 shows the frequency distribution of similarity scores for different genes. The population from which the samples have been obtained will also influence the degree of similarity to be expected. Sets of clonal sequences from different tissues in a single patient will tend to be more similar than sets from different persons in a clustered outbreak, which in turn will tend to be less similar than sequences from geographically disperse locations.
Figure 1. Frequency distribution of similarity scores for different genes.
Figure 2. LAI/HXB2 contamination and sample mix-ups.
Figure 3. Three recombinant sequences clustering with LAI.
Figure 4. Alignment of samples in Figure 3. showing perfect match with LAI in shaded region.
Figure 5. Alignment of samples to pNL43. Yellow bars: contaminant sequences best BLAST match to pNL43, other colors: patient signature patterns.
Figure 6. Neighbor-joining tree based on all amino acids.
Figure 7. Neighbor-joining tree based on synonymous amino acid changes only.