NOTE: We have created a web interface called SNAP to create trees based on only synonymous changes.
The low variability of protease and reverse transcriptase and the occurrence of drug resistance-associated mutations make detecting problems more challenging, especially in short sequences. Phylogenetic analysis of the protease gene sometimes shows patient intermixing that is not necessarily caused by contamination. But even in these genes it is possible to get more information about the sequence quality.
In addition to the methods suggested on the previous page, here are some other analyses that can help exclude contamination:
If sequences from a patient cluster with sequences from another patient, this makes contamination very likely. If they do not cluster with other sequences from the set, this makes contamination less plausible (but doesn't exclude it).
If patient sequences do not cluster together in a phylogenetic tree, check if the separation is associated with drug resistance; it can be the result of selection rather than contamination.
Making a tree without the resistance-associated mutations can sometimes resolve the problem.
If a sequence from one patient clusters with sequences from another patient, check if they were PCR amplified on the same day. If so, this sequence is more likely to be suspect.
Make a tree based on synonymous substitutions only. This can in some cases reunite all sequences from one patient into the same cluster and confirm their validity.
Look for signature patterns that are characteristic of a patient and see if they are preserved in the outliers in question.
Below is an example of the use of synonymous substitutions to validate a published sequence dataset. The set consists of 421 clonal culture-derived sequences from varying numbers of samples from 21 patients. A BLAST search revealed no indication of lab strain contamination. Four single sequences clustered with other patients, and were probably contaminants or mix-ups. This is an excellent result for a study of this size.
In addition, samples from some patients show unexpected clustering. A neighbor-joining tree containing three 'well-behaved' (N, P, U) and one 'strange' patient (K) looked like this:
Sequences from patient K form two separate clusters, separated by sequences from two other patients. It is possible that this behavior is caused by the emergence of a drug-resistant strain of virus. The virus isolated at week 60 was highly resistant to Indinavir, while the virus from week 0 and week 18 was not. To diminish the influence of amino acid-driven selection, we looked at only synonymous changes, using the program SNAP that was developed here at Los Alamos; the phylogenetic analysis programs MEGA (for PC) and Phylowin (for UNIX) can do the same thing. The tree below is based on only synonymous changes.
All sequences from patient K now cluster together, which makes it very plausible that they are indeed from the same patient, rather than a sample mix-up or cross-contaminant.
Not all cases are as clear-cut. In this datasets, six patients showed unexpected clustering. In three of these, the sequences came together in the synonymous tree, and therefore most likely were legitimate. In one other case, the existence of a sequence from plasma from the same patient showed that the outlying cluster was OK. Two other cases couldn't be resolved; in view of the overall quality of the dataset and in the absence of strong evidence to the contrary, we considered them likely to be valid.
We thank Dr. Jon Condra and Dr. Andrew Leigh-Brown for help with the analysis and further consideration of the protease sequence set.