HIV sequence database

PCOORD (Principal Coordinate Analysis) is a procedure to find meaningful patterns in sequence data with no a priori knowledge about them. The procedure attempts to summarize the variation in the sequences in a limited number of axes or dimensions. A 'dimension' is basically a combination of positions in a sequence that behave similarly (for example, position 133 usually has an A when position 250 has a G).

One way to describe the process of finding these dimensions is as follows. If we have a two-dimensional swarm of data points, then we need two dimensions (the X and Y axis) to describe the variation in our data. However, if the swarm is very elongated and the points almost lie on a straight line, then we really need only one dimension, although we use two. PCOORD uses a mathematical method to find the best way to describe a multi-dimensional dataset in a smaller number of dimensions, which are linear combinations of the original dimensions.

The dimensions are not necessarily biologically meaningful, but they can be. Quite frequently, some dimensions that are extracted correspond to an epidemiological variable or some other feature of the data. The patterns that are found using PCOORD usually can be seen in a phylogenetic tree as well, but they may be much less pronounced there.

The results from PCOORD are to some extent influenced by which distance scoring method is used. For nucleotides, PCOORD computes simple Hamming distances. For amino acids, a similar same/different scoring scheme is available, called ID distances. Also implemented is the Smith and Smith (1990) scoring method, which results in Euclidian distances. Their scoring matrix looks like this:

D 0 E 1 0 K 2 2 0 R 2 2 1 0 H 2 2 1 1 0 N 2 2 2 2 2 0 Q 2 2 2 2 2 1 0 S 2 2 2 2 2 2 2 0 T 2 2 2 2 2 2 2 1 0 I 3 3 3 3 3 3 3 3 3 0 L 3 3 3 3 3 3 3 3 3 1 0 V 3 3 3 3 3 3 3 3 3 1 1 0 F 3 3 3 3 3 3 3 3 3 2 2 2 0 W 3 3 3 3 3 3 3 3 3 2 2 2 1 0 Y 3 3 3 3 3 3 3 3 3 2 2 2 1 1 0 C 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 0 M 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 0 A 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 G 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 0 P 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 D E K R H N Q S T I L V F W Y C M A G P

The PCOORD program has the possibility to identify each sequence with a character (number, letter, or symbol such as * or ^). To use that feature, you need a file with one character for each sequence. In the dimension plot, the point representing each sequence will then be identified by the corresponding character.

The Principal Coordinate Analysis method is very similar to Principal Component Analysis. The method was developed J.C. Gower. The PCOORD program suite was developed by Des Higgins (then at the European Molecular Biology Laboratory, EMBL), and adapted for UNIX machines by Jack Leunissen of the CAOS/CAMM institute in Nijmegen, The Netherlands.

For more detailed information, you can view the manual for the Spacer code, which is the basis of our PCOORD tool.

**References:**

Higgins DG (1992) Sequence ordinations: a multivariate analysis approach to analysing large sequence data sets Comput Appl Biosci 8(1):15-22

Gower JC (1966) Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53:325-328

Smith RF, Smith TF (1990). Automatic generation of primary sequence patterns from sets of related protein sequences. Proc Natl Acad Sci U S A. 1990 Jan;87(1):118-22.

last modified: Wed Mar 17 14:42 2010