HIV sequence database

**Introduction**

A heatmap is a graphical way of displaying a table of numbers by using colors to represent numerical values. The clustering algorithm groups related rows and/or columns together by similarity. For example, low values might tend towards cool blue tones while higher values tend to hotter orange and red tones. Heatmaps also re-arrange the rows and columns of the table so that similar rows, and similar columns, are grouped together, with their similarity represented by a dendrogram (separate dendrograms for rows and for columns). This web tool uses the heatmap tool, "heatmap.2" of the gplots package of the statistical environment R: A Language and Environment for Statistical Computing.

This sort of 2-dimensional clustering was originally used for analysis of gene expression array data (for example, Hastie, T., R. Tibshirani, and J. Friedman. 2001. in: The elements of statistical learning, data mining inference and prediction, p. 453- 480. Springer-Verlag, New York, N.Y.). A heatmap plot is broadly applicable to any problem where it is beneficial to arrange numeric values in a 2-dimensional array according to like-behavior. We have found this organization of data useful for interpreting neutralizing antibody data, where panels of sera or monoclonal antibodies are tested against panels of Envelopes (for example, Binley et al., J Virol. 2004 Dec;78(23):13232-52); the heatmap graphically groups Envelopes with similar antibody sensitivities and simultaneously groups antibodies with similar neutralization profiles. It is also a useful method for interpreting other quantitative immunological data, such as CD8 T-cell EliSpot results.

Input

Required format:

- A table of M rows by N columns.
- Values delimited by spaces or tabs.
- Text format; the filename should have a suffix ".txt", as in "MyData.txt".
- Each row should start with a label for the row, followed by N
values. (
**Note:**the header row has one value fewer than the other rows have!) - Each column should start with a label for the column, followed by M values.
- Missing values should be denoted by "NA" or "ND" (standing for "Not Available").
**Size:**very large input matrices tend to produce errors. If you get an error message, try your input again without doing bootstrapping, or try clustering by only rows or columns, not both. If errors persist, try smaller subsets of your input data to verify that your data are correctly formatted and to determine if a particular part of your data may be causing the error.

Here is a small sample table illustrating the format:

BB8 BB12 BB28 BB55 BB70 BB106 Pool_B Du123.6_C_SA 207 352 165 84 147 198 182 Du151.2_C_SA 196 1555 2529 818 241 487 518 Du156.12*_C_SA 369 NA 238 336 406 258 231 Du172.17*_C_SA 429 884 499 196 549 550 315

Output

Three outputs are produced:

- a heatmap with row and column dendrograms.
- a bootstrap dendrogram that represents the stability of a row or column in the heatmap dendrogram.
- a clustering dendrogram representing the clustering of the row data, or of the column data, but does not display a heatmap i.e. a color representation of the tabular data.
- the input data (this is useful if the user selects a below-threshold and/or an above-threshold and therefore the program changes those values in the input data to the assigned thresholds -- see below for explanation).

The output format is pdf. In addition to the pdf file, Heatmap provides a link to the R script used to produce the output.

Options

**Use log transformation**

If selected, the software will take the log of the data. If your data has any zero values, then taking the log will result in an error.

**Use threshold value**

The program allows users to input a below-threshold and/or an above-threshold value. Any values beyond the given threshold value will be presented as a white box. In addition, the input dataset will be changed and all values below the below-threshold or above the above-threshold will be set to the threshold.

For example, if a below threshold is given as 20, then all values less or equal than 20 will be considered below the threshold. The program will create a copy of the input data where all such values are changed into 20's. All values less or equal than 20 will be represented by white boxes in the heatmap. **Note:** if you want all values strictly less than the threshold to be white boxes, but not the threshold itself, then you have to set a slightly smaller (for below thresholds) or slightly greater (for above thresholds) threshold. For example, if your detection threshold is 20, but 20 is still part of your data, you would need to set the threshold to 19. Then all below-threshold values would be changed to 19 and represented by white boxes in the heatmap, while 20's will be in the next color boxes.

Analogously, if you set the above-threshold to 1,000, for example, all values greater or equal to 1,000 in the dataset will be changed to 1,000 and shown as white boxes in the heatmap. If you wish 1,000 to still be considered a valid entry and not a white box, then set the above-threshold to a value slightly greater than 1,000.

**Cluster Method**

To compute a dendrogram, (a) a distance method and (b) a cluster method need to be specified. A heatmap re-orders the rows and columns separately so that similar data are grouped together. A dendrogram shows the similarity of the rows, and a separate dendrogram shows the similarity of the columns. Although the row dendrogram and the column dendrogram are shown simultaneously on the heatmap, they are computed independently of each other.

Three choices are provided for cluster method:

- Complete (default),
- Average,
- Ward.

"Complete" is the default method. The first two choices, complete and average, are widely used. Other choices may be useful in special situations where it's suspected that the clusters are not compact spherical clusters, or to more fully explore the cluster structure.

Reference:

J.H. Ward, J American Statistical Association 58:301 236-244 (1963).

**Distance Method**

- Euclidean (default): Usual sum of squared differences distances between elements of vectors X and Y, take the square root.
- Manhattan: Sum of the absolute values of the differences of the elements of vectors X and Y.
- Binary: Replaces elements of vectors X and Y with "1" if the element is nonzero, and with "0" otherwise. The distance is the proportion of elements that mismatch, i.e., a percentage Hamming distance.

**Color palettes**

A heatmap represents the numerical values in a table of numbers by colors. One popular palette is the Brewer Color palette where lighter, less saturated colors such as light yellows represent small values, while darker, more saturated colors such as deep reds represent large values. (See RColorBrewer.pdf for details.) The available palettes are:

- Brewer palette (default): small values are lighter and less saturated colors, e.g., light yellow, progressing through darker more saturated colors such as reds and deep browns.
- heat colors: small values are red, progressing to higher values as oranges and yellows
- topo colors: small values are blue, progressing to greens, then yellows at higher values
- red/green: small values are red, progressing through black, to higher values as green
- ember colors: small values are blue, progressing to oranges and reds for higher values.

**Higher values**

The option "High intensity color" will display high data values with high intensity colors (e.g., red) and low values with low intensity colors (e.g., yellow). The option "Low intensity color" will reverse the usual color assignments.

**Number of colors**

The number of colors in which the data will be represented using the color palette selected. This value will be ignored if color key ranges are specified.

**Color key ranges**

You can customize the color key ranges (optional). Follow the examples below.

- When no threshold values are given, color key range should be: minimum data value,...,maximum data value, for example: 5,60,120,240,366
- When a below-threshold value is given (<20 or 20), the color key range should be: below-threshold value (20),...,maximum data value, for example: 20,60,120,240,366
- When an above-threshold value is given (>1000 or 1000), the color key range should be: minimum data value,...,above-threshold value (1000), for example: 20,150,500,750,1000
- If specified, color key ranges will override the "Number of colors" option.

- If using log values, do not use zero in the color key ranges.

**Labels and margins**

Large tables may have so much text associated with the labels of the rows and the columns that it's hard to fit the text on the heat map. One can try adjusting the character size for the row labels via the Column Label size and Row Label Size values along with the Bottom and Right Margin values. Failing all else, one can always change the input data to use abbreviations instead of full text.

With label sizes set at 7%, Heatmap will accommodate over 1,000 rows and 700 columns. At 20%, about 500 rows and 300 columns.

**Bootstraps**

A common procedure to assess the stability of a clustering is to resample the data with replacement, a.k.a. bootstrap resampling.

__Standard Bootstrap__

Original data matrix is used to compute a dendrogram by row and/or column. The matrix is resampled some number of times, and the resampled dendrograms are compared with the original for support. Resampling consists of making a new matrix by randomly sampling columns of data, with replacement, then making the row-wise dendrogram. For the column-wise dendrogram, rows are sampled from the original matrix, again with replacement. The degree of support is computed using the prop.clades method of the R software ape package.

__% Bootstrap Support__

To see only statistically supported nodes, specify a desired degree of support, and resampling support values for nodes with support at or above the threshold will be shown.

__pvclust__

This option uses the pvclust package to perform bootstrapping. One can separately bootstrap the row data, or the column data. Pvclust produces a row or a column dendrogram with values at nodes in the dendrogram representing the stability of the clusters associated with the node. Pvclust attempts to address the bias in normal bootstrap resampling by employing a multi-scale bootstrap resampling approach. Normal bootstrap resampling values are represented in the output by green letters "bp", for "bootstrap probability". The multi-scale bootstrap resampling probabilities are represented in the output by red letters "au", for "approximately unbiased", and are generally preferred over the "bp" bootstrap probabilities. The threshold for red boxed clusters is 95% probability.

NOTE: Bootstraps can take many minutes to complete, depending on the number of iterations. To avoid browser timeouts, all results requiring bootstraps will be e-mailed.

References for bootstrapping:

Shimodaira, H. (2004) "Approximately unbiased tests of regions using multistep-multiscale boot- strap resampling", Annals of Statistics, 32, 2616-2641.

Suzuki, R. and Shimodaira, H. (2004) "An application of multiscale bootstrap resampling to hierarchical clustering of microarray data: How accurate are these clusters?", The Fifteenth International Conference on Genome Informatics 2004, P034.

These options allow you to add a color bar vertically or horizontally. These sidebars can be used to illustrate any possible grouping pertaining to your samples. For example, if the viruses in your data came from two different species, you could make each species clearly identifiable with a different color in the sidebar.

The colors in the sidebar may be assigned according to characters in the sample names, or by a grouping specified in a text file. The bar can be colored by a default palette (red - orange - yellow) based on frequency, or by user-selected colors (shown below) manually assigned to each group.

● | Red | ● | Orange | ● | Yellow | ● | Yellow green | ● | Green |

● | Blue | ● | Royal blue | ● | Dark orchid | ● | Deep pink | ● | Hot pink |

● | Light pink | ● | Brown | ● | Khaki | ● | Lemon chiffon | ● | Teal |

● | Medium turquoise | ● | Sky blue | ● | Medium purple | ● | Medium slate blue | ● | Grey |

The best way to understand this option is to try it. The following text can be used to add a horizontal color sidebar to the Sample Input. You can copy/paste the example below into the box for "paste grouped column labels".

Group1: virus_1 virus_2 virus_3 virus_4 virus_5 virus_6 virus_7 virus_8 virus_9 virus_10 virus_11 Group2: virus_12 virus_13 virus_14 virus_15 virus_16 virus_17 Group3: virus_18 virus_19 virus_20

Acknowledgements

This tool uses R software. Thanks to the R Team:

R Development Core Team (2005). R: A language and environment for

statistical computing. R Foundation for Statistical Computing,

Vienna, Austria. ISBN 3-900051-07-0, www.R-project.org.

last modified: Tue Aug 20 10:17 2013