HIV Databases HIV Databases home HIV Databases home
HIV sequence database



Heatmap Explanation   Hierarchical Clustering

Introduction

A heatmap is a graphical way of displaying a table of numbers by using colors to represent numerical values. The clustering algorithm groups related rows and/or columns together by similarity. For example, low values might tend towards cool blue tones while higher values tend to hotter orange and red tones. Heatmaps also re-arrange the rows and columns of the table so that similar rows, and similar columns, are grouped together, with their similarity represented by a dendrogram (separate dendrograms for rows and for columns). This web tool uses the heatmap tool, "heatmap.2" of the gplots package of the statistical environment R: A Language and Environment for Statistical Computing.

This sort of 2-dimensional clustering was originally used for analysis of gene expression array data (for example, Hastie, T., R. Tibshirani, and J. Friedman. 2001. in: The elements of statistical learning, data mining inference and prediction, p. 453- 480. Springer-Verlag, New York, N.Y.). A heatmap plot is broadly applicable to any problem where it is beneficial to arrange numeric values in a 2-dimensional array according to like-behavior. We have found this organization of data useful for interpreting neutralizing antibody data, where panels of sera or monoclonal antibodies are tested against panels of Envelopes (for example, Binley et al., J Virol. 2004 Dec;78(23):13232-52); the heatmap graphically groups Envelopes with similar antibody sensitivities and simultaneously groups antibodies with similar neutralization profiles. It is also a useful method for interpreting other quantitative immunological data, such as CD8 T-cell EliSpot results.

Input
Heatmap no longer accepts non-numerical input: With the exception of the header row, all entries in the input file should be numerical. If some of your data has been entered in the form < X, where X is the threshold, please change your input data. For example, if your detection threshold is 20, and you have some cells entered as <20, change them to 19 if you want all values strictly lower than 20 to be considered below the threshold, or, change them to 20, if you want all values lower or equal to 20 to be considered below the threshold.

Required format:

Here is a small sample table illustrating the format:

BB8	BB12	BB28	BB55	BB70	BB106	Pool_B
Du123.6_C_SA	207	352	165	84	147	198	182
Du151.2_C_SA	196	1555	2529	818	241	487	518
Du156.12*_C_SA	369	NA	238	336	406	258	231
Du172.17*_C_SA	429	884	499	196	549	550	315
Output

Three outputs are produced:

The output format is pdf. In addition to the pdf file, Heatmap provides a link to the R script used to produce the output.

Options

Use log transformation

If selected, the software will take the log of the data. If your data has any zero values, then taking the log will result in an error.

Use threshold value

The program allows users to input a below-threshold and/or an above-threshold value. Any values beyond the given threshold value will be presented as a white box. In addition, the input dataset will be changed and all values below the below-threshold or above the above-threshold will be set to the threshold.

For example, if a below threshold is given as 20, then all values less or equal than 20 will be considered below the threshold. The program will create a copy of the input data where all such values are changed into 20's. All values less or equal than 20 will be represented by white boxes in the heatmap. Note: if you want all values strictly less than the threshold to be white boxes, but not the threshold itself, then you have to set a slightly smaller (for below thresholds) or slightly greater (for above thresholds) threshold. For example, if your detection threshold is 20, but 20 is still part of your data, you would need to set the threshold to 19. Then all below-threshold values would be changed to 19 and represented by white boxes in the heatmap, while 20's will be in the next color boxes.

Analogously, if you set the above-threshold to 1,000, for example, all values greater or equal to 1,000 in the dataset will be changed to 1,000 and shown as white boxes in the heatmap. If you wish 1,000 to still be considered a valid entry and not a white box, then set the above-threshold to a value slightly greater than 1,000.

Cluster Method

To compute a dendrogram, (a) a distance method and (b) a cluster method need to be specified. A heatmap re-orders the rows and columns separately so that similar data are grouped together. A dendrogram shows the similarity of the rows, and a separate dendrogram shows the similarity of the columns. Although the row dendrogram and the column dendrogram are shown simultaneously on the heatmap, they are computed independently of each other.

Three choices are provided for cluster method:

"Complete" is the default method. The first two choices, complete and average, are widely used. Other choices may be useful in special situations where it's suspected that the clusters are not compact spherical clusters, or to more fully explore the cluster structure.

Reference:
J.H. Ward, J American Statistical Association 58:301 236-244 (1963).

Distance Method

Color palettes

A heatmap represents the numerical values in a table of numbers by colors. One popular palette is the Brewer Color palette where lighter, less saturated colors such as light yellows represent small values, while darker, more saturated colors such as deep reds represent large values. (See RColorBrewer.pdf for details.) The available palettes are:

Higher values

The option "High intensity color" will display high data values with high intensity colors (e.g., red) and low values with low intensity colors (e.g., yellow). The option "Low intensity color" will reverse the usual color assignments.

Number of colors

The number of colors in which the data will be represented using the color palette selected. This value will be ignored if color key ranges are specified.

Color key ranges

You can customize the color key ranges (optional). Follow the examples below.

Labels and margins

Large tables may have so much text associated with the labels of the rows and the columns that it's hard to fit the text on the heat map. One can try adjusting the character size for the row labels via the Column Label size and Row Label Size values along with the Bottom and Right Margin values. Failing all else, one can always change the input data to use abbreviations instead of full text.

With label sizes set at 7%, Heatmap will accommodate over 1,000 rows and 700 columns. At 20%, about 500 rows and 300 columns.

Bootstraps

A common procedure to assess the stability of a clustering is to resample the data with replacement, a.k.a. bootstrap resampling.

Standard Bootstrap

Original data matrix is used to compute a dendrogram by row and/or column. The matrix is resampled some number of times, and the resampled dendrograms are compared with the original for support. Resampling consists of making a new matrix by randomly sampling columns of data, with replacement, then making the row-wise dendrogram. For the column-wise dendrogram, rows are sampled from the original matrix, again with replacement. The degree of support is computed using the prop.clades method of the R software ape package.

% Bootstrap Support

To see only statistically supported nodes, specify a desired degree of support, and resampling support values for nodes with support at or above the threshold will be shown.

pvclust

This option uses the pvclust package to perform bootstrapping. One can separately bootstrap the row data, or the column data. Pvclust produces a row or a column dendrogram with values at nodes in the dendrogram representing the stability of the clusters associated with the node. Pvclust attempts to address the bias in normal bootstrap resampling by employing a multi-scale bootstrap resampling approach. Normal bootstrap resampling values are represented in the output by green letters "bp", for "bootstrap probability". The multi-scale bootstrap resampling probabilities are represented in the output by red letters "au", for "approximately unbiased", and are generally preferred over the "bp" bootstrap probabilities. The threshold for red boxed clusters is 95% probability.

NOTE: Bootstraps can take many minutes to complete, depending on the number of iterations. To avoid browser timeouts, all results requiring bootstraps will be e-mailed.

References for bootstrapping:
Shimodaira, H. (2004) "Approximately unbiased tests of regions using multistep-multiscale boot- strap resampling", Annals of Statistics, 32, 2616-2641.

Suzuki, R. and Shimodaira, H. (2004) "An application of multiscale bootstrap resampling to hierarchical clustering of microarray data: How accurate are these clusters?", The Fifteenth International Conference on Genome Informatics 2004, P034.

Add Sidebar(s)

These options allow you to add a color bar vertically or horizontally. These sidebars can be used to illustrate any possible grouping pertaining to your samples. For example, if the viruses in your data came from two different species, you could make each species clearly identifiable with a different color in the sidebar.

The colors in the sidebar may be assigned according to characters in the sample names, or by a grouping specified in a text file. The bar can be colored by a default palette (red - orange - yellow) based on frequency, or by user-selected colors (shown below) manually assigned to each group.

Red Orange Yellow Yellow green Green
Blue Royal blue Dark orchid Deep pink Hot pink
Light pink Brown Khaki Lemon chiffon Teal
Medium turquoise Sky blue Medium purple Medium slate blue Grey

The best way to understand this option is to try it. The following text can be used to add a horizontal color sidebar to the Sample Input. You can copy/paste the example below into the box for "paste grouped column labels".

Group1:
virus_1
virus_2
virus_3
virus_4
virus_5
virus_6
virus_7
virus_8
virus_9
virus_10
virus_11

Group2:
virus_12
virus_13
virus_14
virus_15
virus_16
virus_17

Group3:
virus_18
virus_19
virus_20

 

Acknowledgements

This tool uses R software. Thanks to the R Team:
R Development Core Team (2005). R: A language and environment for
statistical computing. R Foundation for Statistical Computing,
Vienna, Austria. ISBN 3-900051-07-0, www.R-project.org.

 

last modified: Tue Aug 20 10:17 2013


Questions or comments? Contact us at seq-info@lanl.gov.

 
Operated by Los Alamos National Security, LLC, for the U.S. Department of Energy's National Nuclear Security Administration
Copyright © 2005-2012 LANS LLC All rights reserved | Disclaimer/Privacy

Dept of Health & Human Services Los Alamos National Institutes of Health