HIV Databases HIV Databases home HIV Databases home
HIV sequence database



Heatmap

Introduction: A heatmap is a graphical way of displaying a table of numbers by using colors to represent the numerical values. For example, low values might tend towards cool blue tones while higher values tend to hotter orange and red tones. Heatmaps also re-arrange the rows and columns of the table so that similar rows, and similar columns, are grouped together, with their similarity represented by a dendogram (separate dendograms for rows and for columns). This web tool uses the heatmap tool, "heatmap.2" of the gplots package of the statistical environment R: A Language and Environment for Statistical Computing.

This sort of 2-dimensional clustering was originally used for analysis of gene expression array data (see, e.g., Hastie, T., R. Tibshirani, and J. Friedman. 2001.in: The elements of statistical learning, data mining inference and prediction, p. 453- 480. Springer-Verlag, New York, N.Y.). It is broadly applicable to any problem where it is beneficial to arrange numeric values in a 2-dimensional array according to like-behavior. We have found this strategy of organization of data useful for interpreting neutralizing antibody data where panels of sera or monoclonal antibodies are tested against panels of Envelopes (Binley et al., J Virol. 2004 Dec;78(23):13232-52), to group Envelopes with similar antibody sensitivities and simultaneously group antibodies with similar neutralization profiles. It is also a useful method for interpreting other quantitative immunological data, such as CD8 T-cell EliSpot results.

Input: Data should be an M row by N column table with values delimited by spaces or tabs. The filename should have a suffix ".txt", as in "MyData.txt". Each row should start with a label for the row, followed by N values. Each column should start with a label for the column, followed by M values. Missing values should be denoted with the two letter code "NA" (standing for "Not Available").

Here is a small sample table illustrating the format:
BB8	BB12	BB28	BB55	BB70	BB106	Pool_B
Du123.6_C_SA	207	352	165	84	147	198	182
Du151.2_C_SA	196	1555	2529	818	241	487	518
Du156.12*_C_SA	369	426	238	336	406	258	231
Du172.17*_C_SA	429	884	499	196	549	550	315
Output: Three outputs are available: 1) a heatmap with row and column dendograms. 2) a bootstrap dendogram that represents the stability of a row, or a column, heatmap dendogram. 3) a clustering dendogram representing the clustering of the row data, or of the column data, but does not display a heatmap i.e. a color representation of the tabular data.

Output format: The output format is pdf. In addition to the pdf file, Heatmap provides a link to the R script used to produce the output.

Color palettes: A heatmap represents the numerical values in a table of numbers by colors. One popular palette is the Brewer Color palette where lighter, less saturated colors such as cornsilks, yellows, and oranges represent small values, while darker, more saturated colors such as browns and deep reds represent large values. The available palettes are (Brewer is the default): 1) Brewer palette: small values are lighter and less saturated colors e.g. cornsilk, progressing through darker more saturated colors such as reds, deep browns. 2) heat colors: small values are red, progressing to higher values as oranges and yellows 3) topo colors: small values are blue, progressing to greens, then yellows to cornsilk at higher values 4) red/green: small values are red, progressing through black, to higher values as green 5) ember colors: small values are blue, progressing to oranges and reds for higher values.

Dendograms: A heatmap re-orders the rows, and separately the columns, of the data so that similar data is grouped together. A dendogram shows the similarity of the rows, and a separate dendogram shows the similarity of the columns. Although the row dendogram and the column dendogram are shown simultaneously on the heatmap, they are computed independently of each other. To compute a dendogram, a (a) distance metric, and (b) an agglomerative method, needs to be specified. Distance Metrics: 1) Euclidean (default): Usual sum of squared differences distances between elements of vectors X and Y, take the square root. 2) Manhattan: Sum of the absolute values of the differences of the elements of vectors X and Y. 3) Binary: Replaces elements of vectors X and Y with "1" if the element is nonzero, and with "0" otherwise. The distance is the proportion of elements that mismatch, i.e. a percentage Hamming distance. Euclidean distance is the default.

Agglomerative Method: 1) Complete (default), 2) Average, 3) Ward; "Complete" is the default method. The first two choices, complete and average, are widely used choices. Other choices may be useful in special situations where it's suspected that the clusters are not compact spherical clusters, or to more fully explore the cluster structure.

References:

J.H. Ward, J American Statistical Association 58:301 236-244 (1963).

Bootstraps: A common procedure to assess the stability of a clustering is to bootstrap the data. This web tool uses the pvclust package to perform bootstrapping. One can separately bootstrap the row data, or the column data. Pvclust produces a row or a column dendogram with values at nodes in the dendogram representing the stability of the clusters associated with the node. Pvclust attempts to address the bias in normal bootstrap resampling by employing a multi-scale bootstrap resampling approach. Normal bootstrap resampling values are represented in the output by green letters "bp", for "bootstrap probability". The multi-scale bootstrap resampling probabilities are represented in the output by red letters "au", for "approximately unbiased", and are generally preferred over the "bp" bootstrap probabilities. The threshhold for red boxed clusters is 95% probability.
NOTE: Bootstrap can take many minutes to complete, depending on the number of iterations. In order to avoid browser timeouts, all results requiring bootstraps will be emailed to the user. Please provide a valid email address so that results can me mailed back.

References:

Shimodaira, H. (2004) "Approximately unbiased tests of regions using multistep-multiscale boot- strap resampling", Annals of Statistics, 32, 2616-2641.

Suzuki, R. and Shimodaira, H. (2004) "An application of multiscale bootstrap resampling to hierar- chical clustering of microarray data: How accurate are these clusters?", The Fifteenth International Conference on Genome Informatics 2004, P034.

Labels: Large tables may have so much text associated with the labels of the rows and the columns that it's hard to fit the text on the heat map. One can try adjusting the character size for the row labels via the Column Label size and Row Label Size values along with the Bottom and Right Margin values. Failing all else, one can always change the input data to use abbreviations instead of full text.

Missing Values

Use "NA" (no quotes) for missing values.

Log Data

If "Use log data" is checked, the software will take the log of the data. If your data is already in log form, you should not check this box.

Acknowledgements

This tool uses R software. Thanks to the R Team:

R Development Core Team (2005). R: A language and environment for
statistical computing. R Foundation for Statistical Computing,
Vienna, Austria. ISBN 3-900051-07-0, www.R-project.org.


last modified: Mon Nov 10 14:35 2008


Questions or comments? Contact us at seq-info@lanl.gov.

 
Operated by Los Alamos National Security, LLC, for the U.S. Department of Energy's National Nuclear Security Administration
Copyright © 2005-2006 LANSLLC All rights reserved | Disclaimer/Privacy

Dept of Health & Human Services Los Alamos National Institutes of Health