Heatmap
Introduction: A heatmap is a graphical way of displaying a table of numbers
by using colors to represent the numerical values. For example, low values
might tend towards cool blue tones while higher values tend to hotter
orange and red tones. Heatmaps also re-arrange the rows and columns of the
table so that similar rows, and similar columns, are grouped together, with
their similarity represented by a dendogram (separate dendograms for rows
and for columns). This web tool uses the heatmap tool, "heatmap.2" of the
gplots package of the statistical environment
R: A Language and Environment for Statistical Computing.
This sort of 2-dimensional clustering was originally used for analysis of gene
expression array data (see, e.g., Hastie, T., R. Tibshirani, and J. Friedman. 2001.in: The
elements of statistical learning, data mining inference and prediction, p. 453-
480. Springer-Verlag, New York, N.Y.). It is broadly applicable to any problem
where it is beneficial to arrange numeric values in a 2-dimensional array
according to like-behavior. We have found this strategy of organization of data
useful for interpreting neutralizing antibody data where panels of sera or
monoclonal antibodies are tested against panels of Envelopes (Binley et al., J
Virol. 2004 Dec;78(23):13232-52), to group Envelopes with similar antibody
sensitivities and simultaneously group antibodies with similar neutralization
profiles. It is also a useful method for interpreting other quantitative immunological
data, such as CD8 T-cell EliSpot results.
Input: Data should be an M row by N column table with values delimited by
spaces or tabs. The filename should have a suffix ".txt", as in
"MyData.txt". Each row should start with a label for the row, followed by N
values. Each column should start with a label for the column, followed by M
values. Missing values should be denoted with the two letter code "NA"
(standing for "Not Available").
Here is a small sample table illustrating the format:
BB8 BB12 BB28 BB55 BB70 BB106 Pool_B
Du123.6_C_SA 207 352 165 84 147 198 182
Du151.2_C_SA 196 1555 2529 818 241 487 518
Du156.12*_C_SA 369 426 238 336 406 258 231
Du172.17*_C_SA 429 884 499 196 549 550 315
Output: Three outputs are available: 1) a heatmap with row and column
dendograms. 2) a bootstrap dendogram that represents the stability of a row,
or a column, heatmap dendogram. 3) a clustering dendogram representing the
clustering of the row data, or of the column data, but does not display a
heatmap i.e. a color representation of the tabular data.
Output format: The output format is pdf. In addition to the pdf file, Heatmap
provides a link to the R script used to produce the output.
Color palettes: A heatmap represents the numerical values in a table of numbers by colors. One popular palette is the
Brewer Color palette where lighter, less saturated colors such as cornsilks,
yellows, and oranges represent small values, while darker, more saturated colors such as browns and deep reds represent large values. The available
palettes are (Brewer is the default): 1) Brewer palette: small values are lighter and less saturated colors e.g. cornsilk, progressing through darker
more saturated colors such as reds, deep browns. 2) heat colors: small values are red, progressing to higher values as oranges and yellows 3) topo
colors: small values are blue, progressing to greens, then yellows to cornsilk at higher values 4) red/green: small values are red, progressing
through black, to higher values as green 5) ember colors: small values are blue, progressing to oranges and reds for higher values.
Dendograms: A heatmap re-orders the rows, and separately the columns, of
the data so that similar data is grouped together. A dendogram shows the
similarity of the rows, and a separate dendogram shows the similarity of
the columns. Although the row dendogram and the column dendogram are shown
simultaneously on the heatmap, they are computed independently of each
other. To compute a dendogram, a (a) distance metric, and (b) an
agglomerative method, needs to be specified. Distance Metrics: 1) Euclidean
(default): Usual sum of squared differences distances between elements of
vectors X and Y, take the square root. 2) Manhattan: Sum of the absolute
values of the differences of the elements of vectors X and Y. 3) Binary:
Replaces elements of vectors X and Y with "1" if the element is nonzero,
and with "0" otherwise. The distance is the proportion of elements that
mismatch, i.e. a percentage Hamming distance. Euclidean distance is the
default.
Agglomerative Method: 1) Complete (default), 2) Average, 3) Ward; "Complete"
is the default method. The first two choices, complete and average, are
widely used choices. Other choices may be useful in special situations
where it's suspected that the clusters are not compact spherical clusters,
or to more fully explore the cluster structure.
References:
J.H. Ward, J American Statistical Association 58:301 236-244 (1963).
Bootstraps: A common procedure to assess the stability of a clustering is to bootstrap the data. This web tool uses the
pvclust package to perform bootstrapping. One can separately bootstrap the
row data, or the column data. Pvclust produces a row or a column dendogram with values at nodes in the dendogram representing the stability of the
clusters associated with the node. Pvclust attempts to address the bias in normal bootstrap resampling by employing a multi-scale bootstrap resampling
approach. Normal bootstrap resampling values are represented in the output by green letters "bp", for "bootstrap probability". The multi-scale
bootstrap resampling probabilities are represented in the output by red letters "au", for "approximately unbiased", and are generally preferred over
the "bp" bootstrap probabilities. The threshhold for red boxed clusters is 95% probability.
NOTE: Bootstrap can take many minutes to complete, depending on the number of iterations.
In order to avoid browser timeouts, all results requiring bootstraps will be emailed to the user.
Please provide a valid email address so that results can me mailed back.
References:
Shimodaira, H. (2004) "Approximately unbiased tests of regions using multistep-multiscale boot-
strap resampling", Annals of Statistics, 32, 2616-2641.
Suzuki, R. and Shimodaira, H. (2004) "An application of multiscale bootstrap resampling to hierar-
chical clustering of microarray data: How accurate are these clusters?", The Fifteenth International
Conference on Genome Informatics 2004, P034.
Labels: Large tables may have so much text associated with the labels of
the rows and the columns that it's hard to fit the text on the heat map.
One can try adjusting the character size for the row labels via the Column Label
size and Row Label Size values along with the Bottom and Right Margin values.
Failing all else, one can always change the input data to use abbreviations
instead of full text.
Missing Values
Use "NA" (no quotes) for missing values.
Log Data
If "Use log data" is checked, the software will take the log of the data. If your data is already in log form, you
should not check this box.
Acknowledgements
This tool uses R software. Thanks to the R Team:
R Development Core Team (2005). R: A language and environment for
statistical computing. R Foundation for Statistical Computing,
Vienna, Austria. ISBN 3-900051-07-0,
www.R-project.org.
last modified: Mon Nov 10 14:35 2008