HIV Databases HIV Databases home HIV Databases home
HIV sequence database



Heatmap Explanation   Kmeans clustering

Introduction: A heatmap is a graphical way of displaying a table of numbers by using colors to represent the numerical values. The clustering algorithm groups related rows and/or columns together by similarity. For example, low values might tend towards yellow tones while higher values tend to hotter orange and red tones. Kmeans clustering is performed on the rows and columns, and the rows/columns that fall in the same cluster are represented by the distinct colors on the row/column side bars. This web tool uses the heatmap tool, a modified version of "heatmap.2" of the gplots package of the statistical environment R: A Language and Environment for Statistical Computing

Kmeans clustering is performed on the rows and columns, and the rows/columns that fall in the same cluster are represented by the distinct colors on the row/column sidebars. Clustering is done by bootstraps or by noise data, or both.

Input

Required format:

Here is a small sample table illustrating the format:

6535.3_B_USA	QH0692.42_B_Trinidad	SC422661.8_B_Trinidad	PVO.4_B_Italy	AC10.0.29_B_USA
713080024	97	183	83	10	10
702010440	441	288	759	230	155
713080038	10	73	24	10	10
713080046	10	10	10	10	66
704010210	245	315	157	34	257
702010293	1209	221	44	26	NA
Output

You will receive:

Options

Use log transformation

If selected, the software will take the log values of your data. If your data has any values of zero, then taking the log will result in an error.

Use threshold value

The program allows users to input a below-threshold and/or an above-threshold value. Any values beyond the given threshold value will be presented as a white box. In addition, the input dataset will be changed and all values below the below-threshold or above the above- threshold will be set to the threshold.

For example, if a below threshold is given as 20, then all values less or equal than 20 will be considered below the threshold. The program will create a copy of the input data where all such values are changed into 20's. All values less or equal than 20 will be represented by white boxes in the heatmap. Note: if you want all values strictly less than the threshold to be white boxes, but not the threshold itself, then you have to set a slightly smaller (for below thresholds) or slightly greater (for above thresholds) threshold. For example, if your detection threshold is 20, but 20 is still part of your data, you would need to set the threshold to 19. Then all below-threshold values would be changed to 19 and represented by white boxes in the heatmap, while 20's will be in the next color boxes.

Analogously, if you set the above-threshold to 1,000, for example, all values greater or equal to 1,000 in the dataset will be changed to 1,000 and shown as white boxes in the heatmap. If you wish 1,000 to still be considered a valid entry and not a white box, then set the above-threshold to a value slightly greater than 1,000.

Number of clusters

The number of clusters required on the rows/columns using kmeans clustering.

Cluster support threshold

This threshold applies to clustering based on both bootstrapping and noise data. For a given number of clusters (K), K clusters are resolved at a certain point of bootstrap threshold, if every cluster has at least one member which falls in that cluster more than threshold times among the number of resamplings.

With the system default for the resampling threshold, we start with a threshold of 90% and decrease by 10% until K clusters are resolved or the threshold drops down to 50%. Users can try to get clusters at a specific threshold, say 95%, if you choose to set this option.

Color palettes

A heatmap represents the numerical values in a table of numbers by colors. One popular palette is the Brewer Color palette where lighter, less saturated colors such as light yellows represent small values, while darker, more saturated colors such as browns and deep reds represent large values. The available palettes are:

Higher values

The option "High intensity color" will display high data values with high intensity colors (e.g., red) and low values with low intensity color (e.g., yellow). The option "Low intensity color" will reverse the usual color assignments.

Number of colors

The number of colors in which the data will be represented using the color palette selected. This value will be ignored if color key ranges are specified.

Color key ranges

You can customize the color key ranges (optional). Follow the examples below.

Margin and Label sizes

Large tables may have so much text associated with the labels that it's hard to fit the text on the heatmap. One can try adjusting the margins or the character size for the labels via the Row and Column label size values. Failing all else, one can always change the input data to use abbreviations instead of full text.

With label sizes set at 7%, Heatmap will accommodate over 1,000 rows and 700 columns. At 20%, about 500 rows and 300 columns.

# of resamplings

For bootstraps, this value is the number of data samples (iterations) on which kmeans clustering is done. For noise, it is the number of noise (repeated) data samples on which kmeans clustering is done.

Cluster Method

Cluster heatmap by bootstrapping A common procedure to assess the stability of a clustering is to bootstrap the data. This web tool uses the R software bootstrap function to perform bootstrapping. See Efron and Tibshirani [B. Efron & R.J. Tibshirani (1993): An introduction to the bootstrap. Chapman & Hall] for details on this function. The kmeans function is bootstrapped to get the number of clusters on rows and columns.

Cluster heatmap by applying noise data Noise data represents the repeated iterations data. Kmeans clustering is done to the random generation for normal distribution with 2 standard deviation and input data. The standard deviation is calculated from the noise data.

NOTE: Clustering by either method can take many minutes to complete, depending on the number of resamplings. To avoid browser timeouts, all results will be e-mailed.

Input Noise Data

Noise Data should be a table of M rows by 2 columns in a .txt file with values delimited by spaces or tabs. Each column should start with a label for the column, followed by M values. Each row should not only 2 values and no labels. Two or more rows should have the same first value but different second values. Missing values should be denoted by "NA" or "ND" (standing for "Not Available").

Here is a small sample table illustrating the format:

isolate	id50
6535.3/702010293	1514
6535.3/702010293	1519
6535.3/702010440	205
6535.3/702010440	222
6535.3/704010210	489
6535.3/704010210	612

Add Sidebar(s)

These options allow you to add a color bar vertically or horizontally. These sidebars can be used to illustrate any possible grouping pertaining to your samples. For example, if the viruses in your data came from two different species, you could make each species clearly identifiable with a different color in the sidebar.

The colors in the sidebar may be assigned according to characters in the sample names, or by a grouping specified in a text file. The bar can be colored by a default palette (red - orange - yellow) based on frequency, or by user-selected colors (shown below) manually assigned to each group.

Red Orange Yellow Yellow green Green
Blue Royal blue Dark orchid Deep pink Hot pink
Light pink Brown Khaki Lemon chiffon Teal
Medium turquoise Sky blue Medium purple Medium slate blue Grey

The best way to understand this option is to try it. The following text can be used to add a horizontal color sidebar to the Sample Input. You can copy/paste the example below into the box for "paste grouped column labels".

Group1:
virus_1
virus_2
virus_3
virus_4
virus_5
virus_6
virus_7
virus_8
virus_9
virus_10
virus_11

Group2:
virus_12
virus_13
virus_14
virus_15
virus_16
virus_17

Group3:
virus_18
virus_19
virus_20

 

Acknowledgements

This tool uses R software. Thanks to the R Team:
R Development Core Team (2005). R: A language and environment for
statistical computing. R Foundation for Statistical Computing,
Vienna, Austria. ISBN 3-900051-07-0, www.R-project.org.

last modified: Mon Aug 19 15:44 2013


Questions or comments? Contact us at seq-info@lanl.gov.

 
Operated by Los Alamos National Security, LLC, for the U.S. Department of Energy's National Nuclear Security Administration
Copyright © 2005-2012 LANS LLC All rights reserved | Disclaimer/Privacy

Dept of Health & Human Services Los Alamos National Institutes of Health