Kmeans clustering is performed on the rows and columns, and the rows/columns that fall in the same cluster are represented by the distinct colors on the row/column sidebars. Clustering is done by bootstraps or by noise data, or both.
Required format:
Here is a small sample table illustrating the format:
6535.3_B_USA QH0692.42_B_Trinidad SC422661.8_B_Trinidad PVO.4_B_Italy AC10.0.29_B_USA 713080024 97 183 83 10 10 702010440 441 288 759 230 155 713080038 10 73 24 10 10 713080046 10 10 10 10 66 704010210 245 315 157 34 NA 702010293 1209 221 44 26 NA
You will receive:
Use data transformation
If selected, the software will take the log value or square root of your data. If your data has any values of zero, then taking the log will result in an error.
Use threshold value
The program allows users to input a below-threshold and/or an above-threshold value. Any values beyond the given threshold value will be presented as a white box. In addition, the input dataset will be changed and all values below the below-threshold or above the above- threshold will be set to the threshold.
For example, if a below threshold is given as 20, then all values less or equal than 20 will be considered below the threshold. The
program will create a copy of the input data where all such values are changed into 20's. All values less or equal than 20 will be
represented by white boxes in the heatmap. Note: if you want all values strictly less than the threshold to be white boxes,
but not the threshold itself, then you have to set a slightly smaller (for below thresholds) or slightly greater (for above thresholds)
threshold. For example, if your detection threshold is 20, but 20 is still part of your data, you would need to set the threshold to 19.
Then all below-threshold values would be changed to 19 and represented by white boxes in the heatmap, while 20's will be in the next color
boxes.
Analogously, if you set the above-threshold to 1,000, for example, all values greater or equal to 1,000 in the dataset will be changed to
1,000 and shown as white boxes in the heatmap. If you wish 1,000 to still be considered a valid entry and not a white box, then set the
above-threshold to a value slightly greater than 1,000.
Number of clusters
The number of clusters required on the rows/columns using kmeans clustering.
Cluster support threshold
This threshold applies to clustering based on both bootstrapping and noise data. For a given number of clusters (K), K clusters are resolved at a certain point of bootstrap threshold, if every cluster has at least one member which falls in that cluster more than threshold times among the number of resamplings.
With the system default for the resampling threshold, we start with a threshold of 90% and decrease by 10% until K clusters are resolved or the threshold drops down to 50%. Users can try to get clusters at a specific threshold, say 95%, if you choose to set this option.
Color palettes
A heatmap represents the numerical values in a table of numbers by colors. One popular palette is the Brewer Color palette where lighter, less saturated colors such as light yellows represent small values, while darker, more saturated colors such as browns and deep reds represent large values. The available palettes are:
Higher values
The option "High intensity color" will display high data values with high intensity colors (e.g., red) and low values with low intensity color (e.g., yellow). The option "Low intensity color" will reverse the usual color assignments.
Number of colors
The number of colors in which the data will be represented using the color palette selected. This value will be ignored if color key ranges are specified.
Color key ranges
You can customize the color key ranges (optional). Follow the examples below.
Margin and Label sizes
Large tables may have so much text associated with the labels that it's hard to fit the text on the heatmap. One can try adjusting the margins or the character size for the labels via the Row and Column label size values. Failing all else, one can always change the input data to use abbreviations instead of full text.
With label sizes set at 7%, Heatmap will accommodate over 1,000 rows and 700 columns. At 20%, about 500 rows and 300 columns.
# of resamplings
For bootstraps, this value is the number of data samples (iterations) on which kmeans clustering is done. For noise, it is the number of noise (repeated) data samples on which kmeans clustering is done.
Cluster Method
Cluster heatmap by bootstrapping A common procedure to assess the stability of a clustering is to bootstrap the data. This web tool uses the R software bootstrap function to perform bootstrapping. See Efron and Tibshirani [B. Efron & R.J. Tibshirani (1993): An introduction to the bootstrap. Chapman & Hall] for details on this function. The kmeans function is bootstrapped to get the number of clusters on rows and columns.
Cluster heatmap by applying noise data Noise data represents the repeated iterations data. Kmeans clustering is done to the random generation for normal distribution with 2 standard deviation and input data. The standard deviation is calculated from the noise data.
NOTE: Clustering by either method can take many minutes to complete, depending on the number of resamplings. To avoid browser timeouts, all results will be e-mailed.
Input Noise Data
Noise Data should be a table of M rows by 2 columns in a .txt file with values delimited by spaces or tabs. Each column should start with a label for the column, followed by M values. Each row should not only 2 values and no labels. Two or more rows should have the same first value but different second values. Missing values should be denoted by "NA" or "ND" (standing for "Not Available").
Here is a small sample table illustrating the format:
isolate id50 6535.3/702010293 1514 6535.3/702010293 1519 6535.3/702010440 205 6535.3/702010440 222 6535.3/704010210 489 6535.3/704010210 612
These options allow you to add a color bar vertically or horizontally. These sidebars can be used to illustrate any possible grouping pertaining to your samples. For example, if the viruses in your data came from two different species, you could make each species clearly identifiable with a different color in the sidebar.
The colors in the sidebar may be assigned according to characters in the sample names, or by a grouping specified in a text file. The bar can be colored by a default palette (red - orange - yellow) based on frequency, or by user-selected colors (shown below) manually assigned to each group.
● | Red | ● | Orange | ● | Yellow | ● | Yellow green | ● | Green |
● | Blue | ● | Royal blue | ● | Dark orchid | ● | Deep pink | ● | Hot pink |
● | Light pink | ● | Brown | ● | Khaki | ● | Lemon chiffon | ● | Teal |
● | Medium turquoise | ● | Sky blue | ● | Medium purple | ● | Medium slate blue | ● | Grey |
The best way to understand this option is to try it. The following text can be used to add a horizontal color sidebar to the Sample Input. You can copy/paste the example below into the box for "paste grouped column labels".
Group1: virus_1 virus_2 virus_3 virus_4 virus_5 virus_6 virus_7 virus_8 virus_9 virus_10 virus_11 Group2: virus_12 virus_13 virus_14 virus_15 virus_16 virus_17 Group3: virus_18 virus_19 virus_20
This tool uses R software. Thanks to the R Team:
R Development Core Team (2005). R: A language and environment for
statistical computing. R Foundation for Statistical Computing,
Vienna, Austria. ISBN 3-900051-07-0, www.R-project.org.