Center for Computational Diagnostics, IU School of Medicine, Indianapolis, IN, USA

Abstract

Data visualization plays a critical role in interpreting experimental results of proteomic experiments. Heat maps are particularly useful for this task, as they allow us to find quantitative patterns across proteins and biological samples simultaneously. The quality of a heat map can be vastly improved by understanding the options available to display and organize the data in the heat map.

This tutorial illustrates how to optimize heat maps for proteomics data by incorporating known characteristics of the data into the image. First, the concepts used to guide the creating of heat maps are demonstrated. Then, these concepts are applied to two types of analysis: visualizing spectral features across biological samples, and presenting the results of tests of statistical significance. For all examples we provide details of computer code in the open-source statistical programming language R, which can be used for biologists and clinicians with little statistical background.

Heat maps are a useful tool for presenting quantitative proteomic data organized in a matrix format. Understanding and optimizing the parameters used to create the heat map can vastly improve both the appearance and the interoperation of heat map data.

Background

Heat maps are an efficient method of visualizing complex data sets organized as matrices. In a biological context, a typical matrix is created by arranging the data such that each column contains the data from a single sample and each row corresponds to a single feature (e.g. a spectrum, peptide, or protein).

Correlation and interaction matrices are also common

A heat map performs two actions on a matrix. First, it reorders the rows and columns so that rows (and columns) with similar profiles are closer to one another, causing these profiles to be more visible to the eye. Second, each entry in the data matrix is displayed as a color, making it possible to view the patterns graphically. Multiple methods exist to accomplish these two tasks. The purpose of this tutorial is to demonstrate how these methods can be optimized for specific types of matrices. To accomplish this, we describe a few common methods in detail, and demonstrate how these methods are implemented in the open-source statistical programming language R

Heat map functions.

Function

Package

Version

stats

R version 2.12.1

gplots

2.8.0

heatmap.plus

1.13

Heatplus

1.16.0

Heatplus

1.16.0

The heat map functions described in this tutorial.

The file contains all the source code necessary to reproduce the figures in this tutorial.

Click here for file

**The file summarizes features available in each heat map function**.

Click here for file

A larger version of Figure

Click here for file

Heat map components

A heat map is the combination of two independent procedures applied to a data matrix. The first procedure reorders the columns and rows of the data in order to make patterns more visible to the eye. The second procedure translates a numerical matrix into a color image. Here, we use a series of illustrative examples to introduce the concepts from each procedure, and show how they impact the final heat map.

Data reordering

Data reordering plays a critical role in demonstrating patterns in the data. The goal of data reordering is to place columns (or rows) with similar profiles near one another so that shared profiles become more visible. Most heat maps use an agglomerative hierarchical clustering algorithm to group the data, and display this information using a dendrogram. An agglomerative hierarchical clustering algorithm on

A dendrogram is a common method of graphically displaying the output of hierarchical clustering. At the bottom, each line corresponds to each object (clusters of size 1). When two clusters are merged, a line is drawn connecting the two clusters at a height corresponding to how similar the clusters are. The order of the objects is chosen to ensure that at the point where two clusters are merged, no other clusters are between them, but this ordering is not unique. When two clusters are merged, the choice of which cluster is on the left and which is on the right is arbitrary.

Inherent to this procedure is the ability to measure the similarity between clusters, that is to represent similarity with a measurement of distance. In fact, many hierarchical clustering algorithms only look at distances between data points, never at the original data. Two types of distance measurements are important: the distance between individual observations (distance), and the distance between two clusters of observations (agglomeration).

Distance

A distance metric is a non-negative number which measures the difference between two objects. A value of 0 denotes no difference, with higher values corresponding to larger differences. The most common measure of distance calculates the difference in location, with 0 indicating that the two objects are at the same location. This is known as Euclidean distance, and is the default for all heat map functions.

For biological data, the most dominant variation in the data often occur across the features (rows) of the data matrix. Normally, these differences are not interesting, especially in LC-MS/MS data where the intensity of a protein or peptide may be due to many different causes. Rather, it is the changes in protein (or peptide) concentration across the spectra that is of interest. Using Euclidean distance, this variability can cause features with similar profiles to be treated as more distant than those with different profiles with similar mean intensities (Figure

Distance Measures

**Distance Measures**. A simulated example of distance measurements using 4 measurements on 8 samples. On the original scale, measurements in A and C are closest in location, while A and B are the most correlated. On the standardized scale, correlation distance does not change, but measurements A and B now are very similar in location. Note that D has a large negative correlation with the A and B, so its correlation distance (using **Equation 2**) is low.

One solution is to use a distance metric based on the correlation between profiles instead of change in location. Correlation measures the degree to which two variables increase and decrease together, with a range of [-1,1]. More extreme values demonstrate a strong relationship and values close to 0 indicate a weaker (or non-existent) relationship. The sign indicates whether the two variables increase together (positive) or one increases when the other decreases (negative). However, by default this is not a distance metric because it includes negative values, and increasingly similar patterns are represented by values further from zero instead of closer to 0. Two different methods are used to convert correlations to correlation distances.

Using **Equation 2**, if two variables have a strong relationship, they have a closer distance, regardless of whether they are both up-regulated together, or if one is up regulated when the other is down-regulated. In **Equation 3**, two variables must have a strong positive relationship to have a close distance: they must both be up-regulated together. Both definitions are useful in proteomic data sets where the actual measurements are not important, but the change in measurement from spectra to spectra is. Correlation distance captures whether the profile of up-regulation and down-regulation across spectra is the same for two proteins.

The second strategy for handling the variability across features is to standardize each row (feature) so that it has a mean of 0 and a standard deviation of 1. This removes systematic differences between different features with the same profile, so that proteins with the same profile have a small Euclidean distance. This strategy is illustrated by Figure

The default distance function used by all heat map implementations is Euclidean distance, and can be modified using the

Agglomeration

Agglomeration is the process by which clusters are merged into larger clusters: and more importantly, determining which clusters should be merged. Unfortunately, measuring the distance between clusters is more complicated than measuring the distance between objects. Agglomeration methods must be compatible with the distance metric, because it is possible to merge two objects, two clusters, or a cluster and an object at most stages in the algorithm. It also must produce consistent results: the height of a cluster should never be smaller than the heights of the two clusters which were merged to create it. Several algorithms have been developed which meet these properties, of which two are especially common.

The default metric used by the heat map function is called complete linkage. For two clusters, X and Y, it is calculated as

for all _{i }_{j }

Agglomeration Methods

**Agglomeration Methods**. An illustration of the difference between (**a**) complete linkage and (**b**) the Ward method of agglomeration. When merging 3 groups into 2 using complete linkage, 16 - 9 = 7 > 9 - 1 = 8, so 9 is grouped with the larger numbers. Using the Ward method, the groups {1, 3, 3, 4, 5}, {9, 15, 16, 16} produces an ^{2 }= 42.8, while {1, 3, 3, 4, 5, 9}, {15, 16, 16} produces ^{2 }= 37.5, so the second merge strategy is used.

The Ward method

where _{j }

As an example, consider Figure

There are three ways to take these 3 clusters and merge them into 2

Calculating the

The second merge produces a smaller

The purpose of reordering the data is to cluster rows and columns with similar profiles so that patterns among the features and spectra can be easily observed. The most important consideration in this process is ensuring that the distances efficiently measure the similarity across spectra in biologically meaningful ways, i.e. without being influenced by systematic differences in features caused by technical aspects of detection via mass spectrometry. This can be accomplished by standardizing the data or using correlation distance. A good agglomeration method will cause patterns to be easily discerned across features and spectra. The same method may not be ideal for all data sets, so it is important to explore several to see what works best. All heat map functions default to using the

This function is specified using the

Image representation

Image representation is the process of mapping the intensity range of the data to a color palette. A mapping will assign a specific range of values to a particular color, for example suppose we map all numbers in the range (5,8) to green. Mappings are constant across the entire data set: any value between 5 and 8 in all columns and all rows of the data matrix are mapped to green. Similar to the problem with distance calculations, a mapping which uses the original data is likely to be dominated by differences in the range of each feature. Figure **A **and **B **have a similar pattern across samples. Unless the actual numerical values in the data matrix have an explicit meaning, row scaling is usually advisable, and the heat map functions typically do this by default.

Color mapping

**Color mapping**. Heat maps produced using the simulated data from Figure 1 and Euclidean distance. In **(a)**, the colors are mapped to the original data, in **(b) **the colors are mapped to row-scaled data. Using row-scaled data, it is much easier to see that the patterns in A and B are the same. Using the original data, the differences in intensity between each row dominate the image. By default, data is row-scaled.

Color mapping

Once any scaling has been performed, color mapping assigns breaks to the data range. Breaks are the transition points between one color in the palette and the next. By default, the data range is dividing into

In the presence of outliers, equally spaced bins are often inefficient, as seen in Figure ^{th }^{th }^{th }

Breaks

**Breaks**. Breaks are assigned to 1000 randomly generated **(a)**, spaces the breaks evenly across the entire data range. In **(b)**, breaks are chosen to ensure that roughly the same number of data points fall within each break. In **(c) **the top 1% of the data is placed in a single bin while the remainder is placed in equally spaced bins.

Color mapping is controlled by the

Color palette

The color palette is the set of colors used to represent the values of the data matrix. This is normally chosen to gradually shift from one color representing low values to a second color representing high values, sometime by way of a third color representing intermediate values. In the default scheme, low values are represented by red and high values are represented by yellow using the

Color palettes

**Color palettes**. A selection of pre-defined color palettes available in R.

Several packages in R, including

While the choice of color palette is largely personal preference, two considerations are worth mentioning. First, although the green-black-red color scheme is extremely common due to its relation to red/green channels in microarray experiments, it cannot be interpreted by the color blind. For this reason, it should be avoided. Second, dark colors can be harder to distinguish from one another compared to light colors, as evidenced by comparing the

The

Extras

Although the basic components discussed above are shared by all heat map functions, the implementation of other features varies significantly. While it is beyond the scope of this tutorial to provide an in-depth review of all the features of all the functions, three features in particular require mentioning.

Color key

A color key is used to show the map between the data range (after scaling) and the colors. For any matrix, this can be useful in demonstrating which color(s) represent smaller values and which represent larger values. It becomes far more important when the data values have an explicit meaning. For example, a correlation matrix may contain values in the range of -1 and 1. When all the correlations contained in the matrix are greater than 0, using a blue-white-red color scheme without careful consideration of the breaks could produce a misleading visualization. By including a color key, such a mistake can be caught and corrected.

The functions

Group labels

Group labels provide the final piece of information for a heat map. They allow us to incorporate known group memberships (e.g. the disease group or gender for a sample, the protein membership of a peptide, or the annotation of a protein) into the heat map picture. This information can be used to determine (1) whether groups of samples or features with the same group membership tend to cluster together and (2) if subgroups of samples or features with the same group membership have a distinct profile.

Each heat map function has unique methods available for displaying group information, so its useful to demonstrate each function separately. Consider a simulated data set with 24 samples and 10 features. The 24 samples are each associated with two different grouping variables: one which takes on 2 values (e.g. male/female) and one with 3 values (e.g. 3 disease groups). For the purpose of illustration, 5 features are associated with each variable.

The **Figure (a) **and group **Figure (b)**. To display more than two groups on the same heat map, the

Group labels

**Group labels**. Rows and columns can be labeled using all heat map functions, but the implementation varies. The heatmap (a) andheatmap.2 (b) functions are limited to displaying a single color bar.The heatmap.plus (c) function can display a matrix of color bars. The heatmap_plus (d) function can display a data frame of binary variables. While heatmap.2 and heatmap.plus can produce a rectangular image, both heatmap and heatmap_plus produce a square image as output.

The

Layout

The image layout determines the amount of space in the graphics window devoted to the dendrograms, group labels, image matrix, color key (if applicable) and margins. The

Example

To demonstrate the use of heat maps in an analysis, we will use the

Data preparation

Before any analysis can begin, the data must be formatted appropriately. At the very least, the data must be organized into a

Excluding the zeros, the range of data in the _{2 }scale. Since log_{2}(0) = -∞, the zeros must be removed or replaced with a different value before applying the log transformation. Because the data reordering and color mapping steps are performed independently, different strategies can be used for each component of the heat map. Initially, we replace each zero with an

The distance function determines how robust the heat map function is to missing data. At a minimum, it is necessary to have at least one observed value in common for two samples or features to calculate a distance. In the

Example 1: Simultaneous clustering of samples and features

In the

While entire books have been written on the subject of missing data (e.g.,

Missing data affects row standardization because the center and standard deviation of the data are determined by the observed data. When the smallest values in the row are unobserved, these estimates are biased. In particular, the calculated standard deviation is likely to underestimate the true standard deviation, which magnifies differences in the observed data (after standardization). Alternatively, replacing the missing data with a low value can have a noticeable impact on the resulting heat map. Choosing a replacement value without understanding how the data was quantified will lead to bias. In this data set, the data is standardized while treating unobserved values as missing.

To measure the distance between two objects, the two objects must have at least 1 measurement in common. So to measure the distance between two samples, there must be one feature in which both samples have non missing data, and two features must have one sample on which both are measured. In this particular data set, this restriction would greatly reduce the number of features we could display. Since this would greatly diminish the value of the heat map, we know that the unobserved values are low, and we are primarily interested in visualization (and not a statistical test), we can justify the decision to impute the missing data to calculate the distance between rows. The missing values are replaced with a value of -10, which corresponds to approximately 0.001 on the original scale (and much lower than the minimum observed value of 0.006). Both dendrograms are created independently of the heat map using correlation distance and the Ward method of agglomeration. Creating them independently allows us use the data matrix with missing values in the call to

For this heat map, gray is chosen to represent missing values. The breaks are specified to group the top and bottom 0.2% of the data into separate bins, with the rest of the data placed into equally spaced bins. The final heat map is shown in Figure

Features and samples in the

**Features and samples in the ****Prostate2000Peaks****data set**. The spectral features (rows) and samples (columns) from the

Interpretation

We focus on interpreting the behavior of the samples in Figure

Example 2: Presenting significance results

Most statistical analyses involve one or more tests of statistical significance. In a mass spectrometry data set, the same tests is usually performed separately for each feature, or on the group of spectrum coming from the same protein. When multiple tests are performed on each protein or feature, a heat map can be used to organize and display the results. For the

The most important component of presenting significance results is the map between the matrices and the color palette. Specifically, a 3-color palette (e.g. the blue-white-red palette used here) should have a natural interpretation: white should indicate non-significance, while progressively more saturated values of blue and red should indicate increased significance, with the color dependent upon the sign of the

Pitfalls in using

**Pitfalls in using t-statistic based breaks**. Three possible problems when deriving a color scale based on

that is, taking the log of the

Typically, a

This color scale is independent of the distribution of the

** p-value based breaks**. A color-scale based on

Creating this type of heat map requires the following procedure:

1. Calculate the

2. Calculate the matrix of displayed values using Eq.13.

3. Calculate the inner breaks by applying Eq.13 to the p-values above.

4. Calculate the minimum break as the minimum display value or -7, whichever is smaller. Calculate the maximum break as the maximum display value or 8, whichever is larger. Put all of these values into a single monotonically increasing breaks vector.

5. Create the heat map. Note that the option

Interpretation

We see the significance results for the

Significance results for the

**Significance results for the ****Prostate2000Peaks****data set**. A heat map is used to display the significance results when performing pairwise comparisons between the disease groups, using the color key in Figure 9.

Competing interests

The author declares no competing interests.

Acknowledgements

I would like to thank Olga Vitek for her guidance and support, without whom this tutorial would not have been possible.

This article has been published as part of