creating a heatmap where the data has NaN values in it

33.7k views Asked by At

I am trying to create a heatmap using the heatmap.2 package. My data has lot's of NaN values in it, and what I would like to do is the following. Every time there is a NaN value, simply have the cell be colored as light grey (or some other neutral color, maybe white), and all of the other values (which are log2 expression) to have a standard green/yellow/red coloring scheme. Here is my code that I have ben using:

heatmap.2(as.matrix(foo2[rowSums (abs(foo2)) != 0,]),
          col = redgreen,
          margins = c(12, 22),
          trace = "none", 
          xlab = "Comparison",
          lhei = c(2, 8),
          scale = c("none"),
          symbreaks = min(foo2 = 0, na.rm = TRUE),
          na.color = "blue",
          cexRow = 0.5,
          cexCol = .7,
          main = "DE geness",
          Colv = F)

This works well when there is no NaN values, but when the data has NaN, I am getting an error which says:

Error in hclustfun(distfun(x)) : 
  NA/NaN/Inf in foreign function call (arg 11)

Essentially, I would like to have it ignore the NaN's in the data. I am not sure how to handle this. any help would be greatly appreciated.

5

There are 5 answers

2
kdauria On

I can't reproduce the problem. The code below works just fine. All of the NaN values are colored blue.

library(gplots)
mat = matrix( rnorm(25), 5, 5)
mat[c(1,6,8,11,15,20,22,24)] = NaN

heatmap.2( mat,
           col = colorpanel(100,"red","yellow","green"),
           margins = c(12, 22),
           trace = "none", 
           xlab = "Comparison",
           lhei = c(2, 8),
           scale = c("none"),
           symbreaks = min(mat, na.rm=TRUE),
           na.color="blue",
           cexRow = 0.5, cexCol = 0.7,
           main = "DE genes", 
           dendrogram = "row", 
           Colv = FALSE )

enter image description here

2
Kahina On

So I am not an expert in coding at all, but I have been learning to make heatmaps on R and I kept having the same error message for my NA data. It turns out the reason I was getting the error message was there were NA terms in the first column in my data and R did not like that at all. So I added an extra column and filled it with 1's and it worked!! I hope maybe someone will find this useful!

Kahina

0
Jana Obajdin On

I apologise if this seems like I am over simplifying it but I know I would appreciate a simplified post like this (since I am no expert in R). I found this the easiest method so far and I'll show it with my data;

My data ranges from 0 to 114 in a data matrix with a lot of NA values so what I did was first replace all NA values with -1 (below the range of my dataset)

x <- mymatrix %>% replace(is.na(.), -1)

then I set breaks using heatmap.2(). If you want your NA values to be let's say "black" and the rest of the values to use a colourpalette with a range of colours then set your breaks using seq(). Since my data ranges from 0 to 114, I set my seq to go from 0 to 114 by increments of 1. Then using heatmap.2() I set the breaks as -1 and then my sequence (so the breaks would look like (-1,0,1,2,3..etc). I set the colours to be "black" for the -1 values (the NAs) and use 114 colours from the bluered palette for the remaining values.

seq <- seq(from = 0, to = 114, by = 1)
heatmap.2(x, col = c("black", bluered(114)), 
      trace = "none", density.info = "none", breaks=c(-1,seq))

I hope this is helpful!

0
0range On

Just a suggestion for a practical solution in addition to posdef's very instructive answer:

Since distfun is only used to determine the structure of the dendrogram, you can simply replace the NA's in the dist matrix with values that are a bit higher than the maximum of the non-NA values.

For this, we need a new distance function (one that wraps the normal dist function and just replaces NAs):

dist_no_na <- function(mat) {
    edist <- dist(mat)
    edist[which(is.na(edist))] <- max(edist, na.rm=TRUE) * 1.1 
    return(edist)
}

and make use of this function in the heatmap.2 call:

heatmap.2(mat, ..., dendrogram="row", Colv="NA", na.color="black", distfun=dist_no_na)

Properties

This is of course not a perfect solution. It assigns numerical distance values to pairs of vectors for which there is no basis on which a (euclidean?) distance can be computed. However, it does have some desirable properties.

  1. The heatmap.2 function works :-)

  2. Rows that only contain NA's for instance are then split from the main branch first (which reflects the issue at hand nicely).

  3. I am not entirely certain which effect it has to replace NA values that are caused by other properties of the matrix. posdef pointed out that there may be such NA values. In posdef's example, there are two rows for which there is no pair of non-NA entries in the same column - i.e. it is impossible to determine a euclidean distance. It is in this case, probably still be appropriate to reflect this as a particularly large distance larger than all those that can be computed numerically.

I would not choose a replacement value much larger than the non-NA maximum. (The chosen value in the code above is 10% larger.) This would increase the distance of the split-off point of all-NA rows to the following split-off points (the relevant part of the dendrogram) and may make the relevant part of the dendrogram difficult to see.

0
posdef On

TL;DR: The issue is likely due to delegated distfun and not the heatmap2 function itself. The default dist function tries to calculate the distance between your data points, and if the distance calculation returns an NA, the clustering function cannot handle that.


The longer version:

I have recently experienced the same issue as the OP, and had to dig in quite a bit to understand why the problem wasn't reproducible for others.

The essential issue is as follows: heatmap2 by default passes hclust and hclustfun and dist as distfun parameters. The error message clearly states that it's hclustfun (which in this case defaults to hclust) that does not like the NAs.

The next bit of information is this: even though the data matrix includes NAs the results of dist (which are passed in to hclust) might be free from NAs, which is the case for @kdauria's answer. See below:

> library(gplots)
> mat = matrix( rnorm(25), 5, 5)
> mat[c(1,6,8,11,15,20,22,24)] = NaN
> 
> heatmap.2( mat,
+            col = colorpanel(100,"red","yellow","green"),
+            margins = c(12, 22),
+            trace = "none", 
+            xlab = "Comparison",
+            lhei = c(2, 8),
+            scale = c("none"),
+            symbreaks = min(mat, na.rm=TRUE),
+            na.color="blue",
+            cexRow = 0.5, cexCol = 0.7,
+            main = "DE genes", 
+            dendrogram = "row", 
+            Colv = FALSE )
> ?dist
> mat
           [,1]       [,2]        [,3]        [,4]       [,5]
[1,]        NaN        NaN         NaN -1.10103187 -1.4396185
[2,] -0.8821449  1.4891180  0.41956063 -0.06442867        NaN
[3,] -2.5912928        NaN -0.56603029 -0.55177559 -2.0313602
[4,]  0.8348197  0.2199583  0.06318663  1.59697764        NaN
[5,] -0.2632078 -1.2193110         NaN         NaN  0.8618543
> dist(mat)
         1        2        3        4
2 2.317915                           
3 1.276559 2.623637                  
4 6.032933 3.050821 5.283828         
5 5.146250 4.392798 5.871684 2.862324

The random valued matrix does not reproduce the problem because it avoids the issue at hand. Which brings me to the question: what does it take to get NAs from dist?


My data had some outlying large values which I thought to be the reason, however I only managed to reproduce the problem by adding a row of NAs:

> mat = matrix(rnorm(49), 7, 7)
> mat[c(3,17,28, 41)] = mat[c(3,17,28, 41)] * 100000
> mat
              [,1]        [,2]          [,3]          [,4]        [,5]          [,6]       [,7]
[1,] -6.175928e-01  1.68691561 -1.233250e+00 -7.355322e-01 -0.37392178  3.559804e-01  1.7536137
[2,]  6.680429e-01  0.90590237 -1.375424e+00  5.842512e-01 -0.09376548 -3.556098e-01 -1.2926535
[3,] -3.739372e+04 -1.74534887 -2.241643e+05 -2.209226e-01 -0.86769435 -4.590908e-01  1.6306854
[4,] -1.283405e+00  0.20698245  3.635557e-01  3.673208e-01 -0.12339047  1.119922e+00  0.4301094
[5,] -5.430687e-02 -0.75219479  2.609126e+00 -1.340564e-01  0.54016622  2.885021e-01  0.9237946
[6,] -8.395116e-01  0.03675002  2.455545e+00  4.432025e-02 -0.86194910  1.302758e+05  0.6062505
[7,]  1.817036e-01 -1.46137388 -1.853179e+00 -2.177306e+03  2.36763806 -2.273134e+00  1.2440088
> dist(mat)
             1            2            3            4            5            6
2 3.726858e+00                                                                 
3 2.272605e+05 2.272606e+05                                                    
4 2.966078e+00 3.537475e+00 2.272620e+05                                       
5 4.787577e+00 5.039154e+00 2.272644e+05 3.016614e+00                          
6 1.302754e+05 1.302762e+05 2.619559e+05 1.302747e+05 1.302755e+05             
7 2.176576e+03 2.177895e+03 2.272705e+05 2.177679e+03 2.177179e+03 1.302963e+05
> mat = rbind(mat[1:4, ], rep(NA,7), mat[5:6, ])
> mat
              [,1]        [,2]          [,3]        [,4]        [,5]          [,6]       [,7]
[1,] -6.175928e-01  1.68691561 -1.233250e+00 -0.73553223 -0.37392178  3.559804e-01  1.7536137
[2,]  6.680429e-01  0.90590237 -1.375424e+00  0.58425125 -0.09376548 -3.556098e-01 -1.2926535
[3,] -3.739372e+04 -1.74534887 -2.241643e+05 -0.22092261 -0.86769435 -4.590908e-01  1.6306854
[4,] -1.283405e+00  0.20698245  3.635557e-01  0.36732078 -0.12339047  1.119922e+00  0.4301094
[5,]            NA          NA            NA          NA          NA            NA         NA
[6,] -5.430687e-02 -0.75219479  2.609126e+00 -0.13405635  0.54016622  2.885021e-01  0.9237946
[7,] -8.395116e-01  0.03675002  2.455545e+00  0.04432025 -0.86194910  1.302758e+05  0.6062505
> dist(mat)
             1            2            3            4            5            6
2 3.726858e+00                                                                 
3 2.272605e+05 2.272606e+05                                                    
4 2.966078e+00 3.537475e+00 2.272620e+05                                       
5           NA           NA           NA           NA                          
6 4.787577e+00 5.039154e+00 2.272644e+05 3.016614e+00           NA             
7 1.302754e+05 1.302762e+05 2.619559e+05 1.302747e+05           NA 1.302755e+05
> heatmap.2( mat,
+            col = colorpanel(100,"red","yellow","green"),
+            margins = c(12, 22),
+            trace = "none", 
+            xlab = "Comparison",
+            lhei = c(2, 8),
+            scale = c("none"),
+            symbreaks = min(mat, na.rm=TRUE),
+            na.color="blue",
+            cexRow = 0.5, cexCol = 0.7,
+            main = "DE genes", 
+            dendrogram = "row", 
+            Colv = FALSE )
Error in hclustfun(distfun(x)) : 
  NA/NaN/Inf in foreign function call (arg 11)

However the situation does not appear to be specific to the case where there is a row entirely composed of NAs. For example:

> mat
              [,1]        [,2]          [,3]       [,4]       [,5]          [,6]       [,7]
[1,]           NaN         NaN           NaN        NaN         NA -7.531027e-01  0.2238252
[2,]  3.210084e-01 -1.55702840  2.777516e-01  0.2176875  1.3310334 -9.621561e-01        NaN
[3,]  1.159837e+05  0.04480172 -1.649482e+04        NaN  2.4748122  8.446133e-01 -0.4240776
[4,] -8.584051e-01         NaN           NaN  1.0557713 -1.0855826 -5.638023e-02 -0.3789979
[5,]            NA          NA -2.539003e-01 -0.4552776  0.3856384            NA         NA
[6,]           NaN  1.31986556           NaN -1.0393147 -1.9197183 -1.434064e+00  0.6334569
[7,]           NaN -0.42180912           NaN -0.8023476 -0.8264077  4.471358e+04  0.5046408
> dist(mat)
             1            2            3            4            5            6
2 5.531033e-01                                                                 
3 3.225471e+00 1.386143e+05                                                    
4 1.723619e+00 3.913983e+00 1.534332e+05                                       
5           NA 1.949799e+00 3.085851e+04 3.945524e+00                          
6 1.486699e+00 6.010961e+00 6.905415e+00 3.743585e+00 4.449179e+00             
7 8.365286e+04 5.915178e+04 5.914939e+04 5.915058e+04 2.358664e+00 5.290752e+04