apply prepends space for logical

591 views Asked by At

Having a strange issue here with apply and R 3.0.1.

I have a huge dataframe with text, numbers and logical values. The logical values are converted to chr when I use apply, but because R allows something like TRUE == "TRUE" that isn't a problem.

But to some logical values, apply seems to prepend a space, and TRUE == " TRUE" returns NA. Of course, I could do

sapply(cuelist[,4],FUN=function(logicalvalue) as.logical(sub("^ +", "", logicalvalue)))

but that isn't nice and I still don't know why R does that.

df <- data.frame(test=c("a","b","<",">"),logi=c(TRUE,FALSE,FALSE,TRUE))
apply(df, MARGIN=1, function(listelement) print(listelement) )

Interestlingly, the spaces only appear in this example on [2,1] and [2,4]

version _
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 0.1
year 2013
month 05
day 16
svn rev 62743
language R
version.string R version 3.0.1 (2013-05-16) nickname Good Sport

Edit: same behaviour on R version 2.15.0 (2012-03-30)

Edit2: My dataframe lools like this

> df
  test  logi
1    a FALSE
2    b FALSE
3    <  TRUE
4    >  TRUE

> str(df)
'data.frame':   4 obs. of  2 variables:
 $ test: Factor w/ 4 levels "<",">","a","b": 3 4 1 2
 $ logi: logi  FALSE FALSE TRUE TRUE
2

There are 2 answers

2
A5C1D2H2I1M1N2O1R2T1 On BEST ANSWER

In a way, the problem is with apply, but more appropriately, the problem is with as.matrix, and how it is handling logical values.

Here are a few examples to help elaborate on the query I had for Karl.

First, let's create four data.frames to do some tests on.

  1. Your original data.frame to demonstrate the behavior:
  2. A data.frame with varying number of characters in the "test" column to look into Karl's explanation of what's going on.
  3. A data.frame with some numbers to help us start to understand what actually seems to be going on.
  4. A data.frame where your "logi" column is explicitly created as.character.
df1 <- data.frame(test = c("a","b","<",">"),
                  logi = c(TRUE,FALSE,FALSE,TRUE))
df2 <- data.frame(test = c("aa","b","<",">>"), 
                  logi = c(TRUE,FALSE,FALSE,TRUE))
df3 <- data.frame(test = c("aa","b","<",">>"), 
                  logi = c(TRUE,FALSE,FALSE,TRUE),
                  num = c(1, 12, 123, 2))
df4 <- data.frame(test = c("aa","b","<",">>"), 
                  logi = as.character(c(TRUE,FALSE,FALSE,TRUE)))

Now, let's use as.matrix on each of them.

This has a space before TRUE.

as.matrix(df1)
#      test logi   
# [1,] "a"  " TRUE"
# [2,] "b"  "FALSE"
# [3,] "<"  "FALSE"
# [4,] ">"  " TRUE"

This has a space before TRUE, but the "test" column remains unaffected. Hmm.

as.matrix(df2)
#      test logi   
# [1,] "aa" " TRUE"
# [2,] "b"  "FALSE"
# [3,] "<"  "FALSE"
# [4,] ">>" " TRUE"

Ahh... This has a space before TRUE and spaces before shorter numbers. So it seems that perhaps R is considering the numeric underlying value of TRUE and FALSE, but calculating the width of the number of characters in TRUE and FALSE. Again, the first "test" column remains unaffected.

as.matrix(df3)
#      test logi    num  
# [1,] "aa" " TRUE" "  1"
# [2,] "b"  "FALSE" " 12"
# [3,] "<"  "FALSE" "123"
# [4,] ">>" " TRUE" "  2"

Things seem fine here, if you tell R that the logi column is a character column.

as.matrix(df4)
#      test logi   
# [1,] "aa" "TRUE" 
# [2,] "b"  "FALSE"
# [3,] "<"  "FALSE"
# [4,] ">>" "TRUE" 

For what it's worth, sapply doesn't seem to have that problem.

sapply(df1, as.matrix)
#      test logi   
# [1,] "a"  "TRUE" 
# [2,] "b"  "FALSE"
# [3,] "<"  "FALSE"
# [4,] ">"  "TRUE" 

Update

In the R Public chat room, Joshua Ulrich points to format being the culprit. as.matrix uses as.vector for factors, which converts them to character (try str(as.vector(df1$test)) to see what I mean; for everything else, it uses format, but unfortunately, doesn't have an option to include any of the arguments from format, one of which is trim (which is by default set to FALSE).

Compare the following:

A <- c(TRUE, FALSE)

format(A)
# [1] " TRUE" "FALSE"
format(A, trim = TRUE)
# [1] "TRUE"  "FALSE"
format(as.character(A))
# [1] "TRUE " "FALSE"
format(as.factor(A))
# [1] "TRUE " "FALSE"

So, how to sort of easily convert logical columns to character? Maybe something like this (though I would suggest creating a backup of your data first):

df1[sapply(df1, is.logical)] <- lapply(df1[sapply(df1, is.logical)], as.character)
df1
#   test  logi
# 1    a  TRUE
# 2    b FALSE
# 3    < FALSE
# 4    >  TRUE
as.matrix(df1)
#      test logi   
# [1,] "a"  "TRUE" 
# [2,] "b"  "FALSE"
# [3,] "<"  "FALSE"
# [4,] ">"  "TRUE" 
7
Karl Forner On

It is definitely due to apply, that converts the data frame to a matrix, so all elements have the same type, here character, and logicals are converted to it. TRUE gets converted to " TRUE" to match the number of characters of "FALSE":

"FALSE"
" TRUE"

To get convinced:

as.matrix(df)

Instead you could use the a*ply from plyr package, e.g.

a_ply(df, 1, print)