I've created a data frame from some climate data, and want to analyse it by three variables, each with three factors. I was having trouble graphing one of the variables, so I went backwards to find the error, and find the following:
> levels(batch3$site)
[1] " Site_1" " Site_2" " Site_3"
> levels(batch3$RCP)
[1] "00" "26" "85"
> levels(batch3$time)
[1] "baseline " "baseline "
[3] "baseline " "end "
[5] "end " "end "
[7] "mid " "mid "
[9] "mid "
As you can see, the third set is for some reason split into three. I've tried reverting them back to Characters to remove any empty spaces using the following:
batch3 %>%
mutate_if(is.character, trimws)
But no luck there, the results were identical. All three factors have the same sort of distribution through the dataset, but this problem only exists with the 'time' column, and the only thing I can see is the large empty spaces in those factors, which seem resistant to removal.
I did try to recreate the problem by making a smaller data frame, but the issue didn't arise, and the original dataset has about 480 lines of data, so not something I can easily cut and paste.
test_df <- data.frame(a = as.factor(c('Site1', 'Site2', 'Site3','Site1', 'Site2', 'Site3','Site1', 'Site2', 'Site3')),
b = as.factor(c('00','00','00','26','26','26','85','85','85')),
c = as.factor(c('baseline', 'mid','end','baseline', 'mid','end','baseline', 'mid','end')),
d = c(45, 56, 54, 57, 59, 77, 37, 21, 44))
I would have expected to get the following:
> levels(test_df$a)
[1] "Site1" "Site2" "Site3"
> levels(test_df$b)
[1] "00" "26" "85"
> levels(test_df$c)
[1] "baseline" "end" "mid"
So I'm at a bit of a loss really although the issue seems to be related to the empty space in the factors. Any advice on how to approach this?
The test data set and (seemingly) the real data set are different in that the test data set does not have whitespace. You don't show how you are reading in the data, whether it is coming from a csv file or some other source. You are on the right track in thinking that the whitespace is the issue.
The key issue is that when factors are created they use the actual strings to create the levels. Your strings include whitespace so that is part of the levels created.
Let's do this in base R
What that does is: