R Spread function with duplicates- still can't get to work after adding transient row

149 views Asked by At

trying to get the spread() function to work with duplicates in the key column- yes, this has been covered before but I can't seem to get it to work and I've spent the better part of a day on it (somewhat new to R).

I have two columns of data. The first column 'snowday' represents the first day of a winter season, with the corresponding snow depth in the 'depth' column. This is several years of data (~62 years). So there should be sixty two years of first, second, third, etc days for the snowday column- this produces duplicates in snowday:

    snowday row depth
       1   1     0
       1   2     0
       1   3     0
       1   4     0
       1   5     0
       1   6     0
...

      75 4633    24
      75 4634     4
      75 4635     6
      75 4636    20
      75 4637    29
      75 4638     1

I added a "row" column to make the data frame more transient (which I vaguely understand to be hones so 1:4638 rows is the total measurements taken over ~62 years at 75 days per year . Now i'd like to spread it wide:

wide <- spread(seasondata, key = snowday, value =  depth, fill = 0)

and i get all zeros:

row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 
 1 0 0 0 0 0 0 0 0 0  0  0   0  0 0
 2 0 0 0 0 0 0 0 0 0  0  0   0  0 0
 3 0 0 0 0 0 0 0 0 0  0  0   0  0 0

what I want it to look like is something like this (the columns are defined by the "snowday" and the row values are the various depths recorded on for that particular day over the various years- e.g. days 1 through 11 :

   1 2 3 4 5 6 7 8 9 10 11 12 13 14 
   2 1 3 4 0 0 1 0 2  8  9 19  0 3
   0 8 0 0 0 4 0 6 6  0  1  0  2 0
   3 5 0 0 0 2 0 1 0  2  7  0 12 4

I think I'm fundamentally missing something here- I've tried working through drop=TRUE or convert = TRUE, and the output values are either all zeros or NA's depending on how I tinker. Also, all values in the data.frame(seasondata) are integers. Any thoughts?

1

There are 1 answers

1
Weihuang Wong On

It seems to me what you wish to do is to split up the the depth column according to values of snowday, and then bind all the 75 columns together.

There is a complication, in that 62*75 is not 4638, so I assume we do not observe 75 snowdays in some years. That is, some of the 75 columns (snowdays) will not have 62 observations. We'll make sure all 75 columns are 62 entries long by filling short columns up with NAs.

I make some fake data as an example. We observe 3 "years" of data for snowdays 1 and 2, but only 2 "years" of data for snowdays 3 and 4.

set.seed(1)
seasondata <- data.frame(
  snowday = c(rep(1:2, each = 3), rep(3:4, each = 2)),
  depth = round(runif(10, 0, 10), 0))
#    snowday depth
# 1        1     3
# 2        1     4
# 3        1     6
# 4        2     9
# 5        2     2
# 6        2     9
# 7        3     9
# 8        3     7
# 9        4     6
# 10       4     1

We first figure out how long a column should be. In your case, m == 62. In my example, m == 3 (the years of data).

m <- max(table(seasondata$snowday))

Now, we use the by function to split up depth by values of snowdays, and fill short columns with NAs, and finally cbind all the columns together:

out <- do.call(cbind, 
  by(seasondata$depth, seasondata$snowday,
    function(x) {
      c(x, rep(NA, m - length(x)))
    }
  )
)
out
#      1 2  3  4
# [1,] 3 9  9  6
# [2,] 4 2  7  1
# [3,] 6 9 NA NA

Using spread:

You can use spread if you wish. In this case, you have to define row correctly. row should be 1 for the first first snowday (snowday == 1), 2 for the second first snowday, etc. row should also be 1 for the first second snowday, 2 for the second second snowday, etc.

seasondata$row <- unlist(sapply(rle(seasondata$snowday)$lengths, seq_len))
seasondata
#    snowday depth row
# 1        1     3   1
# 2        1     4   2
# 3        1     6   3
# 4        2     9   1
# 5        2     2   2
# 6        2     9   3
# 7        3     9   1
# 8        3     7   2
# 9        4     6   1
# 10       4     1   2

Now we can use spread:

library(tidyr)
spread(seasondata, key = snowday, value = depth, fill = NA)
#   row 1 2  3  4
# 1   1 3 9  9  6
# 2   2 4 2  7  1
# 3   3 6 9 NA NA