Is there a function within dcast that allows me to include additional conditions?

58 views Asked by At

I'm trying to create a wide format dataset that would include only some of the long format data. This is data from learners going through an online learning module in which they sometimes get "stuck" in a screen, therefore have multiple attempts recorded for that screen.

lesson_long <- data.frame (id  = c(4256279, 4256279, 4256279, 4256279, 4256279, 4256279, 4256308, 4256308, 4256308, 4256308),
                           screen = c("survey1", "survey1", "survey1", "survey1", "survey2", "survey2", "survey1", "survey1", "survey2", "survey2"),
                           question_attempt = c(1, 1, 2, 2, 1, 1, 1, 1, 1, 1),
                           variable = c("age", "country", "age", "country", "education", "course", "age", "country", "education", "course"),
                           response = c(0, 5, 20, 5, 3, 2, 18, 5, 4, 1 ))

.

id       screen     question_attempt variable response
4256279  survey1            1           age       0
4256279  survey1            1         country     5
4256279  survey1            2           age       20
4256279  survey1            2         country     5
4256279  survey2            1        education    3
4256279  survey2            1         course      2
4256308  survey1            1           age       18
4256308  survey1            1         country     5
4256308  survey2            1        education    4
4256308  survey2            1         course      1

For my analyses I need to include only their response in their last attempt in each screen (or response on their max question_attempt - sometimes they have up to 8 or 9 attempts in each screen). All previous attempts will be dismissed and I don't need to have the screen name in the final dataset. The final wide format would look like this:

id        age  country education course
4256279   20     5         3         2
4256308   18     5         4         1

I've been trying to do this with just dcast (unsuccessfully):

lesson_wide <- dcast(lesson_long, `id` ~ variable, value.var = "response", fun.aggregate = max("question_attempt"), fill=0)

The fun.aggregate is obviously not working as I made it up... But is there a solution for this? Or perhaps I need an additional step to select the data before using dcast? But how would do this if that's the solution?

Curious to see your answers. Thanks in advance!

1

There are 1 answers

3
Ronak Shah On BEST ANSWER

You can order the data by id, screen and question_attempt and select the last value of each question_attempt.

library(data.table)

setDT(lesson_long)

dcast(lesson_long[order(id, screen, question_attempt)], 
      id~variable, value.var = 'response', fun.aggregate = last, fill = NA)

#        id age country course education
#1: 4256279  20       5      2         3
#2: 4256308  18       5      1         4

Similarly, using dplyr and tidyr :

library(dplyr)

lesson_long %>%
  arrange(id, screen, question_attempt) %>%
  tidyr::pivot_wider(names_from = variable, values_from = response, 
                     id_cols = id, values_fn = last)