How to split a dataframe into smaller ones using one unique variable?

55 views Asked by At

I am working with a teacher absences file. I need to flag any instance of 5+ consecutive absences.

I have a dataframe like the one below. How do I split this data into new dataframes by teacher name?

absences <- data.frame(
  staffid = c("123","456","789","101","121", "123", "123", "123", "123"), 
  name = c("Kara","Barbie","Sam","Jane","Chris", "Kara", "Kara", "Kara", "Kara"), 
  date = c(as.Date("2022-08-31"), as.Date("2022-09-01"), as.Date("2022-09-01"), 
           as.Date("2022-09-02"), as.Date("2022-09-07"), as.Date("2022-09-01"), 
           as.Date("2022-09-02"), as.Date("2022-09-06"), as.Date("2022-09-07")),
  schoolday = c(1, 2, 2, 3, 5, 2, 3, 4, 5))

I tried the code below:

absences_new <- absences %>% nest(.by = name) 

which gave me

enter image description here

Sorry, I am new to R and do not know what the mini-data frames in the output are called.

I also tried:

X <- split(absences, absences$name)

which gave me a result more like what I'm looking for, but it is in a format I don't know how to work with.

I also tried:

teachername <- (unique(absences$name))
splitdata <- split(absences, teachername)

but this gave me an error of "data length is not a multiple of split variable."

What I want for my output is something like what this would make:

Kara <- data.frame(
  staffid = c("123","123", "123", "123", "123"), 
  name= c("Kara", "Kara", "Kara", "Kara", "Kara"), 
  date = c(as.Date("2022-08-31"), as.Date("2022-09-01"),  
           as.Date("2022-09-02"), as.Date("2022-09-06"), 
           as.Date("2022-09-07")),
  schoolday = c(1, 2, 3, 4, 5))


Sam <- data.frame(
  staffid = c("789"), 
  name= c("Sam"), 
  date = c(as.Date("2022-09-01")),
  schoolday = c(2))

Then, my plan is to take these mini data frames and scan for any consecutive days.

Thank you!!

2

There are 2 answers

0
Sarab On

This code gets you datasets for each teacher:

library(tidyverse)

teacher_names <- absences %>% distinct(name) %>% pull()

getting_teacher_dfs <- function(teacher_name) {
    absences %>% 
    filter(name == teacher_name)
}

teacher_datasets <- map(teacher_names, getting_teacher_dfs)

You can then pull each dataset like this.

teacher_datasets[[1]]

Overall, though, you probably should look into using group_by feature in dplyr instead of creating this many new datasets.

0
langtang On

If you are looking for consecutive integers in schoolday, you can make a function that returns TRUE if any set of consecutive integers meets/exceed length n, like this this

consecutive_n <- function(d,n=5) {
  any(sapply(split(d, cumsum(c(1, diff(d)!= 1))),length)>=n)
}

And then apply that by each name

absences %>% 
  arrange(date) %>%
  reframe(absent_5 = consecutive_n(schoolday), .by = name)

Output:

    name absent_5
1   Kara     TRUE
2 Barbie    FALSE
3    Sam    FALSE
4   Jane    FALSE
5  Chris    FALSE