How to non-randomly sample every n rows in dplyr?

3k views Asked by At

I'd like to do sample_n() in dplyr, except I don't want the sampling to be random, I intend to sample every n rows.

Is there a way to do this?

For example, I want to get every 10th row of the airquality dataset after ordering by Month and Day. Expected output:

   Ozone Solar.R Wind Temp Month Day
      NA     194  8.6   69     5  10
      11      44  9.7   62     5  20
     115     223  5.7   79     5  30
      71     291 13.8   90     6   9
      12     120 11.5   73     6  19
      NA      31 14.9   77     6  29

    ...
3

There are 3 answers

1
jmk On

You would really want to be subsetting using a sequence.

mtcars[seq(1, nrow(mtcars), 10), ]

Replace both mtcars occurances with your data.frame, And replace 10 with the nth row you want to extract

0
akrun On

If you intend to sample every 'n' rows in a data.frame and get 'n1' rows within each 'n', create a grouping variable for every 'n' rows and use sample_n. (That is what I interpreted. Feel free to correct me)

library(dplyr)
n <- 6
n1 <- 3
df1 %>% 
    group_by(gr= as.numeric(gl(n(), n, n()))) %>%
    sample_n(.,n1)

data

set.seed(24)
df1 <- as.data.frame(cbind(rn=1:40, matrix(sample(0:10,3*40, 
            replace=TRUE), ncol=3)) ) 
0
saladi On

If you have a dataframe of ordered data that you'd like to sample, you can filter on row_number:

library(tidyverse)

airquality %>%
  arrange(Month, Day) %>%
  filter(row_number() %% 10 == 0) %>%
  head()
#>   Ozone Solar.R Wind Temp Month Day
#> 1    NA     194  8.6   69     5  10
#> 2    11      44  9.7   62     5  20
#> 3   115     223  5.7   79     5  30
#> 4    71     291 13.8   90     6   9
#> 5    12     120 11.5   73     6  19
#> 6    NA      31 14.9   77     6  29

Since each month is not grouped, the each 10th row is retained (which means that the Day goes from 10s to 9s). Grouping by Month gets around this:

airquality %>%
  arrange(Month, Day) %>%
  group_by(Month) %>%
  filter(row_number() %% 10 == 0) %>%
  head()
#> # A tibble: 6 x 6
#> # Groups:   Month [2]
#>   Ozone Solar.R  Wind  Temp Month   Day
#>   <int>   <int> <dbl> <int> <int> <int>
#> 1    NA     194  8.60    69     5    10
#> 2    11      44  9.70    62     5    20
#> 3   115     223  5.70    79     5    30
#> 4    39     323 11.5     87     6    10
#> 5    13     137 10.3     76     6    20
#> 6    NA     138  8.00    83     6    30

Of course, we could have just used filter(Day %% 10 == 0), but one doesn't always have such nice numbers to work with!