Aggregate discrete frequency table in smaller table with fewer intervals

45 views Asked by At

I have a frequency table that contains 800 rows. Here is a similar table for example:

table<- data.frame(number = c(1:10),
                   units = c(800, 780,500, 430, 200, 189, 110, 86, 54, 31))

I would like to create a table with fewer rows, with intervals having roughly the same amount of units, respecting the fact that the intervals must be discrete and based on the numbers column. How can I do this in R?

If the above is not possible, some help with aggregating the table with personalised intervals would be very appreciated.

1

There are 1 answers

1
int 21h -- Glory to Ukraine -- On

It seems that you are looking for binning. Try cut_interval() from {ggplot2}, it will create several groups (bins) of your values, then look at levels.

library(ggplot2)
table <- data.frame(number = c(1:10),
                    units = c(800, 780,500, 430, 200, 189, 110, 86, 54, 31))
cut_interval(table$units, 3)

Another similar function is cut_number

EDIT:

I apologise, turned out that that function is pretty useless here.

It's still not completely clear, what is your desired output.

I have 2 solutions. Scenario 1. You just want to keep, let's say, every 3rd row. Here, you can simply subset with filter:

library(dplyr)
    
n_rows <- 3 # put your value here
table |>
filter(number %in% seq(min(number), max(number), by = n_rows))

Output:

# A tibble: 4 × 2
  number units
   <dbl> <dbl>
1      1   800
2      4   430
3      7   110
4     10    31

Scenario 2, based on bins - taking one value out of a bin:

bins <- binst::create_bins(table$units,
seq(min(table$units), max(table$units),
(max(table$units) - min(table$units))/n_rows))

table1 <- table |>
mutate(groups = n_rows*(number-1) %/% n_rows+1)

cbind(table1, bins = bins) |>
group_by(bins) |>
summarize(units = max(units)) |>
left_join(table, by = "units") |>
select(number, units) |>
arrange(number)

Output:

# A tibble: 4 × 2
  number units
   <int> <dbl>
1      1   800
2      3   500
3      5   200
4     10    31

Choose the scenario you like. You can also provide the desired output to avoid the confusion.