Reduce nrows of large data frame to nrows of smaller data frame when dimensions are not divisible

213 views Asked by At

I have two data frames. One is ~133 rows and one is ~4337 rows. They each have two columns containing the same type of information. Sun elevation in the first column and Radiance in the second column. I would like to reduce the number of rows of the large data frame to the number of rows in the small data frame so that I can proceed with analysis without getting dimension errors. I do not want to combine them into a single data frame.

The thing is, I don't want to lose any data. On further inspection, I realized that I also can't do means because that is not physically meaningful for my data.

I've been trying to find something in dplyr or reshape2 that will do this, but have not had luck so far.

Notes:

  • Dimensions in the example are smaller than my real world dimensions for simplicity

The solution presented here appears to be close: Calculate the mean of every 13 rows in data frame in R

However, I'm running into problems with rounding resulting in getting too many or too few rows in the resultant new data frame.

Code example trying to implement the above-linked solution:

set.seed(123)
df1 <- data.frame(sunel = sample(c(-6:4), 133, replace = TRUE),
                  rad = sample(c(1000:500000), 133, replace = TRUE))
df2 <- data.frame(sunel = sample(c(-15:15), 4337, replace = TRUE),
                  rad = sample(c(100:5000000), 4337, replace = TRUE))
df2a <- df2[df2$sunel >= -6 & df2$sunel <= 4,]

n <- (nrow(df2a) %/% 133) - 1
df3 <- aggregate(df2a, list(rep(1:(nrow(df2a) %/% n+1), each = n, len = nrow(df2a))), mean)
nrow(df1)
# [1] 133
nrow(df2a)
# [1] 1520
nrow(df3)
# [1] 150
min(df1$sunel);max(df1$sunel)
# [1] -6
# [1] 4
min(df2a$sunel);max(df2a$sunel)
# [1] -6
# [1] 4
min(df3$sunel);max(df3$sunel)
# [1] -3.2
# [1] 1.9
nrow(df3a)
# [1] 133

I have tried to change n, but due to rounding, it results in either ~130 rows (too few), or too many (as shown in the example). Another problem is that it is important for me to maintain, roughly the same range of sunel and the range in df3 is not acceptable.

Here is a hack solution I've found using caret. I would appreciate any advice on a more elegant solution.

library(caret)

133/1520
# [1] 0.0875

inTrain <- createDataPartition(df2a$sunel, p = .0875, list = FALSE)
nrow(inTrain)
# [1] 135 #Nope

inTrain <- createDataPartition(df2a$sunel, p = .0874, list = FALSE)
nrow(inTrain)
# [1] 135 #Still nope

inTrain <- createDataPartition(df2a$sunel, p = .086, list = FALSE)
nrow(inTrain)
# [1] 133 #Awesome

df3a <- df2a[inTrain, ]
min(df3a$sunel);max(df3a$sunel)
# [1] -6
# [1] 4
2

There are 2 answers

0
Joel Lafond Lapalme On

I suggest you to bootstrap. http://www.ats.ucla.edu/stat/r/library/bootstrap.htm

Resampling is your solution to get a representative sample of your big dataset!

0
jmk On

Would the sinecol package and the approxTime function hold the answer for you? It might be too restrictive for you dataset, and you would need to work out your interpolations for the xout vector.