Fast way to create dataframe in a rbind() style

62 views Asked by At

I have two survey datasets I'm working with. The first one (survey1) is a population/household survey (like the American CPS or the European EU-SILC) with around 150,000 observations; the second one (survey2) is a household budget survey (like the European HBS) with around 50,000 observations. I have to impute household expenditures from survey2 into survey1: for this, I created a distance-based matching between the two surveys using household characteristics, where every household in survey1 is matched to one household in survey2.

I have a pretty big dataset (over 2 million rows) based on survey2 with three columns: household ID, product ID, and expenditure value (the dataframe is called survey2_expenditures). I'm trying to create a similar dataset for survey1, where I would have, for each household ID in survey1, the imputed expenditure value for each product ID from the matched household in survey2. So for example, if household 12 in survey1 were matched to household 23 in the survey2, this new dataset would have household 12's ID in the first column, and household 23's product IDs and expenditure values in the other columns. That is, if in the original survey2 dataset we have:

> survey2_expenditures %>% filter(id == 23)
   id product     value 
1  23    6001 81.700000
2  23    7001 50.286667
3  23    2400 88.356667
4  23   24022 33.973333
5  23   30001 160.00000
6  23   30002 24.380000
7  23   30014 57.910000

The new dataset I'm trying to create would have:

> survey1_expenditures %>% filter(id == 12)
   id product     value 
1  12    6001 81.700000
2  12    7001 50.286667
3  12    2400 88.356667
4  12   24022 33.973333
5  12   30001 160.00000
6  12   30002 24.380000
7  12   30014 57.910000

Since the first survey is around 3x larger than the second one, I would expect this new dataset to have around 6 million rows.

I've tried using a for loop to get the get the data from survey2's dataset and to create the new survey1 expenditure dataset. First I created an empty dataframe with only the three column names:

survey1_expenditures <- data.frame(matrix(ncol=3,nrow=0, 
                                       dimnames=list(NULL, c("id", "product", "value"))))

Then I ran the following for loop, where matching is the dataframe containing the original household IDs from survey1 in one column and the matched household IDs from survey2 in the other:

for(id_survey1 in matching$id_survey1){ #looping through each household ID in the survey1
  id_survey2 <- matching$id_survey2[id_survey1 ] #get match for that household
  
  matched_expenditures <- survey2_expenditures %>% #
    filter(id == id_survey2) %>% #filter rows from matched household
    mutate(id = id_survey1) %>% #substitute survey2's ID with survey1's ID
    select(id, product, value) #select only ID, product ID and expenditure value
  
  survey1_expenditures <- rbind(survey1_expenditures, matched_expenditures)
}

Even though it seems to work, this is very, very slow. I've also tried creating a list of dataframes and then rbinding them with data.table's rbindlist(), but it is also very slow. Is there a faster to construct the dataset I want?

1

There are 1 answers

0
Rui Barradas On

You can match columns value on each data set with a join.
But in order to keep only id from the 1st data set and product from the second, select those columns first, then join the data sets.

survey2_expenditures <- read.table(text = 
"id product     value 
1  23    6001 81.700000
2  23    7001 50.286667
3  23    2400 88.356667
4  23   24022 33.973333
5  23   30001 160.00000
6  23   30002 24.380000
7  23   30014 57.910000"
, header = TRUE)

survey1_expenditures <- read.table(text = 
"id product     value 
1  12    6001 81.700000
2  12    7001 50.286667
3  12    2400 88.356667
4  12   24022 33.973333
5  12   30001 160.00000
6  12   30002 24.380000
7  12   30014 57.910000"
, header = TRUE)

library(dplyr)

left_join(
  survey1_expenditures %>% select(id, value),
  survey2_expenditures %>% select(product, value),
  by = "value"
)[c(1L, 3L, 2L)]
#>   id product     value
#> 1 12    6001  81.70000
#> 2 12    7001  50.28667
#> 3 12    2400  88.35667
#> 4 12   24022  33.97333
#> 5 12   30001 160.00000
#> 6 12   30002  24.38000
#> 7 12   30014  57.91000

Created on 2023-10-06 with reprex v2.0.2