I have two survey datasets I'm working with. The first one (survey1) is a population/household survey (like the American CPS or the European EU-SILC) with around 150,000 observations; the second one (survey2) is a household budget survey (like the European HBS) with around 50,000 observations. I have to impute household expenditures from survey2 into survey1: for this, I created a distance-based matching between the two surveys using household characteristics, where every household in survey1 is matched to one household in survey2.
I have a pretty big dataset (over 2 million rows) based on survey2 with three columns: household ID, product ID, and expenditure value (the dataframe is called survey2_expenditures). I'm trying to create a similar dataset for survey1, where I would have, for each household ID in survey1, the imputed expenditure value for each product ID from the matched household in survey2. So for example, if household 12 in survey1 were matched to household 23 in the survey2, this new dataset would have household 12's ID in the first column, and household 23's product IDs and expenditure values in the other columns. That is, if in the original survey2 dataset we have:
> survey2_expenditures %>% filter(id == 23)
id product value
1 23 6001 81.700000
2 23 7001 50.286667
3 23 2400 88.356667
4 23 24022 33.973333
5 23 30001 160.00000
6 23 30002 24.380000
7 23 30014 57.910000
The new dataset I'm trying to create would have:
> survey1_expenditures %>% filter(id == 12)
id product value
1 12 6001 81.700000
2 12 7001 50.286667
3 12 2400 88.356667
4 12 24022 33.973333
5 12 30001 160.00000
6 12 30002 24.380000
7 12 30014 57.910000
Since the first survey is around 3x larger than the second one, I would expect this new dataset to have around 6 million rows.
I've tried using a for loop to get the get the data from survey2's dataset and to create the new survey1 expenditure dataset. First I created an empty dataframe with only the three column names:
survey1_expenditures <- data.frame(matrix(ncol=3,nrow=0,
dimnames=list(NULL, c("id", "product", "value"))))
Then I ran the following for loop, where matching is the dataframe containing the original household IDs from survey1 in one column and the matched household IDs from survey2 in the other:
for(id_survey1 in matching$id_survey1){ #looping through each household ID in the survey1
id_survey2 <- matching$id_survey2[id_survey1 ] #get match for that household
matched_expenditures <- survey2_expenditures %>% #
filter(id == id_survey2) %>% #filter rows from matched household
mutate(id = id_survey1) %>% #substitute survey2's ID with survey1's ID
select(id, product, value) #select only ID, product ID and expenditure value
survey1_expenditures <- rbind(survey1_expenditures, matched_expenditures)
}
Even though it seems to work, this is very, very slow. I've also tried creating a list of dataframes and then rbinding them with data.table's rbindlist(), but it is also very slow. Is there a faster to construct the dataset I want?
You can match columns
valueon each data set with a join.But in order to keep only
idfrom the 1st data set andproductfrom the second, select those columns first, then join the data sets.Created on 2023-10-06 with reprex v2.0.2