I have 2 dataframes, which I need to join using the fuzzyjoin function. I've tried performing the function on the whole dataframes but do not have enough memory to do so. One of the dataframes [UPRN] acts as source data holding a unique identifier for addresses, the other [Address] holds addresses that needs to be matched to the unique identifier.
I'm aware there's a fair few questions relating to the below but none I've found seem to be answering my query.
I'm looking to split [Address] which is roughly 45000 rows, into manageable chunks (read smaller dataframes) of say 5000 rows, just by row position. I want to then use these small dataframes to then fuzzyjoin to the [UPRN] dataframe. For example [Address1] reads the first 5000 rows, applies the fuzzyjoin and outputs [Join1], then [Address2] reads rows 5001 to 10000, applies the fuzzyjoin and outputs [Join2] and so on.
A small example of what I'm after with the splitting below;
> Address
Street Town PostCode
742 Evergreen Terrace Springfield SP12 HS1
84 Evergreen Terrace Springfield SP14 DH9
....3 to 4999 skipped
23 Evergreen Terrace Springfield SP19 IA18
3230 Evergreen Terrace Springfield SP2 K43
**Function to split [Address]**
> Address1
Street Town PostCode
742 Evergreen Terrace Springfield SP12 HS1
84 Evergreen Terrace Springfield SP14 DH9
...3 to 5000 skipped
> Address2
Street Town PostCode
23 Evergreen Terrace Springfield SP19 IA18
3230 Evergreen Terrace Springfield SP2 K43
...5003 to 10000 skipped
I then want to sequentially join the Address1 to UPRN, and then Address2 to UPRN, outputting to either individual files (which I can then append) or outputting to the same file. The join function I have already, just need a way to call each separate dataframe. How would I go about doing such a thing? Which functions should I be looking for?
If you split (e.g. with
base::split
ordplyr::group_split
) your Address data frame into a list of data frames, then you can callpurrr::map
on the list.purrr::map(list_of_dfs, ~fuzzy_join(x=., y=UPRN, by = "Street"))
Your result will be a list of data frames each fuzzyjoined with UPRN. You can then call
bind_rows
(or you could domap_dfr
) to get all the results in the same data frame again.