I have 2 dataframes, which I need to join using the fuzzyjoin function. I've tried performing the function on the whole dataframes but do not have enough memory to do so. One of the dataframes [UPRN] acts as source data holding a unique identifier for addresses, the other [Address] holds addresses that needs to be matched to the unique identifier.

I'm aware there's a fair few questions relating to the below but none I've found seem to be answering my query.

I'm looking to split [Address] which is roughly 45000 rows, into manageable chunks (read smaller dataframes) of say 5000 rows, just by row position. I want to then use these small dataframes to then fuzzyjoin to the [UPRN] dataframe. For example [Address1] reads the first 5000 rows, applies the fuzzyjoin and outputs [Join1], then [Address2] reads rows 5001 to 10000, applies the fuzzyjoin and outputs [Join2] and so on.

A small example of what I'm after with the splitting below;

> Address
Street                   Town            PostCode
742 Evergreen Terrace    Springfield     SP12 HS1
84 Evergreen Terrace     Springfield     SP14 DH9
....3 to 4999 skipped
23 Evergreen Terrace     Springfield     SP19 IA18
3230 Evergreen Terrace   Springfield     SP2 K43


**Function to split [Address]**
> Address1
Street                   Town            PostCode
742 Evergreen Terrace    Springfield     SP12 HS1
84 Evergreen Terrace     Springfield     SP14 DH9
...3 to 5000 skipped

> Address2
Street                   Town            PostCode
23 Evergreen Terrace     Springfield     SP19 IA18
3230 Evergreen Terrace   Springfield     SP2 K43
...5003 to 10000 skipped

I then want to sequentially join the Address1 to UPRN, and then Address2 to UPRN, outputting to either individual files (which I can then append) or outputting to the same file. The join function I have already, just need a way to call each separate dataframe. How would I go about doing such a thing? Which functions should I be looking for?

1

There are 1 answers

0
Arthur Yip On BEST ANSWER

If you split (e.g. with base::split or dplyr::group_split) your Address data frame into a list of data frames, then you can call purrr::map on the list.

purrr::map(list_of_dfs, ~fuzzy_join(x=., y=UPRN, by = "Street"))

Your result will be a list of data frames each fuzzyjoined with UPRN. You can then call bind_rows (or you could do map_dfr) to get all the results in the same data frame again.