I'm trying to label Addresses an Entity ID number based on whether they've been in a transaction together.
The idea is that if an address is in a transaction with another address, it is assumed that all addresses in that transaction, as well as all addresses in future transactions with these addresses, are owned by the same entity.
I'm currently running this on fairly large dataset (~150-180 million obs) in SQL using a loop, but I feel like R's data.table can tackle this much faster and with simpler syntax, I'm just not sure how to do it. Any help is much appreciated!
Here is an example:
DT <- data.table(Address=c('A','B','C','A','D','C','E'), Transaction=c(1,1,2,3,3,4,4))
Address Transaction
A 1
B 1
C 2
A 3
D 3
C 4
E 4
The outcome I'm looking for would look like this:
Address Transaction Entity
A 1 1
B 1 1
C 2 2
A 3 1
D 3 1
C 4 2
E 4 2
I would proceed by iterating over the addresses. For each address, find its fellows/neighbors and give them all have the same entity ID. The mapping from address to entity can be stored in a separate table and mapped onto the transactions table after the loop is finished.
I'm pretty sure this works, but more general example data would help.
As you are faced with millions of records, you'll probably want to look into tweaking this to be a little faster. The costliest step is the computation of
fellows
.You're pretty much tagging connected components of a graph (see graph-theory). Think of the Addresses as nodes and each Transaction as adding
The R
igraph
package may be of interest; more likely than not, it computes connected components. (I'm not familiar enough with it to know.)Generally, if your data set is growing, keeping this variable up-to-date will be a real headache, and I think you should see if you can do without it.