Connecting components of a graph using data.table

168 views Asked by At

I'm trying to label Addresses an Entity ID number based on whether they've been in a transaction together.

The idea is that if an address is in a transaction with another address, it is assumed that all addresses in that transaction, as well as all addresses in future transactions with these addresses, are owned by the same entity.

I'm currently running this on fairly large dataset (~150-180 million obs) in SQL using a loop, but I feel like R's data.table can tackle this much faster and with simpler syntax, I'm just not sure how to do it. Any help is much appreciated!

Here is an example:

DT <- data.table(Address=c('A','B','C','A','D','C','E'), Transaction=c(1,1,2,3,3,4,4))

Address  Transaction
      A            1
      B            1
      C            2
      A            3
      D            3
      C            4
      E            4

The outcome I'm looking for would look like this:

Address  Transaction  Entity
      A            1       1
      B            1       1
      C            2       2
      A            3       1
      D            3       1
      C            4       2
      E            4       2
1

There are 1 answers

3
Frank On BEST ANSWER

I would proceed by iterating over the addresses. For each address, find its fellows/neighbors and give them all have the same entity ID. The mapping from address to entity can be stored in a separate table and mapped onto the transactions table after the loop is finished.

setkey(DT,Address)
AE  <- DT[,.(Address=unique(Address),Entity=NA_integer_)]
eid <- 0L
for (a in AE$Address){

    ts      <- DT[.(a)]$Transaction
    fellows <- DT[Transaction %in% ts, unique(Address)]
    f_ent   <- AE[.(fellows)][!is.na(Entity), if (.N > 0) min(Entity) else 0L]

    a_ent <- if (f_ent == 0L){
        eid = eid + 1L
    } else {
        f_ent
    }

    AE[.(fellows),Entity:=a_ent]
}

DT[AE,Entity:=Entity][order(Transaction)]
#    Address Transaction Entity
# 1:       A           1      1
# 2:       B           1      1
# 3:       C           2      2
# 4:       A           3      1
# 5:       D           3      1
# 6:       C           4      2
# 7:       E           4      2

I'm pretty sure this works, but more general example data would help.


As you are faced with millions of records, you'll probably want to look into tweaking this to be a little faster. The costliest step is the computation of fellows.

You're pretty much tagging connected components of a graph (see ). Think of the Addresses as nodes and each Transaction as adding

  • new Addresses and
  • new links between Addresses.

The R igraph package may be of interest; more likely than not, it computes connected components. (I'm not familiar enough with it to know.)

Generally, if your data set is growing, keeping this variable up-to-date will be a real headache, and I think you should see if you can do without it.