Creating dyad-pair averages in R

60 views Asked by At

I want to create a pair-wise average of price of commodities produced by countries. My data looks like this

df <- data.frame(country    = c("US; UK; FI", "CN; IT; US; GR", "UK; US"),
                 product_id = c(1, 2, 3),
                 price      = c(300, 500, 200))

I want to transform the data to create average of price between dyads of two countries. Something like this:

Ctr_1 Ctr_2 Avg_Price
US    UK    250
US    FI    300
US    CN    500
US    IT    500
UK    FI    300
UK    US    250
CN    IT    500
CN    US    500
CN    GR    500
IT    CN    500
IT    US    500
IT    GR    500
GR    CN    500
GR    IT    500
GR    US    500

I tried changing the data to long form.

library(data.table)

setDT(df)

df1 <- df[, .(country = unlist(strsplit(country, "; "))), by = .(product_id)]

But didn't know how to proceed from here. Any help would be really appreciated. In fact, there is a year variable as well, and the idea is to aggregate pair-wise per year to create a panel dataset.

2

There are 2 answers

1
mt1022 On BEST ANSWER
df1 <- df[, .(country = strsplit(country, '; ')[[1]]), by = .(product_id, price)]

# join product_id and price of c1 (CJ for cross-join)
df2 <- df1[CJ(country, c2 = country),
           on = .(country), allow.cartesian = TRUE][country < c2]  # keep uniq pairs

# join product_id and price of c2, then get average
res <- df1[df2, on = .(country = c2, product_id), nomatch = 0][
  , .(avg_price = mean(price)), by = .(c1 = country, c2 = i.country)]

res
#    c1 c2 avg_price
# 1: GR CN       500
# 2: IT CN       500
# 3: US CN       500
# 4: UK FI       300
# 5: US FI       300
# 6: IT GR       500
# 7: US GR       500
# 8: US IT       500
# 9: US UK       250
3
jblood94 On

From long format, a non-equi join on country will give the pairs for each product_id. However, non-equi joins don't work with character columns, so we first get a country index. After the join, get the average price with a grouping operation:

df1 <- df[
  ,.(country = unlist(strsplit(country, "; ")), price = price),
  by = .(product_id)
]

df1[,ctr_id := match(country, unique(country))][
  df1,
  on = .(product_id = product_id, ctr_id > ctr_id),
  .(Ctr_1 = i.country, Ctr_2 = x.country, price = price),
  nomatch = 0
][,.(Avg_Price = mean(price)), .(Ctr_1, Ctr_2)]
#>    Ctr_1 Ctr_2 Avg_Price
#> 1:    US    UK       250
#> 2:    US    FI       300
#> 3:    UK    FI       300
#> 4:    CN    IT       500
#> 5:    CN    GR       500
#> 6:    IT    GR       500
#> 7:    US    CN       500
#> 8:    US    IT       500
#> 9:    US    GR       500

Alternatively, we can get the combinations while doing the strsplit:

library(RcppAlgos)

df[
  ,{
    m <- comboGeneral(sort(strsplit(country, "; ")[[1]]), 2)
    .(Ctr_1 = m[,1], Ctr_2 = m[,2], price = price)
  }, product_id
][,.(Avg_Price = mean(price)), .(Ctr_1, Ctr_2)]
#>    Ctr_1 Ctr_2 Avg_Price
#> 1:    FI    UK       300
#> 2:    FI    US       300
#> 3:    UK    US       250
#> 4:    CN    GR       500
#> 5:    CN    IT       500
#> 6:    CN    US       500
#> 7:    GR    IT       500
#> 8:    GR    US       500
#> 9:    IT    US       500