I have a data frame with 3 columns, namely node1, node2 and cluster which was extracted from getLinkCommunities in linkcomm package in R. Here is a reproductive script.
community <- data.frame(
node1 = c(
"st_sub_main_th=hira", "roo_main=2", "st_con_rt=sub-room", "roo_main=1", "roo_main=1",
"roo_main=1", "st_sub_main_th=tsuma", "st_sub_main_th=tsuma", "st_sub_main_th=tsuma",
"st_th=hira", "st_th=hira", "roo_main=1", "st_th=hira", "st_th=hira", "st_con_rt=main-room",
"st_con_rt=main-room", "st_con_tr=terrace", "roo_main=4", "roo_main=4", "roo_main=4",
"roo_main=4", "st_th=tsuma", "st_th=tsuma", "st_sub_main_th=hira", "st_sub_main_th=hira",
"st_sub_main_th=hira", "st_sub_main_th=hira", "roo_main=2", "roo_main=2", "roo_main=2",
"st_con_tr=direct", "st_con_tr=direct"),
node2 = c(
"st_con_tr=terrace", "st_con_tr=terrace", "st_con_tr=terrace", "st_con_tr=direct",
"st_con_rt=sub-room", "st_adsb=add", "st_con_rt=sub-room", "st_con_tr=terrace",
"st_adsb=add", "roo_main=1", "st_con_rt=main-room", "st_con_rt=main-room",
"st_con_tr=terrace", "st_adsb=add", "st_con_tr=terrace", "st_adsb=add", "st_adsb=add",
"st_th=tsuma", "st_con_rt=main-room", "st_con_tr=terrace", "st_adsb=add",
"st_con_rt=main-room", "st_adsb=add", "roo_main=2", "st_con_tr=direct",
"st_con_rt=sub-room", "st_adsb=add", "st_con_tr=direct", "st_con_rt=sub-room",
"st_adsb=add", "st_con_rt=sub-room", "st_adsb=add"
),
cluster = c(
1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7
)
)
On the other hand, I have an original data frame which was used for association analysis.
df <- data.frame(
isstilt = c(NA, NA, NA, NA, NA, NA, 0, NA, NA, NA, NA, 0, NA, NA, NA, NA, 0, NA, NA, NA),
st_con_rt = c(
"sub-room", "main-room", "sub-room", "sub-room", "sub-room", "sub-room", NA, "main-room",
"main-room", "main-room", "main-room", NA, "sub-room", "sub-room", "main-room", "sub-room",
NA, NA, "main-room", "sub-room"
),
st_con_tr = c(
"direct", "terrace", "direct", "direct", "direct", "direct", NA, "direct", "terrace", "direct",
"terrace", NA, "terrace", "direct", "terrace", "terrace", NA, NA, "direct", "terrace"
),
st_th = c(NA, "hira", NA, NA, NA, NA, NA, "tsuma", "tsuma", "tsuma", "tsuma", NA, NA, NA, "hira", NA, NA, NA, "hira", NA),
st_adsb = c("add", "sub", "sub", "add", "add", "sub", NA, "add", "add", "sub", "add", NA, "add", "add", "add", "add", NA, NA, "add", "add"),
st_sub_main_th = c("tsuma", NA, "hira", "hira", "hira", "other", NA, NA, NA, NA, NA, NA, "hira", "hira", NA, "tsuma", NA, NA, NA, "hira"),
st_sub2_main_th = c(
NA, "hira", "tsuma", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "hira", NA, NA, NA, NA
),
roo_main = c(3, 1, 2, 2, 1, 7, 1, 2, 4, 4, 4, 7, 2, 2, 1, 2, 7, 2, 1, 4)
)
node1 and node2 in community contains some of columns in df at the left of the equal sign and choices at the right of the equal sign. I want to join cluster information to df based on those conditions, but it seems quite complicated for me.
I suppose the basic process may be like this:
Convert wide to long format Bind
node1andnode2into one columnruleConvert (split)
ruleinto two columns (elementandchoice) For instance, st_sub_main_th=hira will be split intost_sub_main_thandhira.Compare two data frames and add
clustertodfCheck allelementandchoicebyclusterand add cluster number todfif it meet all conditions by cluster. Now I cannot imagine how to implement it. For instance,cluster=1contains following 6 conditions.element choice st_sub_main_th hira
st_con_tr terrace roo_main 2
st_con_tr terrace st_con_rt sub-room st_con_tr terrace
I need to add 1 to new cluster column to corresponded rows in df.
The 1st and 2nd operation will be done by this:
community <- community %>% pivot_longer(-cluster, names_to="node", values_to="rule") %>%
select(-node) %>% separate_wider_delim(rule, "=", names=c("element","choice")) %>% data.frame()
Now I got a table which contains conditions for joining.
> community
cluster st_sub_main_th st_con_tr roo_main st_con_rt st_adsb st_th
1 1 hira terrace 2 sub-room <NA> <NA>
2 2 <NA> direct 1 sub-room add <NA>
3 3 tsuma terrace <NA> sub-room add <NA>
4 4 <NA> <NA> 1 main-room <NA> hira
5 5 <NA> terrace <NA> main-room add hira
6 6 <NA> terrace 4 main-room add tsuma
7 7 hira direct 2 sub-room add <NA>
And I tried superkey approach based on the comment as well as the previous post.
df %>% select(-building) %>% mutate(cluster = {
match(
do.call(paste, .),
com %>% mutate(across(-cluster, ~ if_else(is.na(.x), "", .x))) %>%
{do.call(paste, select(., -cluster))}
)
}) %>% mutate(cluster=as.factor(cluster)) %>%
cbind(df_com %>% select(building)) %>% select(building, everything())
I also tried more simple approach.
## Extract columns in data frame
col <- comunity %>% select(-cluster) %>% colnames()
## Join
df %>% left_join(com, by=setNames(col, col), na_matches="never")
The first approach returned NA for all records and the latter approach found some matches but did not work as expected.
One issue is that community data frame contains NA in some rows, but it does NOT mean that the corresponded element should be NA in df. This NA indicates that this column should be omitted from comparison process.
I appreciate your suggestions.
EDIT
I suppose I could solve the issue thanks to this post.
community %>%
## Split into single row data frame
split(seq(nrow(.))) %>%
## Select columns without NA and inner join
map_dfr(~ select_if(.x, ~ !any(is.na(.))) %>%
inner_join(df,.)) %>%
## Select columns
select(building, cluster) %>%
## Right join with original data frame
right_join(df, by="building")
If I'm understanding correctly, you want to use
communitydata.frame to include the specific rules to match on. In other words, if yourdfhas extra information not found incommunityfor a given cluster, then it should still match and join.If that is true, you could try something a bit different. For the
communitydata, remove duplicates and include an additional column to indicate the required number of matches for that cluster.Also, for your
dfdata, remove missing values, and include arow_number()column to track rows and reassemble your wide data in the end.Then you can join these two data sources by
elementandchoice. Then, you canfilterand keep results where for a givenclusterand row there are enough matches found across the required rules. The output will include the matched choices found as well.Please let me know if this gives the desired results.
Output