I have a data frame with 3 columns, namely node1
, node2
and cluster
which was extracted from getLinkCommunities
in linkcomm
package in R
. Here is a reproductive script.
community <- data.frame(
node1 = c(
"st_sub_main_th=hira", "roo_main=2", "st_con_rt=sub-room", "roo_main=1", "roo_main=1",
"roo_main=1", "st_sub_main_th=tsuma", "st_sub_main_th=tsuma", "st_sub_main_th=tsuma",
"st_th=hira", "st_th=hira", "roo_main=1", "st_th=hira", "st_th=hira", "st_con_rt=main-room",
"st_con_rt=main-room", "st_con_tr=terrace", "roo_main=4", "roo_main=4", "roo_main=4",
"roo_main=4", "st_th=tsuma", "st_th=tsuma", "st_sub_main_th=hira", "st_sub_main_th=hira",
"st_sub_main_th=hira", "st_sub_main_th=hira", "roo_main=2", "roo_main=2", "roo_main=2",
"st_con_tr=direct", "st_con_tr=direct"),
node2 = c(
"st_con_tr=terrace", "st_con_tr=terrace", "st_con_tr=terrace", "st_con_tr=direct",
"st_con_rt=sub-room", "st_adsb=add", "st_con_rt=sub-room", "st_con_tr=terrace",
"st_adsb=add", "roo_main=1", "st_con_rt=main-room", "st_con_rt=main-room",
"st_con_tr=terrace", "st_adsb=add", "st_con_tr=terrace", "st_adsb=add", "st_adsb=add",
"st_th=tsuma", "st_con_rt=main-room", "st_con_tr=terrace", "st_adsb=add",
"st_con_rt=main-room", "st_adsb=add", "roo_main=2", "st_con_tr=direct",
"st_con_rt=sub-room", "st_adsb=add", "st_con_tr=direct", "st_con_rt=sub-room",
"st_adsb=add", "st_con_rt=sub-room", "st_adsb=add"
),
cluster = c(
1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7
)
)
On the other hand, I have an original data frame which was used for association analysis.
df <- data.frame(
isstilt = c(NA, NA, NA, NA, NA, NA, 0, NA, NA, NA, NA, 0, NA, NA, NA, NA, 0, NA, NA, NA),
st_con_rt = c(
"sub-room", "main-room", "sub-room", "sub-room", "sub-room", "sub-room", NA, "main-room",
"main-room", "main-room", "main-room", NA, "sub-room", "sub-room", "main-room", "sub-room",
NA, NA, "main-room", "sub-room"
),
st_con_tr = c(
"direct", "terrace", "direct", "direct", "direct", "direct", NA, "direct", "terrace", "direct",
"terrace", NA, "terrace", "direct", "terrace", "terrace", NA, NA, "direct", "terrace"
),
st_th = c(NA, "hira", NA, NA, NA, NA, NA, "tsuma", "tsuma", "tsuma", "tsuma", NA, NA, NA, "hira", NA, NA, NA, "hira", NA),
st_adsb = c("add", "sub", "sub", "add", "add", "sub", NA, "add", "add", "sub", "add", NA, "add", "add", "add", "add", NA, NA, "add", "add"),
st_sub_main_th = c("tsuma", NA, "hira", "hira", "hira", "other", NA, NA, NA, NA, NA, NA, "hira", "hira", NA, "tsuma", NA, NA, NA, "hira"),
st_sub2_main_th = c(
NA, "hira", "tsuma", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "hira", NA, NA, NA, NA
),
roo_main = c(3, 1, 2, 2, 1, 7, 1, 2, 4, 4, 4, 7, 2, 2, 1, 2, 7, 2, 1, 4)
)
node1
and node2
in community
contains some of columns in df
at the left of the equal sign and choices at the right of the equal sign. I want to join cluster
information to df
based on those conditions, but it seems quite complicated for me.
I suppose the basic process may be like this:
Convert wide to long format Bind
node1
andnode2
into one columnrule
Convert (split)
rule
into two columns (element
andchoice
) For instance, st_sub_main_th=hira will be split intost_sub_main_th
andhira
.Compare two data frames and add
cluster
todf
Check allelement
andchoice
bycluster
and add cluster number todf
if it meet all conditions by cluster. Now I cannot imagine how to implement it. For instance,cluster=1
contains following 6 conditions.element choice st_sub_main_th hira
st_con_tr terrace roo_main 2
st_con_tr terrace st_con_rt sub-room st_con_tr terrace
I need to add 1
to new cluster
column to corresponded rows in df
.
The 1st and 2nd operation will be done by this:
community <- community %>% pivot_longer(-cluster, names_to="node", values_to="rule") %>%
select(-node) %>% separate_wider_delim(rule, "=", names=c("element","choice")) %>% data.frame()
Now I got a table which contains conditions for joining.
> community
cluster st_sub_main_th st_con_tr roo_main st_con_rt st_adsb st_th
1 1 hira terrace 2 sub-room <NA> <NA>
2 2 <NA> direct 1 sub-room add <NA>
3 3 tsuma terrace <NA> sub-room add <NA>
4 4 <NA> <NA> 1 main-room <NA> hira
5 5 <NA> terrace <NA> main-room add hira
6 6 <NA> terrace 4 main-room add tsuma
7 7 hira direct 2 sub-room add <NA>
And I tried superkey
approach based on the comment as well as the previous post.
df %>% select(-building) %>% mutate(cluster = {
match(
do.call(paste, .),
com %>% mutate(across(-cluster, ~ if_else(is.na(.x), "", .x))) %>%
{do.call(paste, select(., -cluster))}
)
}) %>% mutate(cluster=as.factor(cluster)) %>%
cbind(df_com %>% select(building)) %>% select(building, everything())
I also tried more simple approach.
## Extract columns in data frame
col <- comunity %>% select(-cluster) %>% colnames()
## Join
df %>% left_join(com, by=setNames(col, col), na_matches="never")
The first approach returned NA
for all records and the latter approach found some matches but did not work as expected.
One issue is that community
data frame contains NA
in some rows, but it does NOT mean that the corresponded element
should be NA
in df
. This NA
indicates that this column should be omitted from comparison process.
I appreciate your suggestions.
EDIT
I suppose I could solve the issue thanks to this post.
community %>%
## Split into single row data frame
split(seq(nrow(.))) %>%
## Select columns without NA and inner join
map_dfr(~ select_if(.x, ~ !any(is.na(.))) %>%
inner_join(df,.)) %>%
## Select columns
select(building, cluster) %>%
## Right join with original data frame
right_join(df, by="building")
If I'm understanding correctly, you want to use
community
data.frame to include the specific rules to match on. In other words, if yourdf
has extra information not found incommunity
for a given cluster, then it should still match and join.If that is true, you could try something a bit different. For the
community
data, remove duplicates and include an additional column to indicate the required number of matches for that cluster.Also, for your
df
data, remove missing values, and include arow_number()
column to track rows and reassemble your wide data in the end.Then you can join these two data sources by
element
andchoice
. Then, you canfilter
and keep results where for a givencluster
and row there are enough matches found across the required rules. The output will include the matched choices found as well.Please let me know if this gives the desired results.
Output