Identify records which fulfill complex join conditions with NA value in R

83 views Asked by At

I have a data frame with 3 columns, namely node1, node2 and cluster which was extracted from getLinkCommunities in linkcomm package in R. Here is a reproductive script.

community <- data.frame(
  node1 = c(
    "st_sub_main_th=hira", "roo_main=2", "st_con_rt=sub-room", "roo_main=1", "roo_main=1",
    "roo_main=1", "st_sub_main_th=tsuma", "st_sub_main_th=tsuma", "st_sub_main_th=tsuma",
    "st_th=hira", "st_th=hira", "roo_main=1", "st_th=hira", "st_th=hira", "st_con_rt=main-room",
    "st_con_rt=main-room", "st_con_tr=terrace", "roo_main=4", "roo_main=4", "roo_main=4",
    "roo_main=4", "st_th=tsuma", "st_th=tsuma", "st_sub_main_th=hira", "st_sub_main_th=hira",
    "st_sub_main_th=hira", "st_sub_main_th=hira", "roo_main=2", "roo_main=2", "roo_main=2",
    "st_con_tr=direct", "st_con_tr=direct"),
  node2 = c(
    "st_con_tr=terrace", "st_con_tr=terrace", "st_con_tr=terrace", "st_con_tr=direct",
    "st_con_rt=sub-room", "st_adsb=add", "st_con_rt=sub-room", "st_con_tr=terrace",
    "st_adsb=add", "roo_main=1", "st_con_rt=main-room", "st_con_rt=main-room",
    "st_con_tr=terrace", "st_adsb=add", "st_con_tr=terrace", "st_adsb=add", "st_adsb=add",
    "st_th=tsuma", "st_con_rt=main-room", "st_con_tr=terrace", "st_adsb=add",
    "st_con_rt=main-room", "st_adsb=add", "roo_main=2", "st_con_tr=direct",
    "st_con_rt=sub-room", "st_adsb=add", "st_con_tr=direct", "st_con_rt=sub-room",
    "st_adsb=add", "st_con_rt=sub-room", "st_adsb=add"
  ),
  cluster = c(
    1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7
  )
)

On the other hand, I have an original data frame which was used for association analysis.

df <- data.frame(
  isstilt = c(NA, NA, NA, NA, NA, NA, 0, NA, NA, NA, NA, 0, NA, NA, NA, NA, 0, NA, NA, NA),
  st_con_rt = c(
    "sub-room", "main-room", "sub-room", "sub-room", "sub-room", "sub-room", NA, "main-room",
    "main-room", "main-room", "main-room", NA, "sub-room", "sub-room", "main-room", "sub-room",
    NA, NA, "main-room", "sub-room"
  ),
  st_con_tr = c(
    "direct", "terrace", "direct", "direct", "direct", "direct", NA, "direct", "terrace", "direct",
    "terrace", NA, "terrace", "direct", "terrace", "terrace", NA, NA, "direct", "terrace"
  ),
  st_th = c(NA, "hira", NA, NA, NA, NA, NA, "tsuma", "tsuma", "tsuma", "tsuma", NA, NA, NA, "hira", NA, NA, NA, "hira", NA),
  st_adsb = c("add", "sub", "sub", "add", "add", "sub", NA, "add", "add", "sub", "add", NA, "add", "add", "add", "add", NA, NA, "add", "add"),
  st_sub_main_th = c("tsuma", NA, "hira", "hira", "hira", "other", NA, NA, NA, NA, NA, NA, "hira", "hira", NA, "tsuma", NA, NA, NA, "hira"),
  st_sub2_main_th = c(
    NA, "hira", "tsuma", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "hira", NA, NA, NA, NA
  ),
  roo_main = c(3, 1, 2, 2, 1, 7, 1, 2, 4, 4, 4, 7, 2, 2, 1, 2, 7, 2, 1, 4)
)

node1 and node2 in community contains some of columns in df at the left of the equal sign and choices at the right of the equal sign. I want to join cluster information to df based on those conditions, but it seems quite complicated for me.

I suppose the basic process may be like this:

  1. Convert wide to long format Bind node1 and node2 into one column rule

  2. Convert (split) rule into two columns (element and choice) For instance, st_sub_main_th=hira will be split into st_sub_main_th and hira.

  3. Compare two data frames and add cluster to df Check all element and choice by cluster and add cluster number to df if it meet all conditions by cluster. Now I cannot imagine how to implement it. For instance, cluster=1 contains following 6 conditions.

    element choice st_sub_main_th hira
    st_con_tr terrace roo_main 2
    st_con_tr terrace st_con_rt sub-room st_con_tr terrace

I need to add 1 to new cluster column to corresponded rows in df.

The 1st and 2nd operation will be done by this:

community <- community %>% pivot_longer(-cluster, names_to="node", values_to="rule") %>%
    select(-node) %>% separate_wider_delim(rule, "=", names=c("element","choice")) %>% data.frame()

Now I got a table which contains conditions for joining.

> community
  cluster st_sub_main_th st_con_tr roo_main st_con_rt st_adsb st_th
1       1           hira   terrace        2  sub-room    <NA>  <NA>
2       2           <NA>    direct        1  sub-room     add  <NA>
3       3          tsuma   terrace     <NA>  sub-room     add  <NA>
4       4           <NA>      <NA>        1 main-room    <NA>  hira
5       5           <NA>   terrace     <NA> main-room     add  hira
6       6           <NA>   terrace        4 main-room     add tsuma
7       7           hira    direct        2  sub-room     add  <NA>

And I tried superkey approach based on the comment as well as the previous post.

df %>% select(-building) %>% mutate(cluster = {
    match(
      do.call(paste, .),
      com %>% mutate(across(-cluster, ~ if_else(is.na(.x), "", .x))) %>%
        {do.call(paste, select(., -cluster))}
      )
    }) %>% mutate(cluster=as.factor(cluster)) %>%
      cbind(df_com %>% select(building)) %>% select(building, everything())

I also tried more simple approach.

## Extract columns in data frame
    col <- comunity %>% select(-cluster) %>% colnames()
    
    ## Join
    df %>% left_join(com, by=setNames(col, col), na_matches="never")

The first approach returned NA for all records and the latter approach found some matches but did not work as expected. One issue is that community data frame contains NA in some rows, but it does NOT mean that the corresponded element should be NA in df. This NA indicates that this column should be omitted from comparison process.

I appreciate your suggestions.

EDIT

I suppose I could solve the issue thanks to this post.

community %>%
      ## Split into single row data frame
      split(seq(nrow(.))) %>%
      ## Select columns without NA and inner join
      map_dfr(~ select_if(.x, ~ !any(is.na(.))) %>%
              inner_join(df,.)) %>%
      ## Select columns
      select(building, cluster) %>%
      ## Right join with original data frame
      right_join(df, by="building")
1

There are 1 answers

0
Ben On

If I'm understanding correctly, you want to use community data.frame to include the specific rules to match on. In other words, if your df has extra information not found in community for a given cluster, then it should still match and join.

If that is true, you could try something a bit different. For the community data, remove duplicates and include an additional column to indicate the required number of matches for that cluster.

Also, for your df data, remove missing values, and include a row_number() column to track rows and reassemble your wide data in the end.

Then you can join these two data sources by element and choice. Then, you can filter and keep results where for a given cluster and row there are enough matches found across the required rules. The output will include the matched choices found as well.

Please let me know if this gives the desired results.

library(tidyverse)

com_key <- community %>%
  pivot_longer(-cluster, names_to = "node", values_to = "rule") %>%
  select(-node) %>%
  separate(rule, into = c("element", "choice"), sep = "=") %>%
  group_by(cluster) %>%
  distinct(element, .keep_all = TRUE) %>%
  mutate(match = n()) 

df_key <- df %>%
  mutate(rn = row_number()) %>%
  pivot_longer(-rn, names_to = "element", values_to = "choice", values_transform = as.character) %>%
  drop_na(choice)

left_join(df_key, com_key, by = c("element", "choice"), relationship = "many-to-many") %>%
  drop_na(cluster) %>%
  group_by(rn, cluster) %>%
  filter(n() == match) %>%
  pivot_wider(id_cols = c(rn, cluster), names_from = element, values_from = choice) %>%
  arrange(rn)

Output

      rn cluster st_con_rt st_th roo_main st_con_tr st_adsb st_sub_main_th
   <int>   <dbl> <chr>     <chr> <chr>    <chr>     <chr>   <chr>         
 1     2       4 main-room hira  1        NA        NA      NA            
 2     4       7 sub-room  NA    2        direct    add     hira          
 3     5       2 sub-room  NA    1        direct    add     NA            
 4     9       6 main-room tsuma 4        terrace   add     NA            
 5    11       6 main-room tsuma 4        terrace   add     NA            
 6    13       1 sub-room  NA    2        terrace   NA      hira          
 7    14       7 sub-room  NA    2        direct    add     hira          
 8    15       4 main-room hira  1        NA        NA      NA            
 9    15       5 main-room hira  NA       terrace   add     NA            
10    16       3 sub-room  NA    NA       terrace   add     tsuma         
11    19       4 main-room hira  1        NA        NA      NA