How to detect and remove recursive structure in hierarchical data in R?

112 views Asked by At

here is an example data like mine. Think of it as a dataset of user activity on a website.

time from to
t0 A a
t1 a b
t2 a c
t3 a d
t4 b x1
t5 b x2
t6 c y1
t7 x1 a

I want to generate hierarchical data. But like last row in dataset, there is infinite loop.

Parent is a then child is b.

Parent is b then child is x1.

Parent is x1 then child is a again. So here there is an infinite loop. So I can't use my original data because of that.

So my question is, how can i remove rows like the end?

Thank you.

I've tried to write some loop functions. But i couldn't.

PS: Edit example data

1

There are 1 answers

1
Allan Cameron On BEST ANSWER

Your data frame can be thought of as representing the edgelist of a directed graph. The rows you describe as causing an 'infinite loop' are called the feedback arc set, and the operation of removing these rows leaves the maximal acyclic subgraph of your graph, which is what you are looking for.

Using the igraph package, you can explicitly turn your data frame into a graph, find the edge(s) that cause an infinite loop using feedback_arc_set, subtract these from your graph, then convert the resulting directed acyclic graph back into a data frame:

library(igraph)

g <- graph_from_data_frame(df[c(2:3, 1)])
g <- g - feedback_arc_set(g)
new_df <- igraph::as_data_frame(g)[c(3, 1:2)]

new_df
#>   time from to
#> 1   t0    A  a
#> 2   t1    a  b
#> 3   t2    a  c
#> 4   t3    a  d
#> 5   t4    b x1
#> 6   t5    b x2
#> 7   t6    c y1

Note that this has correctly identified time t7 as the row causing a feedback arc and removed it from your data frame.


Data in reproducible format

df <- structure(list(time = c("t0", "t1", "t2", "t3", "t4", "t5", "t6", 
"t7"), from = c("A", "a", "a", "a", "b", "b", "c", "x1"), to = c("a", 
"b", "c", "d", "x1", "x2", "y1", "a")), class = "data.frame", row.names = c(NA, 
-8L))