I do have the following dataframe, which contains all the paths within a tree after going through all nodes. For each jump between nodes, a row will be created where "dist" is the number of nodes so far, "node" the current node and "path" the path so far.
dist | node | path
0 | 1 | [1]
1 | 2 | [1,2]
1 | 5 | [1,5]
2 | 3 | [1,2,3]
2 | 4 | [1,2,4]
At the end I just want to have a dataframe containing the complete paths without the intermediate steps:
dist | node | path
1 | 5 | [1,5]
2 | 3 | [1,2,3]
2 | 4 | [1,2,4]
I also tried by having the path column as a string ("1;2;3") and comparing which row is a substring from each other, however i could not find a way to do that.
I found my old code and created an adapted example for your problem. I used the spark graph library Graphframes for this. The path can be determined by a Pregel like message aggregation loop.
Here the code. First import all modules
Then create a sample dataset
For visualisation run
With this message aggregation algorithm you find the paths as you searched them. if you set the flag
show_steps
toTrue
the results of each step is shown which helps to understand.it shows then the correct results
to get your final dataframe you can join it back or take the first and last element of the array into separate columns
You can write the same algorithm with the Graphframes Pregel API I suppose.
P.S: The algorithm in this form might cause problems if the graph has lops or backward directed edges. I had another algorithm to first clean up loops and cycles