find most visited node in a graph

1.3k views Asked by At

There are N nodes in a graph connected by exactly N-1 edges. There is exactly 1 shortest path from one node to any other node. The nodes are numbered from 1 to N. Given Q queries which tell source node and the destination nodes. Find the most visited node after traveling those Q paths. For example, say Q=3 and 3 queries are :

1 5

2 4

3 1

So travel from node 1 to node 5, then from node 2 to node 4, then from node 3 to node 1. Finally, find what is the most visited node after the Q queries. Finding every path and incrementing every visited node count is a naive approach. The interviewer asked me to optimize it.

1

There are 1 answers

3
ruakh On

Optimization often involves tradeoffs; in some cases one algorithm is unambiguously better than another, but in other cases one algorithm is better in one respect (e.g. time) and a different algorithm is better in a different respect (e.g. memory usage).

In your case, I'm guessing that your interviewer was looking for an approach that optimizes for the amount of processing that has to be done after you start receiving queries, even if this means you have to do more preprocessing on the graph. My reason for saying this is the term "query"; it's quite common to optimize a data source for "online" querying. (Of course, (s)he probably didn't expect you to decide on your own that this tradeoff was OK; rather, (s)he was likely hoping for a conversation about the different sorts of tradeoffs.)

So, with that in mind . . .

  • I see that you've already tagged your question with [tree] and [least-common-ancestor], so you've presumably already made the biggest observations, namely:
    • The graph is a tree. We can arbitrarily select a "root", such that every other node has a "parent", a nonzero "depth", one or more "ancestors", etc.
    • Once we've done that, the shortest path from node a to node b consists of node a, node b, all ancestors of a that aren't ancestors of b, all ancestors of b that aren't ancestors of a, and their "least common ancestor". (This remains true if a is an ancestor of b or vice versa: if a is an ancestor of b, then it's the least common ancestor of a and b, and vice versa. It even remains true if a and b are the same.)
  • So, we can do the following preprocessing:
    • Represent the graph as a mapping from each node to a list of its neighbors. (Since the nodes are numbered from 1 to N, this mapping is an array of N lists.)
    • Choose a root node.
    • Calculate and store each node's "parent" and "depth". (We can do this in O(N) time using depth-first search or breadth-first search.)
    • For each pair of nodes, calculate and store their "least common ancestor". (We can do this in total time O(N2) using the results of the previous step and memoization, because the memoization provides amortization.)
    • Initialize a mapping from each node to the number of times that it's the endpoint of a path, and a mapping from each node to the number of times that it's the least common ancestor of the endpoints of a path. (Note: if a given path is from a single node to itself, then we will count that as twice that it's the endpoint of a path — as well as once that it's the last common ancestor of the endpoints.)
  • For each query, update the two mappings. We can do this in O(1) time per query, for a total of O(Q) time.
  • Finally:
    • Do a post-order traversal of the graph, computing the number of paths that visited that node. The logic for this is as follows: the total number of paths that visited node a is equal to the sum of the numbers of paths that visited each of its children, minus the sum of the numbers of times that each of its children was the last common ancestor of a path's endpoint, plus the number of times that a itself was an endpoint, minus the number of times that a itself was the last common ancestor of a path's endpoint (to cancel out double-counting).
    • Return the node for which the previous step returned the greatest number. If multiple nodes are tied for greatest, then . . . I dunno, the problem statement was vague about this, you'll need to ask for requirements.

Overall, this requires O(N2) preprocessing, O(Q) realtime processing per query, and O(N) postprocessing.

If N is quite large, and we expect it to be possible that only a small subset of nodes were visited even once, then we can speed up the postprocessing by ignoring unvisited parts of the tree. This involves maintaining a set of nodes that were endpoints of paths, and then doing the postprocessing in "bottom-up" fashion, starting at the deepest such nodes, and moving "parentward" from a given node only if the number of paths that visited that node is less than the number of times it was a lest common ancestor. If we denote the number of distinct endpoints by P and the number of distinct visited nodes by M, then this can be done in something like O(P log P + M).