I'm trying to get my head around an issue with the theory of implementing the PageRank with MapReduce.
I have the following simple scenario with three nodes: A B C.
The adjacency matrix is here:
A { B, C }
B { A }
The PageRank for B for example is equal to:
(1-d)/N + d ( PR(A) / C(A) )
N = number of incoming links to B
PR(A) = PageRank of incoming link A
C(A) = number of outgoing links from page A
I am fine with all the schematics and how the mapper and reducer would work but I cannot get my head around how at the time of calculation by the reducer, C(A) would be known. How will the reducer, when calculating the PageRank of B by aggregating the incoming links to B will know the number of outgoing links from each page. Does this require a lookup in some external data source?
We iteratively evaluate PR. PR(x) = Sum(PR(a)*weight(a), a in in_links) by
so the output equals input and we can do this until coverage.