Creating hierarchy tree from dictionary of pages' contents

2.3k views Asked by At

The following key:value pairs are 'page' and 'page contents'.

{
  'section-a.html':{'contents':'section-b.html section-c.html section-d.html'},
  'section-b.html':{'contents':'section-d.html section-e.html'},
  'section-c.html':{'contents':'product-a.html product-b.html product-c.html product-d.html'},
  'section-d.html':{'contents':'product-a.html product-c.html'},
  'section-e.html':{'contents':'product-b.html product-d.html'},
  'product-a.html':{'contents':''},
  'product-b.html':{'contents':''},
  'product-c.html':{'contents':''},
  'product-d.html':{'contents':''}
}

For any given 'item' how could I find the path(s) to said item? With my very limited knowledge of data structures in most cases, I'm assuming this would be a hierarchy tree. Please correct me if I'm wrong!

UPDATE: My apologies, I should have been more clear about the data and my expected outcome.

Assuming 'page-a' is an index, each 'page' is literally a page appearing on a website, where as each 'item' is something like a product page that would appear on Amazon, Newegg, etc.

Thus, my expected output for 'item-d' would be a path (or paths) to that item. For example (delimiter is arbitrary, for illustration here): item-d has the following paths:

page-a > page-b > page-e > item-d
page-a > page-c > item-d

UPDATE2: Updated my original dict to provide more accurate and real data. '.html' added for clarification.

3

There are 3 answers

7
Alex Martelli On BEST ANSWER

Here's a simple approach -- it's O(N squared), so, not all that highly scalable, but will serve you well for a reasonable book size (if you have, say, millions of pages, you need to be thinking about a very different and less simple approach;-).

First, make a more usable dict, mapping page to set of contents: e.g., if the original dict is d, make another dict mud as:

mud = dict((p, set(d[p]['contents'].split())) for p in d)

Then, make the dict mapping each page to its parent pages:

parent = dict((p, [k for k in mud if p in mud[k]]) for p in mud)

Here, I'm using lists of parent pages (sets would be fine too), but that's OK for pages with 0 or 1 parents as in your example, too -- you'll just be using an empty list to mean "no parent", else a list with the parent as the one and only item. This should be an acyclic directed graph (if you're in doubt, you can check, of course, but I'm skipping that check).

Now, given a page, finding the paths up its parent(s) to a parentless-parent ("root page") simply require "walking" the parent dict. E.g., in the 0/1 parent case:

path = [page]
while parent[path[-1]]:
  path.append(parent[path[-1]][0])

If you can clarify your specs better (ranges of number of pages per book, number of parents per page, and so on), this code can no doubt be refined, but as a start I hope it can help.

Edit: as the OP clarified that cases with > 1 parent (and so, multiple paths) are indeed of interest, let me show how do deal with that:

partial_paths = [ [page] ]
while partial_paths:
  path = partial_paths.pop()
  if parent[path[-1]]:
    # add as many partial paths as open from here
    for p in parent[path[-1]]:
      partial_paths.append(path + [p])
  else:
    # we've reached a root (parentless node)
    print(path)

Of course, instead of printing, you can yield each path when it reaches a root (making the function whose body this is into a generator), or otherwise treat it in whatever way you require.

Edit again: a commenter is worried about cycles in the graph. If that worry's warranted, it's not hard to keep track of nodes already seen in a path and detect and warn about any cycles. Fastest is to keep a set alongside each list representing a partial path (we need the list for ordering, but checking for membership is O(1) in sets vs O(N) in lists):

partial_paths = [ ([page], set([page])) ]
while partial_paths:
  path, pset = partial_paths.pop()
  if parent[path[-1]]:
    # add as many partial paths as open from here
    for p in parent[path[-1]]:
      if p in pset:
        print('Cycle: %s (%s)' % (path, p))
        continue
      partial_paths.append((path + [p], pset.union([p])))
  else:
    # we've reached a root (parentless node)
    print('Path: %s' % (path,))

It's probably worthwhile, for clarity, packing the list and set representing a partial path into a small utility class Path with suitable methods.

3
extraneon On

EDIT With the question explained a bit better I think the following might be what you need, or could at least provide something of a starting point.

data = {
  'section-a.html':{'contents':'section-b.html section-c.html section-d.html'},
  'section-b.html':{'contents':'section-d.html section-e.html'},
  'section-c.html':{'contents':\
                    'product-a.html product-b.html product-c.html product-d.html'},
  'section-d.html':{'contents':'product-a.html product-c.html'},
  'section-e.html':{'contents':'product-b.html product-d.html'},
  'product-a.html':{'contents':''},
  'product-b.html':{'contents':''},
  'product-c.html':{'contents':''},
  'product-d.html':{'contents':''}
}

def findSingleItemInData(item):
    return map( lambda x: (item, x), \
                [key for key in data if data[key]['contents'].find(item) <> -1])

def trace(text):
    searchResult = findSingleItemInData(text)
    if not searchResult:
        return text

    retval = [] 
    for item in searchResult:
        retval.append([text, trace(item[-1])]) 

    return retval

print trace('product-d.html')

OLD

I don't really know what you expect to see, but maybe something like this will work.

data = {
   'page-a':{'contents':'page-b page-c'},
   'page-b':{'contents':'page-d page-e'},
   'page-c':{'contents':'item-a item-b item-c item-d'},
   'page-d':{'contents':'item-a item-c'},
   'page-e':{'contents':'item-b item-d'}
}

itemToFind = 'item-c'

for key in data:
  for index, item in enumerate(data[key]['contents'].split()):
    if item == itemToFind:
      print key, 'at position', index

It would be easier, and I think more correct, if you'd use a slightly different data structure:

 data = {
   'page-a':{'contents':['page-b', 'page-c']},
   'page-b':{'contents':['page-d', 'page-e']},
   'page-c':{'contents':['item-a', 'item-b item-c item-d']},
   'page-d':{'contents':['item-a', 'item-c']},
   'page-e':{'contents':['item-b', 'item-d']}
 }

Then you wouldn't need to split.

Given that last case, it can even be expressed a bit shorter:

for key in data:
    print [ (key, index, value) for index,value in \
             enumerate(data[key]['contents']) if value == 'item-c' ]

And even shorter, with the empty lists removed:

print filter(None, [[ (key, index, value) for index,value in \ 
       enumerate(data[key]['contents']) if value == 'item-c' ] for key in data])

That should be a single line, but I used \ as line break indicator so it can be read without scrollbars.

0
jfs On

Here's an illustration for your question. It is easier to reason about graphs when you have a picture.

First, abbreviate the data:

#!/usr/bin/perl -pe
s/section-([a-e])\.html/uc$1/eg; s/product-([a-e])\.html/$1/g

Result:

# graph as adj list
DATA = {
  'A':{'contents':'B C D'},
  'B':{'contents':'D E'},
  'C':{'contents':'a b c d'},
  'D':{'contents':'a c'},
  'E':{'contents':'b d'},
  'a':{'contents':''},
  'b':{'contents':''},
  'c':{'contents':''},
  'd':{'contents':''}
}

Convert to graphviz's format:

with open('data.dot', 'w') as f:
    print >> f, 'digraph {'
    for node, v in data.iteritems():
        for child in v['contents'].split():
            print >> f, '%s -> %s;' % (node, child),
        if v['contents']: # don't print empty lines
            print >> f
    print >> f, '}'

Result:

digraph {
A -> C; A -> B; A -> D;
C -> a; C -> b; C -> c; C -> d;
B -> E; B -> D;
E -> b; E -> d;
D -> a; D -> c;
}

Plot the graph:

$ dot -Tpng -O data.dot

data.dot