python transitivity between dictionaries

1k views Asked by At

I have a list like the following in python (the real one is huge and I cannot do this only by looking at it):

original1=[['email', 'tel', 'fecha', 'descripcion', 'categ'],
          ['[email protected]', '1', '2014-08-06 00:00:06', 'MySpace a', 'animales'],
          ['[email protected]', '1', '2014-08-01 00:00:06', 'My Space a', 'ropa'],
          ['[email protected]', '2', '2014-08-06 00:00:06', 'My Space b', 'electronica'],
          ['[email protected]', '3', '2014-08-10 00:00:06', 'Myace c', 'animales'],
          ['[email protected]', '4', '2014-08-10 00:00:06', 'Myace c', 'animales']]

I split it between data and names to work with data:

datos=original1[-(len(original1)-1):len(original1)]

I need to do a dictionary that has all the duplicates together, considering email and tel, but I need to apply transitivity: since line 0= line 2 if we consider email, but also line 1 if we consider tel, and line 1= line 3 if we consider email again, I need to get that all candidates in this case are 0,1,2 and 3, while 4 is alone.

I created the following code:

from collections import defaultdict
email_to_indices = defaultdict(list) 
phone_to_indices = defaultdict(list)

for idx, row in enumerate(datos): 
    email = row[0].lower() 
    phone = row[1]
    email_to_indices[email].append(idx) 
    phone_to_indices[phone].append(idx)

So now I need to apply transitivity rules, to get together 0 to 3 and alone 4.

If you print

print 'email', email_to_indices
print 'phone', phone_to_indices

You get:

email defaultdict(, {'[email protected]': [0, 2],'[email protected]': [1, 3], '[email protected]': [4]})

phone defaultdict(, {'1': [0, 1], '3': [3], '2': [2], '4': [4]})

Don't know how to get the union of those considering the transitive property. I need to get something like:

first_group: [0, 1, 2 , 3]
second_group: [4]

Thanks!

2

There are 2 answers

0
shaktimaan On

This is another approach:

When you are building the email_to_indices dictionary, you can store the phone number of that row as the values and then have the phone_to_indices have the index of the row. That way we are creating a email_to_indices to phone_to_indices to index of the row map.

With that modification and basic set operations I am able to get what you want exactly:

from collections import defaultdict

email_to_indices = defaultdict(list)
phone_to_indices = defaultdict(list)
combined = defaultdict(set)

original=[['email', 'tel', 'fecha', 'descripcion', 'categ'],
          ['[email protected]', '1', '2014-08-06 00:00:06', 'MySpace a', 'animales'],
          ['[email protected]', '1', '2014-08-01 00:00:06', 'My Space a', 'ropa'],
          ['[email protected]', '2', '2014-08-06 00:00:06', 'My Space b', 'electronica'],
          ['[email protected]', '3', '2014-08-10 00:00:06', 'Myace c', 'animales'],
          ['[email protected]', '4', '2014-08-10 00:00:06', 'Myace c', 'animales']]


for idx, row in enumerate(original[1:], start=1):
    email = row[0].lower()
    phone = row[1]
    email_to_indices[email].append(phone) # Here is what I changed
    phone_to_indices[phone].append(idx)

random_key = 0
for idx, row in enumerate(original[1:], start=1):
    grouped_rows = []
    if row[0].lower() in email_to_indices:
        for phone_no in email_to_indices[row[0].lower()]:
            grouped_rows.extend(phone_to_indices[phone_no])

    if len(combined[random_key]) > 0 and len(set(grouped_rows).intersection(combined[random_key])) > 0:
        combined[random_key].update(set(grouped_rows))
    elif len(combined[random_key]) > 0:
        random_key += 1
        combined[random_key].update(set(grouped_rows))
    else:
        combined[random_key].update(set(grouped_rows))

print combined

This gives:

defaultdict(<type 'set'>, {0: set([1, 2, 3, 4]), 1: set([5])})
2
Anton Savin On

Here you have a graph, or Bipartite graph to be more precise. The nodes are of two types: emails and phones. Two nodes are connected if there exists a record with that email and phone. Or we even can say that the record itself is the edge which connects two nodes.

The task is to find Connected components of this graph. By following the links you can find algorithms which can do it in linear time.

Of course some quick and dirty solutions can also be invented, and even may be deemed appropriate if your dataset is small enough.

You can find some Python implementations here: Python connected components

UPDATE: Here is an example of how you can construct the graph:

graph = {};
EMAIL = "email";
PHONE = "phone";

for rec in datos:
    graph.setdefault((EMAIL, rec[0]), set()).add((PHONE, rec[1]));
    graph.setdefault((PHONE, rec[1]), set()).add((EMAIL, rec[0]));

print "\n".join("%s: %s" % (str(node), str(linkedNodes)) for (node, linkedNodes) in graph.iteritems());

So every node has a type (EMAIL or PHONE, they can actually be just integers, for example 0 and 1, I made them strings only for nice printing) and a value. Graph is a dictionary with nodes as keys and sets of connected nodes as values.