I have a list like the following in python (the real one is huge and I cannot do this only by looking at it):
original1=[['email', 'tel', 'fecha', 'descripcion', 'categ'],
['[email protected]', '1', '2014-08-06 00:00:06', 'MySpace a', 'animales'],
['[email protected]', '1', '2014-08-01 00:00:06', 'My Space a', 'ropa'],
['[email protected]', '2', '2014-08-06 00:00:06', 'My Space b', 'electronica'],
['[email protected]', '3', '2014-08-10 00:00:06', 'Myace c', 'animales'],
['[email protected]', '4', '2014-08-10 00:00:06', 'Myace c', 'animales']]
I split it between data and names to work with data:
datos=original1[-(len(original1)-1):len(original1)]
I need to do a dictionary that has all the duplicates together, considering email and tel, but I need to apply transitivity: since line 0= line 2 if we consider email, but also line 1 if we consider tel, and line 1= line 3 if we consider email again, I need to get that all candidates in this case are 0,1,2 and 3, while 4 is alone.
I created the following code:
from collections import defaultdict
email_to_indices = defaultdict(list)
phone_to_indices = defaultdict(list)
for idx, row in enumerate(datos):
email = row[0].lower()
phone = row[1]
email_to_indices[email].append(idx)
phone_to_indices[phone].append(idx)
So now I need to apply transitivity rules, to get together 0 to 3 and alone 4.
If you print
print 'email', email_to_indices
print 'phone', phone_to_indices
You get:
email defaultdict(, {'[email protected]': [0, 2],'[email protected]': [1, 3], '[email protected]': [4]})
phone defaultdict(, {'1': [0, 1], '3': [3], '2': [2], '4': [4]})
Don't know how to get the union of those considering the transitive property. I need to get something like:
first_group: [0, 1, 2 , 3]
second_group: [4]
Thanks!
This is another approach:
When you are building the
email_to_indices
dictionary, you can store the phone number of that row as the values and then have thephone_to_indices
have the index of the row. That way we are creating aemail_to_indices
tophone_to_indices
to index of the row map.With that modification and basic set operations I am able to get what you want exactly:
This gives: