I have a set of attributes A= {a1, a2, ...an}
and a set of clusters C = {c1, c2, ... ck}
and I have a set of correspondences COR
which is a subset of A x C
and |COR|<< A x C
. Here is a sample set of correspondences
COR = {(a1, c1), (a1, c2), (a2, c1), (a3, c3), (a4, c4)}
Now, I want to generate all the subsets of COR
such that each pair in the subset represents an injective function from set A
to set C
. Let's call each of such subset a mapping then the valid mappings from the above set COR
would be
m1 = {(a1, c1), (a3, c3), (a4, c4)}
and m2 = {(a1, c2), (a2, c1), (a3, c3), (a4, c4)}
m1
is interesting here because adding any of the remaining elements from COR
to m1
would either violate the definition of the function or it would violate the condition of being an injective function. For instance, if we add the pair (a1,c2)
to m1
, m1
would not be a function anymore and if we add (a2,c1)
to m1
, it will cease to be an injective function. So, I am interested in some code snippets or algorithm that I can use to generate all such mappings. Here is what I have tried so far in python
import collections
import itertools
corr = set({('a1', 'c1'), ('a1', 'c2'), ('a2', 'c1'), ('a3', 'c3'), ('a4', 'c4')})
clusters = [c[1] for c in corr]
attribs = [a[0] for a in corr]
rep_clusters = [item for item, count in collections.Counter(clusters).items() if count>1]
rep_attribs = [item for item, count in collections.Counter(attribs).items() if count>1]
conflicting_sets = []
for c in rep_clusters:
conflicting_sets.append([p for p in corr if p[1] == c])
for a in rep_attribs:
conflicting_sets.append([p for p in corr if p[0] == a])
non_conflicting = corr
for s in conflicting_sets:
non_conflicting = non_conflicting - set(s)
m = set()
for p in itertools.product(*conflicting_sets):
print(p, 'product', len(p))
p_attribs = set([k[0] for k in p])
p_clusters = set([k[1] for k in p])
print(len(p_attribs), len(p_clusters))
if len(p) == len(p_attribs) and len(p) == len(p_clusters):
m.add(frozenset(set(p).union(non_conflicting)))
print(m)
And as expected the code produces m2
but not m1
because m1
will not be generated from itertools.product
. Can anyone guide me on this? I would also like some guidance on performance because the actual sets would be larger than COR
set used here and may contain many more conflicting sets.
A simpler definition of your requirements is:
I'm also assuming any
a<x>
orc<y>
is unique.Here's a solution:
The test
is_injective_function
checks if the provided setf
represents a valid injective function, by getting all the values from the domain and range of the function and checking that both only contain unique values.The generator takes an
f
, and if it represents an injective valid function, it checks to see that none of the elements that have been removed from the originalcorr
to reachf
can be added back in while still having it represent an injective valid function. If that's the case, it yieldsf
as a valid result.If
f
isn't an injective valid function to begin with, it will try to remove each of the elements inf
in turn and generate any injective valid functions from each of those subsets.Finally, the whole function removes duplicates from the resulting generator and returns it as a list of unique sets.
Output:
Note, there's several approaches to deduplicating a list of non-hashable values, but this approach turns all the sets in the list into a
frozenset
to make them hashable, then turns the list into a set to remove duplicates, then turns the contents into sets again and returns the result as a list.You can prevent removing duplicates at the end by keeping track of what removed subsets have already been tried, which may perform better depending on your actual data set:
This is probably a generally better performing solution, but I liked the clean algorithm of the first one better for explanation.
I was annoyed by the slowness of the above solution after the comment asking whether it scales up to 100 elements with ~15 conflicts (it would run for many minutes to solve it), so here's a faster solution that runs under 1 second for 100 elements with 15 conflicts, although the execution time still goes up exponentially, so it has its limits):