Generating Maximal Subsets of a Set Under Some Constraints in Python

142 views Asked by At

I have a set of attributes A= {a1, a2, ...an} and a set of clusters C = {c1, c2, ... ck} and I have a set of correspondences COR which is a subset of A x C and |COR|<< A x C. Here is a sample set of correspondences

COR = {(a1, c1), (a1, c2), (a2, c1), (a3, c3), (a4, c4)}

Now, I want to generate all the subsets of COR such that each pair in the subset represents an injective function from set A to set C. Let's call each of such subset a mapping then the valid mappings from the above set COR would be
m1 = {(a1, c1), (a3, c3), (a4, c4)} and m2 = {(a1, c2), (a2, c1), (a3, c3), (a4, c4)}
m1 is interesting here because adding any of the remaining elements from COR to m1 would either violate the definition of the function or it would violate the condition of being an injective function. For instance, if we add the pair (a1,c2) to m1, m1 would not be a function anymore and if we add (a2,c1) to m1, it will cease to be an injective function. So, I am interested in some code snippets or algorithm that I can use to generate all such mappings. Here is what I have tried so far in python import collections import itertools

corr = set({('a1', 'c1'), ('a1', 'c2'), ('a2', 'c1'), ('a3', 'c3'), ('a4', 'c4')})
clusters = [c[1] for c in corr]
attribs = [a[0] for a in corr]
rep_clusters = [item for item, count in collections.Counter(clusters).items() if count>1]
rep_attribs = [item for item, count in collections.Counter(attribs).items() if count>1]
conflicting_sets = []
for c in rep_clusters:
    conflicting_sets.append([p for p in corr if p[1] == c])
for a in rep_attribs:
    conflicting_sets.append([p for p in corr if p[0] == a])
non_conflicting  = corr
for s in conflicting_sets:
    non_conflicting = non_conflicting - set(s)
m = set()
for p in itertools.product(*conflicting_sets):
    print(p, 'product', len(p))
    p_attribs = set([k[0] for k in p])
    p_clusters = set([k[1] for k in p])
    print(len(p_attribs), len(p_clusters))
    if len(p) == len(p_attribs) and len(p) == len(p_clusters):
        m.add(frozenset(set(p).union(non_conflicting)))
print(m)

And as expected the code produces m2 but not m1 because m1 will not be generated from itertools.product. Can anyone guide me on this? I would also like some guidance on performance because the actual sets would be larger than COR set used here and may contain many more conflicting sets.

1

There are 1 answers

3
Grismar On BEST ANSWER

A simpler definition of your requirements is:

  • You have a set of unique tuples.
  • You want to generate all subsets for which:
    • all of the first elements of the tuples are unique (to ensure a function);
    • and all of the second elements are unique (to ensure injectivity).
  • Your title suggests you only want the maximal subsets, i.e. it must be impossible to add any additional elements from the original set without breaking the other requirements.

I'm also assuming any a<x> or c<y> is unique.

Here's a solution:

def get_maximal_subsets(corr):
    def is_injective_function(f):
        if not f:
            return False
        f_domain, f_range = zip(*f)
        return len(set(f_domain)) - len(f_domain) + len(set(f_range)) - len(f_range) == 0

    def generate_from(f):
        if is_injective_function(f):
            for r in corr - f:
                if is_injective_function(f | {r}):
                    break
            else:
                yield f
        else:
            for c in f:
                yield from generate_from(f - {c})

    return list(map(set, set(map(frozenset, generate_from(corr)))))


# representing a's and c's as strings, as their actual value doesn't matter, as long as they are unique
print(get_maximal_subsets(corr={('a1', 'c1'), ('a1', 'c2'), ('a2', 'c1'), ('a3', 'c3'), ('a4', 'c4')}))

The test is_injective_function checks if the provided set f represents a valid injective function, by getting all the values from the domain and range of the function and checking that both only contain unique values.

The generator takes an f, and if it represents an injective valid function, it checks to see that none of the elements that have been removed from the original corr to reach f can be added back in while still having it represent an injective valid function. If that's the case, it yields f as a valid result.

If f isn't an injective valid function to begin with, it will try to remove each of the elements in f in turn and generate any injective valid functions from each of those subsets.

Finally, the whole function removes duplicates from the resulting generator and returns it as a list of unique sets.

Output:

[{('a1', 'c1'), ('a3', 'c3'), ('a4', 'c4')}, {('a2', 'c1'), ('a3', 'c3'), ('a4', 'c4'), ('a1', 'c2')}]

Note, there's several approaches to deduplicating a list of non-hashable values, but this approach turns all the sets in the list into a frozenset to make them hashable, then turns the list into a set to remove duplicates, then turns the contents into sets again and returns the result as a list.

You can prevent removing duplicates at the end by keeping track of what removed subsets have already been tried, which may perform better depending on your actual data set:

def get_maximal_subsets(corr):
    def is_injective_function(f):
        if not f:
            return False
        f_domain, f_range = zip(*f)
        return len(set(f_domain)) - len(f_domain) + len(set(f_range)) - len(f_range) == 0

    previously_removed = []

    def generate_from(f, removed: set = None):
        previously_removed.append(removed)
        if removed is None:
            removed = set()
        if is_injective_function(f):
            for r in removed:
                if is_injective_function(f | {r}):
                    break
            else:
                yield f
        else:
            for c in f:
                if removed | {c} not in previously_removed:
                    yield from generate_from(f - {c}, removed | {c})

    return list(generate_from(corr))

This is probably a generally better performing solution, but I liked the clean algorithm of the first one better for explanation.

I was annoyed by the slowness of the above solution after the comment asking whether it scales up to 100 elements with ~15 conflicts (it would run for many minutes to solve it), so here's a faster solution that runs under 1 second for 100 elements with 15 conflicts, although the execution time still goes up exponentially, so it has its limits):

def injective_function_conflicts(f):
    if not f:
        return {}
    conflicts = defaultdict(set)
    # loop over the product f x f
    for x in f:
        for y in f:
            # for each x and y that have a conflict in any position
            if x != y and any(a == b for a, b in zip(x, y)):
                # add x to y's entry and y to x's entry
                conflicts[y].add(x)
                conflicts[x].add(y)
    return conflicts


def get_maximal_partial_subsets(conflicts, off_limits: set = None):
    if off_limits is None:
        off_limits = set()

    while True and conflicts:
        # pop elements from the conflicts, using them now, or discarding them if off-limits
        k, vs = conflicts.popitem()
        if k not in off_limits:
            break
    else:
        # nothing left in conflicts that's not off-limits
        yield set()
        return

    # generate each possible result from the rest of the conflicts, adding the conflicts vs for k to off_limits
    for sub_result in get_maximal_partial_subsets(dict(conflicts), off_limits | vs):
        # these results can have k added to them, as all the conflicts with k were off-limits
        yield sub_result | {k}
    # also generated each possible result from the rest of the conflicts without k's conflicts
    for sub_result in get_maximal_partial_subsets(conflicts, off_limits):
        # but only yield as a result if adding k itself to it would actually cause a conflict, avoiding duplicates
        if sub_result and injective_function_conflicts(sub_result | {k}):
            yield sub_result


def efficient_get_maximal_subsets(corr):
    conflicts = injective_function_conflicts(corr)
    final_result = list((corr - set(conflicts.keys())) | result
                        for result in get_maximal_partial_subsets(dict(conflicts)))
    print(f'size of result and conflict: {len(final_result)}, {len(conflicts)}')
    return final_result


print(efficient_get_maximal_subsets(corr={('a1', 'c1'), ('a1', 'c2'), ('a2', 'c1'), ('a3', 'c3'), ('a4', 'c4')}))