Consider a list of permutations (order-relevant combinations) of the form :
(1 2 3)
(1 2 4)
(5 2 3)
(5 2 4)
I need to find the smallest number of generating sets for this permutation group. For example, given the permutations above,
(1 2 3,4)
(5 2 3,4)
Is not an optimal solution. The optimal solution is (1,5 2 3,4).
You will notice that this solution contains sets A={1, 5} B={2} and C={3,4} The original list of permutations is the ordered Cartesian product of these sets: A X B X C.
I would like to partition the list of permutations into the fewest possible groups expressed as sets A, B, and C whose product contains all permutations in the list. The final answer is usually, but not always, a list of a list of sets, since it is not always possible to reduce the list of permutations down to a single list of generating sets. That is, it is usually the case that the product of sets A, B, and C account for some of the permutations in the list, but sets D, E, and F are required to account for other permutations in the list and so on.
My crude attempt at solving this problem involved checking to see if I had a match on any of the two permutation slots in the list and recursively merging them. If so, I merged those two combinations, i.e.
(1 2 3)
(1 2 4)
produce
(1 2 3,4)
Unfortunately, the merge order of such combinations is not associative. An ideal solution would merge these combinations in such a way that the final list of sets would subsume the largest possible number of permutations.
To demonstrate the problem with associativity, take this example, which cannot reduce to fewer than two lists of generating sets:
(1 2 4)
(1 2 3)
(1 2 5)
(5 2 3)
(5 2 4)
Suppose you were to recursively merge these according to the following rule, “If any two columns match, I merge by preserving those columns as-is. I then merge the two sets in the third column to produce the new set. After the merger of these two rows, I throw out the two original rows, so they are not re-merged or double-counted.” The order of these merges is significant. Given the list of permutations above, if I merge (1 2 3) and (1 2 4), I get (1 2 3,4). Now, how do I conduct the next merge to optimize the generating set? Suppose I look at (1 2 5) and see that it matches (1 2 3,4) on two columns, I perform the merge and get (1 2 3,4,5). All appears to be well. However, I then merge (5 2 3) and (5 2 4) which results in (5 2 3,4). I compare (5 2 3,4) and (1 2 3,4,5). I do not have a two-column match, so I stop merging. I would have reduced the output to (5 2 3,4) and (1 2 3,4,5).
Now suppose that I merged in a different order. I merge (1 2 3) and (1 2 4) to produce (1 2 3,4). Then I merge (5 2 3) and (5 2 4) to produce (5 2 3,4). I see that I have a match between these two products. I then merge (1 2 3,4) and (5 2 3,4) to produce (1,5 2 3,4). The final list of generating sets is (1,5 2 3,4) and (1 2 5). So you see that the merge order has produced two different answers: (1,5 2 3,4) and (1 2 5) vs. (5 2 3,4) and (1 2 3,4,5).
In this case I would probably settle for either answer, since the same number (2) of lists of generating sets occurs in each answer. However, (1,5 2 3,4) and (1 2 5) is slightly preferable, because (1,5 2 3,4) subsumes the largest possible number of combinations. Suppose, however, that we have a list of 900 combinations. The merge order of a bottom-up solution to the problem will cause you to reduce the problem in a non-optimal way where optimization is the smallest possible count on the list of the list of sets. It's hard to know what the merge order is without looking ahead through every possible merge path, which is not any more efficient than the brute force method of finding matches, which I have also tried.
I have also attempted the brute force method. Why is the efficiency of the brute force method unacceptable? That method first finds a list of unique integers across all columns. It then generates the power set of every possible combination of these integers. It does so for column/set A, column/set B, and column/set C. It then orders these sets by size (biggest to smallest), computes the Cartesian product of each set across every other set in other columns, then it loops through these Cartesian products, which are keyed by their generating sets to check if the Cartesian product matches a list of permutations from the input. This is roughly O(n^3) where n=the number of combinations in the input list. If I could do this in O(n^2) even, it would be a win compared to where it is now.
As far as the memory considerations go. Let’s assume that the domain for each slot is the integers 1-12. The number of distinct combinations across three slots is 12!/3! You’re looking at over 79 million raw combinations. That’s before they are even partitioned into sets by Google’s Guava Collection APIs (which I’d highly recommend BTW). We could probably somehow generate the set lazily, and I get the sense that Google does, but the memory constraints are still large.
I'm really looking for an algorithm to solve this problem. However, if there's a Java library or method that will take care of this with minimal pain, I'm up for that too. Thanks.
Thanks everyone for your insights and answers. I'd like to credit this answer to John Kolen (http://www.johnfkolen.com) who has presented a feasible solution as follows:
A Greedy Algorithm for Maximal Coverage of Triples
Input: A set, with subsets A x B x C, of triples, where A, B, and C are sets of integers.
Output: A set of n triples (A_i, B_i, C_i), where A_i in A, B_i in B, and C_i in C, and Union_i A_i x B_i x C_i = X and Intersection_i A_i x B_i x C_i = empty.
Algorithm
The algorithm can be broken into three parts: finding pair covers, finding triple covers, and finally constructing a total cover.
Finding covers of pairs from a subset of B x C involves building a map from elements of B to sets of C. Essentially a small prefix tree, or trie, this data structure makes it easy to find the maximal cover a set of prefixes. For instance, if b_1->C_1 and b_2->C_2, then the maximal cover involving b_1 and b_2 will be <[b_1,b_2],C_1 intersection C_2>.
Once we can find maximal pairs, then it's possible to construct maximal triples. This time, however, the prefix (A) maps to a set of pairs (BxC). By using the previous method it's possible to find examine all subsets of A and their matching pair coverage over the suffixes.
The greedy approach finds the locally best cover and adds it to the solution set. The triples captured by the current cover are removed from consideration, and the process continues until the locally best cover is a singleton. The remaining triples are added to the solution.
The relevant code is attached.