How to subset a binary matrix to maximize “uniqueness”?

85 views Asked by At

I’m trying to optimize a matrix by subsetting rows and then performing a calculation on the subsetted rows. The calculation is representing each of the columns exactly once and having as few duplicates as possible.

To be clear, the parameters to optimize are the following:

  • p = Number of columns detected (More is better)
  • q = Number of duplicate rows (Less is better)
  • r = Penalty on including a duplicate row to increase p

Uniqueness would be defined as p - q*r

Here is a simple example where there is a clear answer where rows [2,3,4] will be chosen (row 0 is dropped because row 2 is better):

A = [
[1,0,0,0,0],
[0,0,0,0,0],
[1,0,1,0,0],
[0,1,0,1,0],
[0,0,0,0,1],
]

p = 5 (all columns are represented) q = 0 (no duplicates)

There will not always be a perfect combination and sometimes it will need to add a penalty (include a duplicate q) to add to the number of columns represented (p). Weighting this will be important which will be done by r.

B = [
[1,1,0,1,0],
[0,0,0,0,0],
[1,0,0,0,1],
[0,0,1,0,0],
[0,0,0,0,1],
]

Best combination is [0,2,3]

P = 5 (all columns detected) q = 1 (1 duplicate)

Lastly, there will sometimes be 2 or more subsets that have the best combination so it should include all best subsets:

C = [
[1,0,0,1,0],
[0,1,1,0,1],
[0,1,0,0,1],
[1,0,0,1,0],
[0,0,0,0,1],
]

In this one, I spot a few good options: [0,1], [1,3]

Other than doing bruteforce of all combinations of rows, can someone help me understand how to start implement this while leveraging any of the algorithms in Scikit-Learn, SciPy, NumPy, or Platypus in Python?

More specifically, what algorithms can I use to optimize the rows in a NumPy array that maximizes p, minimize q weighted by r (e.g., score = p - q*r)?

Here's some of my test code:

from platypus import NSGAII, Problem, Real

A = np.asarray([
    [1, 0, 0, 0, 0],
    [0, 0, 0, 0, 0],
    [1, 0, 1, 0, 0],
    [0, 1, 0, 1, 0],
    [0, 0, 0, 0, 1],
])

def objective(x):
    x = np.asarray(x)
    selected_rows = A[x >= 0.5]
    p = len(set(selected_rows.flatten()))
    q = len(selected_rows) - len(set([tuple(row) for row in selected_rows]))
    return [p, q * 0.1]

problem = Problem(5, 2)
problem.types[:] = [Real(0, 1)] * 5
problem.function = objective
problem.directions[:] = [Problem.MAXIMIZE, Problem.MINIMIZE]

algorithm = NSGAII(problem)
algorithm.run(10000)

for solution in algorithm.result:
    print(f"x = {solution.variables} \tp = {solution.objectives[0]:.2f} \tq*r = {solution.objectives[1]:.2f}")
1

There are 1 answers

1
David Eisenstat On

Since you need all optimal solutions, the CP-SAT solver from the OR-Tools library seems like the right tool here (disclaimer: the developers of OR-Tools are my colleagues, but it’s free and open-source and can be installed as easily as pip3 install ortools).

I’ve provided complete working Python below, though I haven’t tried it on nontrivial instances. The idea is to formulate a mathematical program with a Boolean variable xi for each row i, indicating whether row i belongs to the subset, and a Boolean variable yj for each column j, indicating whether that column is covered. I’ve rewritten your objective to the equivalent

maximize (1/r + 1) ∑j yj − ∑i weight(i) xi,

since the CP-SAT solver prefers to deal in integers. The other idea here is that, instead of detecting duplicates explicitly, we add a bonus to the coverage reward that exactly offsets the first “duplicate”.

The constraints are simple: for each column, some row that covers it belongs to the subset, or the column is uncovered. We could strengthen this constraint to an either/or, but we don’t need to, since there’s no reason to declare a covered column as uncovered.

from collections import defaultdict
from ortools.sat.python import cp_model


class SolutionCollector(cp_model.CpSolverSolutionCallback):
    def __init__(self, row_vars):
        super().__init__()
        self._row_vars = row_vars
        self.solutions = []

    def on_solution_callback(self):
        self.solutions.append(
            [i for (i, row_var) in enumerate(self._row_vars) if self.Value(row_var)]
        )


def solve(matrix):
    model = cp_model.CpModel()
    objective = 0
    col_index_to_row_vars = defaultdict(list)
    row_index_to_var = []
    for i, row in enumerate(matrix):
        row_var = model.NewBoolVar(f"x{i}")
        objective -= sum(row) * row_var
        for j, entry in enumerate(row):
            if entry:
                col_index_to_row_vars[j].append(row_var)
        row_index_to_var.append(row_var)
    for j, row_vars in col_index_to_row_vars.items():
        col_var = model.NewBoolVar(f"y{j}")
        objective += 11 * col_var
        model.AddBoolOr(row_vars + [col_var.Not()])

    # We would love to find all optimal solutions in one go, but CP-SAT does not
    # support that. Find the optimal objective value.
    model.Maximize(objective)
    solver = cp_model.CpSolver()
    status = solver.Solve(model)
    if status != cp_model.OPTIMAL:
        raise Exception("did not find optimal solution")
    optimal_objective_value = solver.Value(objective)

    # Add the optimal objective value as a constraint and clear the objective.
    model.Add(objective == optimal_objective_value)
    model.Proto().ClearField("objective")

    # Find all (optimal) solutions.
    solver = cp_model.CpSolver()
    solution_collector = SolutionCollector(row_index_to_var)
    solver.parameters.enumerate_all_solutions = True
    status = solver.Solve(model, solution_collector)
    if status != cp_model.OPTIMAL:
        raise Exception("did not find all solutions")
    return solution_collector.solutions


A = [
    [1, 0, 0, 0, 0],
    [0, 0, 0, 0, 0],
    [1, 0, 1, 0, 0],
    [0, 1, 0, 1, 0],
    [0, 0, 0, 0, 1],
]
print("A", solve(A))

B = [
    [1, 1, 0, 1, 0],
    [0, 0, 0, 0, 0],
    [1, 0, 0, 0, 1],
    [0, 0, 1, 0, 0],
    [0, 0, 0, 0, 1],
]
print("B", solve(B))

C = [
    [1, 0, 0, 1, 0],
    [0, 1, 1, 0, 1],
    [0, 1, 0, 0, 1],
    [1, 0, 0, 1, 0],
    [0, 0, 0, 0, 1],
]
print("C", solve(C))

Output:

A [[2, 3, 4], [1, 2, 3, 4]]
B [[0, 3, 4], [0, 1, 3, 4]]
C [[0, 1], [1, 3]]