I’m trying to optimize a matrix by subsetting rows and then performing a calculation on the subsetted rows. The calculation is representing each of the columns exactly once and having as few duplicates as possible.
To be clear, the parameters to optimize are the following:
- p = Number of columns detected (More is better)
- q = Number of duplicate rows (Less is better)
- r = Penalty on including a duplicate row to increase p
Uniqueness would be defined as p - q*r
Here is a simple example where there is a clear answer where rows [2,3,4] will be chosen (row 0 is dropped because row 2 is better):
A = [
[1,0,0,0,0],
[0,0,0,0,0],
[1,0,1,0,0],
[0,1,0,1,0],
[0,0,0,0,1],
]
p = 5 (all columns are represented) q = 0 (no duplicates)
There will not always be a perfect combination and sometimes it will need to add a penalty (include a duplicate q) to add to the number of columns represented (p). Weighting this will be important which will be done by r.
B = [
[1,1,0,1,0],
[0,0,0,0,0],
[1,0,0,0,1],
[0,0,1,0,0],
[0,0,0,0,1],
]
Best combination is [0,2,3]
P = 5 (all columns detected) q = 1 (1 duplicate)
Lastly, there will sometimes be 2 or more subsets that have the best combination so it should include all best subsets:
C = [
[1,0,0,1,0],
[0,1,1,0,1],
[0,1,0,0,1],
[1,0,0,1,0],
[0,0,0,0,1],
]
In this one, I spot a few good options: [0,1], [1,3]
Other than doing bruteforce of all combinations of rows, can someone help me understand how to start implement this while leveraging any of the algorithms in Scikit-Learn, SciPy, NumPy, or Platypus in Python?
More specifically, what algorithms can I use to optimize the rows in a NumPy array that maximizes p, minimize q weighted by r (e.g., score = p - q*r)?
Here's some of my test code:
from platypus import NSGAII, Problem, Real
A = np.asarray([
[1, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 1, 0],
[0, 0, 0, 0, 1],
])
def objective(x):
x = np.asarray(x)
selected_rows = A[x >= 0.5]
p = len(set(selected_rows.flatten()))
q = len(selected_rows) - len(set([tuple(row) for row in selected_rows]))
return [p, q * 0.1]
problem = Problem(5, 2)
problem.types[:] = [Real(0, 1)] * 5
problem.function = objective
problem.directions[:] = [Problem.MAXIMIZE, Problem.MINIMIZE]
algorithm = NSGAII(problem)
algorithm.run(10000)
for solution in algorithm.result:
print(f"x = {solution.variables} \tp = {solution.objectives[0]:.2f} \tq*r = {solution.objectives[1]:.2f}")
Since you need all optimal solutions, the CP-SAT solver from the OR-Tools library seems like the right tool here (disclaimer: the developers of OR-Tools are my colleagues, but it’s free and open-source and can be installed as easily as
pip3 install ortools).I’ve provided complete working Python below, though I haven’t tried it on nontrivial instances. The idea is to formulate a mathematical program with a Boolean variable xi for each row i, indicating whether row i belongs to the subset, and a Boolean variable yj for each column j, indicating whether that column is covered. I’ve rewritten your objective to the equivalent
since the CP-SAT solver prefers to deal in integers. The other idea here is that, instead of detecting duplicates explicitly, we add a bonus to the coverage reward that exactly offsets the first “duplicate”.
The constraints are simple: for each column, some row that covers it belongs to the subset, or the column is uncovered. We could strengthen this constraint to an either/or, but we don’t need to, since there’s no reason to declare a covered column as uncovered.
Output: