Is there a way to define a complex objective function in an R optimizer?

638 views Asked by At

In R, I am trying to optimize the following : choose rows which maximize the number of columns whose sum exceeds a certain value which varies by column + some other basic constraints on the row selections.

Is there anything out there in R which allows you to incorporate logic into an objective function? ie maximize countif ( sum(value column) > target value for column ) over ~10k columns choosing 5 rows with ~ 500 row choices.

Simple example: Grab the combo of 4 rows below whose col sum exceeds the target more frequently than any other combo of 4 rows.

  +--------+------+------+------+------+------+------+------+------+------+-------+
    |   x    | col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8 | col9 | col10 |
    +--------+------+------+------+------+------+------+------+------+------+-------+
    | row1   |   82 |   73 |   50 |   11 |   76 |   12 |   46 |   64 |    5 |    44 |
    | row2   |    2 |   33 |   35 |   55 |   52 |   18 |   13 |   86 |   72 |    39 |
    | row3   |   94 |    5 |   10 |   21 |   90 |   62 |   54 |   54 |    7 |    17 |
    | row4   |   27 |   10 |   28 |   87 |   27 |   83 |   62 |   56 |   54 |    86 |
    | row5   |   17 |   50 |   34 |   30 |   80 |    7 |   96 |   91 |   32 |    21 |
    | row6   |   73 |   75 |   32 |   71 |   37 |    1 |   13 |   76 |   10 |    34 |
    | row7   |   98 |   13 |   87 |   49 |   27 |   90 |   28 |   75 |   55 |    21 |
    | row8   |   45 |   54 |   25 |    1 |    3 |   75 |   84 |   76 |    9 |    87 |
    | row9   |   40 |   87 |   44 |   20 |   97 |   28 |   88 |   14 |   66 |    77 |
    | row10  |   18 |   28 |   21 |   35 |   22 |    9 |   37 |   58 |   82 |    97 |
    | target |  200 |  100 |  125 |  135|  250 |  89 |  109 |  210|  184 |   178 |
    +--------+------+------+------+------+------+------+------+------+------+-------+

EDIT + Update: I implemented the following using ompr, ROI, and some Big M logic.

nr <- 10 # number of rows
nt <- 15 # number of target columns
vals <- matrix(sample.int(nr*nt, nr*nt), nrow=nr, ncol=nt)

targets <- vector(length=nt)
targets[1:nt] <- 4*mean(vals)


model <- MIPModel() %>%
  add_variable(x[i], i = 1:nr, type = "binary") %>%
  add_constraint(sum_expr(x[i], i = 1:nr)==4)%>%
  add_variable(A[j], j = 1:nt, type = "binary") %>%
  add_variable(s[j], j = 1:nt, type = "continuous",lb=0) %>%
  add_constraint(s[j] <= 9999999*A[j], j =1:nt)%>%
  add_constraint(s[j] >= A[j], j =1:nt)%>%
  add_constraint(sum_expr(vals[i,j]*x[i], i = 1:nr) + A[j] + s[j] >= targets[j], j=1:nt) %>%    
    set_objective(sum_expr(-9999999*A[j], i = 1:nr, j = 1:nt), "max")

model <- solve_model(model,with_ROI(solver = "glpk"))

The model works great for small problems, including those where no solution exists which exceeds the target of every column.

However, the above returns Infeasible when I change the number of columns to even just 150. Given that I tested various scenarios on my smaller example, my hunch is that my model definition is OK...

Any suggestions as to why this is infeasible? Or maybe a more optimal way to define my model?

4

There are 4 answers

2
Karsten W. On

This is IMHO a linear programming modeling question: Can we formulate the problem as a "normalized" linear problem that can be solved by, for example, ompr or ROI (I would add lpSolveAPI)?

I believe it is possible, though I do not have the time to provide the full formulation. Here are some ideas:

As parameters, i.e. fixed values, we have

nr <- 10 # number of rows
nt <- 10 # number of target columns
vals <- matrix(sample.int(100, nr*nt), nrow=nr, ncol=nt)
targets <- sample.int(300, nt)

The decision variables we are interested in are x[1...nr] as binary variables (1 if the row is picked, 0 otherwise).

Obviously, one constraint would be sum(x[i],i)==4 -- the numbers of rows we pick.

For the objective, I would introduce auxilliary variables, such as

y[j] = 1, if sum_{i=1..nr} x[i]*vals[i,j]>= targets[j]

(and 0 otherwise) for j=1...nt. Now this definition of y is not compatible with linear programming and needs to be linearized. If we can assume that val[i,j] and targets[j] are greater or equal to zero, then we can define y[j] as binary variables like this:

x'vals[,j]-t[j]*y[j] >= 0

(x'y is meant as inner product, i.e. sum(x[i]*y[i], i).) In the case x'vals[,j]>=t[j], the value y[j]==1 is valid. In the case x'vals[,j]<t[j], y[j]==0 is enforced.

With the objective max sum(y[j],j), we should get a proper formulation of the problem. No big-M required. But additional assumptions on non-negativity introduced.

0
AirSquid On

This isn't quite what you asked as it is cast in python, but perhaps it will show you the approach to doing this with Integer Programming. You should be able to replicate this in R as there are bindings in R for several solvers, including CBC, which is the one I'm using below, which is suitable for Integer Programs.

I'm also using pyomo to frame up the math model for the solver. I think with a little research, you could find equivalent way to do this in R. The syntax at the start is just to ingest the data (which I just pasted into a .csv file). The rest should be readable.

The good/bad...

This solves almost immediately for your toy problem. It can be shown that 5 rows can exceed all column totals.

For many more columns, it can bog down greatly. I did a couple tests with large matrices of random numbers.... This is very challenging for the solver because it cannot identify "good" rows easily. I can get it to solve for 500x100 with random values (and the total row randomized and multiplied by 5 (the number of selections....just to make it challenging) only in reasonable time by relaxing the tolerance on the solution.

If you really have 10K columns, there are only a couple ways this could work... 1. You have several rows that can cover all the column totals (solver should discover this quickly) or 2. there is some pattern (other than random noise) to the data/totals that can guide the solver, and 3. Using a large ratio based gap (or time limit)

import pyomo.environ as pyo
import pandas as pd
import numpy as np

df = pd.read_csv("data.csv", header=None)  # this is the data from the post

# uncomment this below for a randomized set of data
# df = pd.DataFrame(
#     data = np.random.random(size=(500,100)))
# df.iloc[-1] = df.iloc[-1]*5

# convert to dictionary
data = df.iloc[:len(df)-1].stack().to_dict()
col_sums = df.iloc[len(df)-1].to_dict()

limit = 5  # max number or rows selected

m = pyo.ConcreteModel('row picker')

### SETS
m.R = pyo.Set(initialize=range(len(df)-1))
m.C = pyo.Set(initialize=range(len(df.columns)))

### Params
m.val = pyo.Param(m.R, m.C, initialize=data)
m.tots = pyo.Param(m.C, initialize=col_sums)

### Variables
m.sel = pyo.Var(m.R, domain=pyo.Binary)  # indicator for which rows are selected
m.abv = pyo.Var(m.C, domain=pyo.Binary)  # indicator for which column is above total

### OBJECTIVE
m.obj = pyo.Objective(expr=sum(m.abv[c] for c in m.C), sense=pyo.maximize)

### CONSTRAINTS
# limit the total number of selections...
m.sel_limit = pyo.Constraint(expr=sum(m.sel[r] for r in m.R) <= limit)

# link the indicator variable to the column sum 
def c_sum(m, c):
    return sum(m.val[r, c] * m.sel[r] for r in m.R) >= m.tots[c] * m.abv[c]
m.col_sum = pyo.Constraint(m.C, rule=c_sum)

### SOLVE
print("...built... solving...")
solver = pyo.SolverFactory('cbc', options={'ratio': 0.05})
result = solver.solve(m)
print(result)

### Inspect answer ...
print("rows to select: ")
for r in m.R:
    if m.sel[r]:
        print(r, end=', ')

print("\ncolumn sums from those rows")
tots = [sum(m.val[r,c]*m.sel[r].value for r in m.R) for c in m.C]
print(tots)
print(f'percentage of column totals exceeded:  {len([1 for c in m.C if m.abv[c]])/len(m.C)*100:0.2f}%')

Yields:

Problem: 
- Name: unknown
  Lower bound: -10.0
  Upper bound: -10.0
  Number of objectives: 1
  Number of constraints: 11
  Number of variables: 20
  Number of binary variables: 20
  Number of integer variables: 20
  Number of nonzeros: 10
  Sense: maximize
Solver: 
- Status: ok
  User time: -1.0
  System time: 0.0
  Wallclock time: 0.0
  Termination condition: optimal
  Termination message: Model was solved to optimality (subject to tolerances), and an optimal solution is available.
  Statistics: 
    Branch and bound: 
      Number of bounded subproblems: 0
      Number of created subproblems: 0
    Black box: 
      Number of iterations: 0
  Error rc: 0
  Time: 0.013128995895385742
Solution: 
- number of solutions: 0
  number of solutions displayed: 0

rows to select: 
0, 2, 3, 8, 9, 
column sums from those rows
[261.0, 203.0, 153.0, 174.0, 312.0, 194.0, 287.0, 246.0, 214.0, 321.0]
percentage of column totals exceeded:  100.00%
[Finished in 845ms]

Edit:

I see your edit follows similar pattern to the above solution.

The reason you are getting "INFEASIBLE" for larger instantiations is that your Big-M is no longer big enough when the values are bigger and more are summed. You should pre-analyze your matrix and set BIG_M to be the maximal value in your target row, which will be big enough to cover any gap (by inspection). That will keep you feasible without massive overshoot on BIG_M which has consequences also.

I tweaked a few things on your r model. My r syntax is terrible, but try this out:

model <- MIPModel() %>%
  add_variable(x[i], i = 1:nr, type = "binary") %>%
  add_constraint(sum_expr(x[i], i = 1:nr)==4)%>%
  add_variable(A[j], j = 1:nt, type = "binary") %>%
  add_variable(s[j], j = 1:nt, type = "continuous",lb=0) %>%
  add_constraint(s[j] <= BIG_M*A[j], j =1:nt)%>%
  # NOT NEEDED:  add_constraint(s[j] >= A[j], j =1:nt)%>%
  # DON'T include A[j]:  add_constraint(sum_expr(vals[i,j]*x[i], i = 1:nr) + A[j] + s[j] >= targets[j], j=1:nt) %>%   
  add_constraint(sum_expr(vals[i,j]*x[i], i = 1:nr) + s[j] >= targets[j], j=1:nt) %>%  
  # REMOVE unneded indexing for i:  set_objective(sum_expr(A[j], i = 1:nr, j = 1:nt), "min")
  # and just minimize.  No need to multiply by a large constant here.
  set_objective(sum_expr(A[j], j = 1:nt), "min")

model <- solve_model(model,with_ROI(solver = "glpk"))
2
anymous.asker On

What you want to solve here is called a "mixed integer program", and there's lots of (mostly commercial) software designed around it.

Your typical R functions such as optim are hardly any good for it due to the kind of constraints, but you can use specialized software (such as CBC) as long as you are able to frame the problem in a standard MIP structure (in this case the variables to optimize are binary variables for each row in your data).

As an alternative, you could also look at the package nloptr with its global derivate-free black-box optimizers, in which you can enter a function like this (setting bounds on the variables) and let it optimize it with some general-purpose heuristics.

0
Enrico Schumann On

You could try a Local-Search algorithm. It may give you only a "good" solution; but in exchange it is highly flexible.

Here is a sketch. Start with an arbitrary valid solution x, for instance for your example data

x <- c(rep(TRUE, 4), rep(FALSE, 6))
## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

Define an objective function:

obj_fun <- function(x, table, target, ...) {
    -sum(colSums(table[x, ]) >= target)
}

Given a table and a target vector, it selects the rows defined in x and calculates the number of rowsums that match or exceed the target. I write -sum because I'll use an implementation that minimises an objective function.

-obj_fun(x, table, target)
## [1] 7

So, for the chosen initial solution, 7 column sums are equal to or greater than the target.

Then you'll need a neighbourhood function. It takes a solution x and returns a slightly changed version (a "neighbour" of the original x). Here is a neighbour function that changes a single row in x.

nb <- function(x, ...) {
    true  <- which( x)
    false <- which(!x)
  
    i <-  true[sample.int(length( true), size = 1)]
    j <- false[sample.int(length(false), size = 1)]
    x[i] <- FALSE
    x[j] <- TRUE
    x
}


x
## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

nb(x)
## [1] FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
##     ^^^^^                                      ^^^^

Here is your data:

library("orgutils")
tt <- readOrg(text = "
    |   x    | col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8 | col9 | col10 |
    |--------+------+------+------+------+------+------+------+------+------+-------+
    | row1   |   82 |   73 |   50 |   11 |   76 |   12 |   46 |   64 |    5 |    44 |
    | row2   |    2 |   33 |   35 |   55 |   52 |   18 |   13 |   86 |   72 |    39 |
    | row3   |   94 |    5 |   10 |   21 |   90 |   62 |   54 |   54 |    7 |    17 |
    | row4   |   27 |   10 |   28 |   87 |   27 |   83 |   62 |   56 |   54 |    86 |
    | row5   |   17 |   50 |   34 |   30 |   80 |    7 |   96 |   91 |   32 |    21 |
    | row6   |   73 |   75 |   32 |   71 |   37 |    1 |   13 |   76 |   10 |    34 |
    | row7   |   98 |   13 |   87 |   49 |   27 |   90 |   28 |   75 |   55 |    21 |
    | row8   |   45 |   54 |   25 |    1 |    3 |   75 |   84 |   76 |    9 |    87 |
    | row9   |   40 |   87 |   44 |   20 |   97 |   28 |   88 |   14 |   66 |    77 |
    | row10  |   18 |   28 |   21 |   35 |   22 |    9 |   37 |   58 |   82 |    97 |
    | target |  200 |  100 |  125 |   135|  250  |  89 |  109 |   210|  184 |   178 |
")


table  <- tt[1:10, -1]
target <- tt[11,   -1]

Run the search; in this case, with an algorithm called "Threshold Accepting". I use the implementation in package NMOF (which I maintain).

library("NMOF")
x0 <- c(rep(TRUE, 4), rep(FALSE, 6))
sol <- TAopt(obj_fun,
             list(neighbour = nb,     ## neighbourhood fun
          x0 = sample(x0),    ## initial solution
          nI = 1000,          ## iterations
                  OF.target = -ncol(target)  ## when to stop
                 ),
             target = target,
             table = as.matrix(table))

rbind(Sums = colSums(table[sol$xbest, ]), Target = target)       
##        col1 col2 col3 col4 col5 col6 col7 col8 col9 col10
## Sums    222  206  216  135  252  148  175  239  198   181
## Target  200  100  125  135  250   89  109  210  184   178

As I said, this is a only sketch, and depending on how large and important your actual problem is, there are a number of points to consider:

  • most importantly: nI sets the number of search iterations. 1000 is the default, but you'll definitely want to play around with this number.

  • there may be cases (i.e. datasets) for which the objective function does not provide good guidance: if selecting different rows does not change the number of columns for which the target is met, the algorithm cannot judge if a new solution is better than the previous one. Thus, adding more-continuous guidance (e.g. via some distance-to-target) may help.

  • updating: the computation above actually does a lot that's not necessary. When a new candidate solution is evaluated, there would actually be no need to recompute the full column sums. Instead, only adjust the previous solution's sums by the changed rows. (For a small dataset, this won't matter much.)