Coordinate Descent Algorithm in Julia for Least Squares not converging

1.2k views Asked by At

As a warm-up to writing my own elastic net solver, I'm trying to get a fast enough version of ordinary least squares implemented using coordinate descent.

I believe I've implemented the coordinate descent algorithm correctly, but when I use the "fast" version (see below), the algorithm is insanely unstable, outputting regression coefficients that routinely overflow a 64-bit float when the number of features is of moderate size compared to the number of samples.

Linear Regression and OLS

If b = A*x, where A is a matrix, x a vector of the unknown regression coefficients, and y is the output, I want to find x that minimizes

||b - Ax||^2

If A[j] is the jth column of A and A[-j] is A without column j, and the columns of A are normalized so that ||A[j]||^2 = 1 for all j, the coordinate-wise update is then

Coordinate Descent:

x[j]  <--  A[j]^T * (b - A[-j] * x[-j])

I'm following along with these notes (page 9-10) but the derivation is simple calculus.

It's pointed out that instead of recomputing A[j]^T(b - A[-j] * x[-j]) all the time, a faster way to do it is with

Fast Coordinate Descent:

x[j]  <--  A[j]^T*r + x[j]

where the total residual r = b - Ax is computed outside the loop over coordinates. The equivalence of these update rules follows from noting that Ax = A[j]*x[j] + A[-j]*x[-j] and rearranging terms.

My problem is that while the second method is indeed faster, it's wildly numerically unstable for me whenever the number of features isn't small compared to the number of samples. I was wondering if anyone might have some insight as to why that's the case. I should note that the first method, which is more stable, still starts disagreeing with more standard methods as the number of features approaches the number of samples.

Julia code

Below is some Julia code for the two update rules:

function OLS_builtin(A,b)
    x = A\b
    return(x)
end

function OLS_coord_descent(A,b)    
    N,P = size(A)
    x = zeros(P)
    for cycle in 1:1000
        for j = 1:P 
            x[j] = dot(A[:,j], b - A[:,1:P .!= j]*x[1:P .!= j])
        end    
    end
    return(x)
end

function OLS_coord_descent_fast(A,b) 
    N,P = size(A)
    x = zeros(P)
    for cycle in 1:1000
        r = b - A*x
        for j = 1:P
            x[j] += dot(A[:,j],r)
        end    
    end
    return(x)
end

Example of the problem

I generate data with the following:

n = 100
p = 50
σ = 0.1
β_nz = float([i*(-1)^i for i in 1:10])

β = append!(β_nz,zeros(Float64,p-length(β_nz)))
X = randn(n,p); X .-= mean(X,1); X ./= sqrt(sum(abs2(X),1))
y = X*β + σ*randn(n); y .-= mean(y);

Here I use p=50, and I get good agreement between OLS_coord_descent(X,y) and OLS_builtin(X,y), whereas OLS_coord_descent_fast(X,y)returns exponentially large values for the regression coefficients.

When p is less than about 20, OLS_coord_descent_fast(X,y) agrees with the other two.

Conjecture

Since things agrees for the regime of p << n, I think the algorithm is formally correct, but numerically unstable. Does anyone have any thoughts on whether this guess is correct, and if so how to correct for the instability while retaining (most) of the performance gains of the fast version of the algorithm?

1

There are 1 answers

2
Dan Getz On BEST ANSWER

The quick answer: You forgot to update r after each x[j] update. Following is the fixed function which behaves like OLS_coord_descent:

function OLS_coord_descent_fast(A,b) 
    N,P = size(A)
    x = zeros(P)
    for cycle in 1:1000
        r = b - A*x
        for j = 1:P
            x[j] += dot(A[:,j],r)
            r -= A[:,j]*dot(A[:,j],r)   # Add this line
        end    
    end
    return(x)
end