Logistic regression on huge dataset

2.8k views Asked by At

I need to run a logistic regression on a huge dataset (many GBs of data). I am currently using using Julia's GLM package for this. Although my regression works on subsets of the data, I am running out of memory when I try to run this on the full dataset.

Is there a way to compute logistic regressions on huge, non-sparse datasets without using a prohibitive amount of memory? I thought about separating the data into chunks, calculating regressions on each of these and aggregating them somehow, but I'm not sure this would give valid results.

5

There are 5 answers

0
IainDunning On

I have not personally used it, but the StreamStats.jl package is designed for this use case. It supports linear and logistic regression, as well as other streaming statistic functions.

0
Vincent Zoonekynd On

Vowpal Wabbit is designed for that: linear models when the data (or even the model) does not fit in memory.

You can do the same thing by hand, using stochastic gradient descent (SGD): write the "loss function" of your logistic regression (the opposite of the likelihood), minimize it just a bit on a chunk of the data (perform a single gradient descent step), do the same thing on another chunk of data, and continue. After several passes on the data, you should have a good solution. It works better if the data arrives in a random order.

Another idea (ADMM, I think), similar to what you suggest, would be to split the data into chunks, and minimize the loss function on each chunk. Of course, the solutions on the different chunks are not the same. To address this problem, we can change the objective functions by adding a small penalty for the difference between the solution on a chunk of data and the average solution, and re-optimize everything. After a few iterations, the solutions become closer and closer and eventually converge. This has the added advantage of being parallelizable.

0
mynameisvinn On

several scikit estimators, including logistic regression, implement partial_fit, which allow for batch-wise training of large, out-of-core datasets.

such models can be used for classification using an out-of-core approach: learning from data that doesn’t fit into main memory.

pseudo code:

from sklearn.linear_model import SGDClassifier

clf = SGDClassifier(loss='log')
for batch_x, batch_y in some_generator:  # lazily read data in chunks
    clf.partial_fit(batch_x, batch_y)
0
Tom Breloff On

Keep an eye on Josh Day's awesome package OnlineStats. In addition to tons of online algorithms for various statistic, regression, classification, dimensionality reduction, and distribution estimation, we are also actively working on porting all missing functionality from StreamStats and merging the two.

Also, I've been working on a very experimental package OnlineAI (extending OnlineStats) which will extend some online algorithms into the machine learning space.

0
joshday On

To add to Tom's answer, OnlineStats.jl has a statistical learning type (StatLearn) which relies on stochastic approximation algorithms, each of which use O(1) memory. Logistic Regression and Support Vector Machines are available for binary response data. The model can be updated with new batches of data, so you don't need to load your whole dataset at once. It's also extremely fast. Here's a basic example:

using OnlineStats, StatsBase
o = StatLearn(n_predictors, LogitMarginLoss())

# load batch 1
fit!(o, (x1, y1))

# load batch 2
fit!(o, (x2, y2))

# load batch 3
fit!(o, (x3, y3))
...

coef(o)