I need to run a logistic regression on a huge dataset (many GBs of data). I am currently using using Julia's GLM package for this. Although my regression works on subsets of the data, I am running out of memory when I try to run this on the full dataset.
Is there a way to compute logistic regressions on huge, non-sparse datasets without using a prohibitive amount of memory? I thought about separating the data into chunks, calculating regressions on each of these and aggregating them somehow, but I'm not sure this would give valid results.
I have not personally used it, but the StreamStats.jl package is designed for this use case. It supports linear and logistic regression, as well as other streaming statistic functions.