I am trying to use MLJ on a DataFrame (30,000 rows x 8,000 columns) but every table operation seems to take a huge amount of time to compile but is fast to run.
I have given an example with code below in which a 5 x 5000 DataFrame is generated and it gets stuck on the unpack line (line 3). When I run the same code for a 5 x 5 DataFrame, line 3 outputs “2.872309 seconds (9.09 M allocations: 565.673 MiB, 6.47% gc time, 99.84% compilation time)”.
This is a crazy amount of compilation time for a seemingly simple task and I would like to know how I can reduce this. Thank you, Jack
using MLJ
using DataFrames
[line 1] @time arr = [[rand(1:10) for i in 1:5] for i in 1:5000];
output: 0.053668 seconds (200.76 k allocations: 11.360 MiB, 22.16% gc time, 99.16% compilation time)
[line 2] @time df = DataFrames.DataFrame(arr, :auto)
output: 0.267325 seconds (733.43 k allocations: 40.071 MiB, 4.29% gc time, 98.67% compilation time)
[line 3] @time y, X = unpack(df, ==(:x1));
does not finish running
It's not unexpected that the Julia compiler struggles with very wide DataFrames, which have (potentially) heterogeneous column types. That said I'm not sure why this has to be a problem for this operation - I've checked with MLJ maintainers who can hopefully chime in.
In the meantime you can simply do
which is instantaneous (Note
select!will dropx1from your underlying data, if you want to copy data useselectinstead)