I am using h2o to carry out some modelling, and having tuned the model, i would now like it to be used to carry out a lot of predictions approx 6bln predictions/rows, per prediction row it needs 80 columns of data
The dataset I have already broken down the input dataset down so that it is in about 500 x 12 million row chunks each with the relevant 80 columns of data.
However to upload a data.table
that is 12 million by 80 columns to h2o takes quite a long time, and doing it 500 times for me is taking a prohibitively long time...I think its because it is parsing the object first before it is uploaded.
The prediction part is relatively quick in comparison....
Are there any suggestions to speed this part up? Would changing the number of cores help?
Below is an reproducible example of the issues...
# Load libraries
library(h2o)
library(data.table)
# start up h2o using all cores...
localH2O = h2o.init(nthreads=-1,max_mem_size="16g")
# create a test input dataset
temp <- CJ(v1=seq(20),
v2=seq(7),
v3=seq(24),
v4=seq(60),
v5=seq(60))
temp <- do.call(cbind,lapply(seq(16),function(y){temp}))
colnames(temp) <- paste0('v',seq(80))
# this is the part that takes a long time!!
system.time(tmp.obj <- as.h2o(localH2O,temp,key='test_input'))
#|======================================================================| 100%
# user system elapsed
#357.355 6.751 391.048
Since you are running H2O locally, you want to save that data as a file and then use:
This will have each thread read their parts of the file in parallel. If you run H2O on a separate server, then you would need to copy the data to a location that the server can read from (most people don't set the servers to read from the file system on their laptops).
as.h2o()
serially uploads the file to H2O. Withh2o.importFile()
, the H2O server finds the file and reads it in parallel.It looks like you are using version 2 of H2O. The same commands will work in H2Ov3, but some of the parameter names have changed a little. The new parameter names are here: http://cran.r-project.org/web/packages/h2o/h2o.pdf