Datalab kernel crashes because of data set size. Is load balancing an option?

290 views Asked by At

I am currently running the virtual machine with the highest memory,n1-highmem-32 (32 vCPUs, 208 GB memory).

My data set is around 90 gigs, but has the potential to grow in the future.

The data is in stored in many zipped csv files. I am loading the data into a sparse matrix in order to preform some dimensionality reduction and clustering.

1

There are 1 answers

1
Chris Meyers On

The Datalab kernel runs on a single machine. Since you are already running on a 208GB RAM machine, you may have to switch to a distributed system to analyze the data.

If the operations you are doing on the data can be expressed as SQL, I'd suggest loading the data into BigQuery, which Datalab has a lot of support for. Otherwise you may want to convert your processing pipeline to use Dataflow (which has a Python SDK). Depending on the complexity of your operations, either of these may be difficult, though.