We are trying to run a Google Cloud Dataflow job in the cloud but we keep getting "java.lang.OutOfMemoryError: Java heap space".
We are trying to process 610 million records from a Big Query table and writing the processed records to 12 different outputs (main + 11 side outputs).
We have tried increasing our number of instances to 64 n1-standard-4 instances but we are still getting the issue.
The Xmx value on the VMs seem to be set at ~4GB(-Xmx3951927296), even though the instances have 15GB memory. Is there any way of increasing the Xmx Value?
The job ID is - 2015-06-11_21_32_32-16904087942426468793
You can't directly set the heap size. Dataflow, however, scales the heap size with the machine type. You can pick a machine with more memory by setting the flag "--machineType". The heap size should increase linearly with the total memory of the machine type.
Dataflow deliberately limits the heap size to avoid negatively impacting the shuffler.
Is your code explicitly accumulating values from multiple records in memory? Do you expect 4GB to be insufficient for any given record?
Dataflow's memory requirements should scale with the size of individual records and the amount of data your code is buffering in memory. Dataflow's memory requirements shouldn't increase with the number of records.