Google Compute Engine support RAM disks - see here.
I develop a project that will reuse existing code which manipulate local files.
For scalability, I am going to use Dataflow.
The files are in GCS, and I will send it to the Dataflow workers for manipulation.
I was thinking on creating better performance by using RAM disks on the workers, copy the files from GCS directly to the RAM disk, and manipulate it there.
I fail to find any example of such capability.
Is this a valid solution, or should I avoid this kind of a "trick" ?
It is not possible to to use ramdisk as the disk type for the workers since ramdisk is being set up on an OS level. The only available disk for the workers are Standard persistent disks (pd-standard), and SSD persistent disks (pd-ssd). Among these, SSD is definitely faster. You can try adding more workers or using a faster CPU to process your data faster.
For comparison I tried running a job that uses standard and ssd and it turns out that it is 13% faster when using SSD compared to standard disk. But take note that I just tested the quick start from the dataflow docs.
Using SSD (3m 54s elapsed time):
Using Standard Disk (4m 29s elapsed time):