Why does google cloud click to deploy hadoop workflow requires picking size for local persistent disk even if you plan to use the hadoop connector for cloud storage? The default size is 500 GB .. I was thinking if it does need some disk it should be much smaller in size. Is there a recommended persistent disk size when using cloud storage connector with hadoop in google cloud?
"Deploying Apache Hadoop on Google Cloud Platform
The Apache Hadoop framework supports distributed processing of large data sets across a clusters of computers.
Hadoop will be deployed in a single cluster. The default deployment creates 1 master VM instance and 2 worker VMs, each having 4 vCPUs, 15 GB of memory, and a 500-GB disk. A temporary deployment-coordinator VM instance is created to manage cluster setup.
The Hadoop cluster uses a Cloud Storage bucket as its default file system, accessed through Google Cloud Storage Connector. Visit Cloud Storage browser to find or create a bucket that you can use in your Hadoop deployment.
Apache Hadoop on Google Compute Engine Click to Deploy Apache Hadoop Apache Hadoop ZONE us-central1-a WORKER NODE COUNT
CLOUD STORAGE BUCKET Select a bucket HADOOP VERSION 1.2.1 MASTER NODE DISK TYPE Standard Persistent Disk MASTER NODE DISK SIZE (GB)
WORKER NODE DISK TYPE Standard Persistent Disk WORKER NODE DISK SIZE (GB) "
The three big uses of persistent disks (PDs) are:
Due to the layout of directories, persistent disks will also be used for other items like job data (JARs, auxiliary data distributed with the application, etc), but those could just as easily use the boot PD.
Bigger persistent disks are almost always better due to the way GCE scales IOPS and throughput with disk size [1]. 500G is probably a good starting point to start profiling your applications and uses. If you don't use HDFS, find that your applications don't log much, and don't spill to disk when shuffling, then a smaller disk can probably work well.
If you find that you actually don't want or need any persistent disk, then bdutil [2] also exists as a command line script that can create clusters with more configurability and customizability.