I deployed Spark (1.3.1) with yarn-client on Hadoop (2.6) cluster using bdutil, by default, the instances are created with Ephemeral external ips, and so far spark works fine. With some security concerns, and assuming the cluster is internal accessed only, I removed the external ips from the instances; after that, the spark-shell will not even run, and seemed it cannot communicate with Yarn/Hadoop, and just stuck indefinitely. Only after I added the external ips back, the spark-shell starts working properly.
My question is, is external ips of the nodes required to run spark over yarn, and why? If yes, will there be any concerns regarding security, etc? Thanks!
Short Answer
You need external IP addresses to access GCS, and default bdutil settings set GCS as the default Hadoop filesystem, including for control files. Use
./bdutil -F hdfs ... deploy
to use HDFS as the default instead.Security shouldn't be a concern when using external IP addresses unless you've added too many permissive rules to your firewall rules in your GCE network config.
EDIT: At the moment there appears to be a bug where we set
spark.eventLog.dir
to a GCS path even if the default_fs is hdfs. I filed https://github.com/GoogleCloudPlatform/bdutil/issues/35 to track this. In the meantime just manually edit/home/hadoop/spark-install/conf/spark-defaults.conf
on your master (you might need tosudo -u hadoop vim.tiny /home/hadoop/spark-install/conf/spark-defaults.conf
to have edit permissions on it) to setspark.eventLog.dir
tohdfs:///spark-eventlog-base
or something else in HDFS, and runhadoop fs -mkdir -p hdfs:///spark-eventlog-base
to get it working.Long Answer
By default, bdutil also configures Google Cloud Storage as the "default Hadoop filesystem", which means that control files used by Spark and YARN require access to Google Cloud Storage. Additionally, external IPs are required in order to access Google Cloud Storage.
I did manage to partially repro your case after manually configuring intra-network SSH; during startup I actually see the following:
As expected, simply by calling
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start
it tries to contact Google Cloud Storage, and fails because there's not GCS access without external IPs.To get around this, you can simply use
-F hdfs
when creating your cluster to use HDFS as your default filesystem; in that case everything should work intra-cluster even without external IP addresses. In that mode, you can still even continue to use GCS whenever you have external IP addresses assigned by specifying fullgs://bucket/object
paths as your Hadoop arguments. However, note that in that case, as long as you've removed the external IP addresses, you won't be able to use GCS unless you also configure a proxy server and funnal all data through your proxy; the GCS configs for that isfs.gs.proxy.address
.In general, there's no need to worry about security just because of having external IP addresses unless you've opened up new permissive rules in your "default" network firewall rules in Google Compute Engine.