Spark Master does not start with DSE 4.7 and OpsCenter 5.1.3

861 views Asked by At

I recently upgraded from Datastax 4.6.3 => 4.7, and now I am having trouble running Spark. The problem seems to be that the Spark Master is not configured properly. I use OpsCenter 5.1.3, and started a three node Analytics cluster. Strangely, the nodes initially has the setting SPARK_ENABLED=0, and I had to set it to 1 manually. Now, however, spark master is not configured properly. In /var/log/cassandra/system.log, I get a long output of:

[SPARK-WORKER-INIT-0] 2015-06-13 21:59:54,027  SparkWorkerRunner.java:49 - Spark Master not ready at (no configured master)
INFO  [SPARK-WORKER-INIT-0] 2015-06-13 21:59:55,028  SparkWorkerRunner.java:49 - Spark Master not ready at (no configured master)
INFO  [SPARK-WORKER-INIT-0] 2015-06-13 21:59:56,028  SparkWorkerRunner.java:49 - Spark Master not ready at (no configured master)

I try to run dse spark, and I get the following error:

java.io.IOException: Spark Master address cannot be retrieved. This really should not be happening with DSE 4.7+ unless your cluster is over 50% down or booted up in the last minute.
    at com.datastax.bdp.plugin.SparkPlugin.getMasterAddress(SparkPlugin.java:257)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:75)
    at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:279)
    at com.sun.jmx.mbeanserver.StandardMBe

My Analytics DC has been up for a few days, and there are no nodes booting. This issue has been blocking development for the last few days, and I am considering downgrading back to DSE 4.6.3, just so I can run my spark jobs again. Any help whatsoever is appreciated.

UPDATE:

I am looking into the condition that 50% of analytics nodes are required to be up for spark master to start. After examining system.log on dse startup, I'm noticing that Gossip still seems to think some old nodes are part of the cluster, and DOWN. For instance,

INFO  [GossipStage:1] 2015-06-14 03:18:05,587  Gossiper.java:968 - InetAddress /172.31.23.17 is now DOWN
INFO  [GossipStage:1] 2015-06-14 03:18:05,614  Gossiper.java:968 - InetAddress /172.31.16.58 is now DOWN
INFO  [GossipStage:1] 2015-06-14 03:18:05,647  Gossiper.java:968 - InetAddress /172.31.24.25 is now DOWN
INFO  [GossipStage:1] 2015-06-14 03:18:05,687  Gossiper.java:968 - InetAddress /172.31.24.147 is now DOWN

These are nodes that I took offline earlier. I have purged the system.peers table of these nodes, but Gossip still seems to acknowledge them as part of the cluster. The phantom presence of these nodes would push the cluster past 50% down. However purging the gossip tables requires a full cluster shutdown.

0

There are 0 answers