Cloudera Quick Start VM lacks Spark 2.0 or greater

484 views Asked by At

In order to test and learn Spark functions, developers require Spark latest version. As the API's and methods earlier to version 2.0 are obsolete and no longer work in the newer version. This throws a bigger challenge and developers are forced to install Spark manually which wastes a considerable amount of development time.

How do I use a later version of Spark on the Quickstart VM?

1

There are 1 answers

3
swapnil shashank On

Every one should not waste setup time which I have wasted, so here is the solution.

SPARK 2.2 Installation Setup on Cloudera VM

Step 1: Download a quickstart_vm from the link:

Prefer a vmware platform as it is easy to use, anyways all the options are viable.

Size is around 5.4gb of the entire tar file. We need to provide the business email id as it won’t accept personal email ids.

Step 2: The virtual environment requires around 8gb of RAM, please allocate sufficient memory to avoid performance glitches.

Step 3: Please open the terminal and switch to root user as:

su root
 password: cloudera

Step 4: Cloudera provides java –version 1.7.0_67 which is old and does not match with our needs. To avoid java related exceptions, please install java with the following commands:

Downloading Java:

wget -c --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.tar.gz

Switch to /usr/java/ directory with “cd /usr/java/” command.

cp the java download tar file to the /usr/java/ directory.

Untar the directory with “tar –zxvf jdk-8u31-linux-x64.tar.gz”

Open the profile file with the command “vi ~/.bash_profile”

export JAVA_HOME to the new java directory.

export JAVA_HOME=/usr/java/jdk1.8.0_131

Save and Exit.

In order to reflect the above change, following command needs to be executed on the shell:

source ~/.bash_profile

The Cloudera VM provides spark 1.6 version by default. However, 1.6 API’s are old and do not match with production environments. In that case, we need to download and manually install Spark 2.2.

Switch to /opt/ directory with the command:

cd /opt/

Download spark with the command:

wget https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz

Untar the spark tar with the following command:

tar -zxvf spark-2.2.0-bin-hadoop2.7.tgz

We need to define some environment variables as default settings:

Please open a file with the following command:

vi /opt/spark-2.2.0-bin-hadoop2.7/conf/spark-env.sh

Paste the following configurations in the file:

SPARK_MASTER_IP=192.168.50.1
SPARK_EXECUTOR_MEMORY=512m
SPARK_DRIVER_MEMORY=512m
SPARK_WORKER_MEMORY=512m
SPARK_DAEMON_MEMORY=512m

Save and exit

We need to start spark with the following command:

/opt/spark-2.2.0-bin-hadoop2.7/sbin/start-all.sh

Export spark_home :

export SPARK_HOME=/opt/spark-2.2.0-bin-hadoop2.7/

Change the permissions of the directory:

chmod 777 -R /tmp/hive

Try “spark-shell”, it should work.