How to read and write to hive tables from Apache Spark on Analytics Engine

1k views Asked by At

I would like to use Spark on analytics engine from a Jupyter notebook in Watson Studio to be able to read and write to Hive tables, but it isn't clear how I can do this by just reading the Spark documentation. The reason that it isn't clear is that IBM Analytics Engine comes pre-configured with Spark and does not provide root access.

I have found some posts online for generic hadoop that describe creating a hive.xml with the metastore location, but it isn't clear how this is translated to the IBM Analytics Engine environment. For example: Unable to write data on hive using spark

1

There are 1 answers

0
Chris Snow On BEST ANSWER

Here is an example using scala ...

Skip the first two steps if you are running spark directly on the Analytics Engine cluster.

  1. Create a Watson Studio Project and associate with an Analytics Engine service.

  2. Create a scala Notebook that uses the Analytics Engine Spark service

  3. Enter the following scala code to retrieve the HiveContext and list the Hive databases (if any)

    import org.apache.spark.sql.hive.HiveContext
    val hc = new HiveContext(sc)

    // uncomment and adjust the next line if you are using
    // Compose mysql for the hive metastore

    /*
       hc.setConf("hive.metastore.warehouse.dir", 
         "mysql://admin:[email protected]:32023/compose");
    */

    import hc.implicits._
    val df = hc.sql("show databases")
    df.show
  1. Create a Dataframe from static values for testing
    val test_df = Seq(
      (8, "bat"),
      (64, "mouse"),
      (-27, "horse")
    ).toDF("number", "word");
  1. Write the dataframe
    test_df.write.mode("overwrite").saveAsTable("src");
  1. Now verify that you are able to read the table from spark
    val read_df = hc.sql("select * from src")
    read_df.show
  1. Open a hive session and verify you can query the table from hive
    select * from src