How to persist a DataFrame to a Hive table?

3k views Asked by At

I use CentOS on Cloudera QuickStart VM. I created a sbt-managed Spark application following the other question How to save DataFrame directly to Hive?.

build.sbt

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.2"
libraryDependencies += "org.apache.spark" % "spark-sql_2.10" % "1.5.2"
libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.5.2"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.5.2"
libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.2"

I'd like to use a DataFrame as a Hive table as follows:

 recordDF.registerTempTable("mytempTable")
 hiveContext.sql("create table productstore as select * from mytempTable"); 

I noticed that I am getting the error:

The root scratch dir: /tmp/hive should be writable. Current permissions are: rwx------

I followed other questions and set chmod 777 for the /tmp/hive in HDFS.

It occurred to me know that spark as using local filesystem /tmp/hive.

I did a chmod for local filesystem.

Now I am getting error

org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:file:/user/hive/warehouse/productstore is not a directory or unable to create one)

I'd like to store a DataFrame in HDFS hive warehouse.

1

There are 1 answers

0
Jacek Laskowski On BEST ANSWER

There are two issues in play here.

Issue #1 - Permissions

On CentOS (or other Unix flavours), e.g. Linux or macOS, just do the following:

chmod -R 777 /tmp/hive

(which should have been done writable for anyone on the OS since it is a temporary directory).

Issue #2 - Hive Metastore Directory

Since you work with Cloudera QuickStart VM, you use pre-Spark 2.0 (1.6.3 perhaps?).

The issue is due to no path specified when persisting a DataFrame to a Hive table.

The directory /user/hive/warehouse is used by default and to avoid writing to the directory, define the path option when saving to a Hive table using option method or save with the path option.

df.write.option("path", "[path-here]").saveAsTable("tableName")

As of Spark 2.0, the above line will write to a local Hive metastore (using Derby) which is in the current directory as spark-warehouse as specified by spark.sql.warehouse.dir Spark property.

To upgrade please define your Spark dependencies using 2.0.2 (not 1.5.2):

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.2"

(Just one single line is all you really need to have Spark SQL with Hive support).

You can then use hive.metastore.warehouse.dir or spark.sql.warehouse.dir to set the Spark warehouse and point to HDFS where the other Hive tables live.