I use CentOS on Cloudera QuickStart VM. I created a sbt-managed Spark application following the other question How to save DataFrame directly to Hive?.
build.sbt
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.2"
libraryDependencies += "org.apache.spark" % "spark-sql_2.10" % "1.5.2"
libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.5.2"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.5.2"
libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.2"
I'd like to use a DataFrame as a Hive table as follows:
recordDF.registerTempTable("mytempTable")
hiveContext.sql("create table productstore as select * from mytempTable");
I noticed that I am getting the error:
The root scratch dir: /tmp/hive should be writable. Current permissions are: rwx------
I followed other questions and set chmod 777
for the /tmp/hive
in HDFS.
It occurred to me know that spark as using local filesystem /tmp/hive.
I did a chmod for local filesystem.
Now I am getting error
org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:file:/user/hive/warehouse/productstore is not a directory or unable to create one)
I'd like to store a DataFrame in HDFS hive warehouse.
There are two issues in play here.
Issue #1 - Permissions
On CentOS (or other Unix flavours), e.g. Linux or macOS, just do the following:
(which should have been done writable for anyone on the OS since it is a temporary directory).
Issue #2 - Hive Metastore Directory
Since you work with Cloudera QuickStart VM, you use pre-Spark 2.0 (1.6.3 perhaps?).
The issue is due to no
path
specified when persisting a DataFrame to a Hive table.The directory
/user/hive/warehouse
is used by default and to avoid writing to the directory, define thepath
option when saving to a Hive table usingoption
method orsave
with the path option.As of Spark 2.0, the above line will write to a local Hive metastore (using Derby) which is in the current directory as
spark-warehouse
as specified byspark.sql.warehouse.dir
Spark property.To upgrade please define your Spark dependencies using
2.0.2
(not1.5.2
):(Just one single line is all you really need to have Spark SQL with Hive support).
You can then use
hive.metastore.warehouse.dir
orspark.sql.warehouse.dir
to set the Spark warehouse and point to HDFS where the other Hive tables live.