spark ETL and spark thrift server

388 views Asked by At

Some details:

  • Spark SQL (version 3.2.1)
  • Driver: Hive JDBC (version 2.3.9)

ThriftCLIService: Starting ThriftBinaryCLIService on port 10000 with 5...500 worker threads

BI tool is connect via odbc driver

After activating Spark Thrift Server I'm unable to run pyspark script using spark-submit as they both use the same metastore_db

error:
Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@3acaa384, see the next exception for details.
        at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
        at org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown Source)
        ... 140 more
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /tmp/metastore_db.

I need to be able to run PySpark (Spark ETL) while having spark thrift server up for BI tool queries. Any workaround for it?

Thanks!

1

There are 1 answers

0
Luis Estrada On BEST ANSWER

In my case the solution was to move the metastore_db to a database server like MySql (in my case) or Postgresql.

You will have to configure $SPARK_HOME/conf/hive-site.xml and include your jdbc driver in $SPARK_HOME/jars path

hive-site.xml example for MySQL connection

<configuration>
  <!-- Hive Execution Parameters -->
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://XXX.XXX.XXX.XXX:3306/metastore?createDatabaseIfNotExist=true&amp;useSSL=FALSE&amp;autoReconnect=true&amp;nullCatalogMeansCurrent=true</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>YOUR_USER</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>YOUR_PASSWORD</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
  <name>hive.server2.transport.mode</name>
  <value>http</value>
</property>
<property>
  <name>hive.server2.thrift.http.port</name>
  <value>10000</value>
</property>
<property>
  <name>hive.server2.http.endpoint</name>
  <value>cliservice</value>
</property>


<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
<description/>
</property>
  <property>
    <name>datanucleus.autoCreateSchema</name>
    <value>true</value>
  </property>
  <property>
    <name>datanucleus.fixedDatastore</name>
    <value>true</value>
  </property>
  <property>
    <name>datanucleus.autoCreateTables</name>
    <value>true</value>
  </property>
  <property>
    <name>datanucleus.schema.autoCreateTables</name>
    <value>true</value>
  </property>

</configuration>