Netezza Drivers not available in Spark (Python Notebook) in DataScienceExperience

1.1k views Asked by At

I have a project code in Python Notebook and it ran all good when Spark was hosted in Bluemix.

We are running the following code to connect to Netezza (on premises) which worked fine in Bluemix.

VT =  sqlContext.read.format('jdbc').options(url='jdbc:netezza://169.54.xxx.x:xxxx/BACC_PRD_ISCNZ_GAPNZ',user='XXXXXX', password='XXXXXXX', dbtable='GRACE.CDVT_LIVE_SPARK', driver='org.netezza.Driver').load()' 

However, after migration to DatascienceExperience, we are getting the following error. I have established the secure gateway and its all working fine, but this code is not running. I think the issue is with the Netezza driver. If it is the case, is there a way we can explicitly import the class/driver so the above code can be executed. Please help how we can address the issue.

Error Message:


/usr/local/src/spark20master/spark/python/pyspark/sql/utils.py in  deco(*a, **kw)
61     def deco(*a, **kw):
62         try:
---> 63             return f(*a, **kw)
64         except py4j.protocol.Py4JJavaError as e:
65             s = e.java_exception.toString()

/usr/local/src/spark20master/spark/python/lib/py4j-0.10.3-src.zip /py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
317                 raise Py4JJavaError(
318                     "An error occurred while calling {0}{1} {2}.\n".
--> 319                     format(target_id, ".", name), value)
320             else:
321                 raise Py4JError(

Py4JJavaError: An error occurred while calling o212.load.
: java.lang.ClassNotFoundException: org.netezza.driver
at java.net.URLClassLoader.findClass(URLClassLoader.java:607)
at java.lang.ClassLoader.loadClassHelper(ClassLoader.java:844)
at java.lang.ClassLoader.loadClass(ClassLoader.java:823)
at java.lang.ClassLoader.loadClass(ClassLoader.java:803)
at  org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:38)
at    org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createC    onnectionFactory$1.apply(JdbcUtils.scala:49)
at  org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createC    onnectionFactory$1.apply(JdbcUtils.scala:49)
at scala.Option.foreach(Option.scala:257)
3

There are 3 answers

1
Roland Weber On

Notebooks in Bluemix and notebooks in DSX (Data Science Experience) currently use the same backend, so they have access to the same pre-installed drivers. Netezza isn't among them. As Chris Snow pointed out, users can install additional JARs and Python packages into their service instances.

You probably created a new service instance for DSX, and did not yet install the user JARs and packages that the old one had. It's a one-time setup, therefore easy to forget when you've been using the same instance for a while. Execute these commands in a Python notebook of the old instance on Bluemix to check for user-installed things:

!ls -lF ~/data/libs
!pip freeze

Then install the missing things into your new instance on DSX.

1
Chris Snow On

You can install a jar file by adding a cell with an exclamation mark that runs a unix tool to download the file, in this example wget:

!wget https://some.public.host/yourfile.jar -P  ${HOME}/data/libs

After downloading the file you will need to restart your kernel.

Note this approach assumes your jar file is publicly available on the Internet.

1
charles gomes On

There is another way to connect to Netezza using ingest connector which is by default enabled in DSX.

http://datascience.ibm.com/docs/content/analyze-data/python_load.html

from ingest import Connectors

from pyspark.sql import SQLContext

sqlContext = SQLContext(sc)

NetezzaloadOptions = { 
                 Connectors.Netezza.HOST              : 'hostorip',
                 Connectors.Netezza.PORT              : 'port',
                 Connectors.Netezza.DATABASE          : 'databasename',
                 Connectors.Netezza.USERNAME          : 'xxxxx',
                 Connectors.Netezza.PASSWORD          : 'xxxx',
                 Connectors.Netezza.SOURCE_TABLE_NAME         : 'tablename'}

NetezzaDF = sqlContext.read.format("com.ibm.spark.discover").options(**NetezzaloadOptions).load()

NetezzaDF.printSchema()

NetezzaDF.show()

Thanks,

Charles.