I am using spark 1.3.1. In PySpark, I have created a DataFrame from a RDD and registered the schema, something like this :
dataLen=sqlCtx.createDataFrame(myrdd, ["id", "size"])
dataLen.registerTempTable("tbl")
at this point everything is fine I can make a "select" query from "tbl", for example "select size from tbl where id='abc'".
Then in a Python function, I define something like :
def getsize(id):
total=sqlCtx.sql("select size from tbl where id='" + id + "'")
return total.take(1)[0].size
at this point still no problem, I can do getsize("ab")
and it does return a value.
The problem occurred when I invoked getsize
within a rdd, say I have a rdd named data which is of (key, value) list, when I do
data.map(lambda x: (x[0], getsize("ab"))
this generated an error which is
py4j.protocol.Py4JError: Trying to call a package
Any idea?
Spark doesn't support nested actions or transformations and
SQLContext
is not accessible outside the driver. So what you're doing here simply cannot work. It is not exactly clear what you want here but most likely a simplejoin
, either onRDDs
orDataFrames
should do the trick.