I'm trying to run a function across the cartesian product of two PySpark DataFrames with:
joined = dataframe1.rdd.cartesian(dataframe2.rdd)
collected = joined.collect()
for tuple in collected:
print tuple
# new_rdd = joined.map(function_to_pass_in)
But I get the following error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-72-bf547304ed8b> in <module>()
29 collected = joined.collect()
30 for tuple in collected:
---> 31 print tuple
32
33 # new_rdd = joined.map(function_to_pass_in)
/opt/spark/spark-1.3.0/python/pyspark/sql/types.pyc in __repr__(self)
1212 # call collect __repr__ for nested objects
1213 return ("Row(%s)" % ", ".join("%s=%r" % (n, getattr(self, n))
-> 1214 for n in self.__FIELDS__))
1215
1216 def __reduce__(self):
/opt/spark/spark-1.3.0/python/pyspark/sql/types.pyc in <genexpr>((n,))
1212 # call collect __repr__ for nested objects
1213 return ("Row(%s)" % ", ".join("%s=%r" % (n, getattr(self, n))
-> 1214 for n in self.__FIELDS__))
1215
1216 def __reduce__(self):
IndexError: tuple index out of range
Interestingly enough, the following code works without error:
joined = dataframe1.rdd.cartesian(dataframe2.rdd)
print joined.count()
for tuple in joined.collect():
print tuple
Why does calling ".count" on my resulting rdd make this work? Shouldn't it work without having to do that? Am I missing something?