Spark RDD Cartesian tuple index out of range

1.2k views Asked by At

I'm trying to run a function across the cartesian product of two PySpark DataFrames with:

joined = dataframe1.rdd.cartesian(dataframe2.rdd)
collected = joined.collect()
for tuple in collected:
    print tuple

# new_rdd = joined.map(function_to_pass_in)

But I get the following error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-72-bf547304ed8b> in <module>()
    29 collected = joined.collect()
    30 for tuple in collected:
---> 31     print tuple
    32 
    33 # new_rdd = joined.map(function_to_pass_in)

/opt/spark/spark-1.3.0/python/pyspark/sql/types.pyc in __repr__(self)
  1212             # call collect __repr__ for nested objects
  1213             return ("Row(%s)" % ", ".join("%s=%r" % (n, getattr(self, n))
-> 1214                                           for n in self.__FIELDS__))
  1215 
  1216         def __reduce__(self):

/opt/spark/spark-1.3.0/python/pyspark/sql/types.pyc in <genexpr>((n,))
  1212             # call collect __repr__ for nested objects
  1213             return ("Row(%s)" % ", ".join("%s=%r" % (n, getattr(self, n))
-> 1214                                           for n in self.__FIELDS__))
  1215 
  1216         def __reduce__(self):

IndexError: tuple index out of range

Interestingly enough, the following code works without error:

joined = dataframe1.rdd.cartesian(dataframe2.rdd)
print joined.count()
for tuple in joined.collect():
    print tuple

Why does calling ".count" on my resulting rdd make this work? Shouldn't it work without having to do that? Am I missing something?

0

There are 0 answers