Trying to do something fairly straightforward with Blaze and my local Spark instance. Loading a csv file with blaze's into() and then use blaze's by()
Python 3.4
Spark 1.4.0
Blaze 0.8.0
csv (simple.csv)
id,car
1,Mustang
2,Malibu
3,Mustang
4,Malibu
5,Murano
code
mport blaze as bz
rdd = bz.into(sc,"simple.csv")
simple = bz.Data(rdd)
simple.count() #gives me 5 so far so good
bz.by(simple.car, count=simple.id.count()) #throws an error
AttributeError: 'InteractiveSymbol' object has no attribute 'car'
Any ideas on what's going on here?
Side note; this works
simple_csv = bz.Data("simple.csv")
bz.by(simple_csv.car, count=simple_csv.id.count())
car count
0 Malibu 2
1 Murano 1
2 Mustang 2
And so does this
simple_csv.car.count_values()
car count
0 Malibu 2
2 Mustang 2
1 Murano 1
Gotta be the way I'm "loading" it into Spark, right?
You'll want to create a Spark DataFrame (formerly
SchemaRDD
) using aSQLContext
instead of creating a "raw"RDD
with theSparkContext
.RDD
s don't have named columns, which you would need in order for theby
operation to succeed. This is why theInteractiveSymbol
did not have acar
attribute, it was stripped away in the process of creating theRDD
. Executing this in a Jupyter code cell:would produce a
pyspark.sql.dataframe.DataFrame
object, and execute a program on the Spark driver to count the rows:At this point, you should be able to compute the group-by as you were trying to before:
BUT. There is a problem with
Blaze
, at least for me, as of today, running Blaze0.9.0
with both Spark 1.6 and Spark 1.4.1. Likely, this is not the same problem you had in the first place, but it is preventing me from reaching a working solution. I tried dropping Jupyter, and running in apyspark
session directly. To do so yourself, you can omit a few of the lines above, sincepyspark
automatically createssc
andsqlContext
:This produces an error. Even just trying to get an interactive view of
simple
like this also produces an error:Anyway, there seems to be some activity in the Blaze project on Github related to upgrading support for Spark 1.6, so hopefully they'll get this stuff fixed at that point.