Understanding some basics of Spark SQL

2k views Asked by At

I'm following http://spark.apache.org/docs/latest/sql-programming-guide.html

After typing:

val df = spark.read.json("examples/src/main/resources/people.json")

// Displays the content of the DataFrame to stdout
df.show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

I have some questions that I didn't see the answers to.

First, what is the $-notation? As in

 df.select($"name", $"age" + 1).show()

Second, can I get the data from just the 2nd row (and I don't know what the data is in the second row).

Third, how would you read in a color image with spark sql?

4th, I'm still not sure what the difference is between a dataset and dataframe in spark. The variable df is a dataframe, so could I change "Michael" to the integer 5? Could I do that in a dataset?

2

There are 2 answers

0
user7337271 On BEST ANSWER
  1. $ is not annotation. It is a method call (shortcut for new ColumnName("name")).
  2. You wouldn't. Spark SQL has no notion of row indexing.
  3. You wouldn't. You can use low level RDD API with specific input formats (like ones from HIPI project) and then convert.
  4. Difference between DataSet API and DataFrame
0
Vishnu Subramanian On

1) For question 1, $ sign is used as a shortcut for selecting a column and applying functions on top of it. For example:

df.select($"id".isNull).show

which can be other wise written as

df.select(col("id").isNull)

2) Spark does not have indexing, but for prototyping you can use df.take(10)(i) where i could be the element you want. Note: the behaviour could be different each time as the underlying data is partitioned.