I am trying to load a csv file into a distributed dataframe (ddf), whilst giving a schema. The ddf gets loaded but the timestamp column shows only null values. I believe this happens because spark expects timestamp in a particular format. So I have two questions:
1) How do I give spark the format or make it detect format (like
"MM/dd/yyyy' 'HH:mm:ss"
)
2) If 1 is not an option how to convert the field (assuming I imported as String) to timestamp.
For Q2 I have tried using following :
def convert(row :org.apache.spark.sql.Row) :org.apache.spark.sql.Row = {
import org.apache.spark.sql.Row
val format = new java.text.SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss")
val d1 = getTimestamp(row(3))
return Row(row(0),row(1),row(2),d1);
}
val rdd1 = df.map(convert)
val df1 = sqlContext.createDataFrame(rdd1,schema1)
The last step doesn't work as there are null values which dont let it finish. I get errors like :
java.lang.RuntimeException: Failed to check null bit for primitive long value.
The sqlContext.load however is able to load the csv without any problems.
val df = sqlContext.load("com.databricks.spark.csv", schema, Map("path" -> "/path/to/file.csv", "header" -> "true"))