Timestamp issue when loading CSV to dataframe

951 views Asked by At

I am trying to load a csv file into a distributed dataframe (ddf), whilst giving a schema. The ddf gets loaded but the timestamp column shows only null values. I believe this happens because spark expects timestamp in a particular format. So I have two questions:

1) How do I give spark the format or make it detect format (like "MM/dd/yyyy' 'HH:mm:ss")

2) If 1 is not an option how to convert the field (assuming I imported as String) to timestamp.

For Q2 I have tried using following :

def convert(row :org.apache.spark.sql.Row) :org.apache.spark.sql.Row = {
    import org.apache.spark.sql.Row
    val format = new java.text.SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss")
    val d1 = getTimestamp(row(3))
    return Row(row(0),row(1),row(2),d1);
}

val rdd1 = df.map(convert)
val df1 = sqlContext.createDataFrame(rdd1,schema1)

The last step doesn't work as there are null values which dont let it finish. I get errors like :

java.lang.RuntimeException: Failed to check null bit for primitive long value.

The sqlContext.load however is able to load the csv without any problems.

val df = sqlContext.load("com.databricks.spark.csv", schema, Map("path" -> "/path/to/file.csv", "header" -> "true"))
0

There are 0 answers