How to add a new column with day of week based on another in dataframe?

6.9k views Asked by At

I have a field in a data frame currently formatted as a string (mm/dd/yyyy) and I want to create a new column in that data frame with the day of week name (i.e. Thursday) for that field. I've imported

import com.github.nscala_time.time.Imports._

but am not sure where to go from here.

1

There are 1 answers

4
zero323 On BEST ANSWER

Create formatter:

val fmt = DateTimeFormat.forPattern("MM/dd/yyyy")

Parse date:

val dt = fmt.parseDateTime("09/11/2015")

Get a day of the week:

dt.toString("EEEEE")

Wrap it using org.apache.spark.sql.functions.udf and you have a complete solution. Still there is no need for that since HiveContext already provides all the required UDFs:

val df = sc.parallelize(Seq(
   Tuple1("08/11/2015"), Tuple1("09/11/2015"), Tuple1("09/12/2015")
)).toDF("date_string")

df.registerTempTable("df")

sqlContext.sql(
  """SELECT date_string,
        from_unixtime(unix_timestamp(date_string,'MM/dd/yyyy'), 'EEEEE') AS dow
      FROM df"""
).show

// +-----------+--------+
// |date_string|     dow|
// +-----------+--------+
// | 08/11/2015| Tuesday|
// | 09/11/2015|  Friday|
// | 09/12/2015|Saturday|
// +-----------+--------+

EDIT:

Since Spark 1.5 you can use from_unixtime, unix_timestamp functions directly:

import org.apache.spark.sql.functions.{from_unixtime, unix_timestamp}

df.select(from_unixtime(
  unix_timestamp($"date_string", "MM/dd/yyyy"), "EEEEE").alias("dow"))