pyspark.pandas: Converting float64 column to TimedeltaIndex

45 views Asked by At

I want to convert a numeric column which is resembling a timedelta in seconds to a ps.TimedeltaIndex (for the purpose of later resampling the dataset)

import pyspark.pandas as ps

df = ps.DataFrame({"time": [2.0, 3.0, 4.0], "x": [4.5, 4.0, 3.5]})
df.set_index(ps.to_timedelta(df.time, "s").to_numpy())

KeyError: '2000000000 nanoseconds'

I don't understand why this doesn't work.

2

There are 2 answers

0
ascripter On BEST ANSWER

The answer of @koedlt brought me on the right track, but is still missing the conversion to TimedeltaIndex

df = ps.DataFrame({"time": [2.0, 3.0, 4.0], "x": [4.5, 4.0, 3.5]})
df["time"] = ps.to_timedelta(df.time, unit="s")
df.set_index("time", inplace=True)

However I also realised that resample I mentioned requires actually a DatetimeIndex, so I should have asked for that. We'd need to use ps.to_datetime(df.time, unit="s") instead of ps.to_timedelta in this case

0
Koedlt On

This doesn't work because set_index requires column names for its keys argument, not the data of the columns.

So you could create a column, and then set it as index:

import pyspark.pandas as ps
import pyspark.sql.functions as F

df = ps.DataFrame({"time": [2, 3, 4], "x": [4.5, 3, 3.5]})

df["time_nanoseconds"] = F.col("time") * 1e9
df.set_index("time_nanoseconds")

                  time    x                                                     
time_nanoseconds           
2.000000e+09         2  4.5
3.000000e+09         3  3.0
4.000000e+09         4  3.5