I would like to convert a datetime string to timestamp in dask cudf and then sort the dataframe by this column.
Example:
import dask_cudf as ddf
import pandas as pd
# Sample data (replace with your actual data)
cdf = cudf.DataFrame({
'city': ['Dallas', 'Bogota', 'Chicago', 'Juarez'],
'timestamp': ['2019-12-29 14:15:08 UTC', '2019-12-30 10:30:15 UTC', '2019-12-31 18:45:30 UTC', '2020-01-01 03:20:45 UTC']
})
# Create a Dask-cuDF DataFrame
dask_df = ddf.from_cudf(cdf, npartitions=2)
def to_timestamp(x):
import time
import datetime
element = datetime.datetime.strptime(x,"%Y-%m-%d %H:%M:%S UTC")
return datetime.datetime.timestamp(element)
dask_df['timestamp'] = dask_df['timestamp'].map_partitions(to_timestamp, meta=("timestamp", "str"))
dask_df.head()
I got error:
TypeError: strptime() argument 1 must be str, not Series
How can I do this for large dataframe on dask cudf ?
==========update ==========
I have tried this:
dask_df["timestamp"] = dask_df["timestamp"].map_partitions(to_timestamp, meta=("timestamp", "str"))
and got error:
TypeError: strptime() argument 1 must be str, not Series
This map_partitions thread seems to cover all the tricks of using
map_partitionson a row-by-row basis.Furthermore, you can refactor your function somewhat. The import statements can be moved outside of the function to save on loading time. You're only using
datetimein the function therefore you can skip on importingtime. The function could then look like this: