How to cast string to Duration dtype in Polars Python

329 views Asked by At

Below is a sample of my data in which i have ten string columns that i need to cast to duration then to seconds (int32)

I have tried using pl.str.strptime and cast(pl.Duration) but all failed

import polars as pl
df = pl.DataFrame(
    {
        "a" : ["24:00:00" , "00:45:00"]
    }
)
df.with_columns(
    pl.col("a").str.strptime(pl.Duration, "%H:%M:%S")
)

The first value should be 86400 and the second should be 2700.

I came from power query where I can just replace "24:00:00" with "1.0:00:00" and change the type to duration.

Any suggested solution?

Thanks in advance.

1

There are 1 answers

1
Mustafa Elec On

Based on @jqurious comment above, I have succeeded in converting it to duration by extract string to list then get list members and assign it to duratio then to int seconds as follow:

import polars as pl
df = pl.DataFrame(
    {
        "ab" : ["24:00:00" , "00:45:00"],
    }
)

df.with_columns(
    pl.col("ab").str.extract_all(r"\d+")
).with_columns(
    pl.duration(
        hours   = pl.col("ab").list.get(0),
        minutes = pl.col("ab").list.get(1),
        seconds = pl.col("ab").list.get(2),
    ).alias("Dur")
).with_columns(
    pl.col("Dur").dt.seconds().alias("Dur Sec")
)
shape: (2, 3)
┌────────────────────┬──────────────┬─────────┐
│ ab                 ┆ Dur          ┆ Dur Sec │
│ ---                ┆ ---          ┆ ---     │
│ list[str]          ┆ duration[μs] ┆ f64     │
╞════════════════════╪══════════════╪═════════╡
│ ["24", "00", "00"] ┆ 1d           ┆ 86400.0 │
│ ["00", "45", "00"] ┆ 45m          ┆ 2700.0  │
└────────────────────┴──────────────┴─────────┘

However, I couldn't apply this solution on multi column which is my data case:

df = pl.DataFrame(
    {
        "ab" : ["24:00:00" , "00:45:00"],
        "abc" : ["23:00:00" , "03:41:00"],
    }
)
df.with_columns(
    pl.col(["ab","abc"]).str.extract_all(r"\d+")
).with_columns(
    pl.duration(
        hours   = pl.col(["ab","abc"]).list.get(0),
        minutes = pl.col(["ab","abc"]).list.get(1),
        seconds = pl.col(["ab","abc"]).list.get(2),
    )
)
PanicException: assertion `left == right` failed
  left: 11
 right: 8

As a workaround, I have succeeded converting it directly to int seconds:

df = pl.DataFrame(
    {
        "ab" : ["24:00:00" , "00:45:00"],
        "abc" : ["23:00:00" , "03:41:00"],
    }
)
df.with_columns(
    pl.col(["ab","abc"]).str.extract_all(r"\d+")
).with_columns(
    (
        pl.col(["ab","abc"]).list.get(0).cast(pl.Int32) * 3600 +
        pl.col(["ab","abc"]).list.get(1).cast(pl.Int32) * 60 +
        pl.col(["ab","abc"]).list.get(2).cast(pl.Int32) 
    ).suffix("Dur sec")
)
shape: (2, 4)
┌────────────────────┬────────────────────┬───────────┬────────────┐
│ ab                 ┆ abc                ┆ abDur sec ┆ abcDur sec │
│ ---                ┆ ---                ┆ ---       ┆ ---        │
│ list[str]          ┆ list[str]          ┆ i32       ┆ i32        │
╞════════════════════╪════════════════════╪═══════════╪════════════╡
│ ["24", "00", "00"] ┆ ["23", "00", "00"] ┆ 86400     ┆ 82800      │
│ ["00", "45", "00"] ┆ ["03", "41", "00"] ┆ 2700      ┆ 13260      │
└────────────────────┴────────────────────┴───────────┴────────────┘