I have a pandas dataframe that contains multiple rows with a datetime and a sensor value. My goal is to add a column that calculates the days until the sensor value will exceed the threshold the next time.

For instance, for the data <2019-01-05 11:00:00, 200>, <2019-01-06 12:00:00, 250>, <2019-01-07 13:00:00, 300> I would want the additional column to look like [1 day, 0 days, 0 days] for thresholds between 200 and 250 and [2 days, 1 day, 0 days] for thresholds between 250 and 300.

I tried subsampling the dataframe with df_sub = df[df[sensor_value] >= threshold], iterate over both dataframes and calculate the next timestamp in df_sub given the current timestamp in df. However, this solution seems to be every inefficient and I think that pandas might have some optimized way to calculating what I need.

In the following example code, I tried what I described above.

import pandas as pd
data = [{'time': '2019-01-05 11:00:00', 'sensor_value' : 200},
        {'time': '2019-01-05 14:37:52', 'sensor_value' : 220},
        {'time': '2019-01-05 17:55:12', 'sensor_value' : 235},
         {'time': '2019-01-06 12:00:00',  'sensor_value' : 250},
         {'time': '2019-01-07 13:00:00',  'sensor_value' : 300},
         {'time': '2019-01-08 14:00:00',  'sensor_value' : 250},
         {'time': '2019-01-09 15:00:00',  'sensor_value' : 320}]
df = pd.DataFrame(data)
df['time'] = pd.to_datetime(df['time'])

def calc_rul(df, threshold):
    # calculate all datetime where the threshold is exceeded
    df_sub = sorted(df[df['sensor_value'] >= threshold]['time'].tolist())

    # variable to store all days
    remaining_days = []
    for v1 in df['time'].tolist():
        for v2 in df_sub:

            # if the exceeding date is the first in future calculate the days difference
            if(v2 > v1):
                remaining_days.append((v2-v1).days)
                break
            elif(v2 == v1):
                remaining_days.append(0)
                break
    df['RUL'] = pd.Series(remaining_days) 

calc_rul(df, 300)

Expected output (output of the above sample):

result

2 Answers

0
Quang Hoang On Best Solutions

Here's what I would do for one threshold

def calc_rul(df, thresh):
    # we mark all the values greater than thresh
    markers =df.value.ge(thresh)

    # copy dates of the above row
    df['last_day'] = np.nan
    df.loc[markers, 'last_day'] = df.timestamp

    # back fill those dates 
    df['last_day'] = df['last_day'].bfill().astype('datetime64[ns]')

    df['RUL'] = (df.last_day - df.timestamp).dt.days

    # drop the columns if necessary,
    # remove this line to better see how the code works
    df.drop('last_day', axis=1, inplace=True)


calc_rul(df, 300)
0
RenauV On

Instead of spliting the dataframe, you can use the '.loc' that allows you to filter and iterate through your threshold the same way:

df['RUL'] = '[2 days, 1 day, 0 days]'
for threshold in threshold_list:
    df.loc[df['sensor_value'] > <your_rule>,'RUL'] = '[1 day, 0 days, 0 days]'

This technique avoids splitting the dataframe.