How to fill gaps in anomaly detection data using pandas?

214 views Asked by At

Assume I have a pandas DataFrame that only consists of 0 and 1 depending if an anomaly was detected or not:

input_data = pd.DataFrame(data={'my_event': [0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 1.]},
                          index=pd.date_range(start='2023-01-01 00:00:00', end='2023-01-01 00:00:10', freq='s'))

Now I would like to fill gaps in the detection depending on their size. E.g. I only want to fill gaps that are 2 seconds or shorter. What is the correct way to do something like this?

I found these questions here: 1, 2, 3 but the solutions seem to be not very straight forward. It kinda feels like there should be a simpler way to solve an issue like this.

EDIT

Sorry for the unprecise question! So a "gap" would in my case be a short time period where no anomaly was detected inside a larger time range that was detected as an anomaly.

For the example input_data the expected output would be a DataFrame with the following data

[0., 0., 1., 1., 1., 1., 0., 0., 0., 1., 1.]

So in this example the single 0. inside the region of ones was replaced by a one. Obviously all zeros could also be replaced by nans, if that would help. I just need to be able to specify the length of the gap that should be filled.

4

There are 4 answers

2
Smordy On BEST ANSWER

i dont know if i understood you well, but to fill gaps in the detection that are 2 seconds or shorter, you can do this :

    import pandas as pd

input_data = pd.DataFrame(data={'my_event': [0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 1.]},
                          index=pd.date_range(start='2023-01-01 00:00:00', end='2023-01-01 00:00:10', freq='s'))

# Find consecutive sequences of 1's
sequences = (input_data['my_event'] == 1).cumsum()

# Calculate the time difference between consecutive events
time_diff = input_data.index.to_series().diff().dt.total_seconds()

# Find the gaps shorter than 2 seconds
gaps = (sequences != sequences.shift(-1)) & (time_diff <= 2)

# Fill the gaps with 1's
input_data['my_event'][gaps] = 1

print(input_data)
     my_event
2023-01-01 00:00:00       0.0
2023-01-01 00:00:01       0.0
2023-01-01 00:00:02       1.0
2023-01-01 00:00:03       1.0
2023-01-01 00:00:04       1.0
2023-01-01 00:00:05       1.0
2023-01-01 00:00:06       0.0
2023-01-01 00:00:07       0.0
2023-01-01 00:00:08       0.0
2023-01-01 00:00:09       1.0
2023-01-01 00:00:10       1.0
4
ragas On

Please find an alternative answer. Here I'm checking if the consecutive date is less than 2 second or not. If True the flag column is getting updated. Hope this is what you are looking for. From your question it's not very clear.

    import pandas as pd

df = pd.DataFrame(data={'my_event': [0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 1.]},
                          index=pd.date_range(start='2023-01-01 00:00:00', end='2023-01-01 00:00:10', freq='s'))
df = df.reset_index().rename(columns={'index': 'date'})
df['flag'] = np.where(
    (df['my_event'] == 0.0) &
    (df['my_event'].shift(1) == 1.0) & 
    (df['my_event'].shift(-1) == 1.0),
    1,
    0
)
df['final'] = df['my_event'] + df['flag']
print(df)

 

    date  my_event  flag  final
0  2023-01-01 00:00:00       0.0     0    0.0
1  2023-01-01 00:00:01       0.0     0    0.0
2  2023-01-01 00:00:02       1.0     0    1.0
3  2023-01-01 00:00:03       1.0     0    1.0
4  2023-01-01 00:00:04       0.0     1    1.0
5  2023-01-01 00:00:05       1.0     0    1.0
6  2023-01-01 00:00:06       0.0     0    0.0
7  2023-01-01 00:00:07       0.0     0    0.0
8  2023-01-01 00:00:08       0.0     0    0.0
9  2023-01-01 00:00:09       1.0     0    1.0
10 2023-01-01 00:00:10       1.0     0    1.0
0
OCa On

(1) Counting rows

Assuming all rows will come with identical timesteps in between them, like in the input data, then a gap of 2 seconds max means exactly one zero, not more, between two ones:

  • [1,0,1] gets filled as [1,1,1]
  • [1,0,0,1] stays as [1,0,0,1]

In that case, a rather simple one-liner exists using .shift:

# input & expected data
data = pd.DataFrame(data={'my_event': [0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 1.],
                          'expected': [0., 0., 1., 1., 1., 1., 0., 0., 0., 1., 1.]},
                    index=pd.date_range(start='2023-01-01 00:00:00', end='2023-01-01 00:00:10', freq='s'))

# solution
data['filled'] = np.where((data['my_event']==1) | ((data['my_event'].shift(-1)==1) & (data['my_event'].shift(1)==1)), 1 , 0)

Output:

                     my_event  expected  filled
2023-01-01 00:00:00       0.0       0.0       0
2023-01-01 00:00:01       0.0       0.0       0
2023-01-01 00:00:02       1.0       1.0       1
2023-01-01 00:00:03       1.0       1.0       1
2023-01-01 00:00:04       0.0       1.0       1
2023-01-01 00:00:05       1.0       1.0       1
2023-01-01 00:00:06       0.0       0.0       0
2023-01-01 00:00:07       0.0       0.0       0
2023-01-01 00:00:08       0.0       0.0       0
2023-01-01 00:00:09       1.0       1.0       1
2023-01-01 00:00:10       1.0       1.0       1

Only row 5 is filled, which is the desired output.

(2) Alternatively, counting time

Now, not assuming the time step is constant throughout the total dataset, I have not found any easy-looking method -but one that works. Step by step:

  1. Get gap size: view subset dataframe of detections (only value 1) then count time diff from a row to the next. It appears diff works from a column, not from index, so prior to that, copy of datetime index as column
  2. Merge both datasets
  3. Back-fill gap size information over each gap
  4. Substitute zeroes with ones depending on condition over arbitrary max gap size.

It may look like a lot of steps, but on the upside, because this is not counting rows but datetime intervals, this method is robust against changes in acquisition frequency or missing time points.

Step 1 get gap size:

# copy time from index to column for use by .diff()
input_data['time'] = input_data.index

# View subset: rows with detection only
Diffs = input_data[['time']].loc[input_data['my_event']==1]

# Calculate time interval between consecutive detections
Diffs['gap_size'] = Diffs['time'].diff()

# Output:
Diffs
                                   time        gap_size
2023-01-01 00:00:02 2023-01-01 00:00:02             NaT
2023-01-01 00:00:03 2023-01-01 00:00:03 0 days 00:00:01
2023-01-01 00:00:05 2023-01-01 00:00:05 0 days 00:00:02
2023-01-01 00:00:09 2023-01-01 00:00:09 0 days 00:00:04
2023-01-01 00:00:10 2023-01-01 00:00:10 0 days 00:00:01

Step 2 merge both datasets

df = pd.concat([input_data, Diffs['gap_size']], axis=1).drop(['time'], axis=1)
df
                     my_event  expected        gap_size
2023-01-01 00:00:00       0.0       0.0             NaT
2023-01-01 00:00:01       0.0       0.0             NaT
2023-01-01 00:00:02       1.0       1.0             NaT
2023-01-01 00:00:03       1.0       1.0 0 days 00:00:01
2023-01-01 00:00:04       0.0       1.0             NaT
2023-01-01 00:00:05       1.0       1.0 0 days 00:00:02
2023-01-01 00:00:06       0.0       0.0             NaT
2023-01-01 00:00:07       0.0       0.0             NaT
2023-01-01 00:00:08       0.0       0.0             NaT
2023-01-01 00:00:09       1.0       1.0 0 days 00:00:04
2023-01-01 00:00:10       1.0       1.0 0 days 00:00:01

Step 3 back-fill

df['fill_gap_size'] = df['gap_size'].bfill()
df
                     my_event  expected        gap_size   fill_gap_size
2023-01-01 00:00:00       0.0       0.0             NaT 0 days 00:00:01
2023-01-01 00:00:01       0.0       0.0             NaT 0 days 00:00:01
2023-01-01 00:00:02       1.0       1.0             NaT 0 days 00:00:01
2023-01-01 00:00:03       1.0       1.0 0 days 00:00:01 0 days 00:00:01
2023-01-01 00:00:04       0.0       1.0             NaT 0 days 00:00:02
2023-01-01 00:00:05       1.0       1.0 0 days 00:00:02 0 days 00:00:02
2023-01-01 00:00:06       0.0       0.0             NaT 0 days 00:00:04
2023-01-01 00:00:07       0.0       0.0             NaT 0 days 00:00:04
2023-01-01 00:00:08       0.0       0.0             NaT 0 days 00:00:04
2023-01-01 00:00:09       1.0       1.0 0 days 00:00:04 0 days 00:00:04
2023-01-01 00:00:10       1.0       1.0 0 days 00:00:01 0 days 00:00:01

Step 4 conditional fill with defined max interval

# define arbitrary max interval
max_interval = np.timedelta64(2, 's')

# duplicate original signal
df['fill_event'] = df['my_event']

# that is used in conditional substitution to avoid filling the top rows
df['cumsum'] = df['my_event'].cumsum()

# conditional substitution: it's a small gap & 
df.loc[(df['my_event']==0) 
         & (df['fill_gap_size']<=max_interval)
         & (df['cumsum']>0), 'fill_event'] = 1

# remove temporary columns
df.drop(['gap_size','fill_gap_size','cumsum'],axis=1, inplace=True)
df
                     my_event  expected  fill_event
2023-01-01 00:00:00       0.0       0.0         0.0
2023-01-01 00:00:01       0.0       0.0         0.0
2023-01-01 00:00:02       1.0       1.0         1.0
2023-01-01 00:00:03       1.0       1.0         1.0
2023-01-01 00:00:04       0.0       1.0         1.0
2023-01-01 00:00:05       1.0       1.0         1.0
2023-01-01 00:00:06       0.0       0.0         0.0
2023-01-01 00:00:07       0.0       0.0         0.0
2023-01-01 00:00:08       0.0       0.0         0.0
2023-01-01 00:00:09       1.0       1.0         1.0
2023-01-01 00:00:10       1.0       1.0         1.0

So, fill_event == expected: success!

0
MZij On

An easy and straightforward way to handle this, is to let each row look at the two rows above and below. From your question, I assume that each row is evenly spaced with 1 second intervals. Using pandas, this can be done very quickly and efficiently using the .loc and .shift() methods.

The example below should get the expected result in your question. It creates for new columns which contain the shifted data of the my_event columns so that each row can easily look into the data of other rows. Next you can use conditional statements to determine which 0 should be a 1. Your example lacked a situation of a gap of two zeroes, so i added some rows myself

import pandas as pd

input_data = pd.DataFrame(data={'my_event': [0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 1., 0., 0., 1., 1.,]},
                          index=pd.date_range(start='2023-01-01 00:00:00', end='2023-01-01 00:00:14', freq='s'))

#created new columns with shifted data so that each row can 'look' at the data in the rows above and below
input_data["shift1"] = input_data["my_event"].shift(1)
input_data["shift-1"] = input_data["my_event"].shift(-1)
input_data["shift2"] = input_data["my_event"].shift(2)
input_data["shift-2"] = input_data["my_event"].shift(-2)

#use .loc and conditional statements to find for each row the locations 

that have a gap of 1 or 2 and set these to value 1
idx_gap_1 = input_data["my_event"].loc[(input_data["my_event"] == 0) & (input_data["shift1"] == 1) & (input_data["shift-1"] ==1)].index
idx_gap_2 = input_data["my_event"].loc[(input_data["my_event"] == 0) & (input_data["shift1"] == 1) & (input_data["shift-2"] ==1)].index
idx_gap_3 = input_data["my_event"].loc[(input_data["my_event"] == 0) & (input_data["shift-1"] == 1) & (input_data["shift2"] ==1)].index
input_data["my_event"].loc[(idx_gap_1 | idx_gap_2 | idx_gap_3)] = 1


print(input_data["my_event"])

Result:

2023-01-01 00:00:00    0.0
2023-01-01 00:00:01    0.0
2023-01-01 00:00:02    1.0
2023-01-01 00:00:03    1.0
2023-01-01 00:00:04    1.0
2023-01-01 00:00:05    1.0
2023-01-01 00:00:06    0.0
2023-01-01 00:00:07    0.0
2023-01-01 00:00:08    0.0
2023-01-01 00:00:09    1.0
2023-01-01 00:00:10    1.0