I have a Pandas dataframe that contains two relevant columns: an event timestamp and a persistent unique ID of a device that completes the event.

I'd like to identify whether each device is still operating.

I do this by selecting a threshold numbered in days, and if the device has no event after this threshold (time elapsed since last event, for that device), the device is deemed defunct.

Eg, if the device has not reported an event within the last 14 days of the data, it is deemed defunct (although that countdown clock is reset if the device has an event after the 14 day threshold, eg if it was dormant for a month and then has an event again).

You can imagine that if I set the threshold to one day, I'd mistakenly classify many devices that are still active (but have not had events for one day) as defunct, and if I set the threshold to 90 days, I'd have many devices marked as operational when they're actually defunct (mean operational lifespan is about 20 days of operation).

I'd like to find the threshold that corresponds to a 95% probability that a device is defunct. That is, if a device disappears for X days, it has a 95% chance of not reappearing again within 90 days, based on historical data. I say 'within 90 days' because technically a device could reappear in 12 months, so I have to draw the line somewhere, and I only have about a year's worth of data.

Sample dataframe:

pd.DataFrame({"device_id": ['A','A','A','A','A'], "event_date":["2019-01-01", "2019-01-03", "2019-01-07", "2019-01-12", "2019-01-18"] })
   device_id     event_date
0          A   "2019-01-01"
1          A   "2019-01-03"
2          A   "2019-01-07"
3          A   "2019-01-12"
4          A   "2019-01-18"

Eg, for this dataset, you'd get a 50% probability of a device being defunct if you had set the threshold to six days, since there are two instances of 6+ days between events, and the device continues to operate after one of those instances.

Whereas if you set the threshold to seven days, you'd have a 100% probability the device is defunct. The actual dataframe has thousands of devices and dozens of dates per device.

I had previously done a similar computation on a smaller dataset in a spreadsheet, but that's not possible with my newer larger dataset.

I've simplified my dataset to a set of unique device ID - event date pairs, but that clearly falls short of the mark.

Desired output: I'd like a function that, given a threshold input by the user (X consecutive days of no events in the dataset), it will compute the probability of a device reappearing after not appearing if it had been absent for X consecutive days.

0 Answers