Prediction of equipment failure based on meteorological data

44 views Asked by At

I have a dataset where there are time stamps and associated weather data in hourly resolution. The last column is "Failure" where the equipment failure is written. This is a classification task. The problem is that the dataset has extremely unbalanced classes O and 1. The number of hours in which a failure did not occur is 4,946. The number of hours in which a failure occurred is 142.

Here is full dataset on Google disk:

Full dataset

Head data preview:

Data prewiev

I have tried many solutions. For example, I filtered out all 142 failure events and randomly selected 142 failure-free hours to have a balanced dataset. I then used lagged predictor values. Accuracy was at most 60% in the logistic regression or SVM or Naive Bayes. Sometimes even less as if it depended on the sampling of the data. Last time I used the ADASYN algorithm from the SMOTEFAMILY package. The dataset was almost balanced. I then classically split the data into a training set and a test set 70/30. The accuracy on the test set oscillated around 94 to 95% for Random Forest Model or C.50 algorithm. However, if I dropped a few rows of failure from the data before modeling, the prediction was catastrophic at about 10%. The model was unable to predict newly arriving data, despite achieving high accuracy on the test set. I therefore suspect overfitting.

Thank you for your answers.

0

There are 0 answers