Python Detect isolated edges in the histogram plot for outliers detection in time-series data

Asked by At

I am attempting to find out outliers my own way. How? Plot the histogram, search for isolated edges with a few counts and zero-count neighbors or edges. Usually they will be at the far end of the histogram. Those could be outliers. Detect and drop them. What kind of data is it? Time-series coming from the field. Sometimes, you would see weird numbers (while sensors data is around 50-100, outliers may be -10000, 1000) when the sensors fail to communicate data in time and the data loggers stores these weird numbers. They are momentary, may occur a few times in a year data and would be less than 1 % of total samples.

My code:

``````# vals, edges = np.histogram(df['column'],bins=20)
# obtained result is
vals = [    38      0      0      0      0      0      0      0      0      0
0      0      0      0      0      1     11 126664  13853   4536]
edges = [ 0.        2.911165  5.82233   8.733495 11.64466  14.555825 17.46699
20.378155 23.28932  26.200485 29.11165  32.022815 34.93398  37.845145
40.75631  43.667475 46.57864  49.489805 52.40097  55.312135 58.2233  ]

# repeat last sample twice in the vals. Why: because vals always have one sample less than edges
vals = np.append(vals, vals[-1])
vedf = pd.DataFrame(data = {'edges':edges,'vals':vals})
# Replace all zero samples with NaN. Hence, these rows will not recognized.
vedf['vals'] = vedf['vals'].replace(0,np.nan)
# Identify the isolated edges by looking the number of samples, say, < 50
vedf['IsolatedEdge?'] = vedf['vals'] <50
# plot histogram
plt.plot(vedf['edges'],vedf['vals'],'o')
plt.show()
``````

Present output:

This is not a correct output. Why? There is only one isolated edge at the beginning at value 0. However, here, my code detected values at 43 and 46 as isolated ones just because they have less count.

``````vedf =

edges     vals    IsolatedEdge?
0   0.000000    38.0    True
1   2.911165    NaN     False
2   5.822330    NaN     False
3   8.733495    NaN     False
4   11.644660   NaN     False
5   14.555825   NaN     False
6   17.466990   NaN     False
7   20.378155   NaN     False
8   23.289320   NaN     False
9   26.200485   NaN     False
10  29.111650   NaN     False
11  32.022815   NaN     False
12  34.933980   NaN     False
13  37.845145   NaN     False
14  40.756310   NaN     False
15  43.667475   1.0     True
16  46.578640   11.0    True
17  49.489805   126664.0    False
18  52.400970   13853.0     False
19  55.312135   4536.0  False
20  58.223300   4536.0  False
``````

Expected output:

``````vedf =

edges     vals    IsolatedEdge?
0   0.000000    38.0    True
1   2.911165    NaN     False
2   5.822330    NaN     False
3   8.733495    NaN     False
4   11.644660   NaN     False
5   14.555825   NaN     False
6   17.466990   NaN     False
7   20.378155   NaN     False
8   23.289320   NaN     False
9   26.200485   NaN     False
10  29.111650   NaN     False
11  32.022815   NaN     False
12  34.933980   NaN     False
13  37.845145   NaN     False
14  40.756310   NaN     False
15  43.667475   1.0     False
16  46.578640   11.0    False
17  49.489805   126664.0    False
18  52.400970   13853.0     False
19  55.312135   4536.0  False
20  58.223300   4536.0  False
``````

Once, I know a specific edge is isolated one, I can drop all the samples in the edge.