Data cleanup that requires iterating over pandas.DataFrame 3 rows at a time

212 views Asked by At

I have some large datasets of sensor readings where occasionally the row will be 0. The heuristic is quite simple: if the previous row and next row were not 0, I assume this is a sensor glitch and I replace this row with the average of the two around it.

There are legitimate cases where sensor readings can be 0, so simply looking at 0s isn't an option.

so far, I have come up with the following method for cleaning it up:

data["x+1"] = data["x"].shift(1)
data["x+2"] = data["x"].shift(2)

res = data[["x", "x+1", "x+2"]].apply( 
  lambda x : (x[0] + x[2])/2 
             if ((x[0] > 0) and (x[1] <= 0) and (x[2] > 0) ) 
             else x[1], axis=1 )

data[x] = res.shift(-1)

This works in principle, and I prefer it to iterating over 3 zipped and shifted dataframes like so:

for row1, row2, row3 in zip( data.iterrows(), data.shift(1).iterrows(), data.shift(2).iterrows() ):
       ...

However, both of these methods take an eternity to process. I've read that apply can't be vectorized and that there is some duplication in memory going on (output).

I've also tried the following but it's just shy of properly working:

data.loc[ data["x"] == 0 , "x" ] = np.NaN
data["x"].fillna( method="ffill", limit=1, inplace=True)
data["x"].fillna( 0 )

This is lightning fast, but doesn't do what I would hope it to do (it simply stops filling past the first NaN whereas I'd want it to fill only if there was a single NaN)

I'm not sure what I can do to make this solution scale to multi-gigabyte files. I'm currently resorting to using awk to run through the files, but this isn't ideal because the code is less maintainable and because there's a bunch of other similar pre-processing already taking place in a python program.

Any advice is appreciated.

1

There are 1 answers

2
Psidom On BEST ANSWER

You can vectorize it with where function:

preV = data['x'].shift(1)
nexT = data['x'].shift(-1)
data['x'] = data['x'].where((data['x'] > 0) | (preV <= 0) | (nexT <= 0), (preV + nexT)/2)

With input:

data = pd.DataFrame({"x": [1,2,3,0,0,2,3,0,4,2,0,0,0,1]})

gives:

0     1.0
1     2.0
2     3.0
3     0.0
4     0.0
5     2.0
6     3.0
7     3.5              # 0 gets replaced here
8     4.0
9     2.0
10    0.0
11    0.0
12    0.0
13    1.0
Name: x, dtype: float64

Or you can create a logical index to indicate positions where value should be replaced and assign the average of values in previous and next rows to them:

data.loc[(data['x'] <= 0) & (preV > 0) & (nexT > 0), "x"] = (preV + nexT)/2