Data cleanup that requires iterating over pandas.DataFrame 3 rows at a time

Question

Data cleanup that requires iterating over pandas.DataFrame 3 rows at a time

244 views Asked by MB. At 31 December 2016 at 02:40

I have some large datasets of sensor readings where occasionally the row will be 0. The heuristic is quite simple: if the previous row and next row were not 0, I assume this is a sensor glitch and I replace this row with the average of the two around it.

There are legitimate cases where sensor readings can be 0, so simply looking at 0s isn't an option.

so far, I have come up with the following method for cleaning it up:

data["x+1"] = data["x"].shift(1)
data["x+2"] = data["x"].shift(2)

res = data[["x", "x+1", "x+2"]].apply( 
  lambda x : (x[0] + x[2])/2 
             if ((x[0] > 0) and (x[1] <= 0) and (x[2] > 0) ) 
             else x[1], axis=1 )

data[x] = res.shift(-1)

This works in principle, and I prefer it to iterating over 3 zipped and shifted dataframes like so:

for row1, row2, row3 in zip( data.iterrows(), data.shift(1).iterrows(), data.shift(2).iterrows() ):
       ...

However, both of these methods take an eternity to process. I've read that apply can't be vectorized and that there is some duplication in memory going on (output).

I've also tried the following but it's just shy of properly working:

data.loc[ data["x"] == 0 , "x" ] = np.NaN
data["x"].fillna( method="ffill", limit=1, inplace=True)
data["x"].fillna( 0 )

This is lightning fast, but doesn't do what I would hope it to do (it simply stops filling past the first NaN whereas I'd want it to fill only if there was a single NaN)

I'm not sure what I can do to make this solution scale to multi-gigabyte files. I'm currently resorting to using awk to run through the files, but this isn't ideal because the code is less maintainable and because there's a bunch of other similar pre-processing already taking place in a python program.

Any advice is appreciated.

Original Q&A

There are 1 answers

**Psidom** · Accepted Answer · 2016-12-31T02:50:24+00:00

You can vectorize it with where function:

preV = data['x'].shift(1)
nexT = data['x'].shift(-1)
data['x'] = data['x'].where((data['x'] > 0) | (preV <= 0) | (nexT <= 0), (preV + nexT)/2)

With input:

data = pd.DataFrame({"x": [1,2,3,0,0,2,3,0,4,2,0,0,0,1]})

gives:

0     1.0
1     2.0
2     3.0
3     0.0
4     0.0
5     2.0
6     3.0
7     3.5              # 0 gets replaced here
8     4.0
9     2.0
10    0.0
11    0.0
12    0.0
13    1.0
Name: x, dtype: float64

Or you can create a logical index to indicate positions where value should be replaced and assign the average of values in previous and next rows to them:

data.loc[(data['x'] <= 0) & (preV > 0) & (nexT > 0), "x"] = (preV + nexT)/2

TechQA.

Data cleanup that requires iterating over pandas.DataFrame 3 rows at a time

There are 1 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in NAN

Related Questions in MOVING-AVERAGE

Related Questions in SLIDING-WINDOW

Popular Questions

Popular Tags

Trending Questions