Fast row removal and addition in pandas.DataFrame without reallocation

709 views Asked by At

I'm using pandas.DataFrame to store 3 hours of sensor data sampled at the second interval. So each second, I'm adding a row and dropping rows older than 3 hours.

Currently, I'm doing it very inefficiently:

record = pd.DataFrame.from_records([record], index='Date')
if self.data.empty:
    #logger.debug('Creating data log')
    self.data = record
else:
    #logger.debug('Appending new record')
    self.data = self.data.append(record)
start = now - self.keepInMemory
self.data = self.data[self.data.index > start]

Namely, a new DataFrame is created, then it is appended, and then old records are removed. It is slow, inefficient, and certainly does a lot of memory reallocation.

What I'm looking for is:

  • Pre-allocated DataFrame
  • Remove old records (without reallocation)
  • Add new record

What is the most Panda-ish way to accomplish that?

Thank you.

P.s. The only relevant question on SO I managed to find was: deque in python pandas but it didn't help.

Update: Using DataFrame and not deque is a requirement, since other modules use self.data as a service for computing generic conditions, e.g. ('is std() of last 15 minutes differs from this of the first' and similar). To stress, it's not just for recording data, it's for providing ability to for other modules to compute various generic conditions efficiently and conveniently.

I suspect there might be a clever play with indices (e.g. data.index = np.roll(data.index,1)) and then replacing the last row inplace, but until now I could not figure out how to do that efficiently. New record has the same format as the rest, so it should be possible.

1

There are 1 answers

2
piRSquared On

work in progress

See comments below. I'll leave answer as it is until I can work something out. I don't want anyone to think that this solves the problem.


Consider the dataframe df with timeseries index tidx. tidx starts off with 70 days worth of dates.

tidx = pd.date_range('2011-03-01', periods=70)
df = pd.DataFrame(dict(A=np.arange(70)), tidx)

Suppose we get a new time stamp for which we will record new data. I happened to just add a day to the max of existing days, but this shouldn't matter. We can append a new row by assigning a series to df at row with index value equal to the new date. We use loc to do this.

This operation should be inplace fairly efficient.

new_index = df.index.max() + pd.offsets.Day()
df.loc[new_index] = pd.Series([99], df.columns)

Now we can define the amount of time you want to keep with a pd.offsets object. I chose 60 days for demonstration purposes. Three hours would have been pd.offsets.Hour(3). I find the index values that are too old and I drop them... again, inplace

keep = pd.offsets.Day(60)
drops = df.index[df.index < (df.index.max() - keep)]

df.drop(drops, inplace=True)

You should be able to apply this and should be more efficient then what you're doing.