I'm using pandas.DataFrame
to store 3 hours of sensor data sampled at the second interval. So each second, I'm adding a row and dropping rows older than 3 hours.
Currently, I'm doing it very inefficiently:
record = pd.DataFrame.from_records([record], index='Date')
if self.data.empty:
#logger.debug('Creating data log')
self.data = record
else:
#logger.debug('Appending new record')
self.data = self.data.append(record)
start = now - self.keepInMemory
self.data = self.data[self.data.index > start]
Namely, a new DataFrame is created, then it is appended, and then old records are removed. It is slow, inefficient, and certainly does a lot of memory reallocation.
What I'm looking for is:
- Pre-allocated DataFrame
- Remove old records (without reallocation)
- Add new record
What is the most Panda-ish way to accomplish that?
Thank you.
P.s. The only relevant question on SO I managed to find was: deque in python pandas but it didn't help.
Update: Using DataFrame and not deque is a requirement, since other modules use self.data
as a service for computing generic conditions, e.g. ('is std() of last 15 minutes differs from this of the first' and similar). To stress, it's not just for recording data, it's for providing ability to for other modules to compute various generic conditions efficiently and conveniently.
I suspect there might be a clever play with indices (e.g. data.index = np.roll(data.index,1))
and then replacing the last row inplace, but until now I could not figure out how to do that efficiently. New record has the same format as the rest, so it should be possible.
work in progress
See comments below. I'll leave answer as it is until I can work something out. I don't want anyone to think that this solves the problem.
Consider the dataframe
df
with timeseries indextidx
.tidx
starts off with 70 days worth of dates.Suppose we get a new time stamp for which we will record new data. I happened to just add a day to the max of existing days, but this shouldn't matter. We can append a new row by assigning a series to
df
at row with index value equal to the new date. We useloc
to do this.This operation should be
inplace
fairly efficient.Now we can define the amount of time you want to keep with a
pd.offsets
object. I chose 60 days for demonstration purposes. Three hours would have beenpd.offsets.Hour(3)
. I find the index values that are too old and Idrop
them... again,inplace
You should be able to apply this and should be more efficient then what you're doing.