I have a system that logs information about running processes. Each running process contains a series of steps that may or may not run in parallel. The system logs information about a process and its steps to two separate tables:
CREATE TABLE pid (
pid integer,
start_time timestamp,
end_time timestamp,
elapsed bigint,
aborted integer,
label char(30)
);
CREATE TABLE pid_step (
pid integer,
step integer,
start_time timestamp,
end_time timestamp,
elapsed bigint,
mem bigint,
...
);
The pid_step
table contains a bunch of resource usage stats about each step which I have simplified here as just the mem
column which logs the # of bytes of memory allocated for that step. I want to sample memory allocation by process label, perhaps at 5 second intervals, so I can plot it. I need a result similar to the following:
tick label mem
----------------------- ------ -----------
2014-11-04 05:37:40.0 foo 328728576
2014-11-04 05:37:40.0 bar 248436
2014-11-04 05:37:40.0 baz 1056144
2014-11-04 05:37:45.0 foo 1158807552
2014-11-04 05:37:45.0 bar 632822
2014-11-04 05:37:45.0 baz 854398
Since the logs only give me starting and ending timestamps for each process and step instead of a sample of resource usage at 5 second intervals, I need to find the most efficient way to determine which process steps were running at each 5 second interval (tick) and then aggregate their allocated memory. I've made 3 separate attempts that all produce the same results with varying levels of performance. For brevity's sake I'll put each query and its explain plan in a gist (https://gist.github.com/anonymous/3b57f70015b0d234a2de) but I'll explain my approach for each:
This was my first attempt and it is definitely the most intuitive and easiest to maintain. It cross joins the distinct process labels with
generate_series
to generate the 5 second ticks for each label, then left joins on thepid
andpid_step
tables. The left join creates a "zero-fill" effect and ensures we do not drop out any ticks that have no associated data. Unfortunately this approach performs the worst (see benchmark link below) and I believe it is due to the use of a hash join where thebetween t2.start_time and t2.end_time
predicate is handled as a join filter instead of a join condition.This was my second attempt and it performs way better but is a lot less intuitive and maintainable. The "zero-fill" approach is the same as in query 1. But, before performing the left join of
pid
andpid_step
, I pre-compute the ticks that have associated data based on the max process elapsed time and the process steps start and end times. This allows for a sort merge join where both the tick and label predicates can be expressed as a join condition and no join filters are used.This was my final attempt and it performs the best with about the same intuitiveness and maintainability as query 2. The optimization here is that I use the max process step elapsed time which is guaranteed to be smaller than the max process elapsed time and therefore creates a smaller nested loop at the beginning of CTE t3.
Ideally, I'd like the SQL to be as simple and maintainable as query 1 but perform as well as query 3. Is there anything I can do in the way of indexes or a slight rewrite of query 1 that would improve the performance?
Benchmark Results: https://i.stack.imgur.com/9SZWS.png
Here is a solution using the power of PostgreSQL ranges (SQLFiddle)
The sampling method is a good idea but is, in my opinion, an optimisation. Here is my solution:
Let's say we want to plot one day of data, we split this day in a number of time slices, each lasting 5 seconds. For one process and for one time slice, we want to retrieve the average memory of all steps that ran during this 5 seconds. So instead of sampling every 5 seconds (which can hide data spikes), we are displaying an aggregation of the relevant data for these 5 seconds. The aggregation can be whichever PostgreSQL aggregate function is available.
The first step is to generate these time slices (as you already did without using the range datatype):
Notice that these slices does not overlap each other as the lower bound is inclusive and the upper bound is exclusive.
Here is the tricky part, we don't want to change our table schema by removing
start_time
andend_time
and creating a range column for this data. Fortunately, PostgreSQL allow indexes on expressions:With this index, we are now able to use the wide variety of PostgreSQL range operators at a fraction of the processing cost, the only caveat is that in order to use this index, we must use the exact same function in our query.
As you might already have guessed, the index will be used to join time slices and steps as we need to join if the step "virtual" range overlap the time slice.
Here is the final query:
I hope that the performance is still great in your production data (there are a lot of details that play in the query planning equation).