How to get a simple hash join query to perform as well as a complex sort merge query?

267 views Asked by At

I have a system that logs information about running processes. Each running process contains a series of steps that may or may not run in parallel. The system logs information about a process and its steps to two separate tables:

CREATE TABLE pid (
  pid         integer,
  start_time  timestamp,
  end_time    timestamp,
  elapsed     bigint,
  aborted     integer,
  label       char(30)
);

CREATE TABLE pid_step (
  pid         integer,
  step        integer,
  start_time  timestamp,
  end_time    timestamp,
  elapsed     bigint,
  mem         bigint,
  ...
);

The pid_step table contains a bunch of resource usage stats about each step which I have simplified here as just the mem column which logs the # of bytes of memory allocated for that step. I want to sample memory allocation by process label, perhaps at 5 second intervals, so I can plot it. I need a result similar to the following:

tick                    label  mem
----------------------- ------ -----------
2014-11-04 05:37:40.0   foo      328728576
2014-11-04 05:37:40.0   bar         248436
2014-11-04 05:37:40.0   baz        1056144
2014-11-04 05:37:45.0   foo     1158807552
2014-11-04 05:37:45.0   bar         632822
2014-11-04 05:37:45.0   baz         854398

Since the logs only give me starting and ending timestamps for each process and step instead of a sample of resource usage at 5 second intervals, I need to find the most efficient way to determine which process steps were running at each 5 second interval (tick) and then aggregate their allocated memory. I've made 3 separate attempts that all produce the same results with varying levels of performance. For brevity's sake I'll put each query and its explain plan in a gist (https://gist.github.com/anonymous/3b57f70015b0d234a2de) but I'll explain my approach for each:

  1. This was my first attempt and it is definitely the most intuitive and easiest to maintain. It cross joins the distinct process labels with generate_series to generate the 5 second ticks for each label, then left joins on the pid and pid_step tables. The left join creates a "zero-fill" effect and ensures we do not drop out any ticks that have no associated data. Unfortunately this approach performs the worst (see benchmark link below) and I believe it is due to the use of a hash join where the between t2.start_time and t2.end_time predicate is handled as a join filter instead of a join condition.

  2. This was my second attempt and it performs way better but is a lot less intuitive and maintainable. The "zero-fill" approach is the same as in query 1. But, before performing the left join of pid and pid_step, I pre-compute the ticks that have associated data based on the max process elapsed time and the process steps start and end times. This allows for a sort merge join where both the tick and label predicates can be expressed as a join condition and no join filters are used.

  3. This was my final attempt and it performs the best with about the same intuitiveness and maintainability as query 2. The optimization here is that I use the max process step elapsed time which is guaranteed to be smaller than the max process elapsed time and therefore creates a smaller nested loop at the beginning of CTE t3.

Ideally, I'd like the SQL to be as simple and maintainable as query 1 but perform as well as query 3. Is there anything I can do in the way of indexes or a slight rewrite of query 1 that would improve the performance?

Benchmark Results: https://i.stack.imgur.com/9SZWS.png

1

There are 1 answers

0
Clément Prévost On

Here is a solution using the power of PostgreSQL ranges (SQLFiddle)

CREATE TABLE pid (
  pid         integer PRIMARY KEY,
  label       char(30)
);

CREATE TABLE pid_step (
  pid         integer,
  step        serial,
  start_time  timestamp,
  end_time    timestamp,
  mem         bigint,
  PRIMARY KEY (pid, step)
);

The sampling method is a good idea but is, in my opinion, an optimisation. Here is my solution:

Let's say we want to plot one day of data, we split this day in a number of time slices, each lasting 5 seconds. For one process and for one time slice, we want to retrieve the average memory of all steps that ran during this 5 seconds. So instead of sampling every 5 seconds (which can hide data spikes), we are displaying an aggregation of the relevant data for these 5 seconds. The aggregation can be whichever PostgreSQL aggregate function is available.

The first step is to generate these time slices (as you already did without using the range datatype):

-- list of time ranges of 5 seconds interval
-- inclusive lower bound, exclusive upper bound
SELECT 
  tsrange(tick, tick + '5 seconds'::interval, '[)') as time_range
FROM generate_series(
  '2001-02-16 21:28:30'::timestamp, 
  '2001-02-16 22:28:30'::timestamp, 
  '5 seconds'::interval
) AS tick

Notice that these slices does not overlap each other as the lower bound is inclusive and the upper bound is exclusive.

Here is the tricky part, we don't want to change our table schema by removing start_time and end_time and creating a range column for this data. Fortunately, PostgreSQL allow indexes on expressions:

-- create index on range (inclusive on upper and lower) 
CREATE INDEX pid_step_tstzrange_index ON pid_step 
USING gist (tsrange(start_time, end_time, '()'));

With this index, we are now able to use the wide variety of PostgreSQL range operators at a fraction of the processing cost, the only caveat is that in order to use this index, we must use the exact same function in our query.

As you might already have guessed, the index will be used to join time slices and steps as we need to join if the step "virtual" range overlap the time slice.

Here is the final query:

WITH

time_range AS (
  -- list of time ranges of 5 seconds interval
  -- inclusive lower bound, exclusive upper bound
  SELECT 
    tsrange(tick, tick + '5 seconds'::interval, '[)') as time_range
  FROM generate_series(
    '2001-02-16 21:28:30'::timestamp, 
    '2001-02-16 22:28:30'::timestamp, 
    '5 seconds'::interval
  ) AS tick
),

-- associate each pid_step with the matching time_range
-- aggregate the average memory usage for each pid for each time slice
avg_memory_by_pid_by_time_range AS (
  SELECT 
    time_range,
    pid,
    avg(mem) avg_memory
  FROM 
    time_range
    JOIN pid_step 
      ON tsrange(pid_step.start_time, pid_step.end_time, '()') && time_range.time_range
  GROUP BY
    time_range,
    pid
)

-- embellish the result with some additional data from pid
SELECT 
  lower(time_range) AS tick,
  pid.label AS label,
  trunc(avg_memory) AS mem
FROM
  avg_memory_by_pid_by_time_range
  JOIN pid ON avg_memory_by_pid_by_time_range.pid = pid.pid
ORDER BY
  lower(time_range),
  pid.label
;

I hope that the performance is still great in your production data (there are a lot of details that play in the query planning equation).