Performing one-off calculations within a SQL query

268 views Asked by At

I have this query (edited for simplicity):

select to_timestamp(s.sampletimestamp/1000)
from sample s
where s.sampletimestamp >= extract(EPOCH FROM TIMESTAMP WITH TIME ZONE '2013-08-11 00:00:00+10')*1000 and
s.sampletimestamp < extract(EPOCH FROM TIMESTAMP WITH TIME ZONE '2013-08-12 00:00:00+10')*1000
order by s.sampletimestamp;

I notice that this executes much quicker by entering values for the time manually:

select to_timestamp(s.sampletimestamp/1000)
from sample s
where s.sampletimestamp >= 1376143200000 and
s.sampletimestamp < 1376229600000
order by s.sampletimestamp;

Where the time is an epoch time stamp in milliseconds. My guess is that the computer is evaluating the extract(EPOCH...) part for every record, whereas it only really needs to do that once.

Is there someway to keep the more human-readable form of the first query, whilst keeping the query as efficient as the second?

I am new to PostgreSQL (and entirely self-taught), so I think the problem I'm mostly suffering from is not knowing a particular keyword I should be putting into google - which I've already used as well as the PostgreSQL documentation.

Thanks in advance :)

EDIT1: Thanks for the very detailed replies. I suspect I'm probably in a different time zone to most of the respondants - I'll provide experimental evidence for this tomorrow (it's very late here).

EDIT2: To summarise the answer below, casting with 'bigint' does the trick. Replace:

where s.sampletimestamp >= extract(EPOCH FROM TIMESTAMP WITH TIME ZONE '2013-08-11 00:00:00+10')*1000 and
s.sampletimestamp < extract(EPOCH FROM TIMESTAMP WITH TIME ZONE '2013-08-12 00:00:00+10')*1000

With:

where s.sampletimestamp >= extract(EPOCH FROM TIMESTAMP WITH TIME ZONE '2013-08-11 00:00:00+10')::bigint*1000 and
s.sampletimestamp < extract(EPOCH FROM TIMESTAMP WITH TIME ZONE '2013-08-12 00:00:00+10')::bigint*1000
1

There are 1 answers

2
Craig Ringer On BEST ANSWER

What's happening here is that extract is implemented using the date_part function:

regress=> explain select count(1) from generate_series(1376143200000,1376143200000+1000000) x where x > extract(EPOCH FROM TIMESTAMP WITH TIME ZONE '2013-08-11 00:00:00+10')*1000 and x <  extract(EPOCH FROM TIMESTAMP WITH TIME ZONE '2013-08-12 00:00:00+10')*1000;
                                                                                                                                        QUERY PLAN                                                                                                                                         
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=30.02..30.03 rows=1 width=0)
   ->  Function Scan on generate_series x  (cost=0.00..30.00 rows=5 width=0)
         Filter: (((x)::double precision > (date_part('epoch'::text, '2013-08-10 22:00:00+08'::timestamp with time zone) * 1000::double precision)) AND ((x)::double precision < (date_part('epoch'::text, '2013-08-11 22:00:00+08'::timestamp with time zone) * 1000::double precision)))
(3 rows)

date_part(text, timestamptz) is defined as stable not immutable:

regress=> \df+ date_part
                                                                                                                 List of functions
   Schema   |   Name    | Result data type |        Argument data types        |  Type  | Volatility |  Owner   | Language |                               Source code                                |                 Description                 
------------+-----------+------------------+-----------------------------------+--------+------------+----------+----------+--------------------------------------------------------------------------+---------------------------------------------
 ...
 pg_catalog | date_part | double precision | text, timestamp with time zone    | normal | stable     | postgres | internal | timestamptz_part                                                         | extract field from timestamp with time zone
 ...

and I'm pretty sure that'll prevent Pg from pre-computing the value and inlining it into the call. I'd need to dig deeper to be sure.

I believe the reasoning is that date_part on a timestamptz can be dependent on the value of the TimeZone setting. This isn't true for date_part('epoch', some_timestamptz) but the query planner doesn't understand at planning time that you're using that.

I'm still surprised that it doesn't get pre-computed, as the documentation states:

A STABLE function cannot modify the database and is guaranteed to return the same results given the same arguments for all rows within a single statement. This category allows the optimizer to optimize multiple calls of the function to a single call.

You can work around this apparent limitation by converting to a timestamp at UTC (or whatever the TZ the epoch of your times is in) first, with AT TIME ZONE 'UTC'. E.g.:

select count(1) 
from generate_series(1376143200000,1376143200000+1000000) x 
where x > extract(EPOCH FROM TIMESTAMP WITH TIME ZONE '2013-08-11 00:00:00+10' AT TIME ZONE 'UTC')*1000 
and x <  extract(EPOCH FROM TIMESTAMP WITH TIME ZONE '2013-08-12 00:00:00+10' AT TIME ZONE 'UTC')*1000;

This executes faster, though there's more time difference than I'd expect if it were just being calculated once:

regress=> select count(1) from generate_series(1376143200000,1376143200000+1000000) x where x > extract(EPOCH FROM TIMESTAMP WITH TIME ZONE '2013-08-11 00:00:00+10')*1000 and x <  extract(EPOCH FROM TIMESTAMP WITH TIME ZONE '2013-08-12 00:00:00+10')*1000;
  count  
---------
 1000000
(1 row)

Time: 767.629 ms

regress=> select count(1) from generate_series(1376143200000,1376143200000+1000000) x where x > extract(EPOCH FROM TIMESTAMP WITH TIME ZONE '2013-08-11 00:00:00+10' AT TIME ZONE 'UTC')*1000 and x <  extract(EPOCH FROM TIMESTAMP WITH TIME ZONE '2013-08-12 00:00:00+10' AT TIME ZONE 'UTC')*1000;
  count  
---------
 1000000
(1 row)

Time: 373.453 ms

regress=> select count(1) from generate_series(1376143200000,1376143200000+1000000) x where x > 1376143200000 and x <  1376229600000;
  count  
---------
 1000000
(1 row)

Time: 324.557 ms

It would be possible to remove this query optimizer limitation / add a feature to optimize this. The optimizer would need to recognize, probably at parse time, that extract('epoch', ...) is a special case and instead of invoking date_part('epoch, ...) invoke a special timestamptz_epoch(...) function that was immutable.

A bit of looking at perf top results shows that the timestamptz case has the following peaks:

 10.33%  postgres      [.] ExecMakeFunctionResultNoSets
  7.76%  postgres      [.] timesub.isra.1
  6.94%  postgres      [.] datebsearch
  5.58%  postgres      [.] timestamptz_part
  3.82%  postgres      [.] AllocSetAlloc
  2.97%  postgres      [.] ExecEvalConst
  2.68%  postgres      [.] downcase_truncate_identifier
  2.38%  postgres      [.] ExecEvalScalarVarFast
  2.23%  postgres      [.] slot_getattr
  1.99%  postgres      [.] DatumGetFloat8

wheras with the use of AT TIME ZONE we get:

 11.58%  postgres      [.] ExecMakeFunctionResultNoSets
  4.28%  postgres      [.] AllocSetAlloc
  4.18%  postgres      [.] ExecProject
  3.82%  postgres      [.] slot_getattr
  2.99%  libc-2.17.so  [.] __memmove_ssse3
  2.96%  postgres      [.] BufFileWrite
  2.80%  libc-2.17.so  [.] __memcpy_ssse3_back
  2.74%  postgres      [.] BufFileRead
  2.69%  postgres      [.] float8lt

and with the integer case:

  7.92%  postgres      [.] ExecMakeFunctionResultNoSets
  5.36%  postgres      [.] slot_getattr
  4.52%  postgres      [.] AllocSetAlloc
  4.02%  postgres      [.] ExecProject
  3.42%  libc-2.17.so  [.] __memmove_ssse3
  3.33%  postgres      [.] BufFileWrite
  3.31%  libc-2.17.so  [.] __memcpy_ssse3_back
  2.91%  postgres      [.] BufFileRead
  2.90%  postgres      [.] GetMemoryChunkSpace
  2.67%  postgres      [.] AllocSetFree

So you can see that the AT TIME ZONE version avoids the repeated timestamptz_part and datebsearch calls. The main difference between it and the integer case is float8lt; it looks like we're doing double precision comparisons instead of integer comparisons.

Sure enough, a cast takes care of it:

select count(1) 
from generate_series(1376143200000,1376143200000+1000000) x
where x > extract(EPOCH FROM TIMESTAMP WITH TIME ZONE '2013-08-11 00:00:00+10' AT TIME ZONE 'UTC')::bigint * 1000  
and x <  extract(EPOCH FROM TIMESTAMP WITH TIME ZONE '2013-08-12 00:00:00+10' AT TIME ZONE 'UTC')::bigint * 1000;

I don't have the time to pursue the enhancement to the optimizer discussed above at present, but it's something you might want to consider raising on the mailing list.